* SCSI mid layer and high IOPS capable devices @ 2012-12-11 0:00 scameron 2012-12-11 8:21 ` Bart Van Assche 2012-12-13 15:22 ` Bart Van Assche 0 siblings, 2 replies; 21+ messages in thread From: scameron @ 2012-12-11 0:00 UTC (permalink / raw) To: linux-scsi; +Cc: stephenmcameron, scameron, dab Is there any work going on to improve performance of the SCSI layer to better support devices capable of high IOPS? I've been playing around with some flash-based devices and have a block driver that uses the make_request interface (calls blk_queue_make_request() rather than blk_init_queue()) and a SCSI LLD variant of the same driver. The block driver is similar in design and performance to the nvme driver. If I compare the performance, the block driver gets about 3x the performance as the SCSI LLD. The SCSI LLD spends a lot of time (according to perf) contending for locks in scsi_request_fn(), presumably the host lock or the queue lock, or perhaps both. All other things being equal, a SCSI LLD would be preferable to me, but, with performance differing by a factor of around 3x, all other things are definitely not equal. I tried using scsi_debug with fake_rw and also the scsi_ram driver that was recently posted to get some idea of what the maximum IOPS that could be pushed through the SCSI midlayer might be, and the numbers were a little disappointing (was getting around 150k iops with scsi_debug with reads and writes faked, and around 3x that with the block driver actually doing the i/o). Essentially, what I've been finding out is consistent with what's in this slide deck: http://static.usenix.org/event/lsf08/tech/IO_Carlson_Accardi_SATA.pdf The driver, like nvme, has a submit and reply queue per cpu. I'm sort of guessing that funnelling all the requests through a single request queue per device that only one cpu can touch at a time as the scsi mid layer does is a big part of what's killing performance. Looking through the scsi code, if I read it correctly, the assumption that each device has a request queue seems to be all over the code, so how exactly one might go about attempting to improve the situation is not really obvious to me. Anyway, just wondering if anybody is looking into doing some improvements in this area. -- steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-11 0:00 SCSI mid layer and high IOPS capable devices scameron @ 2012-12-11 8:21 ` Bart Van Assche 2012-12-11 22:46 ` scameron 2012-12-13 15:22 ` Bart Van Assche 1 sibling, 1 reply; 21+ messages in thread From: Bart Van Assche @ 2012-12-11 8:21 UTC (permalink / raw) To: scameron; +Cc: linux-scsi, stephenmcameron, dab On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote: > I tried using scsi_debug with fake_rw and also the scsi_ram driver > that was recently posted to get some idea of what the maximum IOPS > that could be pushed through the SCSI midlayer might be, and the > numbers were a little disappointing (was getting around 150k iops > with scsi_debug with reads and writes faked, and around 3x that > with the block driver actually doing the i/o). With which request size was that ? I see about 330K IOPS @ 4 KB and about 540K IOPS @ 512 bytes with the SRP protocol, a RAM disk at the target side, a single SCSI LUN and a single IB cable. These results have been obtained on a setup with low-end CPU's. Had you set rq_affinity to 2 in your tests ? Bart. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-11 8:21 ` Bart Van Assche @ 2012-12-11 22:46 ` scameron 2012-12-13 11:40 ` Bart Van Assche 0 siblings, 1 reply; 21+ messages in thread From: scameron @ 2012-12-11 22:46 UTC (permalink / raw) To: Bart Van Assche; +Cc: linux-scsi, stephenmcameron, dab, scameron On Tue, Dec 11, 2012 at 09:21:46AM +0100, Bart Van Assche wrote: > On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote: > >I tried using scsi_debug with fake_rw and also the scsi_ram driver > >that was recently posted to get some idea of what the maximum IOPS > >that could be pushed through the SCSI midlayer might be, and the > >numbers were a little disappointing (was getting around 150k iops > >with scsi_debug with reads and writes faked, and around 3x that > >with the block driver actually doing the i/o). > > With which request size was that ? 4k (I'm thinking the request size should not matter too much since fake_rw=1 causes the i/o not to actually be done -- there's no data transferred. Similarly with scsi_ram there's a flag to discard reads and writes that I was using.) > I see about 330K IOPS @ 4 KB and > about 540K IOPS @ 512 bytes with the SRP protocol, a RAM disk at the > target side, a single SCSI LUN and a single IB cable. These results have > been obtained on a setup with low-end CPU's. Had you set rq_affinity to > 2 in your tests ? No, hadn't done anything with rq_affinity. I had spread interrupts around by turning off irqbalance and echoing things into /proc/irq/*, and running a bunch of dd processes (one per cpu) like this: taskset -c $cpu dd if=/dev/blah of=/dev/null bs=4k iflag=direct & And the hardware in this case should route the interrupts back to the processor which submitted the i/o (the submitted command contains info that lets the hw know which msix vector we want the io to come back on.) I would be curious to see what kind of results you would get with scsi_debug with fake_rw=1. I am sort of suspecting that trying to put an "upper limit" on scsi LLD IOPS performance by seeing what scsi_debug will do with fake_rw=1 is not really valid (or, maybe I'm doing it wrong) as I know of one case in which a real HW scsi driver beats scsi_debug fake_rw=1 at IOPS on the very same system, which seems like it shouldn't be possible. Kind of mysterious. Another mystery I haven't been able to clear up -- I'm using code like this to set affinity hints int i, cpu; cpu = cpumask_first(cpu_online_mask); for (i = 0; i < h->noqs; i++) { int idx = i ? i + 1 : i; int rc; rc = irq_set_affinity_hint(h->qinfo[idx].msix_vector, get_cpu_mask(cpu)); if (rc) dev_warn(&h->pdev->dev, "Failed to hint affinity of vector %d to cpu %d\n", h->qinfo[idx].msix_vector, cpu); cpu = cpumask_next(cpu, cpu_online_mask); } and those hints are set (querying /proc/irq/*/affinity_hint shows that my hints are in there) but the hints are not "taken", that is /proc/irq/smp_affinity does not match the hints. doing this: for x in `seq $first_irq $last_irq` do cat /proc/irq/$x/affinity_hint > /proc/irq/$x/smp_affinity done (where first_irq and last_irq specify the range of irqs assigned to my driver) makes the hints be "taken". I noticed nvme doesn't seem to suffer from this, somehow the hints are taken automatically (er, I don't recall if /proc/irq/*/smp_affinity matches affinity_hints for nvme, but interrupts seem spread around without doing anything special). I haven't seen anything in the nvme code related to affinity that I'm not already doing as well in my driver, so it is a mystery to me why that difference in behavior occurs. -- steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-11 22:46 ` scameron @ 2012-12-13 11:40 ` Bart Van Assche 2012-12-13 18:03 ` scameron 0 siblings, 1 reply; 21+ messages in thread From: Bart Van Assche @ 2012-12-13 11:40 UTC (permalink / raw) To: scameron; +Cc: linux-scsi, stephenmcameron, dab On 12/11/12 23:46, scameron@beardog.cce.hp.com wrote: > I would be curious to see what kind of results you would get with scsi_debug > with fake_rw=1. I am sort of suspecting that trying to put an "upper limit" > on scsi LLD IOPS performance by seeing what scsi_debug will do with fake_rw=1 > is not really valid (or, maybe I'm doing it wrong) as I know of one case in > which a real HW scsi driver beats scsi_debug fake_rw=1 at IOPS on the very > same system, which seems like it shouldn't be possible. Kind of mysterious. The test # disable-frequency-scaling # modprobe scsi_debug delay=0 fake_rw=1 # echo 2 > /sys/block/sdc/queue/rq_affinity # echo noop > /sys/block/sdc/queue/scheduler # echo 0 > /sys/block/sdc/queue/add_random results in about 800K IOPS for random reads on the same setup (with a request size of 4 KB; CPU: quad core i5-2400). Repeating the same test with fake_rw=0 results in about 651K IOPS. Bart. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-13 11:40 ` Bart Van Assche @ 2012-12-13 18:03 ` scameron 2012-12-13 17:18 ` Bart Van Assche 0 siblings, 1 reply; 21+ messages in thread From: scameron @ 2012-12-13 18:03 UTC (permalink / raw) To: Bart Van Assche; +Cc: linux-scsi, stephenmcameron, dab, scameron On Thu, Dec 13, 2012 at 12:40:27PM +0100, Bart Van Assche wrote: > On 12/11/12 23:46, scameron@beardog.cce.hp.com wrote: > >I would be curious to see what kind of results you would get with > >scsi_debug > >with fake_rw=1. I am sort of suspecting that trying to put an "upper > >limit" > >on scsi LLD IOPS performance by seeing what scsi_debug will do with > >fake_rw=1 > >is not really valid (or, maybe I'm doing it wrong) as I know of one case in > >which a real HW scsi driver beats scsi_debug fake_rw=1 at IOPS on the very > >same system, which seems like it shouldn't be possible. Kind of > >mysterious. > > The test > > # disable-frequency-scaling > # modprobe scsi_debug delay=0 fake_rw=1 > # echo 2 > /sys/block/sdc/queue/rq_affinity > # echo noop > /sys/block/sdc/queue/scheduler > # echo 0 > /sys/block/sdc/queue/add_random > > results in about 800K IOPS for random reads on the same setup (with a > request size of 4 KB; CPU: quad core i5-2400). > > Repeating the same test with fake_rw=0 results in about 651K IOPS. What are your system specs? Here's what I'm seeing. I have one 6-core processor. [root@localhost scameron]# grep 'model name' /proc/cpuinfo model name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz model name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz model name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz model name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz model name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz model name : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz hyperthreading is disabled. Here is the script I'm running. [root@localhost scameron]# cat do-dds #!/bin/sh do_dd() { device="$1" cpu="$2" taskset -c "$cpu" dd if="$device" of=/dev/null bs=4k iflag=direct } do_six() { for x in `seq 0 5` do do_dd "$1" $x & done } do_120() { for z in `seq 1 20` do do_six "$1" done wait } time do_120 "$1" I don't have "disable-frequency-scaling" on rhel6, but I think if I send SIGUSR1 to all the cpuspeed processes, this does the same thing. ps aux | grep cpuspeed | grep -v grep | awk '{ printf("kill -USR1 %s\n", $2);}' | sh [root@localhost scameron]# find /sys -name 'scaling_cur_freq' -print | xargs cat 2000000 2000000 2000000 2000000 2000000 2000000 [root@localhost scameron]# Now, using scsi-debug (300mb size) with delay=0 and fake_rw=1, with rq_affinity set to 2, and add_random set to 0 and noop i/o scheduler I get ~216k iops. With my scsi lld (actually doing the i/o) , I now get ~190k iops. rq_affinity set to 2, add_random 0, noop i/o scheduler, irqs manually spread across cpus (irqbalance turned off). With my block lld (actually doing the i/o), I get ~380k iops. rq_affinity set to 2, add_random 0, i/o scheduler "none" (there is no i/o scheduler with the make_request interface), irqs manually spread across cpus (irqbalance turned off). So the block driver seems to beat the snot out of the scsi lld by a factor of 2x now, rather than 3x, so I guess that's some improvement, but still. -- steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-13 18:03 ` scameron @ 2012-12-13 17:18 ` Bart Van Assche 0 siblings, 0 replies; 21+ messages in thread From: Bart Van Assche @ 2012-12-13 17:18 UTC (permalink / raw) To: scameron; +Cc: linux-scsi, stephenmcameron, dab On 12/13/12 19:03, scameron@beardog.cce.hp.com wrote: > What are your system specs? A quad core Intel i5-2400 @ 3.10 GHz. > taskset -c "$cpu" dd if="$device" of=/dev/null bs=4k iflag=direct Please use fio instead of dd for any serious performance measurements. dd doesn't even guarantee that it's buffers are page aligned. > I don't have "disable-frequency-scaling" on rhel6, but I think if I send > SIGUSR1 to all the cpuspeed processes, this does the same thing. Depends on which scaling governor and minimum frequency has been configured. This is what I am using: #!/bin/bash for d in /sys/devices/system/cpu/cpu*/cpufreq do if [ -e "$d/scaling_governor" ]; then echo "userspace" >"$d/scaling_governor" echo "$(<$d/cpuinfo_max_freq)" >"$d/scaling_min_freq" fi done And the test I ran is: fio --bs=4096 --ioengine=libaio --rw=randread --buffered=0 --thread \ --numjobs=${cpucount} --iodepth=16 --iodepth_batch=8 \ --iodepth_batch_complete=8 \ --loops=$((2**31)) --runtime=60 --group_reporting --size=${size} \ --gtod_reduce=1 --name=${dev} --filename=${dev} --invalidate=1 Bart. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-11 0:00 SCSI mid layer and high IOPS capable devices scameron 2012-12-11 8:21 ` Bart Van Assche @ 2012-12-13 15:22 ` Bart Van Assche 2012-12-13 17:25 ` scameron 1 sibling, 1 reply; 21+ messages in thread From: Bart Van Assche @ 2012-12-13 15:22 UTC (permalink / raw) To: scameron; +Cc: linux-scsi, stephenmcameron, dab On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote: > The driver, like nvme, has a submit and reply queue per cpu. This is interesting. If my interpretation of the POSIX spec is correct then aio_write() allows to queue overlapping writes and all writes submitted by the same thread have to be performed in the order they were submitted by that thread. What if a thread submits a first write via aio_write(), gets rescheduled on another CPU and submits a second overlapping write also via aio_write() ? If a block driver uses one queue per CPU, does that mean that such writes that were issued in order can be executed in a different order by the driver and/or hardware than the order in which the writes were submitted ? See also the aio_write() man page, The Open Group Base Specifications Issue 7, IEEE Std 1003.1-2008 (http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html). Bart. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-13 15:22 ` Bart Van Assche @ 2012-12-13 17:25 ` scameron 2012-12-13 16:47 ` Bart Van Assche 0 siblings, 1 reply; 21+ messages in thread From: scameron @ 2012-12-13 17:25 UTC (permalink / raw) To: Bart Van Assche; +Cc: linux-scsi, stephenmcameron, dab, scameron On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote: > On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote: > >The driver, like nvme, has a submit and reply queue per cpu. > > This is interesting. If my interpretation of the POSIX spec is correct > then aio_write() allows to queue overlapping writes and all writes > submitted by the same thread have to be performed in the order they were > submitted by that thread. What if a thread submits a first write via > aio_write(), gets rescheduled on another CPU and submits a second > overlapping write also via aio_write() ? If a block driver uses one > queue per CPU, does that mean that such writes that were issued in order > can be executed in a different order by the driver and/or hardware than > the order in which the writes were submitted ? > > See also the aio_write() man page, The Open Group Base Specifications > Issue 7, IEEE Std 1003.1-2008 > (http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html). It is my understanding that the low level driver is free to re-order the i/o's any way it wants, as is the hardware. It is up to the layers above to enforce any ordering requirements. For a long time there was a bug in the cciss driver that all i/o's submitted to the driver got reversed in order -- adding to head of a list instead of to the tail, or vice versa, I forget which -- and it caused no real problems (apart from some slight performance issues that were mostly masked by the Smart Array's cache. It was caught by firmware guys noticing LBAs coming in in weird orders for supposedly sequential workloads. So in your scenario, I think the overlapping writes should not be submitted by the block layer to the low level driver concurrently, as the block layer is aware that the lld is free to re-order things. (I am very certain that this is the case for scsi low level drivers and block drivers using a request_fn interface -- less certain about block drivers using the make_request interface to submit i/o's, as this interface is pretty new to me. If I am wrong about any of that, that would be very interesting to know. -- steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-13 17:25 ` scameron @ 2012-12-13 16:47 ` Bart Van Assche 2012-12-13 16:49 ` Christoph Hellwig ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Bart Van Assche @ 2012-12-13 16:47 UTC (permalink / raw) To: scameron; +Cc: linux-scsi, stephenmcameron, dab On 12/13/12 18:25, scameron@beardog.cce.hp.com wrote: > On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote: >> On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote: >>> The driver, like nvme, has a submit and reply queue per cpu. >> >> This is interesting. If my interpretation of the POSIX spec is correct >> then aio_write() allows to queue overlapping writes and all writes >> submitted by the same thread have to be performed in the order they were >> submitted by that thread. What if a thread submits a first write via >> aio_write(), gets rescheduled on another CPU and submits a second >> overlapping write also via aio_write() ? If a block driver uses one >> queue per CPU, does that mean that such writes that were issued in order >> can be executed in a different order by the driver and/or hardware than >> the order in which the writes were submitted ? >> >> See also the aio_write() man page, The Open Group Base Specifications >> Issue 7, IEEE Std 1003.1-2008 >> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html). > > It is my understanding that the low level driver is free to re-order the > i/o's any way it wants, as is the hardware. It is up to the layers above > to enforce any ordering requirements. For a long time there was a bug > in the cciss driver that all i/o's submitted to the driver got reversed > in order -- adding to head of a list instead of to the tail, or vice versa, > I forget which -- and it caused no real problems (apart from some slight > performance issues that were mostly masked by the Smart Array's cache. > It was caught by firmware guys noticing LBAs coming in in weird orders > for supposedly sequential workloads. > > So in your scenario, I think the overlapping writes should not be submitted > by the block layer to the low level driver concurrently, as the block layer > is aware that the lld is free to re-order things. (I am very certain > that this is the case for scsi low level drivers and block drivers using a > request_fn interface -- less certain about block drivers using the > make_request interface to submit i/o's, as this interface is pretty new > to me. As far as I know there are basically two choices: 1. Allow the LLD to reorder any pair of write requests. The only way for higher layers to ensure the order of (overlapping) writes is then to separate these in time. Or in other words, limit write request queue depth to one. 2. Do not allow the LLD to reorder overlapping write requests. This allows higher software layers to queue write requests (queue depth > 1). From my experience with block and SCSI drivers option (1) doesn't look attractive from a performance point of view. From what I have seen performance with QD=1 is several times lower than performance with QD > 1. But maybe I overlooked something ? Bart. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-13 16:47 ` Bart Van Assche @ 2012-12-13 16:49 ` Christoph Hellwig 2012-12-14 9:44 ` Bart Van Assche 2012-12-13 21:20 ` scameron 2012-12-14 0:22 ` Jack Wang 2 siblings, 1 reply; 21+ messages in thread From: Christoph Hellwig @ 2012-12-13 16:49 UTC (permalink / raw) To: Bart Van Assche; +Cc: scameron, linux-scsi, stephenmcameron, dab On Thu, Dec 13, 2012 at 05:47:14PM +0100, Bart Van Assche wrote: > From my experience with block and SCSI drivers option (1) doesn't > look attractive from a performance point of view. From what I have > seen performance with QD=1 is several times lower than performance > with QD > 1. But maybe I overlooked something ? What you might be missing is that at least on Linux no one who cares about performance uses the Posix AIO inferface anyway, as the implementation in glibc always has been horrible. The Linux-native aio interface or the various thread pool implementations don't imply useless ordering and thus can be used to fill up large queues. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-13 16:49 ` Christoph Hellwig @ 2012-12-14 9:44 ` Bart Van Assche 2012-12-14 16:44 ` scameron 0 siblings, 1 reply; 21+ messages in thread From: Bart Van Assche @ 2012-12-14 9:44 UTC (permalink / raw) To: Christoph Hellwig; +Cc: scameron, linux-scsi, stephenmcameron, dab On 12/13/12 17:49, Christoph Hellwig wrote: > On Thu, Dec 13, 2012 at 05:47:14PM +0100, Bart Van Assche wrote: >> From my experience with block and SCSI drivers option (1) doesn't >> look attractive from a performance point of view. From what I have >> seen performance with QD=1 is several times lower than performance >> with QD > 1. But maybe I overlooked something ? > > What you might be missing is that at least on Linux no one who cares > about performance uses the Posix AIO inferface anyway, as the > implementation in glibc always has been horrible. The Linux-native > aio interface or the various thread pool implementations don't imply > useless ordering and thus can be used to fill up large queues. Some applications need write ordering without having a need for enforcing durability as fsync() does [1]. What I'm wondering about is whether an operating system kernel like the Linux kernel should penalize application performance when using block drivers and storage hardware that preserve the order of write requests because there exist other drivers and storage devices that do not preserve the order of write requests ? [1] Richard Hipp, Re: [sqlite] SQLite on flash (was: [PATCH 00/16] f2fs: introduce flash-friendly file system), October 10, 2012 (http://www.mail-archive.com/sqlite-users@sqlite.org/msg73033.html). Bart. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-14 9:44 ` Bart Van Assche @ 2012-12-14 16:44 ` scameron 2012-12-14 16:15 ` Bart Van Assche 0 siblings, 1 reply; 21+ messages in thread From: scameron @ 2012-12-14 16:44 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab, scameron On Fri, Dec 14, 2012 at 10:44:34AM +0100, Bart Van Assche wrote: > On 12/13/12 17:49, Christoph Hellwig wrote: > >On Thu, Dec 13, 2012 at 05:47:14PM +0100, Bart Van Assche wrote: > >> From my experience with block and SCSI drivers option (1) doesn't > >>look attractive from a performance point of view. From what I have > >>seen performance with QD=1 is several times lower than performance > >>with QD > 1. But maybe I overlooked something ? > > > >What you might be missing is that at least on Linux no one who cares > >about performance uses the Posix AIO inferface anyway, as the > >implementation in glibc always has been horrible. The Linux-native > >aio interface or the various thread pool implementations don't imply > >useless ordering and thus can be used to fill up large queues. > > Some applications need write ordering without having a need for > enforcing durability as fsync() does [1]. What I'm wondering about is > whether an operating system kernel like the Linux kernel should penalize > application performance when using block drivers and storage hardware > that preserve the order of write requests because there exist other > drivers and storage devices that do not preserve the order of write > requests ? Which devices don't re-order requests? So far as I know every single disk drive ever made that is capable of handling multiple requests will also re-order requests as it sees fit. I expect the flash devices re-order requests as well, simply because to feed requests to the things at a sufficient rate, you have to pump requests into them concurrently on multiple hardware queues -- a single cpu jamming requests into them as fast as it can is still not fast enough to keep them busy. Consequently, they *can't* care about ordering, as the relative order requests on different hardware queues are submitted into them is not even really controlled, so the OS *can't* count on concurrent requests not to be essentially "re-ordered", just because of the nature of the way requests get into the device. So I think the property that devices and drivers are free to reorder concurrent requests is not going away. -- steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-14 16:44 ` scameron @ 2012-12-14 16:15 ` Bart Van Assche 2012-12-14 19:55 ` scameron 0 siblings, 1 reply; 21+ messages in thread From: Bart Van Assche @ 2012-12-14 16:15 UTC (permalink / raw) To: scameron; +Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab On 12/14/12 17:44, scameron@beardog.cce.hp.com wrote: > I expect the flash devices re-order requests as well, simply because > to feed requests to the things at a sufficient rate, you have to pump > requests into them concurrently on multiple hardware queues -- a single > cpu jamming requests into them as fast as it can is still not fast enough > to keep them busy. Consequently, they *can't* care about ordering, as the > relative order requests on different hardware queues are submitted into them > is not even really controlled, so the OS *can't* count on concurrent requests > not to be essentially "re-ordered", just because of the nature of the way > requests get into the device. Why should a flash device have to reorder write requests ? These devices typically use a log-structured file system internally. Bart. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-14 16:15 ` Bart Van Assche @ 2012-12-14 19:55 ` scameron 2012-12-14 19:28 ` Bart Van Assche 0 siblings, 1 reply; 21+ messages in thread From: scameron @ 2012-12-14 19:55 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab, scameron On Fri, Dec 14, 2012 at 05:15:37PM +0100, Bart Van Assche wrote: > On 12/14/12 17:44, scameron@beardog.cce.hp.com wrote: > >I expect the flash devices re-order requests as well, simply because > >to feed requests to the things at a sufficient rate, you have to pump > >requests into them concurrently on multiple hardware queues -- a single > >cpu jamming requests into them as fast as it can is still not fast enough > >to keep them busy. Consequently, they *can't* care about ordering, as the > >relative order requests on different hardware queues are submitted into > >them > >is not even really controlled, so the OS *can't* count on concurrent > >requests > >not to be essentially "re-ordered", just because of the nature of the way > >requests get into the device. > > Why should a flash device have to reorder write requests ? These devices > typically use a log-structured file system internally. It's not so much that they are re-ordered as that there is no controlled ordering to begin with because multiple cpus are submitting to multiple hardware queues concurrently. If you have 12 requests coming in on 12 cpus to 12 hardware queues to the device, it's going to be racy as to which request is processed first by the device -- and this is fine, the hardware queues are independent of one another and do not need to worry about each other. This is all to provide a means of getting enough commands on the device to actually keep it busy. A single cpu can't do it, the device is too fast. If you have ordering dependencies such that request A must complete before request B completes, then don't submit A and B concurrently, because if you do submit them concurrently, you cannot tell whether A or B will arrive into the device first because they may go into it via different hardware queues. Note, in case it isn't obvious, the hardware queues I'm talking about here are not the struct scsi_device, sdev->request_queue queues, they are typically ring buffers in host memory from which the device DMAs commands/responses to/from depending on if it's a submit queue or a completion queue and with producer/consumer indexes one of which is in host memory and one of which is a register on the device (which is which depends on the direction of the queue, from device (pi = host memory, ci = device register), or to device (pi = device register, ci = host memory)) -- steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-14 19:55 ` scameron @ 2012-12-14 19:28 ` Bart Van Assche 2012-12-14 21:06 ` scameron 0 siblings, 1 reply; 21+ messages in thread From: Bart Van Assche @ 2012-12-14 19:28 UTC (permalink / raw) To: scameron; +Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab On 12/14/12 20:55, scameron@beardog.cce.hp.com wrote: > It's not so much that they are re-ordered as that there is no controlled > ordering to begin with because multiple cpus are submitting to multiple > hardware queues concurrently. If you have 12 requests coming in on 12 > cpus to 12 hardware queues to the device, it's going to be racy as to > which request is processed first by the device -- and this is fine, the > hardware queues are independent of one another and do not need to worry > about each other. This is all to provide a means of getting enough commands > on the device to actually keep it busy. A single cpu can't do it, the > device is too fast. If you have ordering dependencies such that request > A must complete before request B completes, then don't submit A and B > concurrently, because if you do submit them concurrently, you cannot tell > whether A or B will arrive into the device first because they may go into > it via different hardware queues. It depends on how these multiple queues are used. If each queue would e.g. be associated with a disjoint LBA range of the storage device then there wouldn't be a risk of request reordering due to using multiple hardware queues. Bart. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-14 19:28 ` Bart Van Assche @ 2012-12-14 21:06 ` scameron 2012-12-15 9:40 ` Bart Van Assche 0 siblings, 1 reply; 21+ messages in thread From: scameron @ 2012-12-14 21:06 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab, scameron On Fri, Dec 14, 2012 at 08:28:56PM +0100, Bart Van Assche wrote: > On 12/14/12 20:55, scameron@beardog.cce.hp.com wrote: > >It's not so much that they are re-ordered as that there is no controlled > >ordering to begin with because multiple cpus are submitting to multiple > >hardware queues concurrently. If you have 12 requests coming in on 12 > >cpus to 12 hardware queues to the device, it's going to be racy as to > >which request is processed first by the device -- and this is fine, the > >hardware queues are independent of one another and do not need to worry > >about each other. This is all to provide a means of getting enough > >commands > >on the device to actually keep it busy. A single cpu can't do it, the > >device is too fast. If you have ordering dependencies such that request > >A must complete before request B completes, then don't submit A and B > >concurrently, because if you do submit them concurrently, you cannot tell > >whether A or B will arrive into the device first because they may go into > >it via different hardware queues. > > It depends on how these multiple queues are used. If each queue would > e.g. be associated with a disjoint LBA range of the storage device then > there wouldn't be a risk of request reordering due to using multiple > hardware queues. They are not associated with disjoint LBA ranges. They are associated with CPUs on the submission side, there's a submit queue per cpu, and msix vectors on the completion side (also a completion queue per cpu). The point of the queues is only to provide a wide enough highway to allow enough requests to be shoved down to the device fast enough and completed back to the host fast enough that the device can be kept reasonably busy, instead of being starved for work to do. There is no distinction about what the requests may do based on what hardware i/o queue they come in on (e.g. no lba range partitioning). All the i/o queues are equivalent. Pretty much all current storage devices, disk drives and the devices I'm talking about in particular do depend on the low level driver and storage devices being permitted to re-order requests. So I don't think the discussion about drivers and devices that *do not* reorder requests (which drivers and devices would those be?) is very related to the topic of how to get the scsi mid layer to provide a wide enough highway for requests destined for very low latency devices. -- steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-14 21:06 ` scameron @ 2012-12-15 9:40 ` Bart Van Assche 2012-12-19 14:23 ` Christoph Hellwig 0 siblings, 1 reply; 21+ messages in thread From: Bart Van Assche @ 2012-12-15 9:40 UTC (permalink / raw) To: scameron; +Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab On 12/14/12 22:06, scameron@beardog.cce.hp.com wrote: > [ ... ] how to get the scsi mid layer to provide a wide enough > highway for requests destined for very low latency devices. While the SCSI mid-layer is processing an I/O request not only the queue lock has to be locked and unlocked several times but also the SCSI host lock. The reason that it's unavoidable to lock and unlock the host lock is because the SCSI core has been designed for SCSI equipment that has a queue depth limit per host (shost->can_queue). For single LUN devices that model could be changed in a queue depth limit per LUN. Also, it's probably not that hard to modify software SCSI target implementations such that these have a queue depth limit per LUN instead of per host. It might be interesting to verify whether the following approach helps to improve performance of the SCSI mid-layer: * Make it possible for SCSI LLDs to tell the SCSI core whether there is a queue depth limit per host or per LUN. * Do not update shost->host_busy and shost->target_busy when using the QD limit per LUN mode. This change will allow to avoid spin_lock() and spin_unlock() calls inside scsi_request_fn(). It will also allow to avoid having to take the host lock inside scsi_device_unbusy(). * In queue-depth-limit-per-LUN mode, neither add a SCSI device to the starved list if it's busy nor examine the starved list in scsi_run_queue(). Bart. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-15 9:40 ` Bart Van Assche @ 2012-12-19 14:23 ` Christoph Hellwig 0 siblings, 0 replies; 21+ messages in thread From: Christoph Hellwig @ 2012-12-19 14:23 UTC (permalink / raw) To: Bart Van Assche Cc: scameron, Christoph Hellwig, linux-scsi, stephenmcameron, dab On Sat, Dec 15, 2012 at 10:40:24AM +0100, Bart Van Assche wrote: > On 12/14/12 22:06, scameron@beardog.cce.hp.com wrote: > >[ ... ] how to get the scsi mid layer to provide a wide enough > >highway for requests destined for very low latency devices. > > While the SCSI mid-layer is processing an I/O request not only the > queue lock has to be locked and unlocked several times but also the > SCSI host lock. The reason that it's unavoidable to lock and unlock > the host lock is because the SCSI core has been designed for SCSI > equipment that has a queue depth limit per host (shost->can_queue). > For single LUN devices that model could be changed in a queue depth > limit per LUN. Also, it's probably not that hard to modify software > SCSI target implementations such that these have a queue depth limit > per LUN instead of per host. We'd also better avoid needing a lock to check these limits, especially if we normally don't hit them. The easiest way to get started would be to simply allow a magic can_queue value that keeps these as unlimited and only let the driver return one of the busy values from ->queuecommand. We could then use unlocked list empty checks to see if anything is in a waiting list and enter a slow path mode. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: SCSI mid layer and high IOPS capable devices 2012-12-13 16:47 ` Bart Van Assche 2012-12-13 16:49 ` Christoph Hellwig @ 2012-12-13 21:20 ` scameron 2012-12-14 0:22 ` Jack Wang 2 siblings, 0 replies; 21+ messages in thread From: scameron @ 2012-12-13 21:20 UTC (permalink / raw) To: Bart Van Assche; +Cc: linux-scsi, stephenmcameron, dab, scameron On Thu, Dec 13, 2012 at 05:47:14PM +0100, Bart Van Assche wrote: > On 12/13/12 18:25, scameron@beardog.cce.hp.com wrote: > >On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote: > >>On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote: > >>>The driver, like nvme, has a submit and reply queue per cpu. > >> > >>This is interesting. If my interpretation of the POSIX spec is correct > >>then aio_write() allows to queue overlapping writes and all writes > >>submitted by the same thread have to be performed in the order they were > >>submitted by that thread. What if a thread submits a first write via > >>aio_write(), gets rescheduled on another CPU and submits a second > >>overlapping write also via aio_write() ? If a block driver uses one > >>queue per CPU, does that mean that such writes that were issued in order > >>can be executed in a different order by the driver and/or hardware than > >>the order in which the writes were submitted ? > >> > >>See also the aio_write() man page, The Open Group Base Specifications > >>Issue 7, IEEE Std 1003.1-2008 > >>(http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html). > > > >It is my understanding that the low level driver is free to re-order the > >i/o's any way it wants, as is the hardware. It is up to the layers above > >to enforce any ordering requirements. For a long time there was a bug > >in the cciss driver that all i/o's submitted to the driver got reversed > >in order -- adding to head of a list instead of to the tail, or vice versa, > >I forget which -- and it caused no real problems (apart from some slight > >performance issues that were mostly masked by the Smart Array's cache. > >It was caught by firmware guys noticing LBAs coming in in weird orders > >for supposedly sequential workloads. > > > >So in your scenario, I think the overlapping writes should not be submitted > >by the block layer to the low level driver concurrently, as the block layer > >is aware that the lld is free to re-order things. (I am very certain > >that this is the case for scsi low level drivers and block drivers using a > >request_fn interface -- less certain about block drivers using the > >make_request interface to submit i/o's, as this interface is pretty new > >to me. > > As far as I know there are basically two choices: > 1. Allow the LLD to reorder any pair of write requests. The only way > for higher layers to ensure the order of (overlapping) writes is then > to separate these in time. Or in other words, limit write request > queue depth to one. > > 2. Do not allow the LLD to reorder overlapping write requests. This > allows higher software layers to queue write requests (queue depth > > 1). > > From my experience with block and SCSI drivers option (1) doesn't look > attractive from a performance point of view. From what I have seen > performance with QD=1 is several times lower than performance with QD > > 1. But maybe I overlooked something ? I don't think 1 is how it works, and I know 2 is not how it works. LLD's are definitely allowed to re-order i/o's arbitrarily (and so is the hardware (e.g. array controller or disk drive)). If you need an i/o to complete before another begins, don't give the 2nd i/o to the LLD before the 1st completes, but be smarter than limiting all writes to queue depth of 1 by knowing when you care about the order. If my understanding is correct, The buffer cache will, for the most part, make sure there generally aren't many overlapping or order-dependent i/o's by essentially combining multiple overlapping writes into a single write, but for filesystem meta data, or direct i/o, there may of course be application specific ordering requirements, and the answer is, I think, the application (e.g. filesystem) needs to know when it care about the order, and wait for completions as necessary when it does care, and take pains that it should not care about the order most of the time if performance is important (one of the reasons the buffer cache exists.) (I might be wrong though.) -- steve ^ permalink raw reply [flat|nested] 21+ messages in thread
* RE: SCSI mid layer and high IOPS capable devices 2012-12-13 16:47 ` Bart Van Assche 2012-12-13 16:49 ` Christoph Hellwig 2012-12-13 21:20 ` scameron @ 2012-12-14 0:22 ` Jack Wang [not found] ` <CADzpL0TMT31yka98Zv0=53N4=pDZOc9+gacnvDWMbj+iZg4H5w@mail.gmail.com> 2 siblings, 1 reply; 21+ messages in thread From: Jack Wang @ 2012-12-14 0:22 UTC (permalink / raw) To: 'Bart Van Assche', scameron; +Cc: linux-scsi, stephenmcameron, dab On 12/13/12 18:25, scameron@beardog.cce.hp.com wrote: > On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote: >> On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote: >>> The driver, like nvme, has a submit and reply queue per cpu. >> >> This is interesting. If my interpretation of the POSIX spec is >> correct then aio_write() allows to queue overlapping writes and all >> writes submitted by the same thread have to be performed in the order >> they were submitted by that thread. What if a thread submits a first >> write via aio_write(), gets rescheduled on another CPU and submits a >> second overlapping write also via aio_write() ? If a block driver >> uses one queue per CPU, does that mean that such writes that were >> issued in order can be executed in a different order by the driver >> and/or hardware than the order in which the writes were submitted ? >> >> See also the aio_write() man page, The Open Group Base Specifications >> Issue 7, IEEE Std 1003.1-2008 >> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html). > > It is my understanding that the low level driver is free to re-order > the i/o's any way it wants, as is the hardware. It is up to the > layers above to enforce any ordering requirements. For a long time > there was a bug in the cciss driver that all i/o's submitted to the > driver got reversed in order -- adding to head of a list instead of to > the tail, or vice versa, I forget which -- and it caused no real > problems (apart from some slight performance issues that were mostly masked by the Smart Array's cache. > It was caught by firmware guys noticing LBAs coming in in weird orders > for supposedly sequential workloads. > > So in your scenario, I think the overlapping writes should not be > submitted by the block layer to the low level driver concurrently, as > the block layer is aware that the lld is free to re-order things. (I > am very certain that this is the case for scsi low level drivers and > block drivers using a request_fn interface -- less certain about block > drivers using the make_request interface to submit i/o's, as this > interface is pretty new to me. As far as I know there are basically two choices: 1. Allow the LLD to reorder any pair of write requests. The only way for higher layers to ensure the order of (overlapping) writes is then to separate these in time. Or in other words, limit write request queue depth to one. 2. Do not allow the LLD to reorder overlapping write requests. This allows higher software layers to queue write requests (queue depth > 1). From my experience with block and SCSI drivers option (1) doesn't look attractive from a performance point of view. From what I have seen performance with QD=1 is several times lower than performance with QD > 1. But maybe I overlooked something ? Bart. I was seen low queue depth improve sequential performance, and high queue depth improve random performance. Jack -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <CADzpL0TMT31yka98Zv0=53N4=pDZOc9+gacnvDWMbj+iZg4H5w@mail.gmail.com>]
[parent not found: <006301cdd99c$35099b40$9f1cd1c0$@com>]
[parent not found: <CADzpL0S5cfCRQftrxHij8KOjKj55psSJedmXLBQz1uQm_SC30A@mail.gmail.com>]
* RE: SCSI mid layer and high IOPS capable devices [not found] ` <CADzpL0S5cfCRQftrxHij8KOjKj55psSJedmXLBQz1uQm_SC30A@mail.gmail.com> @ 2012-12-14 4:59 ` Jack Wang 0 siblings, 0 replies; 21+ messages in thread From: Jack Wang @ 2012-12-14 4:59 UTC (permalink / raw) To: 'Stephen Cameron' Cc: 'Bart Van Assche', 'Stephen M. Cameron', linux-scsi, 'dbrace' Steve, Thanks for share detail of your problem. Yes you re right about test I talk. Now I know what you want to discuss on this thread. Jack Right, but if I understand you correctly, you're ganging up 24 device queues and measuring aggregate iops across them all. That is, you have 24 SAS disks all presented individually to the OS, right? (or did the controller aggregate them all into 1 logical drive presented to the OS?) I'm talking about one very low latency single device capable of let's say 450k iops all by itself. The problem is that with the scsi mid layer in this case, there can only be a single request queue feeding that one device (unlike your 24 request queues feeding 24 devices.) That single request queue is essentially single threaded -- only one cpu can touch it at a time to add or remove a request from it. With the block layer's make_request interface, I can take advantage of parallelism in the low level block driver and get essentially a queue per cpu feeding the single device. With the scsi mid layer, the low level driver's queue per cpu is (if I am correct) throttled by the fact that what is feeding those lld queues is one (essentially) single threaded request queue. It doesn't matter that the scsi LLD has a twelve lane highway leading into it because the scsi midlayer has a 1 lane highway feeding into that 12 lane highway. If I understand you correctly, you get 800k iops by measuring 24 highways going to 24 different towns. I have one town and one highway. The part of my highway that I control can handle several hundred kiops, but the part I don't control seemingly cannot. That is why scsi_debug driver can't get very high iops on a single pseudo-device, because there's only one request queue and that queue is protected by a spin lock. perf shows contention on spin locks in scsi_request_fn() -- large percentage of cpu time spent trying to get spin locks in scsi_request_fn(). I forget the exact number right now, but iirc, it was something like 30-40%. That is sort of the whole problem I'm having, as best I understand it, and why I started this thread. And unfortunately I do not have any very good ideas about what to do about it, other than use the block layer's make request interface, which is not ideal for a number of reasons (e.g. people and software (grub, etc.) are very much accustomed to dealing with the sd driver, and all other things being equal, using the sd driver interface is very much preferable.) With flash based storage devices, the age old assumptions that "disks" are glacially slow compared to the cpu(s) and seek penalties exist and are to be avoided which underlie the design of the linux storage subsystem architecture are starting to become false. That's kind of the "big picture" view of the problem. Part of me thinks what we really ought to do is make the non-volatile storage look like RAM at the hardware level, more or less, then put a ramfs on top of it, and call it done (there are probably myriad reasons it's not that simple of which I'm ignorant.) -- steve On Thu, Dec 13, 2012 at 7:41 PM, Jack Wang <jack_wang@usish.com> wrote: Maybe, and good to know for real-world scenarios, but scsi-debug with fake_rw=1 isn't even actually doing the i/o. I would think sequential, random, whatever wouldn't matter in that case, because presumably, it's not even looking at the LBAs, much less acting on them, nor would I expect the no-op i/o scheduler to be affected by the LBAs. -- steve For read world hardware, I tested with next generation PMCS SAS controller with 24 SAS disks, 512 sequential read with more than 800K , 512 sequential write with more than 500K similar results with windows 2008, but SATA performance did worse than windows kernel is 3.2.x as I remembered. Jack On Thu, Dec 13, 2012 at 6:22 PM, Jack Wang <jack_wang@usish.com> wrote: On 12/13/12 18:25, scameron@beardog.cce.hp.com wrote: > On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote: >> On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote: >>> The driver, like nvme, has a submit and reply queue per cpu. >> >> This is interesting. If my interpretation of the POSIX spec is >> correct then aio_write() allows to queue overlapping writes and all >> writes submitted by the same thread have to be performed in the order >> they were submitted by that thread. What if a thread submits a first >> write via aio_write(), gets rescheduled on another CPU and submits a >> second overlapping write also via aio_write() ? If a block driver >> uses one queue per CPU, does that mean that such writes that were >> issued in order can be executed in a different order by the driver >> and/or hardware than the order in which the writes were submitted ? >> >> See also the aio_write() man page, The Open Group Base Specifications >> Issue 7, IEEE Std 1003.1-2008 >> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html). > > It is my understanding that the low level driver is free to re-order > the i/o's any way it wants, as is the hardware. It is up to the > layers above to enforce any ordering requirements. For a long time > there was a bug in the cciss driver that all i/o's submitted to the > driver got reversed in order -- adding to head of a list instead of to > the tail, or vice versa, I forget which -- and it caused no real > problems (apart from some slight performance issues that were mostly masked by the Smart Array's cache. > It was caught by firmware guys noticing LBAs coming in in weird orders > for supposedly sequential workloads. > > So in your scenario, I think the overlapping writes should not be > submitted by the block layer to the low level driver concurrently, as > the block layer is aware that the lld is free to re-order things. (I > am very certain that this is the case for scsi low level drivers and > block drivers using a request_fn interface -- less certain about block > drivers using the make_request interface to submit i/o's, as this > interface is pretty new to me. As far as I know there are basically two choices: 1. Allow the LLD to reorder any pair of write requests. The only way for higher layers to ensure the order of (overlapping) writes is then to separate these in time. Or in other words, limit write request queue depth to one. 2. Do not allow the LLD to reorder overlapping write requests. This allows higher software layers to queue write requests (queue depth > 1). From my experience with block and SCSI drivers option (1) doesn't look attractive from a performance point of view. From what I have seen performance with QD=1 is several times lower than performance with QD > 1. But maybe I overlooked something ? Bart. I was seen low queue depth improve sequential performance, and high queue depth improve random performance. Jack -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2012-12-19 14:23 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-12-11 0:00 SCSI mid layer and high IOPS capable devices scameron 2012-12-11 8:21 ` Bart Van Assche 2012-12-11 22:46 ` scameron 2012-12-13 11:40 ` Bart Van Assche 2012-12-13 18:03 ` scameron 2012-12-13 17:18 ` Bart Van Assche 2012-12-13 15:22 ` Bart Van Assche 2012-12-13 17:25 ` scameron 2012-12-13 16:47 ` Bart Van Assche 2012-12-13 16:49 ` Christoph Hellwig 2012-12-14 9:44 ` Bart Van Assche 2012-12-14 16:44 ` scameron 2012-12-14 16:15 ` Bart Van Assche 2012-12-14 19:55 ` scameron 2012-12-14 19:28 ` Bart Van Assche 2012-12-14 21:06 ` scameron 2012-12-15 9:40 ` Bart Van Assche 2012-12-19 14:23 ` Christoph Hellwig 2012-12-13 21:20 ` scameron 2012-12-14 0:22 ` Jack Wang [not found] ` <CADzpL0TMT31yka98Zv0=53N4=pDZOc9+gacnvDWMbj+iZg4H5w@mail.gmail.com> [not found] ` <006301cdd99c$35099b40$9f1cd1c0$@com> [not found] ` <CADzpL0S5cfCRQftrxHij8KOjKj55psSJedmXLBQz1uQm_SC30A@mail.gmail.com> 2012-12-14 4:59 ` Jack Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).