SCSI mid layer and high IOPS capable devices

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* SCSI mid layer and high IOPS capable devices
@ 2012-12-11  0:00 scameron
  2012-12-11  8:21 ` Bart Van Assche
  2012-12-13 15:22 ` Bart Van Assche
  0 siblings, 2 replies; 21+ messages in thread
From: scameron @ 2012-12-11  0:00 UTC (permalink / raw)
  To: linux-scsi; +Cc: stephenmcameron, scameron, dab

Is there any work going on to improve performance of the SCSI layer
to better support devices capable of high IOPS?

I've been playing around with some flash-based devices and have
a block driver that uses the make_request interface (calls
blk_queue_make_request() rather than blk_init_queue()) and a
SCSI LLD variant of the same driver.  The block driver is similar
in design and performance to the nvme driver.

If I compare the performance, the block driver gets about 3x the
performance as the SCSI LLD.  The SCSI LLD spends a lot of time
(according to perf) contending for locks in scsi_request_fn(),
presumably the host lock or the queue lock, or perhaps both.

All other things being equal, a SCSI LLD would be preferable to
me, but, with performance differing by a factor of around 3x, all
other things are definitely not equal.

I tried using scsi_debug with fake_rw and also the scsi_ram driver
that was recently posted to get some idea of what the maximum IOPS
that could be pushed through the SCSI midlayer might be, and the
numbers were a little disappointing (was getting around 150k iops
with scsi_debug with reads and writes faked, and around 3x that
with the block driver actually doing the i/o).

Essentially, what I've been finding out is consistent with what's in
this slide deck: http://static.usenix.org/event/lsf08/tech/IO_Carlson_Accardi_SATA.pdf

The driver, like nvme, has a submit and reply queue per cpu.
I'm sort of guessing that funnelling all the requests through
a single request queue per device that only one cpu can touch
at a time as the scsi mid layer does is a big part of what's
killing performance.  Looking through the scsi code, if I read
it correctly, the assumption that each device has a request queue
seems to be all over the code, so how exactly one might go about
attempting to improve the situation is not really obvious to me.

Anyway, just wondering if anybody is looking into doing some
improvements in this area.

-- steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-11  0:00 SCSI mid layer and high IOPS capable devices scameron
@ 2012-12-11  8:21 ` Bart Van Assche
  2012-12-11 22:46   ` scameron
  2012-12-13 15:22 ` Bart Van Assche
  1 sibling, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2012-12-11  8:21 UTC (permalink / raw)
  To: scameron; +Cc: linux-scsi, stephenmcameron, dab

On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote:
> I tried using scsi_debug with fake_rw and also the scsi_ram driver
> that was recently posted to get some idea of what the maximum IOPS
> that could be pushed through the SCSI midlayer might be, and the
> numbers were a little disappointing (was getting around 150k iops
> with scsi_debug with reads and writes faked, and around 3x that
> with the block driver actually doing the i/o).

With which request size was that ? I see about 330K IOPS @ 4 KB and 
about 540K IOPS @ 512 bytes with the SRP protocol, a RAM disk at the 
target side, a single SCSI LUN and a single IB cable. These results have 
been obtained on a setup with low-end CPU's. Had you set rq_affinity to 
2 in your tests ?

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-11  8:21 ` Bart Van Assche
@ 2012-12-11 22:46   ` scameron
  2012-12-13 11:40     ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: scameron @ 2012-12-11 22:46 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, stephenmcameron, dab, scameron

On Tue, Dec 11, 2012 at 09:21:46AM +0100, Bart Van Assche wrote:
> On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote:
> >I tried using scsi_debug with fake_rw and also the scsi_ram driver
> >that was recently posted to get some idea of what the maximum IOPS
> >that could be pushed through the SCSI midlayer might be, and the
> >numbers were a little disappointing (was getting around 150k iops
> >with scsi_debug with reads and writes faked, and around 3x that
> >with the block driver actually doing the i/o).
> 
> With which request size was that ? 

4k (I'm thinking the request size should not matter too much since
fake_rw=1 causes the i/o not to actually be done -- there's no data 
transferred.  Similarly with scsi_ram there's a flag to discard 
reads and writes that I was using.)

> I see about 330K IOPS @ 4 KB and 
> about 540K IOPS @ 512 bytes with the SRP protocol, a RAM disk at the 
> target side, a single SCSI LUN and a single IB cable. These results have 
> been obtained on a setup with low-end CPU's. Had you set rq_affinity to 
> 2 in your tests ?

No, hadn't done anything with rq_affinity.  I had spread interrupts
around by turning off irqbalance and echoing things into /proc/irq/*, and
running a bunch of dd processes (one per cpu) like this: 

	taskset -c $cpu dd if=/dev/blah of=/dev/null bs=4k iflag=direct &

And the hardware in this case should route the interrupts back to the processor
which submitted the i/o (the submitted command contains info that lets the hw
know which msix vector we want the io to come back on.)

I would be curious to see what kind of results you would get with scsi_debug
with fake_rw=1.  I am sort of suspecting that trying to put an "upper limit"
on scsi LLD IOPS performance by seeing what scsi_debug will do with fake_rw=1
is not really valid (or, maybe I'm doing it wrong) as I know of one case in
which a real HW scsi driver beats scsi_debug fake_rw=1 at IOPS on the very
same system, which seems like it shouldn't be possible.  Kind of mysterious.

Another mystery I haven't been able to clear up -- I'm using code like
this to set affinity hints 

        int i, cpu;

        cpu = cpumask_first(cpu_online_mask);
        for (i = 0; i < h->noqs; i++) {
                int idx = i ? i + 1 : i;
                int rc;
                rc = irq_set_affinity_hint(h->qinfo[idx].msix_vector,
                                        get_cpu_mask(cpu));

                if (rc)
                        dev_warn(&h->pdev->dev, "Failed to hint affinity of vector %d to cpu %d\n",
                                        h->qinfo[idx].msix_vector, cpu);
                cpu = cpumask_next(cpu, cpu_online_mask);
        }

and those hints are set (querying /proc/irq/*/affinity_hint shows that my hints
are in there) but the hints are not "taken", that is /proc/irq/smp_affinity
does not match the hints.

doing this:

for x in `seq $first_irq $last_irq`
do
	cat /proc/irq/$x/affinity_hint > /proc/irq/$x/smp_affinity
done

(where first_irq and last_irq specify the range of irqs assigned
to my driver) makes the hints be "taken".

I noticed nvme doesn't seem to suffer from this, somehow the hints are
taken automatically (er, I don't recall if /proc/irq/*/smp_affinity matches
affinity_hints for nvme, but interrupts seem spread around without doing
anything special).   I haven't seen anything in the nvme code related to affinity
that I'm not already doing as well in my driver, so it is a mystery to me why
that difference in behavior occurs.

-- steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-11 22:46   ` scameron
@ 2012-12-13 11:40     ` Bart Van Assche
  2012-12-13 18:03       ` scameron
  0 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2012-12-13 11:40 UTC (permalink / raw)
  To: scameron; +Cc: linux-scsi, stephenmcameron, dab

On 12/11/12 23:46, scameron@beardog.cce.hp.com wrote:
> I would be curious to see what kind of results you would get with scsi_debug
> with fake_rw=1.  I am sort of suspecting that trying to put an "upper limit"
> on scsi LLD IOPS performance by seeing what scsi_debug will do with fake_rw=1
> is not really valid (or, maybe I'm doing it wrong) as I know of one case in
> which a real HW scsi driver beats scsi_debug fake_rw=1 at IOPS on the very
> same system, which seems like it shouldn't be possible.  Kind of mysterious.

The test

# disable-frequency-scaling
# modprobe scsi_debug delay=0 fake_rw=1
# echo 2 > /sys/block/sdc/queue/rq_affinity
# echo noop > /sys/block/sdc/queue/scheduler
# echo 0 > /sys/block/sdc/queue/add_random

results in about 800K IOPS for random reads on the same setup (with a 
request size of 4 KB; CPU: quad core i5-2400).

Repeating the same test with fake_rw=0 results in about 651K IOPS.

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-13 11:40     ` Bart Van Assche
@ 2012-12-13 18:03       ` scameron
  2012-12-13 17:18         ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: scameron @ 2012-12-13 18:03 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, stephenmcameron, dab, scameron

On Thu, Dec 13, 2012 at 12:40:27PM +0100, Bart Van Assche wrote:
> On 12/11/12 23:46, scameron@beardog.cce.hp.com wrote:
> >I would be curious to see what kind of results you would get with 
> >scsi_debug
> >with fake_rw=1.  I am sort of suspecting that trying to put an "upper 
> >limit"
> >on scsi LLD IOPS performance by seeing what scsi_debug will do with 
> >fake_rw=1
> >is not really valid (or, maybe I'm doing it wrong) as I know of one case in
> >which a real HW scsi driver beats scsi_debug fake_rw=1 at IOPS on the very
> >same system, which seems like it shouldn't be possible.  Kind of 
> >mysterious.
> 
> The test
> 
> # disable-frequency-scaling
> # modprobe scsi_debug delay=0 fake_rw=1
> # echo 2 > /sys/block/sdc/queue/rq_affinity
> # echo noop > /sys/block/sdc/queue/scheduler
> # echo 0 > /sys/block/sdc/queue/add_random
> 
> results in about 800K IOPS for random reads on the same setup (with a 
> request size of 4 KB; CPU: quad core i5-2400).
> 
> Repeating the same test with fake_rw=0 results in about 651K IOPS.

What are your system specs?


Here's what I'm seeing.

I have one 6-core processor.

[root@localhost scameron]# grep 'model name' /proc/cpuinfo
model name	: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
model name	: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
model name	: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
model name	: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
model name	: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
model name	: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz

hyperthreading is disabled.

Here is the script I'm running.

[root@localhost scameron]# cat do-dds
#!/bin/sh

do_dd()
{
	device="$1"
	cpu="$2"

	taskset -c "$cpu" dd if="$device" of=/dev/null bs=4k iflag=direct
}

do_six()
{
	for x in `seq 0 5`
	do
		do_dd "$1" $x &
	done
}

do_120()
{
	for z in `seq 1 20` 
	do
		do_six "$1"
	done
	wait
}

time do_120 "$1"
		
I don't have "disable-frequency-scaling" on rhel6, but I think if I send
SIGUSR1 to all the cpuspeed processes, this does the same thing.

 ps aux | grep cpuspeed | grep -v grep | awk '{ printf("kill -USR1 %s\n", $2);}' | sh

[root@localhost scameron]# find /sys -name 'scaling_cur_freq' -print | xargs cat
2000000
2000000
2000000
2000000
2000000
2000000
[root@localhost scameron]#

Now, using scsi-debug (300mb size) with delay=0 and fake_rw=1, with
rq_affinity set to 2, and add_random set to 0 and noop i/o scheduler
I get ~216k iops.

With my scsi lld (actually doing the i/o) , I now get ~190k iops.
rq_affinity set to 2, add_random 0, noop i/o scheduler, irqs
manually spread across cpus (irqbalance turned off).

With my block lld (actually doing the i/o), I get ~380k iops.
rq_affinity set to 2, add_random 0, i/o scheduler "none"
(there is no i/o scheduler with the make_request interface),
irqs manually spread across cpus (irqbalance turned off).

So the block driver seems to beat the snot out of the scsi lld
by a factor of 2x now, rather than 3x, so I guess that's some
improvement, but still.

-- steve


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-13 18:03       ` scameron
@ 2012-12-13 17:18         ` Bart Van Assche
  0 siblings, 0 replies; 21+ messages in thread
From: Bart Van Assche @ 2012-12-13 17:18 UTC (permalink / raw)
  To: scameron; +Cc: linux-scsi, stephenmcameron, dab

On 12/13/12 19:03, scameron@beardog.cce.hp.com wrote:
> What are your system specs?

A quad core Intel i5-2400 @ 3.10 GHz.

> 	taskset -c "$cpu" dd if="$device" of=/dev/null bs=4k iflag=direct

Please use fio instead of dd for any serious performance measurements. 
dd doesn't even guarantee that it's buffers are page aligned.

> I don't have "disable-frequency-scaling" on rhel6, but I think if I send
> SIGUSR1 to all the cpuspeed processes, this does the same thing.

Depends on which scaling governor and minimum frequency has been 
configured. This is what I am using:

#!/bin/bash
for d in /sys/devices/system/cpu/cpu*/cpufreq
do
   if [ -e "$d/scaling_governor" ]; then
     echo "userspace" >"$d/scaling_governor"
     echo "$(<$d/cpuinfo_max_freq)" >"$d/scaling_min_freq"
   fi
done

And the test I ran is:

fio --bs=4096 --ioengine=libaio --rw=randread --buffered=0 --thread    \
     --numjobs=${cpucount} --iodepth=16 --iodepth_batch=8               \
     --iodepth_batch_complete=8                                         \
     --loops=$((2**31)) --runtime=60 --group_reporting --size=${size}   \
     --gtod_reduce=1 --name=${dev} --filename=${dev} --invalidate=1

Bart.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-11  0:00 SCSI mid layer and high IOPS capable devices scameron
  2012-12-11  8:21 ` Bart Van Assche
@ 2012-12-13 15:22 ` Bart Van Assche
  2012-12-13 17:25   ` scameron
  1 sibling, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2012-12-13 15:22 UTC (permalink / raw)
  To: scameron; +Cc: linux-scsi, stephenmcameron, dab

On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote:
> The driver, like nvme, has a submit and reply queue per cpu.

This is interesting. If my interpretation of the POSIX spec is correct 
then aio_write() allows to queue overlapping writes and all writes 
submitted by the same thread have to be performed in the order they were 
submitted by that thread. What if a thread submits a first write via 
aio_write(), gets rescheduled on another CPU and submits a second 
overlapping write also via aio_write() ? If a block driver uses one 
queue per CPU, does that mean that such writes that were issued in order 
can be executed in a different order by the driver and/or hardware than 
the order in which the writes were submitted ?

See also the aio_write() man page, The Open Group Base Specifications 
Issue 7, IEEE Std 1003.1-2008 
(http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html).

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-13 15:22 ` Bart Van Assche
@ 2012-12-13 17:25   ` scameron
  2012-12-13 16:47     ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: scameron @ 2012-12-13 17:25 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, stephenmcameron, dab, scameron

On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote:
> On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote:
> >The driver, like nvme, has a submit and reply queue per cpu.
> 
> This is interesting. If my interpretation of the POSIX spec is correct 
> then aio_write() allows to queue overlapping writes and all writes 
> submitted by the same thread have to be performed in the order they were 
> submitted by that thread. What if a thread submits a first write via 
> aio_write(), gets rescheduled on another CPU and submits a second 
> overlapping write also via aio_write() ? If a block driver uses one 
> queue per CPU, does that mean that such writes that were issued in order 
> can be executed in a different order by the driver and/or hardware than 
> the order in which the writes were submitted ?
> 
> See also the aio_write() man page, The Open Group Base Specifications 
> Issue 7, IEEE Std 1003.1-2008 
> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html).

It is my understanding that the low level driver is free to re-order the
i/o's any way it wants, as is the hardware.  It is up to the layers above
to enforce any ordering requirements.  For a long time there was a bug
in the cciss driver that all i/o's submitted to the driver got reversed
in order -- adding to head of a list instead of to the tail, or vice versa,
I forget which -- and it caused no real problems (apart from some slight
performance issues that were mostly masked by the Smart Array's cache.
It was caught by firmware guys noticing LBAs coming in in weird orders
for supposedly sequential workloads.

So in your scenario, I think the overlapping writes should not be submitted
by the block layer to the low level driver concurrently, as the block layer
is aware that the lld is free to re-order things.  (I am very certain
that this is the case for scsi low level drivers and block drivers using a
request_fn interface -- less certain about block drivers using the
make_request interface to submit i/o's, as this interface is pretty new
to me.

If I am wrong about any of that, that would be very interesting to know.

-- steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-13 17:25   ` scameron
@ 2012-12-13 16:47     ` Bart Van Assche
  2012-12-13 16:49       ` Christoph Hellwig
                         ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Bart Van Assche @ 2012-12-13 16:47 UTC (permalink / raw)
  To: scameron; +Cc: linux-scsi, stephenmcameron, dab

On 12/13/12 18:25, scameron@beardog.cce.hp.com wrote:
> On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote:
>> On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote:
>>> The driver, like nvme, has a submit and reply queue per cpu.
>>
>> This is interesting. If my interpretation of the POSIX spec is correct
>> then aio_write() allows to queue overlapping writes and all writes
>> submitted by the same thread have to be performed in the order they were
>> submitted by that thread. What if a thread submits a first write via
>> aio_write(), gets rescheduled on another CPU and submits a second
>> overlapping write also via aio_write() ? If a block driver uses one
>> queue per CPU, does that mean that such writes that were issued in order
>> can be executed in a different order by the driver and/or hardware than
>> the order in which the writes were submitted ?
>>
>> See also the aio_write() man page, The Open Group Base Specifications
>> Issue 7, IEEE Std 1003.1-2008
>> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html).
>
> It is my understanding that the low level driver is free to re-order the
> i/o's any way it wants, as is the hardware.  It is up to the layers above
> to enforce any ordering requirements.  For a long time there was a bug
> in the cciss driver that all i/o's submitted to the driver got reversed
> in order -- adding to head of a list instead of to the tail, or vice versa,
> I forget which -- and it caused no real problems (apart from some slight
> performance issues that were mostly masked by the Smart Array's cache.
> It was caught by firmware guys noticing LBAs coming in in weird orders
> for supposedly sequential workloads.
>
> So in your scenario, I think the overlapping writes should not be submitted
> by the block layer to the low level driver concurrently, as the block layer
> is aware that the lld is free to re-order things.  (I am very certain
> that this is the case for scsi low level drivers and block drivers using a
> request_fn interface -- less certain about block drivers using the
> make_request interface to submit i/o's, as this interface is pretty new
> to me.

As far as I know there are basically two choices:
1. Allow the LLD to reorder any pair of write requests. The only way
    for higher layers to ensure the order of (overlapping) writes is then
    to separate these in time. Or in other words, limit write request
    queue depth to one.
2. Do not allow the LLD to reorder overlapping write requests. This
    allows higher software layers to queue write requests (queue depth
    > 1).

 From my experience with block and SCSI drivers option (1) doesn't look 
attractive from a performance point of view. From what I have seen 
performance with QD=1 is several times lower than performance with QD > 
1. But maybe I overlooked something ?

Bart.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-13 16:47     ` Bart Van Assche
@ 2012-12-13 16:49       ` Christoph Hellwig
  2012-12-14  9:44         ` Bart Van Assche
  2012-12-13 21:20       ` scameron
  2012-12-14  0:22       ` Jack Wang
  2 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2012-12-13 16:49 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: scameron, linux-scsi, stephenmcameron, dab

On Thu, Dec 13, 2012 at 05:47:14PM +0100, Bart Van Assche wrote:
> From my experience with block and SCSI drivers option (1) doesn't
> look attractive from a performance point of view. From what I have
> seen performance with QD=1 is several times lower than performance
> with QD > 1. But maybe I overlooked something ?

What you might be missing is that at least on Linux no one who cares
about performance uses the Posix AIO inferface anyway, as the
implementation in glibc always has been horrible.  The Linux-native
aio interface or the various thread pool implementations don't imply
useless ordering and thus can be used to fill up large queues.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-13 16:49       ` Christoph Hellwig
@ 2012-12-14  9:44         ` Bart Van Assche
  2012-12-14 16:44           ` scameron
  0 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2012-12-14  9:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: scameron, linux-scsi, stephenmcameron, dab

On 12/13/12 17:49, Christoph Hellwig wrote:
> On Thu, Dec 13, 2012 at 05:47:14PM +0100, Bart Van Assche wrote:
>>  From my experience with block and SCSI drivers option (1) doesn't
>> look attractive from a performance point of view. From what I have
>> seen performance with QD=1 is several times lower than performance
>> with QD > 1. But maybe I overlooked something ?
>
> What you might be missing is that at least on Linux no one who cares
> about performance uses the Posix AIO inferface anyway, as the
> implementation in glibc always has been horrible.  The Linux-native
> aio interface or the various thread pool implementations don't imply
> useless ordering and thus can be used to fill up large queues.

Some applications need write ordering without having a need for 
enforcing durability as fsync() does [1]. What I'm wondering about is 
whether an operating system kernel like the Linux kernel should penalize 
application performance when using block drivers and storage hardware 
that preserve the order of write requests because there exist other 
drivers and storage devices that do not preserve the order of write 
requests ?

[1] Richard Hipp, Re: [sqlite] SQLite on flash (was: [PATCH 00/16] f2fs: 
introduce flash-friendly file system), October 10, 2012 
(http://www.mail-archive.com/sqlite-users@sqlite.org/msg73033.html).

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-14  9:44         ` Bart Van Assche
@ 2012-12-14 16:44           ` scameron
  2012-12-14 16:15             ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: scameron @ 2012-12-14 16:44 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab, scameron

On Fri, Dec 14, 2012 at 10:44:34AM +0100, Bart Van Assche wrote:
> On 12/13/12 17:49, Christoph Hellwig wrote:
> >On Thu, Dec 13, 2012 at 05:47:14PM +0100, Bart Van Assche wrote:
> >> From my experience with block and SCSI drivers option (1) doesn't
> >>look attractive from a performance point of view. From what I have
> >>seen performance with QD=1 is several times lower than performance
> >>with QD > 1. But maybe I overlooked something ?
> >
> >What you might be missing is that at least on Linux no one who cares
> >about performance uses the Posix AIO inferface anyway, as the
> >implementation in glibc always has been horrible.  The Linux-native
> >aio interface or the various thread pool implementations don't imply
> >useless ordering and thus can be used to fill up large queues.
> 
> Some applications need write ordering without having a need for 
> enforcing durability as fsync() does [1]. What I'm wondering about is 
> whether an operating system kernel like the Linux kernel should penalize 
> application performance when using block drivers and storage hardware 
> that preserve the order of write requests because there exist other 
> drivers and storage devices that do not preserve the order of write 
> requests ?

Which devices don't re-order requests?  So far as I know every single
disk drive ever made that is capable of handling multiple requests will
also re-order requests as it sees fit.

I expect the flash devices re-order requests as well, simply because
to feed requests to the things at a sufficient rate, you have to pump
requests into them concurrently on multiple hardware queues -- a single
cpu jamming requests into them as fast as it can is still not fast enough
to keep them busy.  Consequently, they *can't* care about ordering, as the
relative order requests on different hardware queues are submitted into them
is not even really controlled, so the OS *can't* count on concurrent requests
not to be essentially "re-ordered", just because of the nature of the way
requests get into the device.

So I think the property that devices and drivers are free to reorder
concurrent requests is not going away.

-- steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-14 16:44           ` scameron
@ 2012-12-14 16:15             ` Bart Van Assche
  2012-12-14 19:55               ` scameron
  0 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2012-12-14 16:15 UTC (permalink / raw)
  To: scameron; +Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab

On 12/14/12 17:44, scameron@beardog.cce.hp.com wrote:
> I expect the flash devices re-order requests as well, simply because
> to feed requests to the things at a sufficient rate, you have to pump
> requests into them concurrently on multiple hardware queues -- a single
> cpu jamming requests into them as fast as it can is still not fast enough
> to keep them busy.  Consequently, they *can't* care about ordering, as the
> relative order requests on different hardware queues are submitted into them
> is not even really controlled, so the OS *can't* count on concurrent requests
> not to be essentially "re-ordered", just because of the nature of the way
> requests get into the device.

Why should a flash device have to reorder write requests ? These devices 
typically use a log-structured file system internally.

Bart.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-14 16:15             ` Bart Van Assche
@ 2012-12-14 19:55               ` scameron
  2012-12-14 19:28                 ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: scameron @ 2012-12-14 19:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab, scameron

On Fri, Dec 14, 2012 at 05:15:37PM +0100, Bart Van Assche wrote:
> On 12/14/12 17:44, scameron@beardog.cce.hp.com wrote:
> >I expect the flash devices re-order requests as well, simply because
> >to feed requests to the things at a sufficient rate, you have to pump
> >requests into them concurrently on multiple hardware queues -- a single
> >cpu jamming requests into them as fast as it can is still not fast enough
> >to keep them busy.  Consequently, they *can't* care about ordering, as the
> >relative order requests on different hardware queues are submitted into 
> >them
> >is not even really controlled, so the OS *can't* count on concurrent 
> >requests
> >not to be essentially "re-ordered", just because of the nature of the way
> >requests get into the device.
> 
> Why should a flash device have to reorder write requests ? These devices 
> typically use a log-structured file system internally.

It's not so much that they are re-ordered as that there is no controlled
ordering to begin with because multiple cpus are submitting to multiple
hardware queues concurrently.  If you have 12 requests coming in on 12
cpus to 12 hardware queues to the device, it's going to be racy as to
which request is processed first by the device -- and this is fine, the
hardware queues are independent of one another and do not need to worry
about each other.  This is all to provide a means of getting enough commands
on the device to actually keep it busy.  A single cpu can't do it, the
device is too fast.  If you have ordering dependencies such that request
A must complete before request B completes, then don't submit A and B
concurrently, because if you do submit them concurrently, you cannot tell
whether A or B will arrive into the device first because they may go into
it via different hardware queues.

Note, in case it isn't obvious, the hardware queues I'm talking about here
are not the struct scsi_device, sdev->request_queue queues, they are
typically ring buffers in host memory from which the device DMAs commands/responses
to/from depending on if it's a submit queue or a completion queue and with
producer/consumer indexes one of which is in host memory and one of which
is a register on the device (which is which depends on the direction of the
queue, from device (pi = host memory, ci = device register), or to device
(pi = device register, ci = host memory))

-- steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-14 19:55               ` scameron
@ 2012-12-14 19:28                 ` Bart Van Assche
  2012-12-14 21:06                   ` scameron
  0 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2012-12-14 19:28 UTC (permalink / raw)
  To: scameron; +Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab

On 12/14/12 20:55, scameron@beardog.cce.hp.com wrote:
> It's not so much that they are re-ordered as that there is no controlled
> ordering to begin with because multiple cpus are submitting to multiple
> hardware queues concurrently.  If you have 12 requests coming in on 12
> cpus to 12 hardware queues to the device, it's going to be racy as to
> which request is processed first by the device -- and this is fine, the
> hardware queues are independent of one another and do not need to worry
> about each other.  This is all to provide a means of getting enough commands
> on the device to actually keep it busy.  A single cpu can't do it, the
> device is too fast.  If you have ordering dependencies such that request
> A must complete before request B completes, then don't submit A and B
> concurrently, because if you do submit them concurrently, you cannot tell
> whether A or B will arrive into the device first because they may go into
> it via different hardware queues.

It depends on how these multiple queues are used. If each queue would 
e.g. be associated with a disjoint LBA range of the storage device then 
there wouldn't be a risk of request reordering due to using multiple 
hardware queues.

Bart.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-14 19:28                 ` Bart Van Assche
@ 2012-12-14 21:06                   ` scameron
  2012-12-15  9:40                     ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: scameron @ 2012-12-14 21:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab, scameron

On Fri, Dec 14, 2012 at 08:28:56PM +0100, Bart Van Assche wrote:
> On 12/14/12 20:55, scameron@beardog.cce.hp.com wrote:
> >It's not so much that they are re-ordered as that there is no controlled
> >ordering to begin with because multiple cpus are submitting to multiple
> >hardware queues concurrently.  If you have 12 requests coming in on 12
> >cpus to 12 hardware queues to the device, it's going to be racy as to
> >which request is processed first by the device -- and this is fine, the
> >hardware queues are independent of one another and do not need to worry
> >about each other.  This is all to provide a means of getting enough 
> >commands
> >on the device to actually keep it busy.  A single cpu can't do it, the
> >device is too fast.  If you have ordering dependencies such that request
> >A must complete before request B completes, then don't submit A and B
> >concurrently, because if you do submit them concurrently, you cannot tell
> >whether A or B will arrive into the device first because they may go into
> >it via different hardware queues.
> 
> It depends on how these multiple queues are used. If each queue would 
> e.g. be associated with a disjoint LBA range of the storage device then 
> there wouldn't be a risk of request reordering due to using multiple 
> hardware queues.

They are not associated with disjoint LBA ranges. They are associated
with CPUs on the submission side, there's a submit queue per cpu, and
msix vectors on the completion side (also a completion queue per cpu).

The point of the queues is only to provide a wide enough highway to
allow enough requests to be shoved down to the device fast enough
and completed back to the host fast enough that the device can be kept
reasonably busy, instead of being starved for work to do.  There is
no distinction about what the requests may do based on what hardware
i/o queue they come in on (e.g. no lba range partitioning). All the
i/o queues are equivalent.

Pretty much all current storage devices, disk drives and the devices I'm
talking about in particular do depend on the low level driver and storage
devices being permitted to re-order requests.  So I don't think the discussion
about drivers and devices that *do not* reorder requests (which drivers and devices
would those be?) is very related to the topic of how to get the scsi mid layer
to provide a wide enough highway for requests destined for very low latency
devices.

-- steve

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-14 21:06                   ` scameron
@ 2012-12-15  9:40                     ` Bart Van Assche
  2012-12-19 14:23                       ` Christoph Hellwig
  0 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2012-12-15  9:40 UTC (permalink / raw)
  To: scameron; +Cc: Christoph Hellwig, linux-scsi, stephenmcameron, dab

On 12/14/12 22:06, scameron@beardog.cce.hp.com wrote:
> [ ... ] how to get the scsi mid layer to provide a wide enough
> highway for requests destined for very low latency devices.

While the SCSI mid-layer is processing an I/O request not only the queue 
lock has to be locked and unlocked several times but also the SCSI host 
lock. The reason that it's unavoidable to lock and unlock the host lock 
is because the SCSI core has been designed for SCSI equipment that has a 
queue depth limit per host (shost->can_queue). For single LUN devices 
that model could be changed in a queue depth limit per LUN. Also, it's 
probably not that hard to modify software SCSI target implementations 
such that these have a queue depth limit per LUN instead of per host. It 
might be interesting to verify whether the following approach helps to 
improve performance of the SCSI mid-layer:
* Make it possible for SCSI LLDs to tell the SCSI core whether there is
   a queue depth limit per host or per LUN.
* Do not update shost->host_busy and shost->target_busy when using the
   QD limit per LUN mode. This change will allow to avoid spin_lock()
   and spin_unlock() calls inside scsi_request_fn(). It will also allow
   to avoid having to take the host lock inside scsi_device_unbusy().
* In queue-depth-limit-per-LUN mode, neither add a SCSI device to the
   starved list if it's busy nor examine the starved list in
   scsi_run_queue().

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-15  9:40                     ` Bart Van Assche
@ 2012-12-19 14:23                       ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2012-12-19 14:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: scameron, Christoph Hellwig, linux-scsi, stephenmcameron, dab

On Sat, Dec 15, 2012 at 10:40:24AM +0100, Bart Van Assche wrote:
> On 12/14/12 22:06, scameron@beardog.cce.hp.com wrote:
> >[ ... ] how to get the scsi mid layer to provide a wide enough
> >highway for requests destined for very low latency devices.
> 
> While the SCSI mid-layer is processing an I/O request not only the
> queue lock has to be locked and unlocked several times but also the
> SCSI host lock. The reason that it's unavoidable to lock and unlock
> the host lock is because the SCSI core has been designed for SCSI
> equipment that has a queue depth limit per host (shost->can_queue).
> For single LUN devices that model could be changed in a queue depth
> limit per LUN. Also, it's probably not that hard to modify software
> SCSI target implementations such that these have a queue depth limit
> per LUN instead of per host.

We'd also better avoid needing a lock to check these limits, especially
if we normally don't hit them.  The easiest way to get started would
be to simply allow a magic can_queue value that keeps these as unlimited
and only let the driver return one of the busy values from
->queuecommand.  We could then use unlocked list empty checks to see
if anything is in a waiting list and enter a slow path mode.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: SCSI mid layer and high IOPS capable devices
  2012-12-13 16:47     ` Bart Van Assche
  2012-12-13 16:49       ` Christoph Hellwig
@ 2012-12-13 21:20       ` scameron
  2012-12-14  0:22       ` Jack Wang
  2 siblings, 0 replies; 21+ messages in thread
From: scameron @ 2012-12-13 21:20 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, stephenmcameron, dab, scameron

On Thu, Dec 13, 2012 at 05:47:14PM +0100, Bart Van Assche wrote:
> On 12/13/12 18:25, scameron@beardog.cce.hp.com wrote:
> >On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote:
> >>On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote:
> >>>The driver, like nvme, has a submit and reply queue per cpu.
> >>
> >>This is interesting. If my interpretation of the POSIX spec is correct
> >>then aio_write() allows to queue overlapping writes and all writes
> >>submitted by the same thread have to be performed in the order they were
> >>submitted by that thread. What if a thread submits a first write via
> >>aio_write(), gets rescheduled on another CPU and submits a second
> >>overlapping write also via aio_write() ? If a block driver uses one
> >>queue per CPU, does that mean that such writes that were issued in order
> >>can be executed in a different order by the driver and/or hardware than
> >>the order in which the writes were submitted ?
> >>
> >>See also the aio_write() man page, The Open Group Base Specifications
> >>Issue 7, IEEE Std 1003.1-2008
> >>(http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html).
> >
> >It is my understanding that the low level driver is free to re-order the
> >i/o's any way it wants, as is the hardware.  It is up to the layers above
> >to enforce any ordering requirements.  For a long time there was a bug
> >in the cciss driver that all i/o's submitted to the driver got reversed
> >in order -- adding to head of a list instead of to the tail, or vice versa,
> >I forget which -- and it caused no real problems (apart from some slight
> >performance issues that were mostly masked by the Smart Array's cache.
> >It was caught by firmware guys noticing LBAs coming in in weird orders
> >for supposedly sequential workloads.
> >
> >So in your scenario, I think the overlapping writes should not be submitted
> >by the block layer to the low level driver concurrently, as the block layer
> >is aware that the lld is free to re-order things.  (I am very certain
> >that this is the case for scsi low level drivers and block drivers using a
> >request_fn interface -- less certain about block drivers using the
> >make_request interface to submit i/o's, as this interface is pretty new
> >to me.
> 
> As far as I know there are basically two choices:
> 1. Allow the LLD to reorder any pair of write requests. The only way
>    for higher layers to ensure the order of (overlapping) writes is then
>    to separate these in time. Or in other words, limit write request
>    queue depth to one.
>
> 2. Do not allow the LLD to reorder overlapping write requests. This
>    allows higher software layers to queue write requests (queue depth
>    > 1).
> 
> From my experience with block and SCSI drivers option (1) doesn't look 
> attractive from a performance point of view. From what I have seen 
> performance with QD=1 is several times lower than performance with QD > 
> 1. But maybe I overlooked something ?

I don't think 1 is how it works, and I know 2 is not how it works.
LLD's are definitely allowed to re-order i/o's arbitrarily (and so is
the hardware (e.g. array controller or disk drive)).  

If you need an i/o to complete before another begins, 
don't give the 2nd i/o to the LLD before the 1st completes, but be smarter
than limiting all writes to queue depth of 1 by knowing when you care
about the order.  If my understanding is correct, The buffer cache will,
for the most part, make sure there generally aren't many overlapping or
order-dependent i/o's by essentially combining multiple overlapping writes
into a single write, but for filesystem meta data, or direct i/o, there
may of course be application specific ordering requirements, and the answer
is, I think, the application (e.g. filesystem) needs to know when it care
 about the order, and wait for completions as necessary when it does care, and
take pains that it should not care about the order most of the time if 
performance is important (one of the reasons the buffer cache exists.)

(I might be wrong though.)

-- steve


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: SCSI mid layer and high IOPS capable devices
  2012-12-13 16:47     ` Bart Van Assche
  2012-12-13 16:49       ` Christoph Hellwig
  2012-12-13 21:20       ` scameron
@ 2012-12-14  0:22       ` Jack Wang
       [not found]         ` <CADzpL0TMT31yka98Zv0=53N4=pDZOc9+gacnvDWMbj+iZg4H5w@mail.gmail.com>
  2 siblings, 1 reply; 21+ messages in thread
From: Jack Wang @ 2012-12-14  0:22 UTC (permalink / raw)
  To: 'Bart Van Assche', scameron; +Cc: linux-scsi, stephenmcameron, dab

On 12/13/12 18:25, scameron@beardog.cce.hp.com wrote:
> On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote:
>> On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote:
>>> The driver, like nvme, has a submit and reply queue per cpu.
>>
>> This is interesting. If my interpretation of the POSIX spec is 
>> correct then aio_write() allows to queue overlapping writes and all 
>> writes submitted by the same thread have to be performed in the order 
>> they were submitted by that thread. What if a thread submits a first 
>> write via aio_write(), gets rescheduled on another CPU and submits a 
>> second overlapping write also via aio_write() ? If a block driver 
>> uses one queue per CPU, does that mean that such writes that were 
>> issued in order can be executed in a different order by the driver 
>> and/or hardware than the order in which the writes were submitted ?
>>
>> See also the aio_write() man page, The Open Group Base Specifications 
>> Issue 7, IEEE Std 1003.1-2008 
>>
(http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html).
>
> It is my understanding that the low level driver is free to re-order 
> the i/o's any way it wants, as is the hardware.  It is up to the 
> layers above to enforce any ordering requirements.  For a long time 
> there was a bug in the cciss driver that all i/o's submitted to the 
> driver got reversed in order -- adding to head of a list instead of to 
> the tail, or vice versa, I forget which -- and it caused no real 
> problems (apart from some slight performance issues that were mostly
masked by the Smart Array's cache.
> It was caught by firmware guys noticing LBAs coming in in weird orders 
> for supposedly sequential workloads.
>
> So in your scenario, I think the overlapping writes should not be 
> submitted by the block layer to the low level driver concurrently, as 
> the block layer is aware that the lld is free to re-order things.  (I 
> am very certain that this is the case for scsi low level drivers and 
> block drivers using a request_fn interface -- less certain about block 
> drivers using the make_request interface to submit i/o's, as this 
> interface is pretty new to me.

As far as I know there are basically two choices:
1. Allow the LLD to reorder any pair of write requests. The only way
    for higher layers to ensure the order of (overlapping) writes is then
    to separate these in time. Or in other words, limit write request
    queue depth to one.
2. Do not allow the LLD to reorder overlapping write requests. This
    allows higher software layers to queue write requests (queue depth
    > 1).

 From my experience with block and SCSI drivers option (1) doesn't look
attractive from a performance point of view. From what I have seen
performance with QD=1 is several times lower than performance with QD > 1.
But maybe I overlooked something ?



Bart.

I was seen low queue depth improve sequential performance, and high queue
depth improve random performance.

Jack

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the
body of a message to majordomo@vger.kernel.org More majordomo info at
http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 21+ messages in thread

[parent not found: <CADzpL0TMT31yka98Zv0=53N4=pDZOc9+gacnvDWMbj+iZg4H5w@mail.gmail.com>]

[parent not found: <006301cdd99c$35099b40$9f1cd1c0$@com>]

[parent not found: <CADzpL0S5cfCRQftrxHij8KOjKj55psSJedmXLBQz1uQm_SC30A@mail.gmail.com>]

* RE: SCSI mid layer and high IOPS capable devices
       [not found]             ` <CADzpL0S5cfCRQftrxHij8KOjKj55psSJedmXLBQz1uQm_SC30A@mail.gmail.com>
@ 2012-12-14  4:59               ` Jack Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Jack Wang @ 2012-12-14  4:59 UTC (permalink / raw)
  To: 'Stephen Cameron'
  Cc: 'Bart Van Assche', 'Stephen M. Cameron',
	linux-scsi, 'dbrace'

Steve,

Thanks for share detail of your problem.

Yes you ‘re right about test I talk. Now I know what you want to discuss on
this thread.

Jack

Right, but if I understand you correctly, you're ganging up 24 device queues
and measuring aggregate iops across them all.  That is, you have 24 SAS
disks all presented individually to the OS, right? (or did the controller
aggregate them all into 1 logical drive presented to the OS?)

I'm talking about one very low latency single device capable of let's say
450k iops all by itself.  The problem is that with the scsi mid layer in
this case, there can only be a single request queue feeding that one device
(unlike your 24 request queues feeding 24 devices.)  That single request
queue is essentially single threaded -- only one cpu can touch it at a time
to add or remove a request from it.  With the block layer's make_request
interface, I can take advantage of parallelism in the low level block driver
and get essentially a queue per cpu feeding the single device.  With the
scsi mid layer, the low level driver's queue per cpu is (if I am correct)
throttled by the fact that what is feeding those lld queues is one
(essentially) single threaded request queue.  It doesn't matter that the
scsi LLD has a twelve lane highway leading into it because the scsi midlayer
has a 1 lane highway feeding into that 12 lane highway.  If I understand you
correctly, you get 800k iops by measuring 24 highways going to 24 different
towns.  I have one town and one highway.  The part of my highway that I
control can handle several hundred kiops, but the part I don't control
seemingly cannot.

That is why scsi_debug driver can't get very high iops on a single
pseudo-device, because there's only one request queue and that queue is
protected by a spin lock.  perf shows contention on spin locks in
scsi_request_fn()  -- large percentage of cpu time spent trying to get spin
locks in scsi_request_fn().  I forget the exact number right now, but iirc,
it was something like 30-40%.

That is sort of the whole problem I'm having, as best I understand it, and
why I started this thread.   And unfortunately I do not have any very good
ideas about what to do about it, other than use the block layer's make
request interface, which is not ideal for a number of reasons (e.g. people
and software (grub, etc.) are very much accustomed to dealing with the sd
driver, and all other things being equal, using the sd driver interface is
very much preferable.)

With flash based storage devices, the age old assumptions that "disks" are
glacially slow compared to the cpu(s) and seek penalties exist and are to be
avoided which underlie the design of the linux storage subsystem
architecture are starting to become false.  That's kind of the "big picture"
view of the problem.

Part of me thinks what we really ought to do is make the non-volatile
storage look like RAM at the hardware level, more or less, then put a ramfs
on top of it, and call it done (there are probably myriad reasons it's not
that simple of which I'm ignorant.)

-- steve

On Thu, Dec 13, 2012 at 7:41 PM, Jack Wang <jack_wang@usish.com> wrote:

Maybe, and good to know for real-world scenarios, but scsi-debug with
fake_rw=1 isn't even actually doing the i/o.  I would think sequential,
random, whatever wouldn't matter in that case, because presumably, it's not
even looking at the LBAs, much less acting on them, nor would I expect the
no-op i/o scheduler to be affected by the LBAs.

-- steve
For read world hardware, I tested with next generation PMCS SAS controller
with 24 SAS disks, 512 sequential read with more than 800K , 512 sequential
write with more than 500K
similar results with windows 2008, but SATA performance did worse than
windows
kernel is 3.2.x as I remembered.
Jack

On Thu, Dec 13, 2012 at 6:22 PM, Jack Wang <jack_wang@usish.com> wrote:
On 12/13/12 18:25, scameron@beardog.cce.hp.com wrote:
> On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote:
>> On 12/11/12 01:00, scameron@beardog.cce.hp.com wrote:
>>> The driver, like nvme, has a submit and reply queue per cpu.
>>
>> This is interesting. If my interpretation of the POSIX spec is
>> correct then aio_write() allows to queue overlapping writes and all
>> writes submitted by the same thread have to be performed in the order
>> they were submitted by that thread. What if a thread submits a first
>> write via aio_write(), gets rescheduled on another CPU and submits a
>> second overlapping write also via aio_write() ? If a block driver
>> uses one queue per CPU, does that mean that such writes that were
>> issued in order can be executed in a different order by the driver
>> and/or hardware than the order in which the writes were submitted ?
>>
>> See also the aio_write() man page, The Open Group Base Specifications
>> Issue 7, IEEE Std 1003.1-2008
>>
(http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html).
>
> It is my understanding that the low level driver is free to re-order
> the i/o's any way it wants, as is the hardware.  It is up to the
> layers above to enforce any ordering requirements.  For a long time
> there was a bug in the cciss driver that all i/o's submitted to the
> driver got reversed in order -- adding to head of a list instead of to
> the tail, or vice versa, I forget which -- and it caused no real
> problems (apart from some slight performance issues that were mostly
masked by the Smart Array's cache.
> It was caught by firmware guys noticing LBAs coming in in weird orders
> for supposedly sequential workloads.
>
> So in your scenario, I think the overlapping writes should not be
> submitted by the block layer to the low level driver concurrently, as
> the block layer is aware that the lld is free to re-order things.  (I
> am very certain that this is the case for scsi low level drivers and
> block drivers using a request_fn interface -- less certain about block
> drivers using the make_request interface to submit i/o's, as this
> interface is pretty new to me.

As far as I know there are basically two choices:
1. Allow the LLD to reorder any pair of write requests. The only way
    for higher layers to ensure the order of (overlapping) writes is then
    to separate these in time. Or in other words, limit write request
    queue depth to one.
2. Do not allow the LLD to reorder overlapping write requests. This
    allows higher software layers to queue write requests (queue depth
    > 1).

 From my experience with block and SCSI drivers option (1) doesn't look
attractive from a performance point of view. From what I have seen
performance with QD=1 is several times lower than performance with QD > 1.
But maybe I overlooked something ?

Bart.
I was seen low queue depth improve sequential performance, and high queue
depth improve random performance.

Jack

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the
body of a message to majordomo@vger.kernel.org More majordomo info at
http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2012-12-19 14:23 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-11  0:00 SCSI mid layer and high IOPS capable devices scameron
2012-12-11  8:21 ` Bart Van Assche
2012-12-11 22:46   ` scameron
2012-12-13 11:40     ` Bart Van Assche
2012-12-13 18:03       ` scameron
2012-12-13 17:18         ` Bart Van Assche
2012-12-13 15:22 ` Bart Van Assche
2012-12-13 17:25   ` scameron
2012-12-13 16:47     ` Bart Van Assche
2012-12-13 16:49       ` Christoph Hellwig
2012-12-14  9:44         ` Bart Van Assche
2012-12-14 16:44           ` scameron
2012-12-14 16:15             ` Bart Van Assche
2012-12-14 19:55               ` scameron
2012-12-14 19:28                 ` Bart Van Assche
2012-12-14 21:06                   ` scameron
2012-12-15  9:40                     ` Bart Van Assche
2012-12-19 14:23                       ` Christoph Hellwig
2012-12-13 21:20       ` scameron
2012-12-14  0:22       ` Jack Wang
     [not found]         ` <CADzpL0TMT31yka98Zv0=53N4=pDZOc9+gacnvDWMbj+iZg4H5w@mail.gmail.com>
     [not found]           ` <006301cdd99c$35099b40$9f1cd1c0$@com>
     [not found]             ` <CADzpL0S5cfCRQftrxHij8KOjKj55psSJedmXLBQz1uQm_SC30A@mail.gmail.com>
2012-12-14  4:59               ` Jack Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).