Bypass block layer and Fill SCSI lower layer driver queue

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Bypass block layer and Fill SCSI lower layer driver queue
@ 2013-09-18  6:41 Alireza Haghdoost
  2013-09-18  7:58 ` Jack Wang
  2013-09-18 21:00 ` Nicholas A. Bellinger
  0 siblings, 2 replies; 8+ messages in thread
From: Alireza Haghdoost @ 2013-09-18  6:41 UTC (permalink / raw)
  To: linux-fsdevel, linux-scsi; +Cc: Jerry Fredin

Hi

I am working on a high throughput and low latency application which
does not tolerate block layer overhead to send IO request directly to
fiber channel lower layer SCSI driver. I used to work with libaio but
currently I am looking for a way to by pass the block layer and send
SCSI commands from the application layer directly to the SCSI driver
using /dev/sgX device and ioctl() system call.

I have noticed that sending IO request through sg device even with
nonblocking and direct IO flags is quite slow and does not fill up
lower layer SCSI driver TCQ queue. i.e IO depth or
/sys/block/sdX/in_flight is always ZERO. Therefore the application
throughput is even lower that sending IO request through block layer
with libaio and io_submit() system call. In both cases I used only one
IO context (or fd) and single threaded.

I have noticed that some well known benchmarking tools like fio does
not support IO depth for sg devices as well. Therefore, I was
wondering if it is feasible to bypass block layer and achieve higher
throughput and lower latency (for sending IO request only).

Any comment on my issue is highly appreciated.

Thanks
Alireza

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bypass block layer and Fill SCSI lower layer driver queue
  2013-09-18  6:41 Bypass block layer and Fill SCSI lower layer driver queue Alireza Haghdoost
@ 2013-09-18  7:58 ` Jack Wang
  2013-09-18 14:07   ` Douglas Gilbert
  2013-09-18 21:00 ` Nicholas A. Bellinger
  1 sibling, 1 reply; 8+ messages in thread
From: Jack Wang @ 2013-09-18  7:58 UTC (permalink / raw)
  To: Alireza Haghdoost; +Cc: linux-fsdevel, linux-scsi, Jerry Fredin

On 09/18/2013 08:41 AM, Alireza Haghdoost wrote:
> Hi
> 
> I am working on a high throughput and low latency application which
> does not tolerate block layer overhead to send IO request directly to
> fiber channel lower layer SCSI driver. I used to work with libaio but
> currently I am looking for a way to by pass the block layer and send
> SCSI commands from the application layer directly to the SCSI driver
> using /dev/sgX device and ioctl() system call.
> 
> I have noticed that sending IO request through sg device even with
> nonblocking and direct IO flags is quite slow and does not fill up
> lower layer SCSI driver TCQ queue. i.e IO depth or
> /sys/block/sdX/in_flight is always ZERO. Therefore the application
> throughput is even lower that sending IO request through block layer
> with libaio and io_submit() system call. In both cases I used only one
> IO context (or fd) and single threaded.
> 
Hi Alireza,

I think what you want is in_flight command scsi dispatch to low level
device.
I submit a simple patch to export device_busy

http://www.spinics.net/lists/linux-scsi/msg68697.html

I also notice fio sg engine will not fill queue properly, but haven't
look into deeper.

Cheers
Jack

> I have noticed that some well known benchmarking tools like fio does
> not support IO depth for sg devices as well. Therefore, I was
> wondering if it is feasible to bypass block layer and achieve higher
> throughput and lower latency (for sending IO request only).
> 
> 
> Any comment on my issue is highly appreciated.
> 
> 
> Thanks
> Alireza
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bypass block layer and Fill SCSI lower layer driver queue
  2013-09-18  7:58 ` Jack Wang
@ 2013-09-18 14:07   ` Douglas Gilbert
  2013-09-18 14:31     ` Boaz Harrosh
  2013-09-27  6:06     ` Vladislav Bolkhovitin
  0 siblings, 2 replies; 8+ messages in thread
From: Douglas Gilbert @ 2013-09-18 14:07 UTC (permalink / raw)
  To: Jack Wang; +Cc: Alireza Haghdoost, linux-fsdevel, linux-scsi, Jerry Fredin

On 13-09-18 03:58 AM, Jack Wang wrote:
> On 09/18/2013 08:41 AM, Alireza Haghdoost wrote:
>> Hi
>>
>> I am working on a high throughput and low latency application which
>> does not tolerate block layer overhead to send IO request directly to
>> fiber channel lower layer SCSI driver. I used to work with libaio but
>> currently I am looking for a way to by pass the block layer and send
>> SCSI commands from the application layer directly to the SCSI driver
>> using /dev/sgX device and ioctl() system call.
>>
>> I have noticed that sending IO request through sg device even with
>> nonblocking and direct IO flags is quite slow and does not fill up
>> lower layer SCSI driver TCQ queue. i.e IO depth or
>> /sys/block/sdX/in_flight is always ZERO. Therefore the application
>> throughput is even lower that sending IO request through block layer
>> with libaio and io_submit() system call. In both cases I used only one
>> IO context (or fd) and single threaded.
>>
> Hi Alireza,
>
> I think what you want is in_flight command scsi dispatch to low level
> device.
> I submit a simple patch to export device_busy
>
> http://www.spinics.net/lists/linux-scsi/msg68697.html
>
> I also notice fio sg engine will not fill queue properly, but haven't
> look into deeper.
>
> Cheers
> Jack
>
>> I have noticed that some well known benchmarking tools like fio does
>> not support IO depth for sg devices as well. Therefore, I was
>> wondering if it is feasible to bypass block layer and achieve higher
>> throughput and lower latency (for sending IO request only).
>>
>>
>> Any comment on my issue is highly appreciated.

I'm not sure if this is relevant to your problem but by
default both the bsg and sg drivers "queue at head"
when they inject SCSI commands into the block layer.

The bsg driver has a BSG_FLAG_Q_AT_TAIL flag to change
that queueing to what may be preferable for your purposes.
The sg driver could, but does not, support that flag.

Doug Gilbert



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bypass block layer and Fill SCSI lower layer driver queue
  2013-09-18 14:07   ` Douglas Gilbert
@ 2013-09-18 14:31     ` Boaz Harrosh
  2013-09-27  6:06     ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 8+ messages in thread
From: Boaz Harrosh @ 2013-09-18 14:31 UTC (permalink / raw)
  To: dgilbert
  Cc: Jack Wang, Alireza Haghdoost, linux-fsdevel, linux-scsi,
	Jerry Fredin

On 09/18/2013 05:07 PM, Douglas Gilbert wrote:
> On 13-09-18 03:58 AM, Jack Wang wrote:
>> On 09/18/2013 08:41 AM, Alireza Haghdoost wrote:
>>> Hi
>>>
>>> I am working on a high throughput and low latency application which
>>> does not tolerate block layer overhead to send IO request directly to
>>> fiber channel lower layer SCSI driver. I used to work with libaio but
>>> currently I am looking for a way to by pass the block layer and send
>>> SCSI commands from the application layer directly to the SCSI driver
>>> using /dev/sgX device and ioctl() system call.
>>>
>>> I have noticed that sending IO request through sg device even with
>>> nonblocking and direct IO flags is quite slow and does not fill up
>>> lower layer SCSI driver TCQ queue. i.e IO depth or
>>> /sys/block/sdX/in_flight is always ZERO. Therefore the application
>>> throughput is even lower that sending IO request through block layer
>>> with libaio and io_submit() system call. In both cases I used only one
>>> IO context (or fd) and single threaded.
>>>
>> Hi Alireza,
>>
>> I think what you want is in_flight command scsi dispatch to low level
>> device.
>> I submit a simple patch to export device_busy
>>
>> http://www.spinics.net/lists/linux-scsi/msg68697.html
>>
>> I also notice fio sg engine will not fill queue properly, but haven't
>> look into deeper.
>>
>> Cheers
>> Jack
>>
>>> I have noticed that some well known benchmarking tools like fio does
>>> not support IO depth for sg devices as well. Therefore, I was
>>> wondering if it is feasible to bypass block layer and achieve higher
>>> throughput and lower latency (for sending IO request only).
>>>
>>>
>>> Any comment on my issue is highly appreciated.
> 
> I'm not sure if this is relevant to your problem but by
> default both the bsg and sg drivers "queue at head"
> when they inject SCSI commands into the block layer.
> 
> The bsg driver has a BSG_FLAG_Q_AT_TAIL flag to change
> that queueing to what may be preferable for your purposes.
> The sg driver could, but does not, support that flag.
> 

Yes!

The current best bet for keeping the queues full are with libaio
and direct + asynchronous IO. It should not be significantly slower than
bsg. (Believe me, with direct IO Block-Device cache is bypassed and
the only difference is in who prepares the struct requests for submission)

As Doug said sg can not do it. Also with bsg and above BSG_FLAG_Q_AT_TAIL
You will need to use the write() option and not the ioctl() because the
later is synchronous and you want an asynchronous submit of commands
and background completion of them. (Which is what libaio does with async IO)

With bsg you achieve that with using write() in combination of read() to
receive the completions.

> Doug Gilbert
> 

Cheers
Boaz


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bypass block layer and Fill SCSI lower layer driver queue
  2013-09-18 14:07   ` Douglas Gilbert
  2013-09-18 14:31     ` Boaz Harrosh
@ 2013-09-27  6:06     ` Vladislav Bolkhovitin
  1 sibling, 0 replies; 8+ messages in thread
From: Vladislav Bolkhovitin @ 2013-09-27  6:06 UTC (permalink / raw)
  To: linux-scsi
  Cc: dgilbert, Jack Wang, Alireza Haghdoost, linux-fsdevel,
	Jerry Fredin

Douglas Gilbert, on 09/18/2013 07:07 AM wrote:
> On 13-09-18 03:58 AM, Jack Wang wrote:
>> On 09/18/2013 08:41 AM, Alireza Haghdoost wrote:
>>> Hi
>>>
>>> I am working on a high throughput and low latency application which
>>> does not tolerate block layer overhead to send IO request directly to
>>> fiber channel lower layer SCSI driver. I used to work with libaio but
>>> currently I am looking for a way to by pass the block layer and send
>>> SCSI commands from the application layer directly to the SCSI driver
>>> using /dev/sgX device and ioctl() system call.
>>>
>>> I have noticed that sending IO request through sg device even with
>>> nonblocking and direct IO flags is quite slow and does not fill up
>>> lower layer SCSI driver TCQ queue. i.e IO depth or
>>> /sys/block/sdX/in_flight is always ZERO. Therefore the application
>>> throughput is even lower that sending IO request through block layer
>>> with libaio and io_submit() system call. In both cases I used only one
>>> IO context (or fd) and single threaded.
>>>
>> Hi Alireza,
>>
>> I think what you want is in_flight command scsi dispatch to low level
>> device.
>> I submit a simple patch to export device_busy
>>
>> http://www.spinics.net/lists/linux-scsi/msg68697.html
>>
>> I also notice fio sg engine will not fill queue properly, but haven't
>> look into deeper.
>>
>> Cheers
>> Jack
>>
>>> I have noticed that some well known benchmarking tools like fio does
>>> not support IO depth for sg devices as well. Therefore, I was
>>> wondering if it is feasible to bypass block layer and achieve higher
>>> throughput and lower latency (for sending IO request only).
>>>
>>>
>>> Any comment on my issue is highly appreciated.
> 
> I'm not sure if this is relevant to your problem but by
> default both the bsg and sg drivers "queue at head"
> when they inject SCSI commands into the block layer.
> 
> The bsg driver has a BSG_FLAG_Q_AT_TAIL flag to change
> that queueing to what may be preferable for your purposes.
> The sg driver could, but does not, support that flag.

Just curious, for how long this counterproductive insert in head is going to stay? I
guess, now (almost) nobody can recall why it is so. This behavior makes sg interface,
basically, unusable for anything bigger, than sg-utils.

Vlad

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bypass block layer and Fill SCSI lower layer driver queue
  2013-09-18  6:41 Bypass block layer and Fill SCSI lower layer driver queue Alireza Haghdoost
  2013-09-18  7:58 ` Jack Wang
@ 2013-09-18 21:00 ` Nicholas A. Bellinger
  2013-09-19  2:05   ` Alireza Haghdoost
  1 sibling, 1 reply; 8+ messages in thread
From: Nicholas A. Bellinger @ 2013-09-18 21:00 UTC (permalink / raw)
  To: Alireza Haghdoost; +Cc: linux-fsdevel, linux-scsi, Jerry Fredin

On Wed, 2013-09-18 at 01:41 -0500, Alireza Haghdoost wrote:
> Hi
> 
> I am working on a high throughput and low latency application which
> does not tolerate block layer overhead to send IO request directly to
> fiber channel lower layer SCSI driver. I used to work with libaio but
> currently I am looking for a way to by pass the block layer and send
> SCSI commands from the application layer directly to the SCSI driver
> using /dev/sgX device and ioctl() system call.
>
> I have noticed that sending IO request through sg device even with
> nonblocking and direct IO flags is quite slow and does not fill up
> lower layer SCSI driver TCQ queue. i.e IO depth or
> /sys/block/sdX/in_flight is always ZERO. Therefore the application
> throughput is even lower that sending IO request through block layer
> with libaio and io_submit() system call. In both cases I used only one
> IO context (or fd) and single threaded.
> 
> I have noticed that some well known benchmarking tools like fio does
> not support IO depth for sg devices as well. Therefore, I was
> wondering if it is feasible to bypass block layer and achieve higher
> throughput and lower latency (for sending IO request only).
> 
> 
> Any comment on my issue is highly appreciated.
> 
> 

FYI, you've got things backward as to where the real overhead is being
introduced.

The block layer / aio overhead is minimal compared to the overhead
introduced by the existing scsi_request_fn() logic, and extreme locking
contention between request_queue->queue_lock and scsi_host->host_lock
that are accessed/released multiple times per struct scsi_cmnd dispatch.

This locking contention and other memory allocations currently limit per
struct scsi_device performance with small block random IOPs to ~250K vs.
~1M with raw block drivers providing their own make_request() function.

FYI, there is an early alpha scsi-mq prototype that bypasses the
scsi_request_fn() junk all together, that is able to reach small block
IOPs + latency that is comparable to raw block drivers.

Only a handful of LLDs have been converted to run with full scsi-mq
pre-allocation thus far, and the code is considered early, early
alpha.  

It's the only real option for SCSI to get anywhere near raw block driver
performance + latency, but is still quite a ways off mainline.

--nab

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bypass block layer and Fill SCSI lower layer driver queue
  2013-09-18 21:00 ` Nicholas A. Bellinger
@ 2013-09-19  2:05   ` Alireza Haghdoost
  2013-09-19 21:49     ` Nicholas A. Bellinger
  0 siblings, 1 reply; 8+ messages in thread
From: Alireza Haghdoost @ 2013-09-19  2:05 UTC (permalink / raw)
  To: Nicholas A. Bellinger; +Cc: linux-fsdevel, linux-scsi, Jerry Fredin

On Wed, Sep 18, 2013 at 4:00 PM, Nicholas A. Bellinger
<nab@linux-iscsi.org> wrote:
> On Wed, 2013-09-18 at 01:41 -0500, Alireza Haghdoost wrote:
>> Hi
>>
>> I am working on a high throughput and low latency application which
>> does not tolerate block layer overhead to send IO request directly to
>> fiber channel lower layer SCSI driver. I used to work with libaio but
>> currently I am looking for a way to by pass the block layer and send
>> SCSI commands from the application layer directly to the SCSI driver
>> using /dev/sgX device and ioctl() system call.
>>
>> I have noticed that sending IO request through sg device even with
>> nonblocking and direct IO flags is quite slow and does not fill up
>> lower layer SCSI driver TCQ queue. i.e IO depth or
>> /sys/block/sdX/in_flight is always ZERO. Therefore the application
>> throughput is even lower that sending IO request through block layer
>> with libaio and io_submit() system call. In both cases I used only one
>> IO context (or fd) and single threaded.
>>
>> I have noticed that some well known benchmarking tools like fio does
>> not support IO depth for sg devices as well. Therefore, I was
>> wondering if it is feasible to bypass block layer and achieve higher
>> throughput and lower latency (for sending IO request only).
>>
>>
>> Any comment on my issue is highly appreciated.
>>
>>

Hi Nicholas,

Thanks for your reply sharing your thought with us. Please find my
comments below:

> FYI, you've got things backward as to where the real overhead is being
> introduced.

As far as I understand, you are talking about the overhead of making
SCSI request in this case moves from kernel to the application layer.
That is true. However, our application does not require to create SCSI
commands online. i.e It can prepare bunch of SCSI commands off-line
(during warm-up phase) and then send all of them to the device drives
when it goes online. Therefore, the overhead of creating SCSI commands
has been moved to app layer but excluded from critical time of
application. In the critical time of application we have to send IO
request as fast as possible to the driver and we don't want spend time
to creat SCSI commands at that time. That is the whole motivation of
NOT using libaio and raw device.

>
> The block layer / aio overhead is minimal compared to the overhead
> introduced by the existing scsi_request_fn() logic, and extreme locking
> contention between request_queue->queue_lock and scsi_host->host_lock
> that are accessed/released multiple times per struct scsi_cmnd dispatch.

That means even if we send SCSI commands directly to sg device, it
still suffer from the overhead of scsi_request_fn() ? I was thinking
it would bypass scsi_request_fn() since the scsi commands are built
inside this function. (i.e. scsi_prep_fn() called by scsi_request_fn()
). However, In my situation, SCSI commands are built in app layer and
logically there would be no reason to suffer from the overhead of
scsi_request_fn()

Ok, here is a below is a trace of function calls I collected using
ftrace while running our app sending IO request directly to raw device
(/dev/sdX) using libaio. I am gonna describe my view of overhead in
this example.

The whole io_submit() system call in this case take about 42us to
finish (please note there is some measurement error caused by dynamic
instrumentation but we can ignore that in relative comparison)
libaio consume 22us to prepare and run iocb in aio_run_iocb() which is
equal to half of the whole system call time. While scsi_request_fn()
only consume 13us to create scsi commands and submit the command in
low level drive queue. SCSI low level driver takes less than 3us in
qla2xxx_queuecommand() to queue the scsi command.

To me spending 22us in libaio is a big deal for millions of IO
request. It is more than 50% overhead. That is why I am not in favor
of  libaio. Moreover, preparing SCSI commands inside scsi_request_fn()
takes almost 10us if we exclude submission time to low-level driver.
That is like 10% overhead. That is why I am interested to do this job
offline inside the application layer. The greedy approach which I am
looking for is to  spend around 3us running low-level driver queuing
scsi command (in my case qla2xxx_queuecommand() ) and bypass
everything else.   How ever I am wondering if it is possible or not ?

  2)                        |  sys_io_submit() {
  2)                        |    do_io_submit() {
  2)                        |      aio_run_iocb() {
  2)                        |          blkdev_aio_write() {
  2)                        |            __generic_file_aio_write()
  2)                        |              generic_file_direct_write() {
  2)                        |                blkdev_direct_IO() {
  2)                        |                    submit_bio() {
  2)                        |                      generic_make_request() {
  2)                        |                        blk_queue_bio() {
  2)   0.035 us      |                          blk_queue_bounce();
  2) + 21.351 us   |          }
  2) + 21.580 us   |        }
  2) + 22.070 us   |      }
  2)                         |      blk_finish_plug() {
  2)                         |          queue_unplugged() {
  2)                         |            scsi_request_fn() {
  2)                         |              blk_peek_request(){
  2)                         |                sd_prep_fn() {
  2)                         |                  scsi_setup_fs_cmnd() {
  2)                         |                    scsi_get_cmd_from_req()
  2)                         |                    scsi_init_io()
  2)   4.735 us       |                  }
  2)   0.033 us       |                  scsi_prep_return();
  2)   5.234 us       |                }
  2)   5.969 us       |              }
  2)                         |              blk_queue_start_tag() {
  2)                         |                blk_start_request() {
  2)   0.044 us       |                  blk_dequeue_request();
  2)   1.364 us       |              }
  2)                         |              scsi_dispatch_cmd() {
  2)                         |                qla2xxx_queuecommand() {
  2)                         |                  qla24xx_dif_start_scsi()
  2)   3.235 us       |                }
  2)   3.706 us       |              }
  2) + 13.792 us   |            } // end of scsi_request_fn()
  2) + 14.021 us   |          }
  2) + 15.239 us   |        }
  2) + 15.463 us   |      }
  2) + 42.282 us   |    }
  2) + 42.519 us   |  }

>
> This locking contention and other memory allocations currently limit per
> struct scsi_device performance with small block random IOPs to ~250K vs.
> ~1M with raw block drivers providing their own make_request() function.
>
> FYI, there is an early alpha scsi-mq prototype that bypasses the
> scsi_request_fn() junk all together, that is able to reach small block
> IOPs + latency that is comparable to raw block drivers.
>
> Only a handful of LLDs have been converted to run with full scsi-mq
> pre-allocation thus far, and the code is considered early, early
> alpha.
>
> It's the only real option for SCSI to get anywhere near raw block driver
> performance + latency, but is still quite a ways off mainline.
>
> --nab
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bypass block layer and Fill SCSI lower layer driver queue
  2013-09-19  2:05   ` Alireza Haghdoost
@ 2013-09-19 21:49     ` Nicholas A. Bellinger
  0 siblings, 0 replies; 8+ messages in thread
From: Nicholas A. Bellinger @ 2013-09-19 21:49 UTC (permalink / raw)
  To: Alireza Haghdoost; +Cc: linux-fsdevel, linux-scsi, Jerry Fredin

On Wed, 2013-09-18 at 21:05 -0500, Alireza Haghdoost wrote:
> On Wed, Sep 18, 2013 at 4:00 PM, Nicholas A. Bellinger
> <nab@linux-iscsi.org> wrote:
> > On Wed, 2013-09-18 at 01:41 -0500, Alireza Haghdoost wrote:

<SNIP>

> Hi Nicholas,
> 
> Thanks for your reply sharing your thought with us. Please find my
> comments below:
> 
> > FYI, you've got things backward as to where the real overhead is being
> > introduced.
> 
> As far as I understand, you are talking about the overhead of making
> SCSI request in this case moves from kernel to the application layer.
> That is true. However, our application does not require to create SCSI
> commands online.

The largest overhead in SCSI is not from the formation of the commands,
although in existing code the numerous memory allocations are certainly
not helping latency.

The big elephant in the room is scsi_request_fn(), which expects to take
request_queue->queue_lock and scsi_host->host_lock *multiple* times for
each struct scsi_cmnd that is being dispatched to a LLD.

To put this into perspective, with enough SCSI LUNs (say 10x) and a
machine powerful enough to reach 1M IOPs, on the order of ~40% of CPU
time is spent contending on these two locks in scsi_request_fn() alone!

Contrast that with blk-mq <-> scsi-mq code, where it's easily possible
to reach sustained 1M IOPs within KVM guest to a scsi_debug ramdisk on a
moderately powered laptop.

This is very easy to demonstrate with SCSI in it's current state.  Take
scsi-debug and NOP all REQ_TYPE_FS requests to immediately complete
using sc->scsi_done().  Then take a raw block driver and do the same
thing.  You'll find that as you scale up, the scsi-debug driver will be
limited to ~250K IOPs, while the raw block driver is easily capable of
~1M IOPs per LUN.

As you add more LUNs to the same struct scsi_host, things only gets
worse, because of scsi_host wide locking.

>  i.e It can prepare bunch of SCSI commands off-line
> (during warm-up phase) and then send all of them to the device drives
> when it goes online. Therefore, the overhead of creating SCSI commands
> has been moved to app layer but excluded from critical time of
> application. In the critical time of application we have to send IO
> request as fast as possible to the driver and we don't want spend time
> to creat SCSI commands at that time. That is the whole motivation of
> NOT using libaio and raw device.
> 
> >
> > The block layer / aio overhead is minimal compared to the overhead
> > introduced by the existing scsi_request_fn() logic, and extreme locking
> > contention between request_queue->queue_lock and scsi_host->host_lock
> > that are accessed/released multiple times per struct scsi_cmnd dispatch.
> 
> That means even if we send SCSI commands directly to sg device, it
> still suffer from the overhead of scsi_request_fn() ? I was thinking
> it would bypass scsi_request_fn() since the scsi commands are built
> inside this function. (i.e. scsi_prep_fn() called by scsi_request_fn()
> ). However, In my situation, SCSI commands are built in app layer and
> logically there would be no reason to suffer from the overhead of
> scsi_request_fn()
> 
> Ok, here is a below is a trace of function calls I collected using
> ftrace while running our app sending IO request directly to raw device
> (/dev/sdX) using libaio. I am gonna describe my view of overhead in
> this example.
> 
> The whole io_submit() system call in this case take about 42us to
> finish (please note there is some measurement error caused by dynamic
> instrumentation but we can ignore that in relative comparison)
> libaio consume 22us to prepare and run iocb in aio_run_iocb() which is
> equal to half of the whole system call time. While scsi_request_fn()
> only consume 13us to create scsi commands and submit the command in
> low level drive queue. SCSI low level driver takes less than 3us in
> qla2xxx_queuecommand() to queue the scsi command.
> 
> To me spending 22us in libaio is a big deal for millions of IO
> request. 

Look closer.  The latency is not libaio specific.  The majority of the
overhead is actually in the direct IO codepath.

The latency in DIO is primarily from the awkward order in which it does
things..  Eg, first it pins userspace pages, then it asks the fs where
it's mapping to, which includes the size of the io it's going to submit,
then allocates a bio, fills it out, etc. etc.  

So part of Jen's work on blk-mq has been to optimize the DIO path, and
given that blk-mq is now scaling to 10M IOPs per device (yes, that is
not a typo), it's clear that libaio and DIO are not the underlying
problem when it comes to scaling the SCSI subsystem for heavy random
small block workloads.

> It is more than 50% overhead. That is why I am not in favor
> of  libaio. Moreover, preparing SCSI commands inside scsi_request_fn()
> takes almost 10us if we exclude submission time to low-level driver.
> That is like 10% overhead. That is why I am interested to do this job
> offline inside the application layer.

Trying to bypass scsi_request_fn() is essentially bypassing
request_queue, and all of the queue_depth management that comes along
with it.

This would be bad, because it would allow user-space to queue more
requests to an LLD than the underlying hardware is capable of handling.

>  The greedy approach which I am
> looking for is to  spend around 3us running low-level driver queuing
> scsi command (in my case qla2xxx_queuecommand() ) and bypass
> everything else.   How ever I am wondering if it is possible or not ?
> 
> 

No, it's the wrong approach.  The correct approach is use blk-mq <->
scsi-mq, and optimize the scsi-generic codepath from there.

--nab

>   2)                        |  sys_io_submit() {
>   2)                        |    do_io_submit() {
>   2)                        |      aio_run_iocb() {
>   2)                        |          blkdev_aio_write() {
>   2)                        |            __generic_file_aio_write()
>   2)                        |              generic_file_direct_write() {
>   2)                        |                blkdev_direct_IO() {
>   2)                        |                    submit_bio() {
>   2)                        |                      generic_make_request() {
>   2)                        |                        blk_queue_bio() {
>   2)   0.035 us      |                          blk_queue_bounce();
>   2) + 21.351 us   |          }
>   2) + 21.580 us   |        }
>   2) + 22.070 us   |      }
>   2)                         |      blk_finish_plug() {
>   2)                         |          queue_unplugged() {
>   2)                         |            scsi_request_fn() {
>   2)                         |              blk_peek_request(){
>   2)                         |                sd_prep_fn() {
>   2)                         |                  scsi_setup_fs_cmnd() {
>   2)                         |                    scsi_get_cmd_from_req()
>   2)                         |                    scsi_init_io()
>   2)   4.735 us       |                  }
>   2)   0.033 us       |                  scsi_prep_return();
>   2)   5.234 us       |                }
>   2)   5.969 us       |              }
>   2)                         |              blk_queue_start_tag() {
>   2)                         |                blk_start_request() {
>   2)   0.044 us       |                  blk_dequeue_request();
>   2)   1.364 us       |              }
>   2)                         |              scsi_dispatch_cmd() {
>   2)                         |                qla2xxx_queuecommand() {
>   2)                         |                  qla24xx_dif_start_scsi()
>   2)   3.235 us       |                }
>   2)   3.706 us       |              }
>   2) + 13.792 us   |            } // end of scsi_request_fn()
>   2) + 14.021 us   |          }
>   2) + 15.239 us   |        }
>   2) + 15.463 us   |      }
>   2) + 42.282 us   |    }
>   2) + 42.519 us   |  }
> 
> >
> > This locking contention and other memory allocations currently limit per
> > struct scsi_device performance with small block random IOPs to ~250K vs.
> > ~1M with raw block drivers providing their own make_request() function.
> >
> > FYI, there is an early alpha scsi-mq prototype that bypasses the
> > scsi_request_fn() junk all together, that is able to reach small block
> > IOPs + latency that is comparable to raw block drivers.
> >
> > Only a handful of LLDs have been converted to run with full scsi-mq
> > pre-allocation thus far, and the code is considered early, early
> > alpha.
> >
> > It's the only real option for SCSI to get anywhere near raw block driver
> > performance + latency, but is still quite a ways off mainline.
> >
> > --nab
> >



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-09-27  6:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-18  6:41 Bypass block layer and Fill SCSI lower layer driver queue Alireza Haghdoost
2013-09-18  7:58 ` Jack Wang
2013-09-18 14:07   ` Douglas Gilbert
2013-09-18 14:31     ` Boaz Harrosh
2013-09-27  6:06     ` Vladislav Bolkhovitin
2013-09-18 21:00 ` Nicholas A. Bellinger
2013-09-19  2:05   ` Alireza Haghdoost
2013-09-19 21:49     ` Nicholas A. Bellinger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).