SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
@ 2010-01-07  0:16 Chris Worley
       [not found] ` <f3177b9e1001061616v4f0015d1h843ba19c8cdd83d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Worley @ 2010-01-07  0:16 UTC (permalink / raw)
  To: OFED mailing list, scst-devel

In shifting through a great deal of benchmark data collected from two
identical machines (including the attached drive array), I see the
following SRP anomalies:

1) I'm seeing small block random writes (32KB and smaller) get better
performance over SRP than they do as a local drive.  I'm guessing this
is async behavior: once the written data is on the wire, it's deemed
complete, and setting a sync flag would disable this.  Is this
correct?  If not, any ideas why SRP random writes would be faster than
the same writes locally?

2) I'm seeing very poor sequential vs. random I/O performance (both
read and write) at small block sizes (random performs well, sequential
performance is poor).  I'm using direct I/O and the noop scheduler on
the initiator, so there should be no coalescing.  Coalescing on these
drives is not a good thing to do, as they are ultra low latency, and
much faster if the OS doesn't try to coalesce.  Could anything in the
IB/SRP/SCST stack be trying to coalesce sequential data?  If not, any
other ideas on why I might see this?

3) In my iSCSI (tgt) results using the HCA as a 10G interface (not
IPoIB, but mlnx4_en), comparing this to the results of using the same
HCA as IB under SRP, I get much better results with SRP when
benchmarking the raw device, as you'd expect.  This is w/ a drive that
does under 1GB/s.  When I use MD to mirror that SRP or iSCSI device w/
an identical local device, and benchmark the raw MD device, iSCSI gets
superior write performance and about equal read performance.  Does
iSCSI/TGT have some special hook into MD devices that IB/SRP isn't
privy to?

Any ideas or clues would be helpful.

Thanks,

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O  coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found] ` <f3177b9e1001061616v4f0015d1h843ba19c8cdd83d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-01-07  1:57   ` David Dillow
       [not found]     ` <1262829441.29991.10.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  2010-01-09 19:25   ` Bart Van Assche
  1 sibling, 1 reply; 13+ messages in thread
From: David Dillow @ 2010-01-07  1:57 UTC (permalink / raw)
  To: Chris Worley; +Cc: OFED mailing list, scst-devel

On Wed, 2010-01-06 at 17:16 -0700, Chris Worley wrote:
> 1) I'm seeing small block random writes (32KB and smaller) get better
> performance over SRP than they do as a local drive.  I'm guessing this
> is async behavior: once the written data is on the wire, it's deemed
> complete, and setting a sync flag would disable this.  Is this
> correct? 

No, from the initiator point of view, the request is not complete until
the target has responded to the command.

> If not, any ideas why SRP random writes would be faster than
> the same writes locally?

I would guess deeper queue depths and more cache available on the
target, especially if you are using a Linux-based SRP target.

But it would only be a guess without knowing more about your setup.

> 2) I'm seeing very poor sequential vs. random I/O performance (both
> read and write) at small block sizes (random performs well, sequential
> performance is poor).  I'm using direct I/O and the noop scheduler on
> the initiator, so there should be no coalescing.  Coalescing on these
> drives is not a good thing to do, as they are ultra low latency, and
> much faster if the OS doesn't try to coalesce.  Could anything in the
> IB/SRP/SCST stack be trying to coalesce sequential data?

Yes, if you have more requests outstanding than available queue depth --
ie queue backpressure/congestion -- even noop will merge sequential
requests in the queue. You could avoid this by setting max_sectors_kb to
the maximum IO size you wish the drive to see.

Though, I'd be surprised if there was no benefit at all to the OS
coalescing under congestion.

> 3) In my iSCSI (tgt) results using the HCA as a 10G interface (not
> IPoIB, but mlnx4_en), comparing this to the results of using the same
> HCA as IB under SRP, I get much better results with SRP when
> benchmarking the raw device, as you'd expect.  This is w/ a drive that
> does under 1GB/s.  When I use MD to mirror that SRP or iSCSI device w/
> an identical local device, and benchmark the raw MD device, iSCSI gets
> superior write performance and about equal read performance.  Does
> iSCSI/TGT have some special hook into MD devices that IB/SRP isn't
> privy to?

Are trying to achieve high IOPS or high bandwidth? I'm guessing IOPS
from your other comments, but device-mapper (and I suspect MD as well)
used to suffer from an internal limit on the max_sectors_kb -- you could
have it set to 8 MB on the raw devices, but MD would end up restricting
it to 512 KB. This is unlikely the problem if you are going for IOPS,
but can play a factor in bandwidth.

Then again, since the setup seems to be identical, I'm not sure it is
your problem here either. :(

Have you tried using the function tracer or perf tools found in recent
kernels to follow the data path and find the hotspots?

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found]     ` <1262829441.29991.10.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2010-01-08 21:40       ` Chris Worley
       [not found]         ` <f3177b9e1001081340r323c53cela2fb22907212fc2b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Worley @ 2010-01-08 21:40 UTC (permalink / raw)
  To: David Dillow; +Cc: OFED mailing list, scst-devel

On Wed, Jan 6, 2010 at 6:57 PM, David Dillow <dave-i1Mk8JYDVaaSihdK6806/g@public.gmane.org> wrote:
> On Wed, 2010-01-06 at 17:16 -0700, Chris Worley wrote:
>> 1) I'm seeing small block random writes (32KB and smaller) get better
>> performance over SRP than they do as a local drive.  I'm guessing this
>> is async behavior: once the written data is on the wire, it's deemed
>> complete, and setting a sync flag would disable this.  Is this
>> correct?
>
> No, from the initiator point of view, the request is not complete until
> the target has responded to the command.
>
>> If not, any ideas why SRP random writes would be faster than
>> the same writes locally?
>
> I would guess deeper queue depths and more cache available on the
> target, especially if you are using a Linux-based SRP target.

I do set the ib_srp initiator "srp_sg_tablesize" to its maximum of 58.
 On the Target, I set the "srp_max_rdma_size" to 128KB (but that won't
effect small blocks).  I also set thread=1, to work around another
problem.

>
> But it would only be a guess without knowing more about your setup.
>
>> 2) I'm seeing very poor sequential vs. random I/O performance (both
>> read and write) at small block sizes (random performs well, sequential
>> performance is poor).  I'm using direct I/O and the noop scheduler on
>> the initiator, so there should be no coalescing.  Coalescing on these
>> drives is not a good thing to do, as they are ultra low latency, and
>> much faster if the OS doesn't try to coalesce.  Could anything in the
>> IB/SRP/SCST stack be trying to coalesce sequential data?
>
> Yes, if you have more requests outstanding than available queue depth --
> ie queue backpressure/congestion -- even noop will merge sequential
> requests in the queue. You could avoid this by setting max_sectors_kb to
> the maximum IO size you wish the drive to see.

I thought if the device was opened with the O_DIRECT flag, then the
scheduler should have nothing to coalesce.
>
> Though, I'd be surprised if there was no benefit at all to the OS
> coalescing under congestion.

For sequential I/O benchmarking, I need to see the real results for
that size packet.  Direct I/O works for me everywhere except SRP.

The problem turns out to be more curious: sequential reads and writes
are being coalesced.  I'm getting my IOPS from the diskstats, and
therefore it was very low because the block size given to the device
driver is very high (i.e. 32KB is delivered to the device driver,
while 512 byte blocks were being sent).  So, had I been looking at the
bandwidth, I would have seen it inordinately/artificially high.
What's more curious is the write performance excels when coalesced
(w.r.t. the block size you think you're benchmarking), but the read
performance does not.

>
>
>> 3) In my iSCSI (tgt) results using the HCA as a 10G interface (not
>> IPoIB, but mlnx4_en), comparing this to the results of using the same
>> HCA as IB under SRP, I get much better results with SRP when
>> benchmarking the raw device, as you'd expect.  This is w/ a drive that
>> does under 1GB/s.  When I use MD to mirror that SRP or iSCSI device w/
>> an identical local device, and benchmark the raw MD device, iSCSI gets
>> superior write performance and about equal read performance.  Does
>> iSCSI/TGT have some special hook into MD devices that IB/SRP isn't
>> privy to?
>
> Are trying to achieve high IOPS or high bandwidth? I'm guessing IOPS
> from your other comments, but device-mapper (and I suspect MD as well)
> used to suffer from an internal limit on the max_sectors_kb -- you could
> have it set to 8 MB on the raw devices, but MD would end up restricting
> it to 512 KB. This is unlikely the problem if you are going for IOPS,

I'm doing the MD on the initiator side.  I'll try playing with this.

> but can play a factor in bandwidth.
>
> Then again, since the setup seems to be identical, I'm not sure it is
> your problem here either. :(
>
> Have you tried using the function tracer or perf tools found in recent
> kernels to follow the data path and find the hotspots?

I have not.  I parse the data from diskstats.  A pointer to these
tools would be appreciated.

Chris
>
> Dave
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found]         ` <f3177b9e1001081340r323c53cela2fb22907212fc2b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-01-08 22:17           ` David Dillow
       [not found]             ` <1262989053.14204.21.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: David Dillow @ 2010-01-08 22:17 UTC (permalink / raw)
  To: Chris Worley; +Cc: OFED mailing list, scst-devel

On Fri, 2010-01-08 at 14:40 -0700, Chris Worley wrote:
> On Wed, Jan 6, 2010 at 6:57 PM, David Dillow <dave-i1Mk8JYDVaaSihdK6806/g@public.gmane.org> wrote:
> > On Wed, 2010-01-06 at 17:16 -0700, Chris Worley wrote:
> >> 1) I'm seeing small block random writes (32KB and smaller) get better
> >> performance over SRP than they do as a local drive.  I'm guessing this
> >> is async behavior: once the written data is on the wire, it's deemed
> >> complete, and setting a sync flag would disable this.  Is this
> >> correct?

> >> If not, any ideas why SRP random writes would be faster than
> >> the same writes locally?
> >
> > I would guess deeper queue depths and more cache available on the
> > target, especially if you are using a Linux-based SRP target.
> 
> I do set the ib_srp initiator "srp_sg_tablesize" to its maximum of 58.

The max is 255, which will guarantee you can send up to a 1020 KB I/O
without breaking it into two SCSI commands. In practice, you're likely
to be able to send larger requests, as you will often have some
contiguous runs in the data pages.

This is probably not hurting you at smaller request sizes.

> >> 2) I'm seeing very poor sequential vs. random I/O performance (both
> >> read and write) at small block sizes (random performs well, sequential
> >> performance is poor).  I'm using direct I/O and the noop scheduler on
> >> the initiator, so there should be no coalescing.  Coalescing on these
> >> drives is not a good thing to do, as they are ultra low latency, and
> >> much faster if the OS doesn't try to coalesce.  Could anything in the
> >> IB/SRP/SCST stack be trying to coalesce sequential data?
> >
> > Yes, if you have more requests outstanding than available queue depth --
> > ie queue backpressure/congestion -- even noop will merge sequential
> > requests in the queue. You could avoid this by setting max_sectors_kb to
> > the maximum IO size you wish the drive to see.
> 
> I thought if the device was opened with the O_DIRECT flag, then the
> scheduler should have nothing to coalesce.

Depends on how many I/Os your application has in flight at once,
assuming it is using AIO or threads. If you have more requests in flight
than can be queued, the block layer will coalesce if possible.

> > Though, I'd be surprised if there was no benefit at all to the OS
> > coalescing under congestion.
> 
> For sequential I/O benchmarking, I need to see the real results for
> that size packet.  Direct I/O works for me everywhere except SRP.

Hmm, that seems a bit odd, but there is nothing in the SRP initiator
that would cause the behavior you are seeing -- it just hands over the
requests the SCSI and block layers give it. Are you observing this via
diskstats at the initiator or the target side of the SRP connection?

You could also try using sgp_dd from lustre-iokit, but I've seen some
oddities from it -- it couldn't drive the hardware I was testing at full
speed, where XDD and some custom tools I wrote did.

You may have mentioned this, but are you using the raw device, or a
filesystem over top of it?

Also, I've seen some interesting things like device mapper reporting a 4
KB read as 8 512 byte sectors, even though it was handed to DM as a 4KB
request, so there could be gremlins there as well. I don't know how the
MD device driver reports this.

What does the output of 'cd /sys/block/sda/queue && head *' look like,
where sda should be replaced with the SRP disk. It would also be
interesting to see that for iSCSI, and
in /sys/class/scsi_disk/0:0:0:0/device for both connection types to see
if there is a difference.

> > Have you tried using the function tracer or perf tools found in recent
> > kernels to follow the data path and find the hotspots?
> 
> I have not.  I parse the data from diskstats.  A pointer to these
> tools would be appreciated.

You can find information on them in the kernel source, under
Documentation/trace/ftrace.txt and tools/perf/Documentation

You can also try blktrace.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found]             ` <1262989053.14204.21.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2010-01-08 22:39               ` Chris Worley
       [not found]                 ` <f3177b9e1001081439j3730acefrbfcf523b0da06306-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Worley @ 2010-01-08 22:39 UTC (permalink / raw)
  To: David Dillow; +Cc: OFED mailing list, scst-devel

On Fri, Jan 8, 2010 at 3:17 PM, David Dillow <dave-i1Mk8JYDVaaSihdK6806/g@public.gmane.org> wrote:
> On Fri, 2010-01-08 at 14:40 -0700, Chris Worley wrote:
>> On Wed, Jan 6, 2010 at 6:57 PM, David Dillow <dave-i1Mk8JYDVaaSihdK6806/g@public.gmane.org> wrote:
>> > On Wed, 2010-01-06 at 17:16 -0700, Chris Worley wrote:
>> >> 1) I'm seeing small block random writes (32KB and smaller) get better
>> >> performance over SRP than they do as a local drive.  I'm guessing this
>> >> is async behavior: once the written data is on the wire, it's deemed
>> >> complete, and setting a sync flag would disable this.  Is this
>> >> correct?
>
>> >> If not, any ideas why SRP random writes would be faster than
>> >> the same writes locally?
>> >
>> > I would guess deeper queue depths and more cache available on the
>> > target, especially if you are using a Linux-based SRP target.
>>
>> I do set the ib_srp initiator "srp_sg_tablesize" to its maximum of 58.
>
> The max is 255, which will guarantee you can send up to a 1020 KB I/O
> without breaking it into two SCSI commands. In practice, you're likely
> to be able to send larger requests, as you will often have some
> contiguous runs in the data pages.

I've tried a larger max... 58 is all I can get.  Maybe getting more is
dependent on some other setting.
>
> This is probably not hurting you at smaller request sizes.
>
>> >> 2) I'm seeing very poor sequential vs. random I/O performance (both
>> >> read and write) at small block sizes (random performs well, sequential
>> >> performance is poor).  I'm using direct I/O and the noop scheduler on
>> >> the initiator, so there should be no coalescing.  Coalescing on these
>> >> drives is not a good thing to do, as they are ultra low latency, and
>> >> much faster if the OS doesn't try to coalesce.  Could anything in the
>> >> IB/SRP/SCST stack be trying to coalesce sequential data?
>> >
>> > Yes, if you have more requests outstanding than available queue depth --
>> > ie queue backpressure/congestion -- even noop will merge sequential
>> > requests in the queue. You could avoid this by setting max_sectors_kb to
>> > the maximum IO size you wish the drive to see.
>>
>> I thought if the device was opened with the O_DIRECT flag, then the
>> scheduler should have nothing to coalesce.
>
> Depends on how many I/Os your application has in flight at once,
> assuming it is using AIO or threads. If you have more requests in flight
> than can be queued, the block layer will coalesce if possible.

I do use AIO, always 64 threads, each w/ 64 outstanding I/O's.  Local
or iSER initiator based, I never see any coalescing.  Only w/ SRP.

>
>> > Though, I'd be surprised if there was no benefit at all to the OS
>> > coalescing under congestion.

Benefit isn't the issue.  It needs to be benchmarked w/o artificial
aids that cloud the results.  I'm not really fond of sequential I/O,
as it seldom really exists in real applications (except for logging
apps), but if I'm going to test it, I need valid numbers.

I could do like the SAN/FC vendors do, and just take the throughput
for 1MB blocks and divide the TPS by 2M and call that the 512 byte
block IOPS ;)

>>
>> For sequential I/O benchmarking, I need to see the real results for
>> that size packet.  Direct I/O works for me everywhere except SRP.
>
> Hmm, that seems a bit odd, but there is nothing in the SRP initiator
> that would cause the behavior you are seeing -- it just hands over the
> requests the SCSI and block layers give it. Are you observing this via
> diskstats at the initiator or the target side of the SRP connection?

Diskstats on the initiator side.

There is the scst_vdisk "Direct I/O" option that's been commented out
of the code, as it's not supposed to work... maybe direct I/O doesn't
work... but that would be the target side.

>
> You could also try using sgp_dd from lustre-iokit, but I've seen some
> oddities from it -- it couldn't drive the hardware I was testing at full
> speed, where XDD and some custom tools I wrote did.
>
> You may have mentioned this, but are you using the raw device, or a
> filesystem over top of it?

It depends: this #2 issue, sequential vs random: it's atop the raw
block device.  The third issue was atop MD.  As some of this thread
has been snipped, I'm not completely sure which issue we're
discussing.

>
> Also, I've seen some interesting things like device mapper reporting a 4
> KB read as 8 512 byte sectors, even though it was handed to DM as a 4KB
> request, so there could be gremlins there as well. I don't know how the
> MD device driver reports this.
>
> What does the output of 'cd /sys/block/sda/queue && head *' look like,
> where sda should be replaced with the SRP disk. It would also be
> interesting to see that for iSCSI, and
> in /sys/class/scsi_disk/0:0:0:0/device for both connection types to see
> if there is a difference.

Initiator or target?  The target side isn't a SCSI device, it's a
block device.  I guess I could use scst_local to make it look
scsi-ish.

>
>> > Have you tried using the function tracer or perf tools found in recent
>> > kernels to follow the data path and find the hotspots?
>>
>> I have not.  I parse the data from diskstats.  A pointer to these
>> tools would be appreciated.
>
> You can find information on them in the kernel source, under
> Documentation/trace/ftrace.txt and tools/perf/Documentation
>
> You can also try blktrace.

Thanks,

Chris
>
> Dave
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found]                 ` <f3177b9e1001081439j3730acefrbfcf523b0da06306-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-01-08 23:07                   ` David Dillow
       [not found]                     ` <1262992077.14204.39.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  2010-01-09 13:05                   ` Bart Van Assche
  2010-01-11 18:44                   ` Vladislav Bolkhovitin
  2 siblings, 1 reply; 13+ messages in thread
From: David Dillow @ 2010-01-08 23:07 UTC (permalink / raw)
  To: Chris Worley; +Cc: OFED mailing list, scst-devel

On Fri, 2010-01-08 at 15:39 -0700, Chris Worley wrote:
> On Fri, Jan 8, 2010 at 3:17 PM, David Dillow <dave-i1Mk8JYDVaaSihdK6806/g@public.gmane.org> wrote:
> > On Fri, 2010-01-08 at 14:40 -0700, Chris Worley wrote:
> >> I do set the ib_srp initiator "srp_sg_tablesize" to its maximum of 58.
> >
> > The max is 255, which will guarantee you can send up to a 1020 KB I/O
> > without breaking it into two SCSI commands. In practice, you're likely
> > to be able to send larger requests, as you will often have some
> > contiguous runs in the data pages.
> 
> I've tried a larger max... 58 is all I can get.  Maybe getting more is
> dependent on some other setting.

options ib_srp srp_sg_tablesize=255

in modprobe.conf is all that's needed. You can
check /sys/module/ib_srp/parameters/srp_sg_tablesize
to be sure it took effect. 255 is not dependent on any other settings,
but other limits can keep you from using all of the S/G entries.

But this still isn't hurting you at the small request sizes we seem to
be talking about. Or do you mean 58 KB, which is believable -- the
default is 12, which guarantees a 48 KB request size is possible, and
you'd only need a few pages to coalesce and you be there. Of course, 58
isn't a multiple of 4, so maybe it isn't just me misunderstanding.

> >> I thought if the device was opened with the O_DIRECT flag, then the
> >> scheduler should have nothing to coalesce.
> >
> > Depends on how many I/Os your application has in flight at once,
> > assuming it is using AIO or threads. If you have more requests in flight
> > than can be queued, the block layer will coalesce if possible.
> 
> I do use AIO, always 64 threads, each w/ 64 outstanding I/O's.  Local
> or iSER initiator based, I never see any coalescing.  Only w/ SRP.

With 64 requests, you open the possibility of coalescing, as the maximum
queue depth of the unmodified SRP initiator is 63. Or do you mean 64 *
64 == 4096 requests? In that case you are virtually guaranteed to get
coalescing.

Have you tried lower numbers of requests in flight to see if there is a
threshold where the coalescing stops?

> >> For sequential I/O benchmarking, I need to see the real results for
> >> that size packet.  Direct I/O works for me everywhere except SRP.
> >
> > Hmm, that seems a bit odd, but there is nothing in the SRP initiator
> > that would cause the behavior you are seeing -- it just hands over the
> > requests the SCSI and block layers give it. Are you observing this via
> > diskstats at the initiator or the target side of the SRP connection?
> 
> Diskstats on the initiator side.

> > You may have mentioned this, but are you using the raw device, or a
> > filesystem over top of it?
> 
> It depends: this #2 issue, sequential vs random: it's atop the raw
> block device.  The third issue was atop MD.  As some of this thread
> has been snipped, I'm not completely sure which issue we're
> discussing.

Sorry, I hate to wade through oceans of text to find a reply, but
sometimes I snip too much. I was curious if any of the tests were over a
filesystem, and it sounds like the answer is no. That's good, it rules
out a variable. Let's focus on getting the raw block device tests doing
what you want, and then worry about the MD layer later.

> > What does the output of 'cd /sys/block/sda/queue && head *' look like,
> > where sda should be replaced with the SRP disk. It would also be
> > interesting to see that for iSCSI, and
> > in /sys/class/scsi_disk/0:0:0:0/device for both connection types to see
> > if there is a difference.
> 
> Initiator or target?  The target side isn't a SCSI device, it's a
> block device.  I guess I could use scst_local to make it look
> scsi-ish.

Let's just worry about initiator side for now -- I know very little
about SCST's implementation. If we can get to where we're sending the
desired requests from the initiator, you can take up issues with the
target side with someone else. :)

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found]                     ` <1262992077.14204.39.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2010-01-09  1:20                       ` David Dillow
  0 siblings, 0 replies; 13+ messages in thread
From: David Dillow @ 2010-01-09  1:20 UTC (permalink / raw)
  To: Chris Worley; +Cc: OFED mailing list, scst-devel

On Fri, 2010-01-08 at 18:07 -0500, David Dillow wrote:
> But this still isn't hurting you at the small request sizes we seem to
> be talking about. Or do you mean 58 KB, which is believable -- the
> default is 12, which guarantees a 48 KB request size is possible, and
> you'd only need a few pages to coalesce and you be there. Of course, 58
> isn't a multiple of 4, so maybe it isn't just me misunderstanding.

DOh, edited too fast, this should be "maybe it is just me
misunderstanding."

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found]                 ` <f3177b9e1001081439j3730acefrbfcf523b0da06306-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-01-08 23:07                   ` David Dillow
@ 2010-01-09 13:05                   ` Bart Van Assche
       [not found]                     ` <e2e108261001090505w58a70e8ax5cfa522cbf2da9cf-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-01-11 18:44                   ` Vladislav Bolkhovitin
  2 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2010-01-09 13:05 UTC (permalink / raw)
  To: Chris Worley; +Cc: David Dillow, OFED mailing list, scst-devel

On Fri, Jan 8, 2010 at 11:39 PM, Chris Worley <worleys-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Fri, Jan 8, 2010 at 3:17 PM, David Dillow <dave-i1Mk8JYDVaaSihdK6806/g@public.gmane.org> wrote:
>> On Fri, 2010-01-08 at 14:40 -0700, Chris Worley wrote:
>>> On Wed, Jan 6, 2010 at 6:57 PM, David Dillow <dave-i1Mk8JYDVaaSihdK6806/g@public.gmane.org> wrote:
>>> > On Wed, 2010-01-06 at 17:16 -0700, Chris Worley wrote:
>>> >> 1) I'm seeing small block random writes (32KB and smaller) get better
>>> >> performance over SRP than they do as a local drive.  I'm guessing this
>>> >> is async behavior: once the written data is on the wire, it's deemed
>>> >> complete, and setting a sync flag would disable this.  Is this
>>> >> correct?
>>
>>> >> If not, any ideas why SRP random writes would be faster than
>>> >> the same writes locally?
>>> >
>>> > I would guess deeper queue depths and more cache available on the
>>> > target, especially if you are using a Linux-based SRP target.
>>>
>>> I do set the ib_srp initiator "srp_sg_tablesize" to its maximum of 58.
>>
>> The max is 255, which will guarantee you can send up to a 1020 KB I/O
>> without breaking it into two SCSI commands. In practice, you're likely
>> to be able to send larger requests, as you will often have some
>> contiguous runs in the data pages.
>
> I've tried a larger max... 58 is all I can get.  Maybe getting more is
> dependent on some other setting.

The SRP spec says that the target must specify the maximum message
size in the SRP_LOGIN_RSP information unit. The largest value one can
set the srp_sg_tablesize initiator parameter to is (max. SRP message
size defined by the target - 68) / 16. With older SCST-SRPT revisions
the maximum SRP message size was 996 bytes, hence a maximum of 58 for
srp_sg_tablesize. With newer SCST-SRPT revisions the maximum message
size defaults to 2116, which corresponds to a maximum of 128 for
srp_sg_tablesize. The maximum message size can even be increased
further via the module parameter srp_max_message_size of ib_srpt (see
also srpt/src/README in the SCST source tree).

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found]                     ` <e2e108261001090505w58a70e8ax5cfa522cbf2da9cf-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-01-09 17:16                       ` David Dillow
       [not found]                         ` <1263057402.14204.55.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: David Dillow @ 2010-01-09 17:16 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Chris Worley, OFED mailing list, scst-devel

On Sat, 2010-01-09 at 14:05 +0100, Bart Van Assche wrote:
> The SRP spec says that the target must specify the maximum message
> size in the SRP_LOGIN_RSP information unit. The largest value one can
> set the srp_sg_tablesize initiator parameter to is (max. SRP message
> size defined by the target - 68) / 16. With older SCST-SRPT revisions
> the maximum SRP message size was 996 bytes, hence a maximum of 58 for
> srp_sg_tablesize. With newer SCST-SRPT revisions the maximum message
> size defaults to 2116, which corresponds to a maximum of 128 for
> srp_sg_tablesize.

I see, thanks for the reminder. It's been a while since I've had to deal
with that part of the spec, and I've been fortunate that all of the
vendors I work with have max message sizes that allow 255 entries.

Does SRPT support the RDMA'ing the indirect buffer descriptors from the
Initiator such that it isn't constrained by the partial memory
descriptor list in the command request? The initiator restricts itself
to 255 SG entries because I've not found a target that implemented the
spec fully, though it'd be nice to be able to guarantee the ability to
send larger request sizes. I think this will also become important for
running bidirectional commands, since the room for descriptors cached in
the command is shared for both directions.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found]                         ` <1263057402.14204.55.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2010-01-09 17:49                           ` Bart Van Assche
       [not found]                             ` <e2e108261001090949x6b4c9e25mfb9e6ad0320879dc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2010-01-09 17:49 UTC (permalink / raw)
  To: David Dillow; +Cc: Chris Worley, OFED mailing list, scst-devel

On Sat, Jan 9, 2010 at 6:16 PM, David Dillow <dave-i1Mk8JYDVaaSihdK6806/g@public.gmane.org> wrote:
>
> On Sat, 2010-01-09 at 14:05 +0100, Bart Van Assche wrote:
> > The SRP spec says that the target must specify the maximum message
> > size in the SRP_LOGIN_RSP information unit. The largest value one can
> > set the srp_sg_tablesize initiator parameter to is (max. SRP message
> > size defined by the target - 68) / 16. With older SCST-SRPT revisions
> > the maximum SRP message size was 996 bytes, hence a maximum of 58 for
> > srp_sg_tablesize. With newer SCST-SRPT revisions the maximum message
> > size defaults to 2116, which corresponds to a maximum of 128 for
> > srp_sg_tablesize.
>
> I see, thanks for the reminder. It's been a while since I've had to deal
> with that part of the spec, and I've been fortunate that all of the
> vendors I work with have max message sizes that allow 255 entries.
>
> Does SRPT support the RDMA'ing the indirect buffer descriptors from the
> Initiator such that it isn't constrained by the partial memory
> descriptor list in the command request? The initiator restricts itself
> to 255 SG entries because I've not found a target that implemented the
> spec fully, though it'd be nice to be able to guarantee the ability to
> send larger request sizes. I think this will also become important for
> running bidirectional commands, since the room for descriptors cached in
> the command is shared for both directions.

At this time SRPT only supports indirect buffer descriptors that are
present entirely in the command request. Regarding the number of SG
entries: I'm not sure that I understand why you want to be able to
send SG-lists containing more than 255 SG entries. In the tests I have
run the throughput gain resulting from SG list sizes above 128 was
marginal (a few percent).

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found]                             ` <e2e108261001090949x6b4c9e25mfb9e6ad0320879dc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-01-09 18:13                               ` David Dillow
  0 siblings, 0 replies; 13+ messages in thread
From: David Dillow @ 2010-01-09 18:13 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Chris Worley, OFED mailing list, scst-devel

On Sat, 2010-01-09 at 18:49 +0100, Bart Van Assche wrote:
> On Sat, Jan 9, 2010 at 6:16 PM, David Dillow <dave-i1Mk8JYDVaaSihdK6806/g@public.gmane.org> wrote:
> > Does SRPT support the RDMA'ing the indirect buffer descriptors from the
> > Initiator such that it isn't constrained by the partial memory
> > descriptor list in the command request? The initiator restricts itself
> > to 255 SG entries because I've not found a target that implemented the
> > spec fully, though it'd be nice to be able to guarantee the ability to
> > send larger request sizes. I think this will also become important for
> > running bidirectional commands, since the room for descriptors cached in
> > the command is shared for both directions.
> 
> At this time SRPT only supports indirect buffer descriptors that are
> present entirely in the command request. Regarding the number of SG
> entries: I'm not sure that I understand why you want to be able to
> send SG-lists containing more than 255 SG entries. In the tests I have
> run the throughput gain resulting from SG list sizes above 128 was
> marginal (a few percent).

It depends on the hardware I suppose. It may not make sense for SRPT,
but for some of the vendors I deal with, being able to guarantee 1 MB
requests on IB is worth quite a bit of performance, and their tests on
FC show that larger requests still show enough performance gains to make
it worth it. I've done some experiments, and it looks like while IB is
less happy with all the 4 KB requests, it still can get enough data to
the device to saturate the RAID controller.

For the controllers we're using, 1 MB requests perform better than 512
KB requests, and sending a 4 MB random write request stream (highly seek
intensive) gets us to ~90% IIRC of the 1 MB pure sequential write
performance, without using writeback cache. The controllers have a fair
amount of overhead per request, so doing fewer large requests helps.

Our workload is often largely sequential, with large volumes of data,
but once enough clients start writing, it an become random in a hurry.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found] ` <f3177b9e1001061616v4f0015d1h843ba19c8cdd83d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-01-07  1:57   ` David Dillow
@ 2010-01-09 19:25   ` Bart Van Assche
  1 sibling, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2010-01-09 19:25 UTC (permalink / raw)
  To: Chris Worley; +Cc: OFED mailing list, scst-devel

On Thu, Jan 7, 2010 at 1:16 AM, Chris Worley <worleys-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 3) In my iSCSI (tgt) results using the HCA as a 10G interface (not
> IPoIB, but mlnx4_en), comparing this to the results of using the same
> HCA as IB under SRP, I get much better results with SRP when
> benchmarking the raw device, as you'd expect.  This is w/ a drive that
> does under 1GB/s.  When I use MD to mirror that SRP or iSCSI device w/
> an identical local device, and benchmark the raw MD device, iSCSI gets
> superior write performance and about equal read performance.  Does
> iSCSI/TGT have some special hook into MD devices that IB/SRP isn't
> privy to?

Regarding the write test: were the targets configured for write back
or for write through ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances
       [not found]                 ` <f3177b9e1001081439j3730acefrbfcf523b0da06306-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-01-08 23:07                   ` David Dillow
  2010-01-09 13:05                   ` Bart Van Assche
@ 2010-01-11 18:44                   ` Vladislav Bolkhovitin
  2 siblings, 0 replies; 13+ messages in thread
From: Vladislav Bolkhovitin @ 2010-01-11 18:44 UTC (permalink / raw)
  To: Chris Worley; +Cc: David Dillow, OFED mailing list, scst-devel

Chris Worley, on 01/09/2010 01:39 AM wrote:
>>> I thought if the device was opened with the O_DIRECT flag, then the
>>> scheduler should have nothing to coalesce.
>> Depends on how many I/Os your application has in flight at once,
>> assuming it is using AIO or threads. If you have more requests in flight
>> than can be queued, the block layer will coalesce if possible.
> 
> I do use AIO, always 64 threads, each w/ 64 outstanding I/O's.  Local
> or iSER initiator based, I never see any coalescing.  Only w/ SRP.

SRP initiator seems to be not too well optimized for the best 
performance. ISER initiator is noticeably better in this area.

> There is the scst_vdisk "Direct I/O" option that's been commented out
> of the code, as it's not supposed to work... maybe direct I/O doesn't
> work... but that would be the target side.

O_DIRECT for vdisk is supposed to work. It's a matter of a small patch 
for the kernel, see http://scst.sourceforge.net/contributing.html#O_DIRECT.

Meanwhile, you can use fileio_tgt handler, with which O_DIRECT works well.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-01-11 18:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-07  0:16 SRP Q's: 1) When is asynchronous I/O complete, 2) Is sequential I/O coalesced, and 3) why is iSCSI faster than SRP in some instances Chris Worley
     [not found] ` <f3177b9e1001061616v4f0015d1h843ba19c8cdd83d-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-07  1:57   ` David Dillow
     [not found]     ` <1262829441.29991.10.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2010-01-08 21:40       ` Chris Worley
     [not found]         ` <f3177b9e1001081340r323c53cela2fb22907212fc2b-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-08 22:17           ` David Dillow
     [not found]             ` <1262989053.14204.21.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2010-01-08 22:39               ` Chris Worley
     [not found]                 ` <f3177b9e1001081439j3730acefrbfcf523b0da06306-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-08 23:07                   ` David Dillow
     [not found]                     ` <1262992077.14204.39.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2010-01-09  1:20                       ` David Dillow
2010-01-09 13:05                   ` Bart Van Assche
     [not found]                     ` <e2e108261001090505w58a70e8ax5cfa522cbf2da9cf-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-09 17:16                       ` David Dillow
     [not found]                         ` <1263057402.14204.55.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2010-01-09 17:49                           ` Bart Van Assche
     [not found]                             ` <e2e108261001090949x6b4c9e25mfb9e6ad0320879dc-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-01-09 18:13                               ` David Dillow
2010-01-11 18:44                   ` Vladislav Bolkhovitin
2010-01-09 19:25   ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox