srp sg_tablesize

public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed

* srp sg_tablesize
@ 2010-08-20  7:49 Bernd Schubert
       [not found] ` <201008200949.54595.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2010-08-20  7:49 UTC (permalink / raw)
  To: general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bernd Schubert

In ib_srp.c sg_tablesize is defined as 255. With that value we see lots of IO 
requests of size 1020. As I already wrote on linux-scsi, that is really sub-
optimal for DDN storage, as lots of IO requests of size 1020 come up.

Now the question is if we can safely increase it. Is there somewhere a 
definition what is the real hardware supported size? And shouldn't we increase 
sg_tablesize, but also set the .dma_boundary value?

Thanks in advance,
Bernd

-- 
Bernd Schubert
DataDirect Networks
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <201008200949.54595.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>]

* Re: srp sg_tablesize
       [not found] ` <201008200949.54595.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>
@ 2010-08-20 14:15   ` David Dillow
       [not found]     ` <1282313740.7441.25.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
  2010-08-21 11:14   ` Bart Van Assche
  1 sibling, 1 reply; 12+ messages in thread
From: David Dillow @ 2010-08-20 14:15 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: general-G2znmakfqn7U1rindQTSdQ, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Bernd Schubert

On Fri, 2010-08-20 at 09:49 +0200, Bernd Schubert wrote:
> In ib_srp.c sg_tablesize is defined as 255. With that value we see lots of IO 
> requests of size 1020. As I already wrote on linux-scsi, that is really sub-
> optimal for DDN storage, as lots of IO requests of size 1020 come up.
> 
> Now the question is if we can safely increase it. Is there somewhere a 
> definition what is the real hardware supported size? And shouldn't we increase 
> sg_tablesize, but also set the .dma_boundary value?

Currently, we limit sg_tablesize to 255 because we can only cache 255
indirect memory descriptors in the SRP_CMD message to the target. That's
due to the count being in an 8 bit field.

It does not have to be this way -- the spec defines that that indirect
descriptors in the message are just a cache, and the target should RDMA
any additional descriptors from the initiator, and then process those as
well. So we could easily take it higher, up to the size of a contiguous
allocation (or bigger, using FMR). However, to my knowledge, no vendor
implements this support.

We could make more descriptors fit in the SRP_CMD by using FMR to make
them virtually contiguous. The initiator currently tries to allocate 512
byte pages, but I think it ends up using 4K pages as I don't think any
HCA supports a smaller FMR page. That's OK -- I'm pretty sure that the
mid-layer isn't going to pass down an SG list of 512 byte sectors, it
would be in pages, but it something I'd have to check to be sure. You
could get ~255 MB request using this method, assuming you didn't run out
of FMR entries (that request would need up to 65,280 entries).

The problem with using FMR in this manner is the failure cases. We have
no way to tell the SCSI mid-layer that it needs to split the request up,
and even if we could there may be certain commands that must not be
split. We could return BUSY if we fail to allocate an FMR entry, but
then we have no guarantee of forward progress. This should be a rare
case, but it's not something we want in a storage system.

So, we would still want to be able to fall back to the RDMA of indirect
descriptors, even if it is very rarely used.

If you can get Cedric to add it to the target, I'll commit to writing
the initiator part. We'd love to have it, as would many of your other
customers.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <1282313740.7441.25.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>]

* Re: srp sg_tablesize
       [not found]     ` <1282313740.7441.25.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
@ 2010-08-24 19:47       ` Bernd Schubert
       [not found]         ` <201008242147.50692.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2010-08-24 19:47 UTC (permalink / raw)
  To: David Dillow
  Cc: general-G2znmakfqn7U1rindQTSdQ, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Bernd Schubert

David,

thanks a lot for your explanation and I'm sorry for my late reply. I have to 
admit that I'm not familiar at all with the srp protocol, so please excuse 
that you have lost me and my further questions about it.

On Friday, August 20, 2010, David Dillow wrote:
> On Fri, 2010-08-20 at 09:49 +0200, Bernd Schubert wrote:
> > In ib_srp.c sg_tablesize is defined as 255. With that value we see lots
> > of IO requests of size 1020. As I already wrote on linux-scsi, that is
> > really sub- optimal for DDN storage, as lots of IO requests of size 1020
> > come up.
> > 
> > Now the question is if we can safely increase it. Is there somewhere a
> > definition what is the real hardware supported size? And shouldn't we
> > increase sg_tablesize, but also set the .dma_boundary value?
> 
> Currently, we limit sg_tablesize to 255 because we can only cache 255
> indirect memory descriptors in the SRP_CMD message to the target. That's
> due to the count being in an 8 bit field.

I think the magic is in srp_map_data(), but I do not find any 8-bit field 
there? While looking through the code, I also think I found a bug:

In srp_map_data() 

count = ib_dma_map_sg()

Now if something fails, count may become zero and that is not handled at all.


> 
> It does not have to be this way -- the spec defines that that indirect
> descriptors in the message are just a cache, and the target should RDMA
> any additional descriptors from the initiator, and then process those as
> well. So we could easily take it higher, up to the size of a contiguous
> allocation (or bigger, using FMR). However, to my knowledge, no vendor
> implements this support.

I have no idea if DDN supports it or not, but I'm sure I could figure it out.

> 
> We could make more descriptors fit in the SRP_CMD by using FMR to make
> them virtually contiguous. The initiator currently tries to allocate 512
> byte pages, but I think it ends up using 4K pages as I don't think any
> HCA supports a smaller FMR page. That's OK -- I'm pretty sure that the
> mid-layer isn't going to pass down an SG list of 512 byte sectors, it
> would be in pages, but it something I'd have to check to be sure. You
> could get ~255 MB request using this method, assuming you didn't run out
> of FMR entries (that request would need up to 65,280 entries).

Hmm, there is already srp_map_frm() and if that fails it already uses an 
idirect mapping? Or do I completely miss something? 


> 
> The problem with using FMR in this manner is the failure cases. We have
> no way to tell the SCSI mid-layer that it needs to split the request up,
> and even if we could there may be certain commands that must not be
> split. We could return BUSY if we fail to allocate an FMR entry, but
> then we have no guarantee of forward progress. This should be a rare
> case, but it's not something we want in a storage system.
> 
> So, we would still want to be able to fall back to the RDMA of indirect
> descriptors, even if it is very rarely used.
> 
> If you can get Cedric to add it to the target, I'll commit to writing
> the initiator part. We'd love to have it, as would many of your other
> customers.

Hmm, who is Cedric? One of my European colleagues from Paris is Cedric, but I 
doubt you mean him?


Thanks,
Bernd

-- 
Bernd Schubert
DataDirect Networks
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <201008242147.50692.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>]

* Re: srp sg_tablesize
       [not found]         ` <201008242147.50692.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>
@ 2010-08-24 20:23           ` David Dillow
  0 siblings, 0 replies; 12+ messages in thread
From: David Dillow @ 2010-08-24 20:23 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: general-G2znmakfqn7U1rindQTSdQ@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Bernd Schubert

On Tue, 2010-08-24 at 15:47 -0400, Bernd Schubert wrote:
> On Friday, August 20, 2010, David Dillow wrote:
> > Currently, we limit sg_tablesize to 255 because we can only cache 255
> > indirect memory descriptors in the SRP_CMD message to the target. That's
> > due to the count being in an 8 bit field.
> 
> I think the magic is in srp_map_data(), but I do not find any 8-bit field 
> there? 

The SRP_CMD message is described in the SRP spec, and also by struct
srp_cmd in include/scsi/srp.h. The fields in question are
data_{in,out}_desc_cnt.

> While looking through the code, I also think I found a bug:
> 
> In srp_map_data() 
> 
> count = ib_dma_map_sg()
> 
> Now if something fails, count may become zero and that is not handled at all.

Yes, I think you are correct. I don't think it is possible to hit on any
system arch that one would use IB on, but I'll add to the list of things
I need to fix.

> > It does not have to be this way -- the spec defines that that indirect
> > descriptors in the message are just a cache, and the target should RDMA
> > any additional descriptors from the initiator, and then process those as
> > well. So we could easily take it higher, up to the size of a contiguous
> > allocation (or bigger, using FMR). However, to my knowledge, no vendor
> > implements this support.
> 
> I have no idea if DDN supports it or not, but I'm sure I could figure it out.

You don't; trust me on this. :)

> > We could make more descriptors fit in the SRP_CMD by using FMR to make
> > them virtually contiguous. The initiator currently tries to allocate 512
> > byte pages, but I think it ends up using 4K pages as I don't think any
> > HCA supports a smaller FMR page. That's OK -- I'm pretty sure that the
> > mid-layer isn't going to pass down an SG list of 512 byte sectors, it
> > would be in pages, but it something I'd have to check to be sure. You
> > could get ~255 MB request using this method, assuming you didn't run out
> > of FMR entries (that request would need up to 65,280 entries).
> 
> Hmm, there is already srp_map_frm() and if that fails it already uses an 
> idirect mapping? Or do I completely miss something?

Yes, that tries to use FMR to map the pages and we fall back to indirect
mappings if that fails. We could use FMR to reduce the number of S/G
entries, but we still would need a fallback before we could tell the
SCSI mid-layer that we can handle more than 255 entries.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: srp sg_tablesize
       [not found] ` <201008200949.54595.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>
  2010-08-20 14:15   ` David Dillow
@ 2010-08-21 11:14   ` Bart Van Assche
       [not found]     ` <AANLkTimMoyEpfYPFSLLqS9ZCg3VyyOQcd4i2zzCQjHMN-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2010-08-21 11:14 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: general-ZwoEplunGu1OwGhvXhtEPSCwEArCW2h5,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bernd Schubert

On Fri, Aug 20, 2010 at 9:49 AM, Bernd Schubert
<bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org> wrote:
>
> In ib_srp.c sg_tablesize is defined as 255. With that value we see lots of IO
> requests of size 1020. As I already wrote on linux-scsi, that is really sub-
> optimal for DDN storage, as lots of IO requests of size 1020 come up.
>
> Now the question is if we can safely increase it. Is there somewhere a
> definition what is the real hardware supported size? And shouldn't we increase
> sg_tablesize, but also set the .dma_boundary value?

(resending as plain text)

The request size of 1020 indicates that there are less than 60 data
buffer descriptors in the SRP_CMD request. So you are probably hitting
another limit than srp_sg_tablesize.

Did this occur with buffered (asynchronous) or unbuffered (direct) I/O
? And in the first case, which I/O scheduler did you use ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <AANLkTimMoyEpfYPFSLLqS9ZCg3VyyOQcd4i2zzCQjHMN-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: srp sg_tablesize
       [not found]     ` <AANLkTimMoyEpfYPFSLLqS9ZCg3VyyOQcd4i2zzCQjHMN-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-08-21 16:27       ` David Dillow
       [not found]         ` <1282408043.20840.13.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: David Dillow @ 2010-08-21 16:27 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Bernd Schubert, general-G2znmakfqn7U1rindQTSdQ,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bernd Schubert

On Sat, 2010-08-21 at 13:14 +0200, Bart Van Assche wrote:
> On Fri, Aug 20, 2010 at 9:49 AM, Bernd Schubert
> <bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org> wrote:
> >
> > In ib_srp.c sg_tablesize is defined as 255. With that value we see lots of IO
> > requests of size 1020. As I already wrote on linux-scsi, that is really sub-
> > optimal for DDN storage, as lots of IO requests of size 1020 come up.
> >
> > Now the question is if we can safely increase it. Is there somewhere a
> > definition what is the real hardware supported size? And shouldn't we increase
> > sg_tablesize, but also set the .dma_boundary value?
> 
> (resending as plain text)
> 
> The request size of 1020 indicates that there are less than 60 data
> buffer descriptors in the SRP_CMD request. So you are probably hitting
> another limit than srp_sg_tablesize.

4 KB * 255 descriptors = 1020 KB

IIRC, we verified that we were seeing 255 entries in the S/G list with a
few printk()s, but it has been a few years.

I'm not sure how you came up with 60 descriptors -- could you elaborate
please? 

> Did this occur with buffered (asynchronous) or unbuffered (direct) I/O
> ? And in the first case, which I/O scheduler did you use ?

I'm sure Bernd will speak for his situation, but we've seen it with both
buffered and unbuffered, with the deadline and noop schedulers (mostly
on vendor 2.6.18 kernels). CFQ never gave us larger than 512 KB
requests. Our main use is Lustre, which does unbuffered IO from the
kernel.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <1282408043.20840.13.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>]

* Re: srp sg_tablesize
       [not found]         ` <1282408043.20840.13.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2010-08-21 17:28           ` Bart Van Assche
       [not found]             ` <AANLkTimFS=QkHd9+393mS1gQ5ZnL79jSDQaUZ8C_Xd2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-08-21 18:04           ` Bernd Schubert
  1 sibling, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2010-08-21 17:28 UTC (permalink / raw)
  To: David Dillow
  Cc: Bernd Schubert, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bernd Schubert

On Sat, Aug 21, 2010 at 6:27 PM, David Dillow <dillowda-1Heg1YXhbW8@public.gmane.org> wrote:
>
> On Sat, 2010-08-21 at 13:14 +0200, Bart Van Assche wrote:
> > On Fri, Aug 20, 2010 at 9:49 AM, Bernd Schubert
> > <bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org> wrote:
> > >
> > > In ib_srp.c sg_tablesize is defined as 255. With that value we see lots of IO
> > > requests of size 1020. As I already wrote on linux-scsi, that is really sub-
> > > optimal for DDN storage, as lots of IO requests of size 1020 come up.
> > >
> > > Now the question is if we can safely increase it. Is there somewhere a
> > > definition what is the real hardware supported size? And shouldn't we increase
> > > sg_tablesize, but also set the .dma_boundary value?
> >
> > (resending as plain text)
> >
> > The request size of 1020 indicates that there are less than 60 data
> > buffer descriptors in the SRP_CMD request. So you are probably hitting
> > another limit than srp_sg_tablesize.
>
> 4 KB * 255 descriptors = 1020 KB
>
> IIRC, we verified that we were seeing 255 entries in the S/G list with a
> few printk()s, but it has been a few years.
>
> I'm not sure how you came up with 60 descriptors -- could you elaborate
> please?

The original message mentions "size 1020" but not the unit of that
size. So I guessed that this referred to an SRP_CMD information unit
of 1020 bytes. And in a SRP_CMD message of 1020 bytes there fit at
most 59 descriptors ((1020-68)/16). Now that I see your computation,
I'm afraid that my guess about the meaning of the original message was
wrong. Looks like I have been delving too deep into the SRP protocol
...

> > Did this occur with buffered (asynchronous) or unbuffered (direct) I/O
> > ? And in the first case, which I/O scheduler did you use ?
>
> I'm sure Bernd will speak for his situation, but we've seen it with both
> buffered and unbuffered, with the deadline and noop schedulers (mostly
> on vendor 2.6.18 kernels). CFQ never gave us larger than 512 KB
> requests. Our main use is Lustre, which does unbuffered IO from the
> kernel.

If ib_srp is already sending SRP commands with 255 descriptors,
changing the configuration of the I/O scheduler or the I/O mode will
not help.

What might help - depending on how the target is implemented - is
using an I/O depth larger than one. ib_srp sends all SRP_CMDs with the
task attribute SIMPLE, so a target is allowed to process these
requests concurrently. For the ib_srpt target I see the following
results over a single QDR link and a NULLIO target (fio
--bs=$((1020*1024)) --ioengine=psync --buffered=0 --rw=read --thread
--numjobs=${threads} --group_reporting --gtod_reduce=1 --name=${dev}
--filename=${dev}):

I/O depth Bandwidth (MB/s)
    1       1270
    2       2300
    4       2500
    8       2670
   16       2700

That last result is close to the bandwidth reported by ib_rdma_bw.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <AANLkTimFS=QkHd9+393mS1gQ5ZnL79jSDQaUZ8C_Xd2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: srp sg_tablesize
       [not found]             ` <AANLkTimFS=QkHd9+393mS1gQ5ZnL79jSDQaUZ8C_Xd2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-08-21 18:20               ` Bernd Schubert
       [not found]                 ` <201008212020.55028.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>
  2010-08-21 20:38               ` David Dillow
  1 sibling, 1 reply; 12+ messages in thread
From: Bernd Schubert @ 2010-08-21 18:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: David Dillow, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bernd Schubert

On Saturday, August 21, 2010, Bart Van Assche wrote:
> On Sat, Aug 21, 2010 at 6:27 PM, David Dillow <dillowda-1Heg1YXhbW8@public.gmane.org> wrote:
> > On Sat, 2010-08-21 at 13:14 +0200, Bart Van Assche wrote:
> > > On Fri, Aug 20, 2010 at 9:49 AM, Bernd Schubert
> > > 
> > > <bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org> wrote:
> > > > In ib_srp.c sg_tablesize is defined as 255. With that value we see
> > > > lots of IO requests of size 1020. As I already wrote on linux-scsi,
> > > > that is really sub- optimal for DDN storage, as lots of IO requests
> > > > of size 1020 come up.
> > > > 
> > > > Now the question is if we can safely increase it. Is there somewhere
> > > > a definition what is the real hardware supported size? And shouldn't
> > > > we increase sg_tablesize, but also set the .dma_boundary value?
> > > 
> > > (resending as plain text)
> > > 
> > > The request size of 1020 indicates that there are less than 60 data
> > > buffer descriptors in the SRP_CMD request. So you are probably hitting
> > > another limit than srp_sg_tablesize.
> > 
> > 4 KB * 255 descriptors = 1020 KB
> > 
> > IIRC, we verified that we were seeing 255 entries in the S/G list with a
> > few printk()s, but it has been a few years.
> > 
> > I'm not sure how you came up with 60 descriptors -- could you elaborate
> > please?
> 
> The original message mentions "size 1020" but not the unit of that
> size. So I guessed that this referred to an SRP_CMD information unit
> of 1020 bytes. And in a SRP_CMD message of 1020 bytes there fit at
> most 59 descriptors ((1020-68)/16). Now that I see your computation,
> I'm afraid that my guess about the meaning of the original message was
> wrong. Looks like I have been delving too deep into the SRP protocol

Er sorry, I really meant 1020K IOs. That is something that easily can be 
monitored on DDN storage.


> > ...
> 
> > > Did this occur with buffered (asynchronous) or unbuffered (direct) I/O
> > > ? And in the first case, which I/O scheduler did you use ?
> > 
> > I'm sure Bernd will speak for his situation, but we've seen it with both
> > buffered and unbuffered, with the deadline and noop schedulers (mostly
> > on vendor 2.6.18 kernels). CFQ never gave us larger than 512 KB
> > requests. Our main use is Lustre, which does unbuffered IO from the
> > kernel.
> 
> If ib_srp is already sending SRP commands with 255 descriptors,
> changing the configuration of the I/O scheduler or the I/O mode will
> not help.
> 
> What might help - depending on how the target is implemented - is
> using an I/O depth larger than one. ib_srp sends all SRP_CMDs with the

It depends if we enable write-back cache or not. The older S2A architechture 
does not mirror the cache at all and therefore write-back cache is supposed to 
be disabled. The recent SFA architechture mirrors the write-back cache and so 
it supposed to be enabled. With enabled witeback-cache an 'improved' command 
processing is done (I don't know details myself). However, cache mirroring is 
an expensive operation if the system can do 10GB/s and there IOs only will go 
into the cache, if the size is not a multiple of 1024K. 1MiB IOs are directly 
send to the disks. And now that leaves us with srp, where we see to many 1020K 
request, which will need to be processed by the write-back cache....


> task attribute SIMPLE, so a target is allowed to process these
> requests concurrently. For the ib_srpt target I see the following
> results over a single QDR link and a NULLIO target (fio
> --bs=$((1020*1024)) --ioengine=psync --buffered=0 --rw=read --thread
> --numjobs=${threads} --group_reporting --gtod_reduce=1 --name=${dev}
> --filename=${dev}):
> 
> I/O depth Bandwidth (MB/s)
>     1       1270
>     2       2300
>     4       2500
>     8       2670
>    16       2700
> 
> That last result is close to the bandwidth reported by ib_rdma_bw.


How exactly do you do that? Is that something I would try with our storage as 
well? I guess only with a special firmware version, which I also do not have 
access to.

Thanks,
Bernd


-- 
Bernd Schubert
DataDirect Networks
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <201008212020.55028.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>]

* Re: srp sg_tablesize
       [not found]                 ` <201008212020.55028.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>
@ 2010-08-21 20:50                   ` David Dillow
  2010-08-22  7:15                   ` Bart Van Assche
  1 sibling, 0 replies; 12+ messages in thread
From: David Dillow @ 2010-08-21 20:50 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Bart Van Assche, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Bernd Schubert

On Sat, 2010-08-21 at 20:20 +0200, Bernd Schubert wrote:
> On Saturday, August 21, 2010, Bart Van Assche wrote:

> > What might help - depending on how the target is implemented - is
> > using an I/O depth larger than one. ib_srp sends all SRP_CMDs with the
> 
> It depends if we enable write-back cache or not. The older S2A architechture 
> does not mirror the cache at all and therefore write-back cache is supposed to 
> be disabled. The recent SFA architechture mirrors the write-back cache and so 
> it supposed to be enabled. With enabled witeback-cache an 'improved' command 
> processing is done (I don't know details myself). However, cache mirroring is 
> an expensive operation if the system can do 10GB/s and there IOs only will go 
> into the cache, if the size is not a multiple of 1024K. 1MiB IOs are directly 
> send to the disks. And now that leaves us with srp, where we see to many 1020K 
> request, which will need to be processed by the write-back cache....

You have a few options here -- if it was a 1024 KB request broken into a
1020 KB and a 4 KB request, you can hold onto the 1020 KB request for a
fraction of a second to see if the next request completes it. The 4 KB
request will almost always be the next request for the LUN. If the next
request doesn't fill out the stripe, then do the effort for write
mirroring.

That can be extended as well -- perhaps start the mirror of the 1020 KB
request, but you can decide not to mirror the 4 KB request since it
completes the full stripe write, and then you can just wait to complete
the 4 KB write once the full stripe write completes to disk, as you
would if a 1 MB, stripe-aligned request comes in.

And similarly for if the next request fills a stripe but spills into the
next stripe -- mirror only the portion that's needed and switch to
waiting for the disk write if you can make a full stripe on the next
request.

> > task attribute SIMPLE, so a target is allowed to process these
> > requests concurrently. For the ib_srpt target I see the following
> > results over a single QDR link and a NULLIO target (fio

> How exactly do you do that? Is that something I would try with our storage as 
> well? I guess only with a special firmware version, which I also do not have 
> access to.

Bart is referring to keeping multiple requests in flight. On your
client, use a non-zero -qd to xdd for example, or the --threads
--numjobs=X for fio he showed.

If you're not doing direct IO, then you have less direct control over
how the page cache will do its writeback, but I would expect it to
either try to have a decent queue depth or the block/mm developers may
be interested in patches to get there if it makes sense for a particular
device.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: srp sg_tablesize
       [not found]                 ` <201008212020.55028.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>
  2010-08-21 20:50                   ` David Dillow
@ 2010-08-22  7:15                   ` Bart Van Assche
  1 sibling, 0 replies; 12+ messages in thread
From: Bart Van Assche @ 2010-08-22  7:15 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: David Dillow, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bernd Schubert

On Sat, Aug 21, 2010 at 8:20 PM, Bernd Schubert
<bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org> wrote:
> On Saturday, August 21, 2010, Bart Van Assche wrote:
>> [ ... ]
>> task attribute SIMPLE, so a target is allowed to process these
>> requests concurrently. For the ib_srpt target I see the following
>> results over a single QDR link and a NULLIO target (fio
>> --bs=$((1020*1024)) --ioengine=psync --buffered=0 --rw=read --thread
>> --numjobs=${threads} --group_reporting --gtod_reduce=1 --name=${dev}
>> --filename=${dev}):
>>
>> I/O depth Bandwidth (MB/s)
>>     1       1270
>>     2       2300
>>     4       2500
>>     8       2670
>>    16       2700
>>
>> That last result is close to the bandwidth reported by ib_rdma_bw.
>
> How exactly do you do that? Is that something I would try with our storage as
> well? I guess only with a special firmware version, which I also do not have
> access to.

For me it doesn't matter whether you use xdd or fio to repeat the
above test. In case you prefer fio: more information about fio can be
found here: http://freshmeat.net/projects/fio/. It is safe to run the
above fio command since it doesn't modify any data. Even if you do not
configure the target for NULLIO but read real data instead, repeating
the above test will reveal whether the IB link is the bottleneck or
the storage system inside the target device.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: srp sg_tablesize
       [not found]             ` <AANLkTimFS=QkHd9+393mS1gQ5ZnL79jSDQaUZ8C_Xd2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2010-08-21 18:20               ` Bernd Schubert
@ 2010-08-21 20:38               ` David Dillow
  1 sibling, 0 replies; 12+ messages in thread
From: David Dillow @ 2010-08-21 20:38 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Bernd Schubert, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bernd Schubert

On Sat, 2010-08-21 at 19:28 +0200, Bart Van Assche wrote:
> On Sat, Aug 21, 2010 at 6:27 PM, David Dillow <dillowda-1Heg1YXhbW8@public.gmane.org> wrote:
> >
> > On Sat, 2010-08-21 at 13:14 +0200, Bart Van Assche wrote:
> > > The request size of 1020 indicates that there are less than 60 data
> > > buffer descriptors in the SRP_CMD request. So you are probably hitting
> > > another limit than srp_sg_tablesize.
> >
> > 4 KB * 255 descriptors = 1020 KB
> >
> > IIRC, we verified that we were seeing 255 entries in the S/G list with a
> > few printk()s, but it has been a few years.
> >
> > I'm not sure how you came up with 60 descriptors -- could you elaborate
> > please?
> 
> The original message mentions "size 1020" but not the unit of that
> size. So I guessed that this referred to an SRP_CMD information unit
> of 1020 bytes. And in a SRP_CMD message of 1020 bytes there fit at
> most 59 descriptors ((1020-68)/16). Now that I see your computation,
> I'm afraid that my guess about the meaning of the original message was
> wrong. Looks like I have been delving too deep into the SRP protocol

Sorry, 1020 KB requests have been a perennial thorn in our side, so I'm
intimately familiar with that number. And your deep dives into SRP is
why I asked you to elaborate -- I may have missed something.

> If ib_srp is already sending SRP commands with 255 descriptors,
> changing the configuration of the I/O scheduler or the I/O mode will
> not help.

This is all on the initiator side, but with CFQ we weren't getting 255
descriptors; we only got 512 KB requests. We'd see that with deadline as
well sometimes, but not as much. It seemed to be something in the
scheduler breaking up large requests but we never investigated it since
noop was the recommended scheduler for DDN hardware. We've been
re-evaluating that recently, though.

dm-multipath is another source of size restrictions, since it defaults
to a max_sectors_kb of 512 if the underlying devices have larger limits.
So far the only way to fix that is to patch the kernel or run a
systemtap script to fix it up.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: srp sg_tablesize
       [not found]         ` <1282408043.20840.13.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  2010-08-21 17:28           ` Bart Van Assche
@ 2010-08-21 18:04           ` Bernd Schubert
  1 sibling, 0 replies; 12+ messages in thread
From: Bernd Schubert @ 2010-08-21 18:04 UTC (permalink / raw)
  To: David Dillow
  Cc: Bart Van Assche, general-G2znmakfqn7U1rindQTSdQ,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Bernd Schubert

On Saturday, August 21, 2010, David Dillow wrote:
> On Sat, 2010-08-21 at 13:14 +0200, Bart Van Assche wrote:
> > On Fri, Aug 20, 2010 at 9:49 AM, Bernd Schubert
> > 
> > <bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org> wrote:
> > > In ib_srp.c sg_tablesize is defined as 255. With that value we see lots
> > > of IO requests of size 1020. As I already wrote on linux-scsi, that is
> > > really sub- optimal for DDN storage, as lots of IO requests of size
> > > 1020 come up.
> > > 
> > > Now the question is if we can safely increase it. Is there somewhere a
> > > definition what is the real hardware supported size? And shouldn't we
> > > increase sg_tablesize, but also set the .dma_boundary value?
> > 
> > (resending as plain text)
> > 
> > The request size of 1020 indicates that there are less than 60 data
> > buffer descriptors in the SRP_CMD request. So you are probably hitting
> > another limit than srp_sg_tablesize.
> 
> 4 KB * 255 descriptors = 1020 KB

We at least verified it indirectly. Lustre-1.8.4 will include a patch to 
incrase SG_ALL from 255 to 256 (not ideal at least for older kernels, as it 
will require at least a order 1 allocation, instead of the previous order 0).
But including that patch into our release and then testing IO sizes with 
QLogic FC definitely made 1020K IO requests to vanish. 

> 
> IIRC, we verified that we were seeing 255 entries in the S/G list with a
> few printk()s, but it has been a few years.

I probably should do that as well, just some time limitations.

> 
> I'm not sure how you came up with 60 descriptors -- could you elaborate
> please?
> 
> > Did this occur with buffered (asynchronous) or unbuffered (direct) I/O
> > ? And in the first case, which I/O scheduler did you use ?
> 
> I'm sure Bernd will speak for his situation, but we've seen it with both
> buffered and unbuffered, with the deadline and noop schedulers (mostly
> on vendor 2.6.18 kernels). CFQ never gave us larger than 512 KB
> requests. Our main use is Lustre, which does unbuffered IO from the
> kernel.

I'm in the DDN Lustre group, so I mainly speak for Lustre as well. I think 
Lustres filterio is directio-like. It is not the classical kernel direct-IO 
interface and provides a few buffers for writes, AFAIK. But it is still almost 
direct-IO and its filterio also immediately sends a disk commit request.

We use the deadline scheduler by default. Differences to noop are  small for 
streaming writes, but for example for mke2fs it is 5 times faster with 
deadline compared to noop.

Cheers,
Bernd

-- 
Bernd Schubert
DataDirect Networks
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-08-24 20:23 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-20  7:49 srp sg_tablesize Bernd Schubert
     [not found] ` <201008200949.54595.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>
2010-08-20 14:15   ` David Dillow
     [not found]     ` <1282313740.7441.25.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
2010-08-24 19:47       ` Bernd Schubert
     [not found]         ` <201008242147.50692.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>
2010-08-24 20:23           ` David Dillow
2010-08-21 11:14   ` Bart Van Assche
     [not found]     ` <AANLkTimMoyEpfYPFSLLqS9ZCg3VyyOQcd4i2zzCQjHMN-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-08-21 16:27       ` David Dillow
     [not found]         ` <1282408043.20840.13.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2010-08-21 17:28           ` Bart Van Assche
     [not found]             ` <AANLkTimFS=QkHd9+393mS1gQ5ZnL79jSDQaUZ8C_Xd2A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-08-21 18:20               ` Bernd Schubert
     [not found]                 ` <201008212020.55028.bs_lists-ivAEE9vf7JuUmYeGgvxl9AC/G2K4zDHf@public.gmane.org>
2010-08-21 20:50                   ` David Dillow
2010-08-22  7:15                   ` Bart Van Assche
2010-08-21 20:38               ` David Dillow
2010-08-21 18:04           ` Bernd Schubert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox