linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: A question regarding "multiple SGL"
       [not found] <20161027005230.9904DC00097@webmail.sinamail.sina.com.cn>
@ 2016-10-27  6:41 ` Christoph Hellwig
  2016-10-27  6:57   ` Qiuxin (robert)
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2016-10-27  6:41 UTC (permalink / raw)
  To: 鑫愿
  Cc: Bart Van Assche, Jens Axboe, linux-block@vger.kernel.org,
	James Bottomley, Martin K. Petersen, Mike Snitzer,
	linux-rdma@vger.kernel.org, Ming Lei,
	linux-nvme@lists.infradead.org, Keith Busch, Doug Ledford,
	linux-scsi@vger.kernel.org, Laurence Oberman, Christoph Hellwig,
	tiger.zhao, qiuxin

Hi Robert,

There is no feature called "Multiple SGL in one NVMe capsule".  The
NVMe over Fabrics specification allows a controller to advertise how
many SGL descriptors it supports using the MSDBD Identify field:

"Maximum SGL Data Block Descriptors (MSDBD): This field indicates the
maximum number of (Keyed) SGL Data Block descriptors that a host is allowed to
place in a capsule. A value of 0h indicates no limit."

Setting this value to 1 is perfectly valid.  Similarly a host is free
to chose any number of SGL descriptors between 0 (only for command that
don't transfer data) to the limit imposed by the controller using the
MSDBD field.

There are no plans to support a MSDBD value larger than 1 in the Linux
NVMe target, and there are no plans to ever submit commands with multiple
SGLs from the host driver either.

Cheers,
	Christoph

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A question regarding "multiple SGL"
  2016-10-27  6:41 ` A question regarding "multiple SGL" Christoph Hellwig
@ 2016-10-27  6:57   ` Qiuxin (robert)
  2016-10-27  7:10     ` Christoph Hellwig
  0 siblings, 1 reply; 5+ messages in thread
From: Qiuxin (robert) @ 2016-10-27  6:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Jens Axboe, linux-block@vger.kernel.org,
	James Bottomley, Martin K. Petersen, Mike Snitzer,
	linux-rdma@vger.kernel.org, Ming Lei,
	linux-nvme@lists.infradead.org, Keith Busch, Doug Ledford,
	linux-scsi@vger.kernel.org, Laurence Oberman, Tiger zhao

SGkgQ2hyaXN0b3BoLA0KDQpUaGFua3MgLCBnb3QgaXQuDQoNCkNvdWxkIHlvdSBwbGVhc2UgZG8g
bWUgZmF2b3IgdG8gbGV0IG1lIGtub3cgdGhlIGJhY2tncm91bmQgd2h5IHdlIE9OTFkgc3VwcG9y
dCAiIE1TREJEID09MSI/ICAgSSBhbSBOT1QgdHJ5aW5nIHRvIHJlc2lzdCBvciBvcHBvc2UgYW55
dGhpbmcgLCBJIGp1c3Qgd2FudCB0byBrbm93IHRoZSByZWFzb24uICBZb3Uga25vdywgIGl0IGlz
IGEgbGl0dGxlIHdpcmVkIGZvciBtZSwgYXMgICJNU0RCRCA9PTEiIGRvZXMgbm90IGZ1bGZpbGwg
YWxsIHRoZSB1c2UgY2FzZXMgd2hpY2ggaXMgZGVwaWN0ZWQgaW4gdGhlIHNwZWMuDQoNCkJlc3Qs
DQpSb2JlcnQgUWl1eGluDQpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
DQpSb2JlcnQgUWl1eGluDQq7qs6qvLzK9dPQz965q8u+IEh1YXdlaSBUZWNobm9sb2dpZXMgQ28u
LCBMdGQuDQpQaG9uZTogKzg2LTc1NS0yODQyMDM1Nw0KRmF4OiANCk1vYmlsZTogKzg2IDE1OTg2
NjM4NDI5DQpFbWFpbDogcWl1eGluQGh1YXdlaS5jb20NCrXY1rejusnu29rK0MH6uNrH+NvgzO+7
qs6qu/m12CDTyrHgo7o1MTgxMjkNCkh1YXdlaSBUZWNobm9sb2dpZXMgQ28uLCBMdGQuDQpCYW50
aWFuLCBMb25nZ2FuZyBEaXN0cmljdCxTaGVuemhlbiA1MTgxMjksIFAuUi5DaGluYQ0KaHR0cDov
L3d3dy5odWF3ZWkuY29tIA0KX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
Xw0Ksb7Tyrz+vLDG5Li9vP66rNPQu6rOqrmry761xLGjw9zQxc+io6y99s/e09q3osvNuPjJz8Pm
tdjWt9bQwdCz9rXEuPbIy7vyyLrX6aGjvfsNCta5yM66zsbky/vIy9LUyM66ztDOyr3KudPDo6iw
/MCotauyu8/e09rIq7K/u/Kyv7fWtdjQucK2oaK4tNbGoaK78smit6KjqbG+08q8/tbQDQq1xNDF
z6Kho8jnufvE+rTtytXBy7G+08q8/qOsx+vE+sGivLS157uwu/LTyrz+zajWqreivP7Iy7Kiyb6z
/bG+08q8/qOhDQpUaGlzIGUtbWFpbCBhbmQgaXRzIGF0dGFjaG1lbnRzIGNvbnRhaW4gY29uZmlk
ZW50aWFsIGluZm9ybWF0aW9uIGZyb20gSFVBV0VJLCB3aGljaCANCmlzIGludGVuZGVkIG9ubHkg
Zm9yIHRoZSBwZXJzb24gb3IgZW50aXR5IHdob3NlIGFkZHJlc3MgaXMgbGlzdGVkIGFib3ZlLiBB
bnkgdXNlIG9mIHRoZSANCmluZm9ybWF0aW9uIGNvbnRhaW5lZCBoZXJlaW4gaW4gYW55IHdheSAo
aW5jbHVkaW5nLCBidXQgbm90IGxpbWl0ZWQgdG8sIHRvdGFsIG9yIHBhcnRpYWwgDQpkaXNjbG9z
dXJlLCByZXByb2R1Y3Rpb24sIG9yIGRpc3NlbWluYXRpb24pIGJ5IHBlcnNvbnMgb3RoZXIgdGhh
biB0aGUgaW50ZW5kZWQgDQpyZWNpcGllbnQocykgaXMgcHJvaGliaXRlZC4gSWYgeW91IHJlY2Vp
dmUgdGhpcyBlLW1haWwgaW4gZXJyb3IsIHBsZWFzZSBub3RpZnkgdGhlIHNlbmRlciBieSANCnBo
b25lIG9yIGVtYWlsIGltbWVkaWF0ZWx5IGFuZCBkZWxldGUgaXQhDQotLS0tLdPKvP7Urbz+LS0t
LS0NCreivP7IyzogQ2hyaXN0b3BoIEhlbGx3aWcgW21haWx0bzpoY2hAbHN0LmRlXSANCreiy83K
sbzkOiAyMDE2xOoxMNTCMjfI1SAxNDo0MQ0KytW8/sjLOiD2ztS4DQqzrcvNOiBCYXJ0IFZhbiBB
c3NjaGU7IEplbnMgQXhib2U7IGxpbnV4LWJsb2NrQHZnZXIua2VybmVsLm9yZzsgSmFtZXMgQm90
dG9tbGV5OyBNYXJ0aW4gSy4gUGV0ZXJzZW47IE1pa2UgU25pdHplcjsgbGludXgtcmRtYUB2Z2Vy
Lmtlcm5lbC5vcmc7IE1pbmcgTGVpOyBsaW51eC1udm1lQGxpc3RzLmluZnJhZGVhZC5vcmc7IEtl
aXRoIEJ1c2NoOyBEb3VnIExlZGZvcmQ7IGxpbnV4LXNjc2lAdmdlci5rZXJuZWwub3JnOyBMYXVy
ZW5jZSBPYmVybWFuOyBDaHJpc3RvcGggSGVsbHdpZzsgVGlnZXIgemhhbzsgUWl1eGluIChyb2Jl
cnQpDQrW98ziOiBSZTogQSBxdWVzdGlvbiByZWdhcmRpbmcgIm11bHRpcGxlIFNHTCINCg0KSGkg
Um9iZXJ0LA0KDQpUaGVyZSBpcyBubyBmZWF0dXJlIGNhbGxlZCAiTXVsdGlwbGUgU0dMIGluIG9u
ZSBOVk1lIGNhcHN1bGUiLiAgVGhlIE5WTWUgb3ZlciBGYWJyaWNzIHNwZWNpZmljYXRpb24gYWxs
b3dzIGEgY29udHJvbGxlciB0byBhZHZlcnRpc2UgaG93IG1hbnkgU0dMIGRlc2NyaXB0b3JzIGl0
IHN1cHBvcnRzIHVzaW5nIHRoZSBNU0RCRCBJZGVudGlmeSBmaWVsZDoNCg0KIk1heGltdW0gU0dM
IERhdGEgQmxvY2sgRGVzY3JpcHRvcnMgKE1TREJEKTogVGhpcyBmaWVsZCBpbmRpY2F0ZXMgdGhl
IG1heGltdW0gbnVtYmVyIG9mIChLZXllZCkgU0dMIERhdGEgQmxvY2sgZGVzY3JpcHRvcnMgdGhh
dCBhIGhvc3QgaXMgYWxsb3dlZCB0byBwbGFjZSBpbiBhIGNhcHN1bGUuIEEgdmFsdWUgb2YgMGgg
aW5kaWNhdGVzIG5vIGxpbWl0LiINCg0KU2V0dGluZyB0aGlzIHZhbHVlIHRvIDEgaXMgcGVyZmVj
dGx5IHZhbGlkLiAgU2ltaWxhcmx5IGEgaG9zdCBpcyBmcmVlIHRvIGNob3NlIGFueSBudW1iZXIg
b2YgU0dMIGRlc2NyaXB0b3JzIGJldHdlZW4gMCAob25seSBmb3IgY29tbWFuZCB0aGF0IGRvbid0
IHRyYW5zZmVyIGRhdGEpIHRvIHRoZSBsaW1pdCBpbXBvc2VkIGJ5IHRoZSBjb250cm9sbGVyIHVz
aW5nIHRoZSBNU0RCRCBmaWVsZC4NCg0KVGhlcmUgYXJlIG5vIHBsYW5zIHRvIHN1cHBvcnQgYSBN
U0RCRCB2YWx1ZSBsYXJnZXIgdGhhbiAxIGluIHRoZSBMaW51eCBOVk1lIHRhcmdldCwgYW5kIHRo
ZXJlIGFyZSBubyBwbGFucyB0byBldmVyIHN1Ym1pdCBjb21tYW5kcyB3aXRoIG11bHRpcGxlIFNH
THMgZnJvbSB0aGUgaG9zdCBkcml2ZXIgZWl0aGVyLg0KDQpDaGVlcnMsDQoJQ2hyaXN0b3BoDQo=

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A question regarding "multiple SGL"
  2016-10-27  6:57   ` Qiuxin (robert)
@ 2016-10-27  7:10     ` Christoph Hellwig
  2016-10-27  9:02       ` Sagi Grimberg
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2016-10-27  7:10 UTC (permalink / raw)
  To: Qiuxin (robert)
  Cc: Bart Van Assche, Jens Axboe, linux-block@vger.kernel.org,
	James Bottomley, Martin K. Petersen, Mike Snitzer,
	linux-rdma@vger.kernel.org, Ming Lei,
	linux-nvme@lists.infradead.org, Keith Busch, Doug Ledford,
	linux-scsi@vger.kernel.org, Laurence Oberman, Tiger zhao

Hi Robert,

please explain your use cases that isn't handled.  The one and only
reason to set MSDBD to 1 is to make the code a lot simpler given that
there is no real use case for supporting more.

RDMA uses memory registrations to register large and possibly
discontiguous data regions for a single rkey, aka single SGL descriptor
in NVMe terms.  There would be two reasons to support multiple SGL
descriptors:  a) to support a larger I/O size than supported by a single
MR, or b) to support a data region format not mappable by a single
MR.

iSER only supports a single rkey (or stag in IETF terminology) and has
been doing fine on a) and mostly fine on b).   There are a few possible
data layouts not supported by the traditional IB/iWarp FR WRs, but the
limit is in fact exactly the same as imposed by the NVMe PRPs used for
PCIe NVMe devices, so the Linux block layer has support to not generate
them.  Also with modern Mellanox IB/RoCE hardware we can actually
register completely arbitrary SGLs.  iSER supports using this registration
mode already with a trivial code addition, but for NVMe we didn't have a
pressing need yet.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A question regarding "multiple SGL"
  2016-10-27  7:10     ` Christoph Hellwig
@ 2016-10-27  9:02       ` Sagi Grimberg
  2016-10-27 14:50         ` Steve Wise
  0 siblings, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2016-10-27  9:02 UTC (permalink / raw)
  To: Christoph Hellwig, Qiuxin (robert)
  Cc: linux-block@vger.kernel.org, James Bottomley, Martin K. Petersen,
	Mike Snitzer, linux-rdma@vger.kernel.org, Ming Lei, Tiger zhao,
	linux-nvme@lists.infradead.org, Jens Axboe, Doug Ledford,
	Laurence Oberman, linux-scsi@vger.kernel.org, Bart Van Assche,
	Keith Busch


> Hi Robert,

Hey Robert, Christoph,

> please explain your use cases that isn't handled.  The one and only
> reason to set MSDBD to 1 is to make the code a lot simpler given that
> there is no real use case for supporting more.
>
> RDMA uses memory registrations to register large and possibly
> discontiguous data regions for a single rkey, aka single SGL descriptor
> in NVMe terms.  There would be two reasons to support multiple SGL
> descriptors:  a) to support a larger I/O size than supported by a single
> MR, or b) to support a data region format not mappable by a single
> MR.
>
> iSER only supports a single rkey (or stag in IETF terminology) and has
> been doing fine on a) and mostly fine on b).   There are a few possible
> data layouts not supported by the traditional IB/iWarp FR WRs, but the
> limit is in fact exactly the same as imposed by the NVMe PRPs used for
> PCIe NVMe devices, so the Linux block layer has support to not generate
> them.  Also with modern Mellanox IB/RoCE hardware we can actually
> register completely arbitrary SGLs.  iSER supports using this registration
> mode already with a trivial code addition, but for NVMe we didn't have a
> pressing need yet.

Good explanation :)

The IO transfer size is a bit more pressing on some devices (e.g.
cxgb3/4) where the number of pages per-MR can be indeed lower than
a reasonable transfer size (Steve can correct me if I'm wrong).

However, if there is a real demand for this we'll happily accept
patches :)

Just a note, having this feature in-place can bring unexpected behavior
depending on how we implement it:
- If we can use multiple MRs per IO (for multiple SGLs) we can either
prepare for the worst-case and allocate enough MRs to satisfy the
various IO patterns. This will be much heavier in terms of resource
allocation and can limit the scalability of the host driver.
- Or we can implement a shared MR pool with a reasonable number of MRs.
In this case each IO can consume one or more MRs on the expense of
other IOs. In this case we may need to requeue the IO later when we
have enough available MRs to satisfy the IO. This can yield some
unexpected performance gaps for some workloads.

Cheers,
Sagi.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: A question regarding "multiple SGL"
  2016-10-27  9:02       ` Sagi Grimberg
@ 2016-10-27 14:50         ` Steve Wise
  0 siblings, 0 replies; 5+ messages in thread
From: Steve Wise @ 2016-10-27 14:50 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Christoph Hellwig',
	'Qiuxin (robert)'
  Cc: linux-block, 'James Bottomley',
	'Martin K. Petersen', 'Mike Snitzer', linux-rdma,
	'Ming Lei', 'Tiger zhao', linux-nvme,
	'Jens Axboe', 'Doug Ledford',
	'Laurence Oberman', linux-scsi, 'Bart Van Assche',
	'Keith Busch'

> > Hi Robert,
> 
> Hey Robert, Christoph,
> 
> > please explain your use cases that isn't handled.  The one and only
> > reason to set MSDBD to 1 is to make the code a lot simpler given that
> > there is no real use case for supporting more.
> >
> > RDMA uses memory registrations to register large and possibly
> > discontiguous data regions for a single rkey, aka single SGL descriptor
> > in NVMe terms.  There would be two reasons to support multiple SGL
> > descriptors:  a) to support a larger I/O size than supported by a single
> > MR, or b) to support a data region format not mappable by a single
> > MR.
> >
> > iSER only supports a single rkey (or stag in IETF terminology) and has
> > been doing fine on a) and mostly fine on b).   There are a few possible
> > data layouts not supported by the traditional IB/iWarp FR WRs, but the
> > limit is in fact exactly the same as imposed by the NVMe PRPs used for
> > PCIe NVMe devices, so the Linux block layer has support to not generate
> > them.  Also with modern Mellanox IB/RoCE hardware we can actually
> > register completely arbitrary SGLs.  iSER supports using this registration
> > mode already with a trivial code addition, but for NVMe we didn't have a
> > pressing need yet.
> 
> Good explanation :)
> 
> The IO transfer size is a bit more pressing on some devices (e.g.
> cxgb3/4) where the number of pages per-MR can be indeed lower than
> a reasonable transfer size (Steve can correct me if I'm wrong).
>

Currently, cxgb4 support 128KB REG_MR operations on a host with 4K page size,
via a max mr page list depth of 32.  Soon it will be bumped up from 32 to 128
and life will be better...

 
> However, if there is a real demand for this we'll happily accept
> patches :)
> 
> Just a note, having this feature in-place can bring unexpected behavior
> depending on how we implement it:
> - If we can use multiple MRs per IO (for multiple SGLs) we can either
> prepare for the worst-case and allocate enough MRs to satisfy the
> various IO patterns. This will be much heavier in terms of resource
> allocation and can limit the scalability of the host driver.
> - Or we can implement a shared MR pool with a reasonable number of MRs.
> In this case each IO can consume one or more MRs on the expense of
> other IOs. In this case we may need to requeue the IO later when we
> have enough available MRs to satisfy the IO. This can yield some
> unexpected performance gaps for some workloads.
> 

I would like to see the storage protocols deal with lack of resources for the
worst case.  This allows much smaller resource usage for both MRs, and SQ
resources, at the expense of adding flow control logic to deal with lack of
available MR and/or SQ slots to process the next IO.  I think it can be
implemented efficiently such that when in flow-control mode, the code is driving
new IO submissions off of SQ completions which will free up SQ slots and most
likely MRs from the QP's MR pool.

Steve.



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-10-27 14:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20161027005230.9904DC00097@webmail.sinamail.sina.com.cn>
2016-10-27  6:41 ` A question regarding "multiple SGL" Christoph Hellwig
2016-10-27  6:57   ` Qiuxin (robert)
2016-10-27  7:10     ` Christoph Hellwig
2016-10-27  9:02       ` Sagi Grimberg
2016-10-27 14:50         ` Steve Wise

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).