public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* Re: A question regarding "multiple SGL"
       [not found] ` <20161027005230.9904DC00097-2RFepEojUI2gQzYKMK1YzK/p1tWXv8elb9TvmfFkwKk@public.gmane.org>
@ 2016-10-27  6:41   ` Christoph Hellwig
  2016-10-27  6:57     ` Qiuxin (robert)
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2016-10-27  6:41 UTC (permalink / raw)
  To: 鑫愿
  Cc: Bart Van Assche, Jens Axboe,
	linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	James Bottomley, Martin K. Petersen, Mike Snitzer,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ming Lei,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Keith Busch, Doug Ledford,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Laurence Oberman, Christoph Hellwig, tiger.zhao, qiuxin

Hi Robert,

There is no feature called "Multiple SGL in one NVMe capsule".  The
NVMe over Fabrics specification allows a controller to advertise how
many SGL descriptors it supports using the MSDBD Identify field:

"Maximum SGL Data Block Descriptors (MSDBD): This field indicates the
maximum number of (Keyed) SGL Data Block descriptors that a host is allowed to
place in a capsule. A value of 0h indicates no limit."

Setting this value to 1 is perfectly valid.  Similarly a host is free
to chose any number of SGL descriptors between 0 (only for command that
don't transfer data) to the limit imposed by the controller using the
MSDBD field.

There are no plans to support a MSDBD value larger than 1 in the Linux
NVMe target, and there are no plans to ever submit commands with multiple
SGLs from the host driver either.

Cheers,
	Christoph
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A question regarding "multiple SGL"
  2016-10-27  6:41   ` A question regarding "multiple SGL" Christoph Hellwig
@ 2016-10-27  6:57     ` Qiuxin (robert)
  2016-10-27  7:10       ` Christoph Hellwig
  0 siblings, 1 reply; 5+ messages in thread
From: Qiuxin (robert) @ 2016-10-27  6:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Jens Axboe, linux-block@vger.kernel.org,
	James Bottomley, Martin K. Petersen, Mike Snitzer,
	linux-rdma@vger.kernel.org, Ming Lei,
	linux-nvme@lists.infradead.org, Keith Busch, Doug Ledford,
	linux-scsi@vger.kernel.org, Laurence Oberman, Tiger zhao

Hi Christoph,

Thanks , got it.

Could you please do me favor to let me know the background why we ONLY support " MSDBD ==1"?   I am NOT trying to resist or oppose anything , I just want to know the reason.  You know,  it is a little wired for me, as  "MSDBD ==1" does not fulfill all the use cases which is depicted in the spec.

Best,
Robert Qiuxin
________________________________________
Robert Qiuxin
华为技术有限公司 Huawei Technologies Co., Ltd.
Phone: +86-755-28420357
Fax: 
Mobile: +86 15986638429
Email: qiuxin@huawei.com
地址:深圳市龙岗区坂田华为基地 邮编:518129
Huawei Technologies Co., Ltd.
Bantian, Longgang District,Shenzhen 518129, P.R.China
http://www.huawei.com 
________________________________________
本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI, which 
is intended only for the person or entity whose address is listed above. Any use of the 
information contained herein in any way (including, but not limited to, total or partial 
disclosure, reproduction, or dissemination) by persons other than the intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by 
phone or email immediately and delete it!
-----邮件原件-----
发件人: Christoph Hellwig [mailto:hch@lst.de] 
发送时间: 2016年10月27日 14:41
收件人: 鑫愿
抄送: Bart Van Assche; Jens Axboe; linux-block@vger.kernel.org; James Bottomley; Martin K. Petersen; Mike Snitzer; linux-rdma@vger.kernel.org; Ming Lei; linux-nvme@lists.infradead.org; Keith Busch; Doug Ledford; linux-scsi@vger.kernel.org; Laurence Oberman; Christoph Hellwig; Tiger zhao; Qiuxin (robert)
主题: Re: A question regarding "multiple SGL"

Hi Robert,

There is no feature called "Multiple SGL in one NVMe capsule".  The NVMe over Fabrics specification allows a controller to advertise how many SGL descriptors it supports using the MSDBD Identify field:

"Maximum SGL Data Block Descriptors (MSDBD): This field indicates the maximum number of (Keyed) SGL Data Block descriptors that a host is allowed to place in a capsule. A value of 0h indicates no limit."

Setting this value to 1 is perfectly valid.  Similarly a host is free to chose any number of SGL descriptors between 0 (only for command that don't transfer data) to the limit imposed by the controller using the MSDBD field.

There are no plans to support a MSDBD value larger than 1 in the Linux NVMe target, and there are no plans to ever submit commands with multiple SGLs from the host driver either.

Cheers,
	Christoph

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A question regarding "multiple SGL"
  2016-10-27  6:57     ` Qiuxin (robert)
@ 2016-10-27  7:10       ` Christoph Hellwig
       [not found]         ` <20161027071009.GA6434-jcswGhMUV9g@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Christoph Hellwig @ 2016-10-27  7:10 UTC (permalink / raw)
  To: Qiuxin (robert)
  Cc: Bart Van Assche, Jens Axboe, linux-block@vger.kernel.org,
	James Bottomley, Martin K. Petersen, Mike Snitzer,
	linux-rdma@vger.kernel.org, Ming Lei,
	linux-nvme@lists.infradead.org, Keith Busch, Doug Ledford,
	linux-scsi@vger.kernel.org, Laurence Oberman, Tiger zhao

Hi Robert,

please explain your use cases that isn't handled.  The one and only
reason to set MSDBD to 1 is to make the code a lot simpler given that
there is no real use case for supporting more.

RDMA uses memory registrations to register large and possibly
discontiguous data regions for a single rkey, aka single SGL descriptor
in NVMe terms.  There would be two reasons to support multiple SGL
descriptors:  a) to support a larger I/O size than supported by a single
MR, or b) to support a data region format not mappable by a single
MR.

iSER only supports a single rkey (or stag in IETF terminology) and has
been doing fine on a) and mostly fine on b).   There are a few possible
data layouts not supported by the traditional IB/iWarp FR WRs, but the
limit is in fact exactly the same as imposed by the NVMe PRPs used for
PCIe NVMe devices, so the Linux block layer has support to not generate
them.  Also with modern Mellanox IB/RoCE hardware we can actually
register completely arbitrary SGLs.  iSER supports using this registration
mode already with a trivial code addition, but for NVMe we didn't have a
pressing need yet.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A question regarding "multiple SGL"
       [not found]         ` <20161027071009.GA6434-jcswGhMUV9g@public.gmane.org>
@ 2016-10-27  9:02           ` Sagi Grimberg
       [not found]             ` <178765fb-0fcf-0fdc-dc5e-0cc226375827-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Sagi Grimberg @ 2016-10-27  9:02 UTC (permalink / raw)
  To: Christoph Hellwig, Qiuxin (robert)
  Cc: Bart Van Assche, Jens Axboe,
	linux-block-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	James Bottomley, Martin K. Petersen, Mike Snitzer,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Ming Lei,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Keith Busch, Doug Ledford,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Laurence Oberman, Tiger zhao


> Hi Robert,

Hey Robert, Christoph,

> please explain your use cases that isn't handled.  The one and only
> reason to set MSDBD to 1 is to make the code a lot simpler given that
> there is no real use case for supporting more.
>
> RDMA uses memory registrations to register large and possibly
> discontiguous data regions for a single rkey, aka single SGL descriptor
> in NVMe terms.  There would be two reasons to support multiple SGL
> descriptors:  a) to support a larger I/O size than supported by a single
> MR, or b) to support a data region format not mappable by a single
> MR.
>
> iSER only supports a single rkey (or stag in IETF terminology) and has
> been doing fine on a) and mostly fine on b).   There are a few possible
> data layouts not supported by the traditional IB/iWarp FR WRs, but the
> limit is in fact exactly the same as imposed by the NVMe PRPs used for
> PCIe NVMe devices, so the Linux block layer has support to not generate
> them.  Also with modern Mellanox IB/RoCE hardware we can actually
> register completely arbitrary SGLs.  iSER supports using this registration
> mode already with a trivial code addition, but for NVMe we didn't have a
> pressing need yet.

Good explanation :)

The IO transfer size is a bit more pressing on some devices (e.g.
cxgb3/4) where the number of pages per-MR can be indeed lower than
a reasonable transfer size (Steve can correct me if I'm wrong).

However, if there is a real demand for this we'll happily accept
patches :)

Just a note, having this feature in-place can bring unexpected behavior
depending on how we implement it:
- If we can use multiple MRs per IO (for multiple SGLs) we can either
prepare for the worst-case and allocate enough MRs to satisfy the
various IO patterns. This will be much heavier in terms of resource
allocation and can limit the scalability of the host driver.
- Or we can implement a shared MR pool with a reasonable number of MRs.
In this case each IO can consume one or more MRs on the expense of
other IOs. In this case we may need to requeue the IO later when we
have enough available MRs to satisfy the IO. This can yield some
unexpected performance gaps for some workloads.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: A question regarding "multiple SGL"
       [not found]             ` <178765fb-0fcf-0fdc-dc5e-0cc226375827-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2016-10-27 14:50               ` Steve Wise
  0 siblings, 0 replies; 5+ messages in thread
From: Steve Wise @ 2016-10-27 14:50 UTC (permalink / raw)
  To: 'Sagi Grimberg', 'Christoph Hellwig',
	'Qiuxin (robert)'
  Cc: linux-block-u79uwXL29TY76Z2rM5mHXA, 'James Bottomley',
	'Martin K. Petersen', 'Mike Snitzer',
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, 'Ming Lei',
	'Tiger zhao', linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	'Jens Axboe', 'Doug Ledford',
	'Laurence Oberman', linux-scsi-u79uwXL29TY76Z2rM5mHXA,
	'Bart Van Assche', 'Keith Busch'

> > Hi Robert,
> 
> Hey Robert, Christoph,
> 
> > please explain your use cases that isn't handled.  The one and only
> > reason to set MSDBD to 1 is to make the code a lot simpler given that
> > there is no real use case for supporting more.
> >
> > RDMA uses memory registrations to register large and possibly
> > discontiguous data regions for a single rkey, aka single SGL descriptor
> > in NVMe terms.  There would be two reasons to support multiple SGL
> > descriptors:  a) to support a larger I/O size than supported by a single
> > MR, or b) to support a data region format not mappable by a single
> > MR.
> >
> > iSER only supports a single rkey (or stag in IETF terminology) and has
> > been doing fine on a) and mostly fine on b).   There are a few possible
> > data layouts not supported by the traditional IB/iWarp FR WRs, but the
> > limit is in fact exactly the same as imposed by the NVMe PRPs used for
> > PCIe NVMe devices, so the Linux block layer has support to not generate
> > them.  Also with modern Mellanox IB/RoCE hardware we can actually
> > register completely arbitrary SGLs.  iSER supports using this registration
> > mode already with a trivial code addition, but for NVMe we didn't have a
> > pressing need yet.
> 
> Good explanation :)
> 
> The IO transfer size is a bit more pressing on some devices (e.g.
> cxgb3/4) where the number of pages per-MR can be indeed lower than
> a reasonable transfer size (Steve can correct me if I'm wrong).
>

Currently, cxgb4 support 128KB REG_MR operations on a host with 4K page size,
via a max mr page list depth of 32.  Soon it will be bumped up from 32 to 128
and life will be better...

 
> However, if there is a real demand for this we'll happily accept
> patches :)
> 
> Just a note, having this feature in-place can bring unexpected behavior
> depending on how we implement it:
> - If we can use multiple MRs per IO (for multiple SGLs) we can either
> prepare for the worst-case and allocate enough MRs to satisfy the
> various IO patterns. This will be much heavier in terms of resource
> allocation and can limit the scalability of the host driver.
> - Or we can implement a shared MR pool with a reasonable number of MRs.
> In this case each IO can consume one or more MRs on the expense of
> other IOs. In this case we may need to requeue the IO later when we
> have enough available MRs to satisfy the IO. This can yield some
> unexpected performance gaps for some workloads.
> 

I would like to see the storage protocols deal with lack of resources for the
worst case.  This allows much smaller resource usage for both MRs, and SQ
resources, at the expense of adding flow control logic to deal with lack of
available MR and/or SQ slots to process the next IO.  I think it can be
implemented efficiently such that when in flow-control mode, the code is driving
new IO submissions off of SQ completions which will free up SQ slots and most
likely MRs from the QP's MR pool.

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-10-27 14:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20161027005230.9904DC00097@webmail.sinamail.sina.com.cn>
     [not found] ` <20161027005230.9904DC00097-2RFepEojUI2gQzYKMK1YzK/p1tWXv8elb9TvmfFkwKk@public.gmane.org>
2016-10-27  6:41   ` A question regarding "multiple SGL" Christoph Hellwig
2016-10-27  6:57     ` Qiuxin (robert)
2016-10-27  7:10       ` Christoph Hellwig
     [not found]         ` <20161027071009.GA6434-jcswGhMUV9g@public.gmane.org>
2016-10-27  9:02           ` Sagi Grimberg
     [not found]             ` <178765fb-0fcf-0fdc-dc5e-0cc226375827-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2016-10-27 14:50               ` Steve Wise

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox