[PATCH] nvme: uring_cmd specific request

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] nvme: uring_cmd specific request_queue for SGLs
@ 2025-06-24 21:14 Keith Busch
  2025-06-25  6:09 ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-06-24 21:14 UTC (permalink / raw)
  To: linux-nvme, hch; +Cc: Keith Busch

From: Keith Busch <kbusch@kernel.org>

User space passthrough IO commands are committed to using the SGL
transfer types if the device supports it. The virt_boundary_mask is a
PRP specific constraint, and this limit causes kernel bounce buffers to
be used when a user vector could have been handled directly. Avoiding
unnecessary copies is important for uring_cmd usage as this is a high
performance interface.

For devices that support SGL, create a new request_queue that drops the
virt_boundary_mask so that vectored user requests can be used with
zero-copy performance. Normal read/write will still use the old boundary
mask, as we can't be sure if forcing all IO to use SGL over PRP won't
cause unexpected regressions for some devices.

Signed-off-by: Keith Busch <kbusch@kernel.org>
---
 drivers/nvme/host/core.c  | 28 +++++++++++++++++++++++++++-
 drivers/nvme/host/ioctl.c |  2 +-
 drivers/nvme/host/nvme.h  |  1 +
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3da5ac71a9b07..e4e03cb9e5c0e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -721,7 +721,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
 	bool logging_enabled;
 
 	if (req->q->queuedata) {
-		struct nvme_ns *ns = req->q->disk->private_data;
+		struct nvme_ns *ns = req->q->queuedata;
 
 		logging_enabled = ns->head->passthru_err_log_enabled;
 		req->timeout = NVME_IO_TIMEOUT;
@@ -4081,6 +4081,27 @@ static void nvme_ns_add_to_ctrl_list(struct nvme_ns *ns)
 	list_add(&ns->list, &ns->ctrl->namespaces);
 }
 
+static void nvme_init_uring_queue(struct nvme_ns *ns)
+{
+	struct nvme_ctrl *ctrl = ns->ctrl;
+	struct queue_limits lim = {};
+	struct request_queue *q;
+
+	if (!nvme_ctrl_sgl_supported(ctrl)) {
+		ns->uring_queue = ns->queue;
+		return;
+	}
+
+	nvme_set_ctrl_limits(ctrl, &lim);
+	lim.virt_boundary_mask = 0;
+
+	q = blk_mq_alloc_queue(ctrl->tagset, &lim, ns);
+	if (IS_ERR(q))
+		ns->uring_queue = ns->queue;
+	else
+		ns->uring_queue = q;
+}
+
 static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 {
 	struct queue_limits lim = { };
@@ -4157,6 +4178,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, struct nvme_ns_info *info)
 
 	if (!nvme_ns_head_multipath(ns->head))
 		nvme_add_ns_cdev(ns);
+	nvme_init_uring_queue(ns);
 
 	nvme_mpath_add_disk(ns, info->anagrpid);
 	nvme_fault_inject_init(&ns->fault_inject, ns->disk->disk_name);
@@ -4224,6 +4246,10 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
 	if (!nvme_ns_head_multipath(ns->head))
 		nvme_cdev_del(&ns->cdev, &ns->cdev_device);
+	if (ns->uring_queue != ns->queue) {
+		blk_mq_destroy_queue(ns->uring_queue);
+		blk_put_queue(ns->uring_queue);
+	}
 
 	nvme_mpath_remove_sysfs_link(ns);
 
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 6b3ac8ae3f34b..f925a10391001 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -445,7 +445,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 {
 	struct nvme_uring_cmd_pdu *pdu = nvme_uring_cmd_pdu(ioucmd);
 	const struct nvme_uring_cmd *cmd = io_uring_sqe_cmd(ioucmd->sqe);
-	struct request_queue *q = ns ? ns->queue : ctrl->admin_q;
+	struct request_queue *q = ns ? ns->uring_queue : ctrl->admin_q;
 	struct nvme_uring_data d;
 	struct nvme_command c;
 	struct iov_iter iter;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7df2ea21851f5..d371940bd342d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -548,6 +548,7 @@ struct nvme_ns {
 
 	struct cdev		cdev;
 	struct device		cdev_device;
+	struct request_queue	*uring_queue;
 
 	struct nvme_fault_inject fault_inject;
 };
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: uring_cmd specific request_queue for SGLs
  2025-06-24 21:14 [PATCH] nvme: uring_cmd specific request_queue for SGLs Keith Busch
@ 2025-06-25  6:09 ` Christoph Hellwig
  2025-06-25 22:08   ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2025-06-25  6:09 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, hch, Keith Busch

On Tue, Jun 24, 2025 at 02:14:44PM -0700, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> User space passthrough IO commands are committed to using the SGL
> transfer types if the device supports it. The virt_boundary_mask is a
> PRP specific constraint, and this limit causes kernel bounce buffers to
> be used when a user vector could have been handled directly. Avoiding
> unnecessary copies is important for uring_cmd usage as this is a high
> performance interface.

Not really more high performance than the normal I/O path.

> For devices that support SGL, create a new request_queue that drops the
> virt_boundary_mask so that vectored user requests can be used with
> zero-copy performance. Normal read/write will still use the old boundary
> mask, as we can't be sure if forcing all IO to use SGL over PRP won't
> cause unexpected regressions for some devices.

Note that this directly conflict with the new DMA API.  There we do
rely on the virt boundary to gurantee that the IOMMU path can always
coalesce the entire request into a single IOVA mapping.  We could still
do it for the direct mapping path, where it makes a difference, but
we really should do that everywhere, i.e. revist the default
sgl_threshold and see if we could reduce it to 2 * PAGE_SIZE or so
so that we'd only use PRPs for the simple path where we can trivially
do the virt_boundary check right in NVMe.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: uring_cmd specific request_queue for SGLs
  2025-06-25  6:09 ` Christoph Hellwig
@ 2025-06-25 22:08   ` Keith Busch
  2025-06-26  5:14     ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-06-25 22:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On Wed, Jun 25, 2025 at 08:09:15AM +0200, Christoph Hellwig wrote:
> > User space passthrough IO commands are committed to using the SGL
> > transfer types if the device supports it. The virt_boundary_mask is a
> > PRP specific constraint, and this limit causes kernel bounce buffers to
> > be used when a user vector could have been handled directly. Avoiding
> > unnecessary copies is important for uring_cmd usage as this is a high
> > performance interface.
> 
> Not really more high performance than the normal I/O path.

Right, that's why I said "a" performance path, not "the" performance
path.

If you send a readv/writev with a similar iovec to a O_DIRECT block
device, then it will just get split on the gapped virt boundaries but it
still uses it directly without bouncing. We can't split passthrough
requests though, so it'd be preferable to use the iovec in a single
command if the hardware supports it rather than bounce it.

> > For devices that support SGL, create a new request_queue that drops the
> > virt_boundary_mask so that vectored user requests can be used with
> > zero-copy performance. Normal read/write will still use the old boundary
> > mask, as we can't be sure if forcing all IO to use SGL over PRP won't
> > cause unexpected regressions for some devices.
> 
> Note that this directly conflict with the new DMA API.  There we do
> rely on the virt boundary to gurantee that the IOMMU path can always
> coalesce the entire request into a single IOVA mapping.  We could still
> do it for the direct mapping path, where it makes a difference, but
> we really should do that everywhere, i.e. revist the default
> sgl_threshold and see if we could reduce it to 2 * PAGE_SIZE or so
> so that we'd only use PRPs for the simple path where we can trivially
> do the virt_boundary check right in NVMe.

Sure, that sounds okay if you mean 2 * NVME_CTRL_PAGE_SIZE.

It looks straight forward to add merging while we iterate for the direct
mapping result if it returns mergable iova's, but I think we'd have to
commit to using SGL over PRP for everything but the simple case, and
drop the PRP imposed virt boundary. The downside might be we'd lose that
iova pre-allocation optimization (dma_iova_try_alloc) you have going on,
but I'm not sure how important that is. Could the direct mapping get too
fragmented to consistently produce contiguous iova's in this path?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: uring_cmd specific request_queue for SGLs
  2025-06-25 22:08   ` Keith Busch
@ 2025-06-26  5:14     ` Christoph Hellwig
  2025-06-26 15:29       ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2025-06-26  5:14 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, Keith Busch, linux-nvme

On Wed, Jun 25, 2025 at 04:08:28PM -0600, Keith Busch wrote:
> If you send a readv/writev with a similar iovec to a O_DIRECT block
> device, then it will just get split on the gapped virt boundaries but it
> still uses it directly without bouncing. We can't split passthrough
> requests though, so it'd be preferable to use the iovec in a single
> command if the hardware supports it rather than bounce it.

True.

> > Note that this directly conflict with the new DMA API.  There we do
> > rely on the virt boundary to gurantee that the IOMMU path can always
> > coalesce the entire request into a single IOVA mapping.  We could still
> > do it for the direct mapping path, where it makes a difference, but
> > we really should do that everywhere, i.e. revist the default
> > sgl_threshold and see if we could reduce it to 2 * PAGE_SIZE or so
> > so that we'd only use PRPs for the simple path where we can trivially
> > do the virt_boundary check right in NVMe.
> 
> Sure, that sounds okay if you mean 2 * NVME_CTRL_PAGE_SIZE.
> 
> It looks straight forward to add merging while we iterate for the direct
> mapping result if it returns mergable iova's, but I think we'd have to
> commit to using SGL over PRP for everything but the simple case, and
> drop the PRP imposed virt boundary. The downside might be we'd lose that
> iova pre-allocation optimization (dma_iova_try_alloc) you have going on,
> but I'm not sure how important that is. Could the direct mapping get too
> fragmented to consistently produce contiguous iova's in this path?

I can't really parse this.  Direct mapping means not using an IOMMU
mapping, either because there is none or because it is configured to
do an identity mapping.  In that case we'll never use the IOVA path.

If an IOMMU is configured for dynamic IOMMU mappings we never use the
direct mapping.  In that case we'd have to do one IOMMU mapping per
segment with the IOVA mapping path that requires (IOMMU) page alignment,
which will be very expensive.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: uring_cmd specific request_queue for SGLs
  2025-06-26  5:14     ` Christoph Hellwig
@ 2025-06-26 15:29       ` Keith Busch
  2025-06-27  7:25         ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-06-26 15:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On Thu, Jun 26, 2025 at 07:14:13AM +0200, Christoph Hellwig wrote:
> On Wed, Jun 25, 2025 at 04:08:28PM -0600, Keith Busch wrote:
> > 
> > It looks straight forward to add merging while we iterate for the direct
> > mapping result if it returns mergable iova's, but I think we'd have to
> > commit to using SGL over PRP for everything but the simple case, and
> > drop the PRP imposed virt boundary. The downside might be we'd lose that
> > iova pre-allocation optimization (dma_iova_try_alloc) you have going on,
> > but I'm not sure how important that is. Could the direct mapping get too
> > fragmented to consistently produce contiguous iova's in this path?
> 
> I can't really parse this.  Direct mapping means not using an IOMMU
> mapping, either because there is none or because it is configured to
> do an identity mapping.  In that case we'll never use the IOVA path.

Okay, maybe I'm confused. The code looks like it defaults to the direct
mapping if it can't be coalesced. What if the IOMMU granularity is 8k
against nvme's 4k virt boundary? We still need the IOMMU dma mappings in
the direct mapping fallback, right? They should just appear as different
dma segments.

When it comes to integrity payloads, merged bio's are almost certainly
not eligible to be coalesced in iova space since they're usually
independently allocated in much smaller granularities, so again, I'd
expect we'd get multiple integrity dma segments.
 
> If an IOMMU is configured for dynamic IOMMU mappings we never use the
> direct mapping.  In that case we'd have to do one IOMMU mapping per
> segment with the IOVA mapping path that requires (IOMMU) page alignment,
> which will be very expensive.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: uring_cmd specific request_queue for SGLs
  2025-06-26 15:29       ` Keith Busch
@ 2025-06-27  7:25         ` Christoph Hellwig
  2025-06-27 15:34           ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2025-06-27  7:25 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, Keith Busch, linux-nvme

On Thu, Jun 26, 2025 at 09:29:21AM -0600, Keith Busch wrote:
> > I can't really parse this.  Direct mapping means not using an IOMMU
> > mapping, either because there is none or because it is configured to
> > do an identity mapping.  In that case we'll never use the IOVA path.
> 
> Okay, maybe I'm confused. The code looks like it defaults to the direct
> mapping if it can't be coalesced.

Oh, you are referring to blk_dma_map_direct?  I guess that is a bit
misnamed, as it doesn't mean we're doing the dma-level concept of a
direct mapping, but anything that is not an iova-based coalesce
operation.

> What if the IOMMU granularity is 8k against nvme's 4k virt boundary?

Then we won't merge.  Note that most (maybe even all) relevant IOMMUs
support 4k mappings, including for system that have larger IOMMU page
sizes.

> We still need the IOMMU dma mappings in
> the direct mapping fallback, right? They should just appear as different
> dma segments.

> When it comes to integrity payloads, merged bio's are almost certainly
> not eligible to be coalesced in iova space since they're usually
> independently allocated in much smaller granularities, so again, I'd
> expect we'd get multiple integrity dma segments.

For the current user DMA that depends on how how user allocated it,
but unless they nicely chunk it up based on the IOMMIU page size: yes.

But when merging multiple bios with the block layer default integrity we
can get it now with the right max_sectors, and the same will
usually/always be true for the file system integrity work in progress.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: uring_cmd specific request_queue for SGLs
  2025-06-27  7:25         ` Christoph Hellwig
@ 2025-06-27 15:34           ` Keith Busch
  2025-06-30  6:00             ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-06-27 15:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On Fri, Jun 27, 2025 at 09:25:24AM +0200, Christoph Hellwig wrote:
> 
> Oh, you are referring to blk_dma_map_direct?

Yes, exactly. Sorry I shortened the name to the ambiguous "direct", but
I think we're on the same page now.

Back to this patch, I get that this uring_cmd path wouldn't be able to
use the more efficient coalesced mapping when the IOMMU is on, and
instead map each segment indiviually. I think that's still better than
the alternative, though.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: uring_cmd specific request_queue for SGLs
  2025-06-27 15:34           ` Keith Busch
@ 2025-06-30  6:00             ` Christoph Hellwig
  2025-06-30 14:04               ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2025-06-30  6:00 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, Keith Busch, linux-nvme

On Fri, Jun 27, 2025 at 09:34:56AM -0600, Keith Busch wrote:
> Back to this patch, I get that this uring_cmd path wouldn't be able to
> use the more efficient coalesced mapping when the IOMMU is on, and
> instead map each segment indiviually. I think that's still better than
> the alternative, though.

My back on the envelope calculations (for 8 byte metadata chunks)
suggests otherwise, but I never got around fully benchmarking it.
If you do have a representation workload that you care about I'd love to
see the numbers.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: uring_cmd specific request_queue for SGLs
  2025-06-30  6:00             ` Christoph Hellwig
@ 2025-06-30 14:04               ` Keith Busch
  2025-07-01  6:16                 ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-06-30 14:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On Mon, Jun 30, 2025 at 08:00:16AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 27, 2025 at 09:34:56AM -0600, Keith Busch wrote:
> > Back to this patch, I get that this uring_cmd path wouldn't be able to
> > use the more efficient coalesced mapping when the IOMMU is on, and
> > instead map each segment indiviually. I think that's still better than
> > the alternative, though.
> 
> My back on the envelope calculations (for 8 byte metadata chunks)
> suggests otherwise, but I never got around fully benchmarking it.
> If you do have a representation workload that you care about I'd love to
> see the numbers.

Metadata isn't actually the important part of this patch.

The workload just receives data from a iouring zero-copy network and
writes them out to disk using uring_cmd. The incoming data can have
various offsets, so it often sends an iovec with page gaps.

Currently the kernel provides a bounce buffer when there are page gaps.
That's obviously undesirable when the hardware is capable of handling
the original vector directly.

The options to avoid the copies are either:

  a. Force the application to split each iovec into a separate command

  b. Relax the kernel's limits to match the hardware's capabilities

This patch is trying to do "b".

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: uring_cmd specific request_queue for SGLs
  2025-06-30 14:04               ` Keith Busch
@ 2025-07-01  6:16                 ` Christoph Hellwig
  2025-07-01 12:15                   ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2025-07-01  6:16 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, Keith Busch, linux-nvme

On Mon, Jun 30, 2025 at 08:04:47AM -0600, Keith Busch wrote:
> > My back on the envelope calculations (for 8 byte metadata chunks)
> > suggests otherwise, but I never got around fully benchmarking it.
> > If you do have a representation workload that you care about I'd love to
> > see the numbers.
> 
> Metadata isn't actually the important part of this patch.
> 
> The workload just receives data from a iouring zero-copy network and
> writes them out to disk using uring_cmd. The incoming data can have
> various offsets, so it often sends an iovec with page gaps.
> 
> Currently the kernel provides a bounce buffer when there are page gaps.
> That's obviously undesirable when the hardware is capable of handling
> the original vector directly.

Yes, the bounce buffer is obviously not very efficient when transferring
large amount of data.

> The options to avoid the copies are either:
> 
>   a. Force the application to split each iovec into a separate command
> 
>   b. Relax the kernel's limits to match the hardware's capabilities
> 
> This patch is trying to do "b".

a, or a variant of that (not using passthrough) would in general be
my preference.  Why is that not suitable here?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: uring_cmd specific request_queue for SGLs
  2025-07-01  6:16                 ` Christoph Hellwig
@ 2025-07-01 12:15                   ` Keith Busch
  2025-07-01 14:35                     ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2025-07-01 12:15 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On Tue, Jul 01, 2025 at 08:16:09AM +0200, Christoph Hellwig wrote:
> On Mon, Jun 30, 2025 at 08:04:47AM -0600, Keith Busch wrote:
> > 
> >   a. Force the application to split each iovec into a separate command
> > 
> >   b. Relax the kernel's limits to match the hardware's capabilities
> > 
> > This patch is trying to do "b".
> 
> a, or a variant of that (not using passthrough) would in general be
> my preference.  Why is that not suitable here?

Why use passthrough? Because until very recently (as in the current rc),
passthrough had been the only way for user space to access write streams
and metadata.

We've had uring_cmd passthrough for several years now, so there's some
decently mature applications making use of it. But can they split
commands such that they dispatch one per iovec instead of as a single
command? I've asked that, still waiting to hear back. I sent this patch,
though, because they bought SGL capable hardware specifically so they
could do this. 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] nvme: uring_cmd specific request_queue for SGLs
  2025-07-01 12:15                   ` Keith Busch
@ 2025-07-01 14:35                     ` Keith Busch
  0 siblings, 0 replies; 12+ messages in thread
From: Keith Busch @ 2025-07-01 14:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, linux-nvme

On Tue, Jul 01, 2025 at 08:15:39AM -0400, Keith Busch wrote:
>
> But can they split commands such that they dispatch one per iovec
> instead of as a single command?

That can be done only if individual iovecs are block size aligned, which
isn't always the case. Only the total size is guaranteed to be block
size aligned.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-07-01 16:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-24 21:14 [PATCH] nvme: uring_cmd specific request_queue for SGLs Keith Busch
2025-06-25  6:09 ` Christoph Hellwig
2025-06-25 22:08   ` Keith Busch
2025-06-26  5:14     ` Christoph Hellwig
2025-06-26 15:29       ` Keith Busch
2025-06-27  7:25         ` Christoph Hellwig
2025-06-27 15:34           ` Keith Busch
2025-06-30  6:00             ` Christoph Hellwig
2025-06-30 14:04               ` Keith Busch
2025-07-01  6:16                 ` Christoph Hellwig
2025-07-01 12:15                   ` Keith Busch
2025-07-01 14:35                     ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).