[PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
@ 2026-06-22 14:15 David Epping
  2026-06-22 14:35 ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: David Epping @ 2026-06-22 14:15 UTC (permalink / raw)
  To: linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

Hello,

some NVMe admin commands (like IO queue creation) require information to
be provided via the PRP1 field of the command.
This information has not previously been passed from the ioctl to the NVMe
command structure, which prevented execution of such commands from userspace.
Likely this is not an oversight but intentional because of security concerns?

With this email I'm seeking to see if there is interest in getting support
for userspace NVMe IO queue creation in the upstream driver.
The attached patch is one way to do it and has been used for multiple years
with multiple Kernel versions now. There are additional patches required to
for example allocate sufficiently many IO queues at initialization time, but
I would like to focus on the ability to create queues for now.
A final patch would also not ignore nvme_user_cmd64().

The system setup where this patch has been used is as follows:
- P2P PCIe capable CPU (currently also IOMMU disabled)
- patched Linux in-Kernel NVMe driver for local PCIe NVMe SSDs
- FPGA accelerator implementing NVMe IO queue memory and IO queue handling,
  exposed via PCIe BAR
- vfio-pcie Kernel driver plus vfio userspace FPGA driver / application
- The userspace application creates new NVMe IO queues at the SSD using the
  patched admin ioctl and points them towards the FPGA BAR. It then informs
  the FPGA about the SSD BAR address and IO queue ID. From then on the FPGA
  can access the SSD storage entirely without software interaction.

Since the omition of PRP1 access from userspace is likely intentional, maybe
the discussion and patches by Leon Romanovsky for making dmabuf and p2pdma
available via vfio lead in the right direction:
https://lore.kernel.org/all/20251106-dmabuf-vfio-v7-0-2503bf390699@nvidia.com/

It currently seems focused on SPDK for handling the SSDs in userspace, but the
author also describes it as a general mechanism that can support other
scenarios.
How about having the in-Kernel NVMe driver accept a dmabuf as IO queue location
for creating a new IO queue? And in turn it likely has to provide a small
dmabuf to the FPGA VFIO world for access to the queue doorbells on the SSD.

Looking forward to your feedback,
David

Signed-off-by: David Epping <david.epping@missinglinkelectronics.com>
---
 drivers/nvme/host/ioctl.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index ca86d3bf7ea4..d5d740f1b554 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	c.common.nsid = cpu_to_le32(cmd.nsid);
 	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
 	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
+	c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
 	c.common.cdw10 = cpu_to_le32(cmd.cdw10);
 	c.common.cdw11 = cpu_to_le32(cmd.cdw11);
 	c.common.cdw12 = cpu_to_le32(cmd.cdw12);
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-22 14:15 [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd David Epping
@ 2026-06-22 14:35 ` Keith Busch
  2026-06-22 14:56   ` David Epping
  0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2026-06-22 14:35 UTC (permalink / raw)
  To: David Epping
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
>  	c.common.nsid = cpu_to_le32(cmd.nsid);
>  	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
>  	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> +	c.common.dptr.prp1 = cpu_to_le64(cmd.addr);

This is not correct: the user space virtual address isn't the device
DMA'able address. The driver already handles mapping the user address to
kernel space, then to dma, then sets the PRP accordingly.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-22 14:35 ` Keith Busch
@ 2026-06-22 14:56   ` David Epping
  2026-06-22 15:15     ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: David Epping @ 2026-06-22 14:56 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote:
> On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> >  	c.common.nsid = cpu_to_le32(cmd.nsid);
> >  	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> >  	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> > +	c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
> 
> This is not correct: the user space virtual address isn't the device
> DMA'able address. The driver already handles mapping the user address to
> kernel space, then to dma, then sets the PRP accordingly.

To clarify, the ioctl struct addr field is not filled with a memory buffer
address by the userspace, but a PCIe mapped BAR address plus offset.
It is obtained by the userspace application operating the FPGA vfio device
by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX.
So it is the address Linux assigned to that BAR (plus offset).


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-22 14:56   ` David Epping
@ 2026-06-22 15:15     ` Keith Busch
  2026-06-23 10:34       ` David Epping
  0 siblings, 1 reply; 6+ messages in thread
From: Keith Busch @ 2026-06-22 15:15 UTC (permalink / raw)
  To: David Epping
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Mon, Jun 22, 2026 at 04:56:22PM +0200, David Epping wrote:
> On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote:
> > On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> > > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > >  	c.common.nsid = cpu_to_le32(cmd.nsid);
> > >  	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> > >  	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> > > +	c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
> > 
> > This is not correct: the user space virtual address isn't the device
> > DMA'able address. The driver already handles mapping the user address to
> > kernel space, then to dma, then sets the PRP accordingly.
> 
> To clarify, the ioctl struct addr field is not filled with a memory buffer
> address by the userspace, but a PCIe mapped BAR address plus offset.
> It is obtained by the userspace application operating the FPGA vfio device
> by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX.
> So it is the address Linux assigned to that BAR (plus offset).

The driver and block layer should already handle PCIe addresses. You're
supposed to mmap it to user space first though, and pass that address in
instead. And you'd also need to set cmd.data_len to a non-zero value so
the driver doesn't skip setting up the data pointers.

Note, creating IO queues from user space, while not explicitly prevented
today, is not supported. The driver doesn't know you've done this so the
queue isn't properly handled on a controller reset.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-22 15:15     ` Keith Busch
@ 2026-06-23 10:34       ` David Epping
  2026-06-23 12:19         ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: David Epping @ 2026-06-23 10:34 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Mon, Jun 22, 2026 at 09:15:40AM -0600, Keith Busch wrote:
> On Mon, Jun 22, 2026 at 04:56:22PM +0200, David Epping wrote:
> > On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote:
> > > On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> > > > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > > >  	c.common.nsid = cpu_to_le32(cmd.nsid);
> > > >  	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> > > >  	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> > > > +	c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
> > > 
> > > This is not correct: the user space virtual address isn't the device
> > > DMA'able address. The driver already handles mapping the user address to
> > > kernel space, then to dma, then sets the PRP accordingly.
> > 
> > To clarify, the ioctl struct addr field is not filled with a memory buffer
> > address by the userspace, but a PCIe mapped BAR address plus offset.
> > It is obtained by the userspace application operating the FPGA vfio device
> > by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX.
> > So it is the address Linux assigned to that BAR (plus offset).
> 
> The driver and block layer should already handle PCIe addresses. You're
> supposed to mmap it to user space first though, and pass that address in
> instead. And you'd also need to set cmd.data_len to a non-zero value so
> the driver doesn't skip setting up the data pointers.
> 
> Note, creating IO queues from user space, while not explicitly prevented
> today, is not supported. The driver doesn't know you've done this so the
> queue isn't properly handled on a controller reset.
> 

Hi Keith, I understand that creating IO queues from user space is not
supported by the current driver. That's why we created patches for that a
couple of years ago and ported them to new Kernels since.

My question is, and maybe I should have put this in my initial email
explicitely, is there interest in having such functionality in the upstream
Linux in-Kernel NVMe driver? An interface and mechanism to request and
manage IO queues that are not used by the Linux NVMe driver to perform IO,
but handed to a separate entity for this purpose.

Of course an upstream implementation would have to take many more things
into account, like the reset you mentioned, and IOMMU setup, and much more.
But that's only worth looking at if there is upstream interest in it.

Thanks for your feedback, David


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-23 10:34       ` David Epping
@ 2026-06-23 12:19         ` Keith Busch
  0 siblings, 0 replies; 6+ messages in thread
From: Keith Busch @ 2026-06-23 12:19 UTC (permalink / raw)
  To: David Epping
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Tue, Jun 23, 2026 at 12:34:29PM +0200, David Epping wrote:
> My question is, and maybe I should have put this in my initial email
> explicitely, is there interest in having such functionality in the upstream
> Linux in-Kernel NVMe driver? An interface and mechanism to request and
> manage IO queues that are not used by the Linux NVMe driver to perform IO,
> but handed to a separate entity for this purpose.

Partitioning device resources to assign to special purposes should be
under a well defined framework. Unfortunately the only thing I know of
approaching this is SIOV. :) Not sure how other maintainers and
developers feel about it, but that's the route I would go for this. It
at least provides memory access on a queue granularity and neatly
separates the control plane.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-23 12:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-22 14:15 [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd David Epping
2026-06-22 14:35 ` Keith Busch
2026-06-22 14:56   ` David Epping
2026-06-22 15:15     ` Keith Busch
2026-06-23 10:34       ` David Epping
2026-06-23 12:19         ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.