* [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd @ 2026-06-22 14:15 David Epping 2026-06-22 14:35 ` Keith Busch 0 siblings, 1 reply; 6+ messages in thread From: David Epping @ 2026-06-22 14:15 UTC (permalink / raw) To: linux-nvme Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Joachim Foerster Hello, some NVMe admin commands (like IO queue creation) require information to be provided via the PRP1 field of the command. This information has not previously been passed from the ioctl to the NVMe command structure, which prevented execution of such commands from userspace. Likely this is not an oversight but intentional because of security concerns? With this email I'm seeking to see if there is interest in getting support for userspace NVMe IO queue creation in the upstream driver. The attached patch is one way to do it and has been used for multiple years with multiple Kernel versions now. There are additional patches required to for example allocate sufficiently many IO queues at initialization time, but I would like to focus on the ability to create queues for now. A final patch would also not ignore nvme_user_cmd64(). The system setup where this patch has been used is as follows: - P2P PCIe capable CPU (currently also IOMMU disabled) - patched Linux in-Kernel NVMe driver for local PCIe NVMe SSDs - FPGA accelerator implementing NVMe IO queue memory and IO queue handling, exposed via PCIe BAR - vfio-pcie Kernel driver plus vfio userspace FPGA driver / application - The userspace application creates new NVMe IO queues at the SSD using the patched admin ioctl and points them towards the FPGA BAR. It then informs the FPGA about the SSD BAR address and IO queue ID. From then on the FPGA can access the SSD storage entirely without software interaction. Since the omition of PRP1 access from userspace is likely intentional, maybe the discussion and patches by Leon Romanovsky for making dmabuf and p2pdma available via vfio lead in the right direction: https://lore.kernel.org/all/20251106-dmabuf-vfio-v7-0-2503bf390699@nvidia.com/ It currently seems focused on SPDK for handling the SSDs in userspace, but the author also describes it as a general mechanism that can support other scenarios. How about having the in-Kernel NVMe driver accept a dmabuf as IO queue location for creating a new IO queue? And in turn it likely has to provide a small dmabuf to the FPGA VFIO world for access to the queue doorbells on the SSD. Looking forward to your feedback, David Signed-off-by: David Epping <david.epping@missinglinkelectronics.com> --- drivers/nvme/host/ioctl.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c index ca86d3bf7ea4..d5d740f1b554 100644 --- a/drivers/nvme/host/ioctl.c +++ b/drivers/nvme/host/ioctl.c @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, c.common.nsid = cpu_to_le32(cmd.nsid); c.common.cdw2[0] = cpu_to_le32(cmd.cdw2); c.common.cdw2[1] = cpu_to_le32(cmd.cdw3); + c.common.dptr.prp1 = cpu_to_le64(cmd.addr); c.common.cdw10 = cpu_to_le32(cmd.cdw10); c.common.cdw11 = cpu_to_le32(cmd.cdw11); c.common.cdw12 = cpu_to_le32(cmd.cdw12); -- 2.43.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd 2026-06-22 14:15 [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd David Epping @ 2026-06-22 14:35 ` Keith Busch 2026-06-22 14:56 ` David Epping 0 siblings, 1 reply; 6+ messages in thread From: Keith Busch @ 2026-06-22 14:35 UTC (permalink / raw) To: David Epping Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Joachim Foerster On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote: > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, > c.common.nsid = cpu_to_le32(cmd.nsid); > c.common.cdw2[0] = cpu_to_le32(cmd.cdw2); > c.common.cdw2[1] = cpu_to_le32(cmd.cdw3); > + c.common.dptr.prp1 = cpu_to_le64(cmd.addr); This is not correct: the user space virtual address isn't the device DMA'able address. The driver already handles mapping the user address to kernel space, then to dma, then sets the PRP accordingly. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd 2026-06-22 14:35 ` Keith Busch @ 2026-06-22 14:56 ` David Epping 2026-06-22 15:15 ` Keith Busch 0 siblings, 1 reply; 6+ messages in thread From: David Epping @ 2026-06-22 14:56 UTC (permalink / raw) To: Keith Busch Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Joachim Foerster On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote: > On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote: > > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, > > c.common.nsid = cpu_to_le32(cmd.nsid); > > c.common.cdw2[0] = cpu_to_le32(cmd.cdw2); > > c.common.cdw2[1] = cpu_to_le32(cmd.cdw3); > > + c.common.dptr.prp1 = cpu_to_le64(cmd.addr); > > This is not correct: the user space virtual address isn't the device > DMA'able address. The driver already handles mapping the user address to > kernel space, then to dma, then sets the PRP accordingly. To clarify, the ioctl struct addr field is not filled with a memory buffer address by the userspace, but a PCIe mapped BAR address plus offset. It is obtained by the userspace application operating the FPGA vfio device by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX. So it is the address Linux assigned to that BAR (plus offset). ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd 2026-06-22 14:56 ` David Epping @ 2026-06-22 15:15 ` Keith Busch 2026-06-23 10:34 ` David Epping 0 siblings, 1 reply; 6+ messages in thread From: Keith Busch @ 2026-06-22 15:15 UTC (permalink / raw) To: David Epping Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Joachim Foerster On Mon, Jun 22, 2026 at 04:56:22PM +0200, David Epping wrote: > On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote: > > On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote: > > > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, > > > c.common.nsid = cpu_to_le32(cmd.nsid); > > > c.common.cdw2[0] = cpu_to_le32(cmd.cdw2); > > > c.common.cdw2[1] = cpu_to_le32(cmd.cdw3); > > > + c.common.dptr.prp1 = cpu_to_le64(cmd.addr); > > > > This is not correct: the user space virtual address isn't the device > > DMA'able address. The driver already handles mapping the user address to > > kernel space, then to dma, then sets the PRP accordingly. > > To clarify, the ioctl struct addr field is not filled with a memory buffer > address by the userspace, but a PCIe mapped BAR address plus offset. > It is obtained by the userspace application operating the FPGA vfio device > by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX. > So it is the address Linux assigned to that BAR (plus offset). The driver and block layer should already handle PCIe addresses. You're supposed to mmap it to user space first though, and pass that address in instead. And you'd also need to set cmd.data_len to a non-zero value so the driver doesn't skip setting up the data pointers. Note, creating IO queues from user space, while not explicitly prevented today, is not supported. The driver doesn't know you've done this so the queue isn't properly handled on a controller reset. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd 2026-06-22 15:15 ` Keith Busch @ 2026-06-23 10:34 ` David Epping 2026-06-23 12:19 ` Keith Busch 0 siblings, 1 reply; 6+ messages in thread From: David Epping @ 2026-06-23 10:34 UTC (permalink / raw) To: Keith Busch Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Joachim Foerster On Mon, Jun 22, 2026 at 09:15:40AM -0600, Keith Busch wrote: > On Mon, Jun 22, 2026 at 04:56:22PM +0200, David Epping wrote: > > On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote: > > > On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote: > > > > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns, > > > > c.common.nsid = cpu_to_le32(cmd.nsid); > > > > c.common.cdw2[0] = cpu_to_le32(cmd.cdw2); > > > > c.common.cdw2[1] = cpu_to_le32(cmd.cdw3); > > > > + c.common.dptr.prp1 = cpu_to_le64(cmd.addr); > > > > > > This is not correct: the user space virtual address isn't the device > > > DMA'able address. The driver already handles mapping the user address to > > > kernel space, then to dma, then sets the PRP accordingly. > > > > To clarify, the ioctl struct addr field is not filled with a memory buffer > > address by the userspace, but a PCIe mapped BAR address plus offset. > > It is obtained by the userspace application operating the FPGA vfio device > > by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX. > > So it is the address Linux assigned to that BAR (plus offset). > > The driver and block layer should already handle PCIe addresses. You're > supposed to mmap it to user space first though, and pass that address in > instead. And you'd also need to set cmd.data_len to a non-zero value so > the driver doesn't skip setting up the data pointers. > > Note, creating IO queues from user space, while not explicitly prevented > today, is not supported. The driver doesn't know you've done this so the > queue isn't properly handled on a controller reset. > Hi Keith, I understand that creating IO queues from user space is not supported by the current driver. That's why we created patches for that a couple of years ago and ported them to new Kernels since. My question is, and maybe I should have put this in my initial email explicitely, is there interest in having such functionality in the upstream Linux in-Kernel NVMe driver? An interface and mechanism to request and manage IO queues that are not used by the Linux NVMe driver to perform IO, but handed to a separate entity for this purpose. Of course an upstream implementation would have to take many more things into account, like the reset you mentioned, and IOMMU setup, and much more. But that's only worth looking at if there is upstream interest in it. Thanks for your feedback, David ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd 2026-06-23 10:34 ` David Epping @ 2026-06-23 12:19 ` Keith Busch 0 siblings, 0 replies; 6+ messages in thread From: Keith Busch @ 2026-06-23 12:19 UTC (permalink / raw) To: David Epping Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg, Leon Romanovsky, Joachim Foerster On Tue, Jun 23, 2026 at 12:34:29PM +0200, David Epping wrote: > My question is, and maybe I should have put this in my initial email > explicitely, is there interest in having such functionality in the upstream > Linux in-Kernel NVMe driver? An interface and mechanism to request and > manage IO queues that are not used by the Linux NVMe driver to perform IO, > but handed to a separate entity for this purpose. Partitioning device resources to assign to special purposes should be under a well defined framework. Unfortunately the only thing I know of approaching this is SIOV. :) Not sure how other maintainers and developers feel about it, but that's the route I would go for this. It at least provides memory access on a queue granularity and neatly separates the control plane. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-06-23 12:19 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-22 14:15 [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd David Epping 2026-06-22 14:35 ` Keith Busch 2026-06-22 14:56 ` David Epping 2026-06-22 15:15 ` Keith Busch 2026-06-23 10:34 ` David Epping 2026-06-23 12:19 ` Keith Busch
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.