All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
@ 2026-06-22 14:15 David Epping
  2026-06-22 14:35 ` Keith Busch
  2026-06-29  9:05 ` 顾泽兵
  0 siblings, 2 replies; 13+ messages in thread
From: David Epping @ 2026-06-22 14:15 UTC (permalink / raw)
  To: linux-nvme
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

Hello,

some NVMe admin commands (like IO queue creation) require information to
be provided via the PRP1 field of the command.
This information has not previously been passed from the ioctl to the NVMe
command structure, which prevented execution of such commands from userspace.
Likely this is not an oversight but intentional because of security concerns?

With this email I'm seeking to see if there is interest in getting support
for userspace NVMe IO queue creation in the upstream driver.
The attached patch is one way to do it and has been used for multiple years
with multiple Kernel versions now. There are additional patches required to
for example allocate sufficiently many IO queues at initialization time, but
I would like to focus on the ability to create queues for now.
A final patch would also not ignore nvme_user_cmd64().

The system setup where this patch has been used is as follows:
- P2P PCIe capable CPU (currently also IOMMU disabled)
- patched Linux in-Kernel NVMe driver for local PCIe NVMe SSDs
- FPGA accelerator implementing NVMe IO queue memory and IO queue handling,
  exposed via PCIe BAR
- vfio-pcie Kernel driver plus vfio userspace FPGA driver / application
- The userspace application creates new NVMe IO queues at the SSD using the
  patched admin ioctl and points them towards the FPGA BAR. It then informs
  the FPGA about the SSD BAR address and IO queue ID. From then on the FPGA
  can access the SSD storage entirely without software interaction.

Since the omition of PRP1 access from userspace is likely intentional, maybe
the discussion and patches by Leon Romanovsky for making dmabuf and p2pdma
available via vfio lead in the right direction:
https://lore.kernel.org/all/20251106-dmabuf-vfio-v7-0-2503bf390699@nvidia.com/

It currently seems focused on SPDK for handling the SSDs in userspace, but the
author also describes it as a general mechanism that can support other
scenarios.
How about having the in-Kernel NVMe driver accept a dmabuf as IO queue location
for creating a new IO queue? And in turn it likely has to provide a small
dmabuf to the FPGA VFIO world for access to the queue doorbells on the SSD.

Looking forward to your feedback,
David

Signed-off-by: David Epping <david.epping@missinglinkelectronics.com>
---
 drivers/nvme/host/ioctl.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index ca86d3bf7ea4..d5d740f1b554 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	c.common.nsid = cpu_to_le32(cmd.nsid);
 	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
 	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
+	c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
 	c.common.cdw10 = cpu_to_le32(cmd.cdw10);
 	c.common.cdw11 = cpu_to_le32(cmd.cdw11);
 	c.common.cdw12 = cpu_to_le32(cmd.cdw12);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-22 14:15 [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd David Epping
@ 2026-06-22 14:35 ` Keith Busch
  2026-06-22 14:56   ` David Epping
  2026-06-29  9:05 ` 顾泽兵
  1 sibling, 1 reply; 13+ messages in thread
From: Keith Busch @ 2026-06-22 14:35 UTC (permalink / raw)
  To: David Epping
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
>  	c.common.nsid = cpu_to_le32(cmd.nsid);
>  	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
>  	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> +	c.common.dptr.prp1 = cpu_to_le64(cmd.addr);

This is not correct: the user space virtual address isn't the device
DMA'able address. The driver already handles mapping the user address to
kernel space, then to dma, then sets the PRP accordingly.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-22 14:35 ` Keith Busch
@ 2026-06-22 14:56   ` David Epping
  2026-06-22 15:15     ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: David Epping @ 2026-06-22 14:56 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote:
> On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> >  	c.common.nsid = cpu_to_le32(cmd.nsid);
> >  	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> >  	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> > +	c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
> 
> This is not correct: the user space virtual address isn't the device
> DMA'able address. The driver already handles mapping the user address to
> kernel space, then to dma, then sets the PRP accordingly.

To clarify, the ioctl struct addr field is not filled with a memory buffer
address by the userspace, but a PCIe mapped BAR address plus offset.
It is obtained by the userspace application operating the FPGA vfio device
by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX.
So it is the address Linux assigned to that BAR (plus offset).


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-22 14:56   ` David Epping
@ 2026-06-22 15:15     ` Keith Busch
  2026-06-23 10:34       ` David Epping
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2026-06-22 15:15 UTC (permalink / raw)
  To: David Epping
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Mon, Jun 22, 2026 at 04:56:22PM +0200, David Epping wrote:
> On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote:
> > On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> > > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > >  	c.common.nsid = cpu_to_le32(cmd.nsid);
> > >  	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> > >  	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> > > +	c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
> > 
> > This is not correct: the user space virtual address isn't the device
> > DMA'able address. The driver already handles mapping the user address to
> > kernel space, then to dma, then sets the PRP accordingly.
> 
> To clarify, the ioctl struct addr field is not filled with a memory buffer
> address by the userspace, but a PCIe mapped BAR address plus offset.
> It is obtained by the userspace application operating the FPGA vfio device
> by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX.
> So it is the address Linux assigned to that BAR (plus offset).

The driver and block layer should already handle PCIe addresses. You're
supposed to mmap it to user space first though, and pass that address in
instead. And you'd also need to set cmd.data_len to a non-zero value so
the driver doesn't skip setting up the data pointers.

Note, creating IO queues from user space, while not explicitly prevented
today, is not supported. The driver doesn't know you've done this so the
queue isn't properly handled on a controller reset.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-22 15:15     ` Keith Busch
@ 2026-06-23 10:34       ` David Epping
  2026-06-23 12:19         ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: David Epping @ 2026-06-23 10:34 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Mon, Jun 22, 2026 at 09:15:40AM -0600, Keith Busch wrote:
> On Mon, Jun 22, 2026 at 04:56:22PM +0200, David Epping wrote:
> > On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote:
> > > On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> > > > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > > >  	c.common.nsid = cpu_to_le32(cmd.nsid);
> > > >  	c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> > > >  	c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> > > > +	c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
> > > 
> > > This is not correct: the user space virtual address isn't the device
> > > DMA'able address. The driver already handles mapping the user address to
> > > kernel space, then to dma, then sets the PRP accordingly.
> > 
> > To clarify, the ioctl struct addr field is not filled with a memory buffer
> > address by the userspace, but a PCIe mapped BAR address plus offset.
> > It is obtained by the userspace application operating the FPGA vfio device
> > by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX.
> > So it is the address Linux assigned to that BAR (plus offset).
> 
> The driver and block layer should already handle PCIe addresses. You're
> supposed to mmap it to user space first though, and pass that address in
> instead. And you'd also need to set cmd.data_len to a non-zero value so
> the driver doesn't skip setting up the data pointers.
> 
> Note, creating IO queues from user space, while not explicitly prevented
> today, is not supported. The driver doesn't know you've done this so the
> queue isn't properly handled on a controller reset.
> 

Hi Keith, I understand that creating IO queues from user space is not
supported by the current driver. That's why we created patches for that a
couple of years ago and ported them to new Kernels since.

My question is, and maybe I should have put this in my initial email
explicitely, is there interest in having such functionality in the upstream
Linux in-Kernel NVMe driver? An interface and mechanism to request and
manage IO queues that are not used by the Linux NVMe driver to perform IO,
but handed to a separate entity for this purpose.

Of course an upstream implementation would have to take many more things
into account, like the reset you mentioned, and IOMMU setup, and much more.
But that's only worth looking at if there is upstream interest in it.

Thanks for your feedback, David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-23 10:34       ` David Epping
@ 2026-06-23 12:19         ` Keith Busch
  2026-06-24  7:40           ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2026-06-23 12:19 UTC (permalink / raw)
  To: David Epping
  Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Tue, Jun 23, 2026 at 12:34:29PM +0200, David Epping wrote:
> My question is, and maybe I should have put this in my initial email
> explicitely, is there interest in having such functionality in the upstream
> Linux in-Kernel NVMe driver? An interface and mechanism to request and
> manage IO queues that are not used by the Linux NVMe driver to perform IO,
> but handed to a separate entity for this purpose.

Partitioning device resources to assign to special purposes should be
under a well defined framework. Unfortunately the only thing I know of
approaching this is SIOV. :) Not sure how other maintainers and
developers feel about it, but that's the route I would go for this. It
at least provides memory access on a queue granularity and neatly
separates the control plane.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-23 12:19         ` Keith Busch
@ 2026-06-24  7:40           ` Christoph Hellwig
  2026-06-26 17:55             ` David Epping
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2026-06-24  7:40 UTC (permalink / raw)
  To: Keith Busch
  Cc: David Epping, linux-nvme, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Leon Romanovsky, Joachim Foerster

On Tue, Jun 23, 2026 at 06:19:08AM -0600, Keith Busch wrote:
> On Tue, Jun 23, 2026 at 12:34:29PM +0200, David Epping wrote:
> > My question is, and maybe I should have put this in my initial email
> > explicitely, is there interest in having such functionality in the upstream
> > Linux in-Kernel NVMe driver? An interface and mechanism to request and
> > manage IO queues that are not used by the Linux NVMe driver to perform IO,
> > but handed to a separate entity for this purpose.
> 
> Partitioning device resources to assign to special purposes should be
> under a well defined framework. Unfortunately the only thing I know of
> approaching this is SIOV. :) Not sure how other maintainers and
> developers feel about it, but that's the route I would go for this. It
> at least provides memory access on a queue granularity and neatly
> separates the control plane.

Yeah, we can't just hand out queues.  I/O to all namespaces can be done
on queue, and any queue can address any IOVA, so this is fundamentally
unsafe.  Add to that fun like abort handling and it's just not going
to work at all.  We had at least to previous public attempts at such
schemes (Damiens' libvnme back in the day, and the Mellanox nvmet
offloading) that were rejected for the same reason.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-24  7:40           ` Christoph Hellwig
@ 2026-06-26 17:55             ` David Epping
  2026-06-26 22:22               ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: David Epping @ 2026-06-26 17:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, linux-nvme, Jens Axboe, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Wed, Jun 24, 2026 at 09:40:45AM +0200, Christoph Hellwig wrote:
> On Tue, Jun 23, 2026 at 06:19:08AM -0600, Keith Busch wrote:
> > Partitioning device resources to assign to special purposes should be
> > under a well defined framework. Unfortunately the only thing I know of
> > approaching this is SIOV. :) Not sure how other maintainers and
> > developers feel about it, but that's the route I would go for this. It
> > at least provides memory access on a queue granularity and neatly
> > separates the control plane.
> 
> Yeah, we can't just hand out queues.  I/O to all namespaces can be done
> on queue, and any queue can address any IOVA, so this is fundamentally
> unsafe.  Add to that fun like abort handling and it's just not going
> to work at all.  We had at least to previous public attempts at such
> schemes (Damiens' libvnme back in the day, and the Mellanox nvmet
> offloading) that were rejected for the same reason.
> 

Thank you both for your feedback, I get the point. I'll definitely look
into using SRIOV or SPDK to migrate the system to an unmodified upstream
NVMe driver mid-term.
Thank you for such a stable base to build upon, David


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-26 17:55             ` David Epping
@ 2026-06-26 22:22               ` Keith Busch
  2026-06-29 12:20                 ` David Epping
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2026-06-26 22:22 UTC (permalink / raw)
  To: David Epping
  Cc: Christoph Hellwig, linux-nvme, Jens Axboe, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Fri, Jun 26, 2026 at 07:55:32PM +0200, David Epping wrote:
> Thank you both for your feedback, I get the point. I'll definitely look
> into using SRIOV or SPDK to migrate the system to an unmodified upstream
> NVMe driver mid-term.

SRIOV could definitely get you there with existing capable hardware and
software as long as you don't need to exceed the VF count, but it is a
bit heavy for what you're describing.

My SIOV suggestion is more fine grained for similar use cases, however
there's no nvme standard or kernel support for the feature, so anything
using the concepts would be a custom solution; NVMe would need some
mechanism to associate an IO queue to a PASID, then attach namespace
access to that queue. After that it's just a matter of implementing the
"mediated" device.

If you're interested, this is a recent proposal to generically setup
SIOV, but it needs some work:

  https://lore.kernel.org/linux-pci/20260604150153.3619662-1-dimitri.daskalakis1@gmail.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-22 14:15 [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd David Epping
  2026-06-22 14:35 ` Keith Busch
@ 2026-06-29  9:05 ` 顾泽兵
  2026-06-29 13:02   ` David Epping
  1 sibling, 1 reply; 13+ messages in thread
From: 顾泽兵 @ 2026-06-29  9:05 UTC (permalink / raw)
  To: David Epping
  Cc: linux-nvme, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Leon Romanovsky, Joachim Foerster

> The system setup where this patch has been used is as follows:
> - P2P PCIe capable CPU (currently also IOMMU disabled)
> - patched Linux in-Kernel NVMe driver for local PCIe NVMe SSDs
> - FPGA accelerator implementing NVMe IO queue memory and IO queue handling,
>   exposed via PCIe BAR
> - vfio-pcie Kernel driver plus vfio userspace FPGA driver / application
> - The userspace application creates new NVMe IO queues at the SSD using the
>   patched admin ioctl and points them towards the FPGA BAR. It then informs
>   the FPGA about the SSD BAR address and IO queue ID. From then on the FPGA
>   can access the SSD storage entirely without software interaction.

Hi David,

I would like to ask for your insight on one point about the FPGA
queue-offload setup described in the RFC. This is not about the PRP1
ioctl change itself; I am personally interested in FPGA/NVMe datapath
offload and would like to better understand how your setup handled this.

For the I/O queues handled by the FPGA, how does the FPGA learn that the
SSD has posted new CQEs?

Did your implementation disable interrupts for those CQs and let the
FPGA poll the CQ phase tag, or did you use MSI/MSI-X with the
corresponding NVMe MSI-X vector targeting an FPGA BAR event register
instead of the host interrupt controller?

I also wonder how the I/O work was submitted to the FPGA in this model.
Does the CPU still provide the FPGA with per-I/O information such as the
data buffer address and the NVMe namespace/LBA range, while the FPGA
then builds and submits the NVMe commands? Or is the FPGA able to derive
most of that by itself after the initial queue setup?

Thanks,
Guzebing


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-26 22:22               ` Keith Busch
@ 2026-06-29 12:20                 ` David Epping
  2026-06-29 12:28                   ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: David Epping @ 2026-06-29 12:20 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, linux-nvme, Jens Axboe, Sagi Grimberg,
	Leon Romanovsky, Joachim Foerster

On Fri, Jun 26, 2026 at 04:22:32PM -0600, Keith Busch wrote:
> On Fri, Jun 26, 2026 at 07:55:32PM +0200, David Epping wrote:
> > Thank you both for your feedback, I get the point. I'll definitely look
> > into using SRIOV or SPDK to migrate the system to an unmodified upstream
> > NVMe driver mid-term.
> 
> SRIOV could definitely get you there with existing capable hardware and
> software as long as you don't need to exceed the VF count, but it is a
> bit heavy for what you're describing.
> 
> My SIOV suggestion is more fine grained for similar use cases, however
> there's no nvme standard or kernel support for the feature, so anything
> using the concepts would be a custom solution; NVMe would need some
> mechanism to associate an IO queue to a PASID, then attach namespace
> access to that queue. After that it's just a matter of implementing the
> "mediated" device.
> 
> If you're interested, this is a recent proposal to generically setup
> SIOV, but it needs some work:
> 
>   https://lore.kernel.org/linux-pci/20260604150153.3619662-1-dimitri.daskalakis1@gmail.com/
> 

Keith, thank you for the follow up and catching my mistake. I noticed the
missing R and just assumed its a typo... Sorry.
I will absolutely look into SIOV to understand the concept!


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-29 12:20                 ` David Epping
@ 2026-06-29 12:28                   ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2026-06-29 12:28 UTC (permalink / raw)
  To: David Epping
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Jens Axboe,
	Sagi Grimberg, Leon Romanovsky, Joachim Foerster

On Mon, Jun 29, 2026 at 02:20:27PM +0200, David Epping wrote:
> Keith, thank you for the follow up and catching my mistake. I noticed the
> missing R and just assumed its a typo... Sorry.
> I will absolutely look into SIOV to understand the concept!

Note that SIOV would indeed be very interesting for these use cases,
but it requires spec work and hardware support.  The NVMe technical
working group is looking into a proposal for it at the moment.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
  2026-06-29  9:05 ` 顾泽兵
@ 2026-06-29 13:02   ` David Epping
  0 siblings, 0 replies; 13+ messages in thread
From: David Epping @ 2026-06-29 13:02 UTC (permalink / raw)
  To: 顾泽兵
  Cc: linux-nvme, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Leon Romanovsky, Joachim Foerster

On Mon, Jun 29, 2026 at 05:05:51PM +0800, 顾泽兵 wrote:
> > The system setup where this patch has been used is as follows:
> > - P2P PCIe capable CPU (currently also IOMMU disabled)
> > - patched Linux in-Kernel NVMe driver for local PCIe NVMe SSDs
> > - FPGA accelerator implementing NVMe IO queue memory and IO queue handling,
> >   exposed via PCIe BAR
> > - vfio-pcie Kernel driver plus vfio userspace FPGA driver / application
> > - The userspace application creates new NVMe IO queues at the SSD using the
> >   patched admin ioctl and points them towards the FPGA BAR. It then informs
> >   the FPGA about the SSD BAR address and IO queue ID. From then on the FPGA
> >   can access the SSD storage entirely without software interaction.
> 
> Hi David,
> 
> I would like to ask for your insight on one point about the FPGA
> queue-offload setup described in the RFC. This is not about the PRP1
> ioctl change itself; I am personally interested in FPGA/NVMe datapath
> offload and would like to better understand how your setup handled this.
> 
> For the I/O queues handled by the FPGA, how does the FPGA learn that the
> SSD has posted new CQEs?
> 
> Did your implementation disable interrupts for those CQs and let the
> FPGA poll the CQ phase tag, or did you use MSI/MSI-X with the
> corresponding NVMe MSI-X vector targeting an FPGA BAR event register
> instead of the host interrupt controller?
> 
> I also wonder how the I/O work was submitted to the FPGA in this model.
> Does the CPU still provide the FPGA with per-I/O information such as the
> data buffer address and the NVMe namespace/LBA range, while the FPGA
> then builds and submits the NVMe commands? Or is the FPGA able to derive
> most of that by itself after the initial queue setup?
> 
> Thanks,
> Guzebing
> 

Hi Guzebing,

the I/O queues managed by the FPGA are implemented as FPGA internal SRAM,
and thus the FPGA sees and performs every single queue memory access.
As you assumed, interrupts are disabled for these queues, and software
would call this polling, but for the FPGA it is instantaneous knowledge
about the access.

After initial I/O queue setup the FPGA operates completely autonomous as
far as NVMe is concerened.
There is additional Linux userspace software controlling the operation
and telling the FPGA which linear range of LBAs it is allowed to access,
but that is not a NVMe driver/protocol level knowledge or enforcement.
As such, Linux simultaneous access to the same LBAs is technically
possible, but does not make sense because of caching.

We use the FPGA to record data from external sources (FPGA attached
network interfaces, high-speed ADCs, ...) to a set of NVMe SSDs in RAID
configuration. Linux never gets to see this data (or even knows this is
happening). Only after the recording Linux may open and use the RAID
block device (we use mdraid structures). This mutually exclusive access
scheduling is managed by userspace software.

Best regards, David


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-06-29 13:02 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-22 14:15 [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd David Epping
2026-06-22 14:35 ` Keith Busch
2026-06-22 14:56   ` David Epping
2026-06-22 15:15     ` Keith Busch
2026-06-23 10:34       ` David Epping
2026-06-23 12:19         ` Keith Busch
2026-06-24  7:40           ` Christoph Hellwig
2026-06-26 17:55             ` David Epping
2026-06-26 22:22               ` Keith Busch
2026-06-29 12:20                 ` David Epping
2026-06-29 12:28                   ` Christoph Hellwig
2026-06-29  9:05 ` 顾泽兵
2026-06-29 13:02   ` David Epping

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.