* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
[not found] <ajlDjpjK_clMrnwx@ubuntu-server>
@ 2026-06-22 14:35 ` Keith Busch
2026-06-22 14:56 ` David Epping
2026-06-29 9:05 ` 顾泽兵
1 sibling, 1 reply; 12+ messages in thread
From: Keith Busch @ 2026-06-22 14:35 UTC (permalink / raw)
To: David Epping
Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Leon Romanovsky, Joachim Foerster
On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> c.common.nsid = cpu_to_le32(cmd.nsid);
> c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> + c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
This is not correct: the user space virtual address isn't the device
DMA'able address. The driver already handles mapping the user address to
kernel space, then to dma, then sets the PRP accordingly.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
2026-06-22 14:35 ` [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd Keith Busch
@ 2026-06-22 14:56 ` David Epping
2026-06-22 15:15 ` Keith Busch
0 siblings, 1 reply; 12+ messages in thread
From: David Epping @ 2026-06-22 14:56 UTC (permalink / raw)
To: Keith Busch
Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Leon Romanovsky, Joachim Foerster
On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote:
> On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > c.common.nsid = cpu_to_le32(cmd.nsid);
> > c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> > c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> > + c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
>
> This is not correct: the user space virtual address isn't the device
> DMA'able address. The driver already handles mapping the user address to
> kernel space, then to dma, then sets the PRP accordingly.
To clarify, the ioctl struct addr field is not filled with a memory buffer
address by the userspace, but a PCIe mapped BAR address plus offset.
It is obtained by the userspace application operating the FPGA vfio device
by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX.
So it is the address Linux assigned to that BAR (plus offset).
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
2026-06-22 14:56 ` David Epping
@ 2026-06-22 15:15 ` Keith Busch
2026-06-23 10:34 ` David Epping
0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2026-06-22 15:15 UTC (permalink / raw)
To: David Epping
Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Leon Romanovsky, Joachim Foerster
On Mon, Jun 22, 2026 at 04:56:22PM +0200, David Epping wrote:
> On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote:
> > On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> > > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > > c.common.nsid = cpu_to_le32(cmd.nsid);
> > > c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> > > c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> > > + c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
> >
> > This is not correct: the user space virtual address isn't the device
> > DMA'able address. The driver already handles mapping the user address to
> > kernel space, then to dma, then sets the PRP accordingly.
>
> To clarify, the ioctl struct addr field is not filled with a memory buffer
> address by the userspace, but a PCIe mapped BAR address plus offset.
> It is obtained by the userspace application operating the FPGA vfio device
> by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX.
> So it is the address Linux assigned to that BAR (plus offset).
The driver and block layer should already handle PCIe addresses. You're
supposed to mmap it to user space first though, and pass that address in
instead. And you'd also need to set cmd.data_len to a non-zero value so
the driver doesn't skip setting up the data pointers.
Note, creating IO queues from user space, while not explicitly prevented
today, is not supported. The driver doesn't know you've done this so the
queue isn't properly handled on a controller reset.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
2026-06-22 15:15 ` Keith Busch
@ 2026-06-23 10:34 ` David Epping
2026-06-23 12:19 ` Keith Busch
0 siblings, 1 reply; 12+ messages in thread
From: David Epping @ 2026-06-23 10:34 UTC (permalink / raw)
To: Keith Busch
Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Leon Romanovsky, Joachim Foerster
On Mon, Jun 22, 2026 at 09:15:40AM -0600, Keith Busch wrote:
> On Mon, Jun 22, 2026 at 04:56:22PM +0200, David Epping wrote:
> > On Mon, Jun 22, 2026 at 08:35:42AM -0600, Keith Busch wrote:
> > > On Mon, Jun 22, 2026 at 04:15:42PM +0200, David Epping wrote:
> > > > @@ -306,6 +306,7 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
> > > > c.common.nsid = cpu_to_le32(cmd.nsid);
> > > > c.common.cdw2[0] = cpu_to_le32(cmd.cdw2);
> > > > c.common.cdw2[1] = cpu_to_le32(cmd.cdw3);
> > > > + c.common.dptr.prp1 = cpu_to_le64(cmd.addr);
> > >
> > > This is not correct: the user space virtual address isn't the device
> > > DMA'able address. The driver already handles mapping the user address to
> > > kernel space, then to dma, then sets the PRP accordingly.
> >
> > To clarify, the ioctl struct addr field is not filled with a memory buffer
> > address by the userspace, but a PCIe mapped BAR address plus offset.
> > It is obtained by the userspace application operating the FPGA vfio device
> > by reading from PCI config space via VFIO_PCI_CONFIG_REGION_INDEX.
> > So it is the address Linux assigned to that BAR (plus offset).
>
> The driver and block layer should already handle PCIe addresses. You're
> supposed to mmap it to user space first though, and pass that address in
> instead. And you'd also need to set cmd.data_len to a non-zero value so
> the driver doesn't skip setting up the data pointers.
>
> Note, creating IO queues from user space, while not explicitly prevented
> today, is not supported. The driver doesn't know you've done this so the
> queue isn't properly handled on a controller reset.
>
Hi Keith, I understand that creating IO queues from user space is not
supported by the current driver. That's why we created patches for that a
couple of years ago and ported them to new Kernels since.
My question is, and maybe I should have put this in my initial email
explicitely, is there interest in having such functionality in the upstream
Linux in-Kernel NVMe driver? An interface and mechanism to request and
manage IO queues that are not used by the Linux NVMe driver to perform IO,
but handed to a separate entity for this purpose.
Of course an upstream implementation would have to take many more things
into account, like the reset you mentioned, and IOMMU setup, and much more.
But that's only worth looking at if there is upstream interest in it.
Thanks for your feedback, David
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
2026-06-23 10:34 ` David Epping
@ 2026-06-23 12:19 ` Keith Busch
2026-06-24 7:40 ` Christoph Hellwig
0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2026-06-23 12:19 UTC (permalink / raw)
To: David Epping
Cc: linux-nvme, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
Leon Romanovsky, Joachim Foerster
On Tue, Jun 23, 2026 at 12:34:29PM +0200, David Epping wrote:
> My question is, and maybe I should have put this in my initial email
> explicitely, is there interest in having such functionality in the upstream
> Linux in-Kernel NVMe driver? An interface and mechanism to request and
> manage IO queues that are not used by the Linux NVMe driver to perform IO,
> but handed to a separate entity for this purpose.
Partitioning device resources to assign to special purposes should be
under a well defined framework. Unfortunately the only thing I know of
approaching this is SIOV. :) Not sure how other maintainers and
developers feel about it, but that's the route I would go for this. It
at least provides memory access on a queue granularity and neatly
separates the control plane.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
2026-06-23 12:19 ` Keith Busch
@ 2026-06-24 7:40 ` Christoph Hellwig
2026-06-26 17:55 ` David Epping
0 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2026-06-24 7:40 UTC (permalink / raw)
To: Keith Busch
Cc: David Epping, linux-nvme, Jens Axboe, Christoph Hellwig,
Sagi Grimberg, Leon Romanovsky, Joachim Foerster
On Tue, Jun 23, 2026 at 06:19:08AM -0600, Keith Busch wrote:
> On Tue, Jun 23, 2026 at 12:34:29PM +0200, David Epping wrote:
> > My question is, and maybe I should have put this in my initial email
> > explicitely, is there interest in having such functionality in the upstream
> > Linux in-Kernel NVMe driver? An interface and mechanism to request and
> > manage IO queues that are not used by the Linux NVMe driver to perform IO,
> > but handed to a separate entity for this purpose.
>
> Partitioning device resources to assign to special purposes should be
> under a well defined framework. Unfortunately the only thing I know of
> approaching this is SIOV. :) Not sure how other maintainers and
> developers feel about it, but that's the route I would go for this. It
> at least provides memory access on a queue granularity and neatly
> separates the control plane.
Yeah, we can't just hand out queues. I/O to all namespaces can be done
on queue, and any queue can address any IOVA, so this is fundamentally
unsafe. Add to that fun like abort handling and it's just not going
to work at all. We had at least to previous public attempts at such
schemes (Damiens' libvnme back in the day, and the Mellanox nvmet
offloading) that were rejected for the same reason.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
2026-06-24 7:40 ` Christoph Hellwig
@ 2026-06-26 17:55 ` David Epping
2026-06-26 22:22 ` Keith Busch
0 siblings, 1 reply; 12+ messages in thread
From: David Epping @ 2026-06-26 17:55 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, linux-nvme, Jens Axboe, Sagi Grimberg,
Leon Romanovsky, Joachim Foerster
On Wed, Jun 24, 2026 at 09:40:45AM +0200, Christoph Hellwig wrote:
> On Tue, Jun 23, 2026 at 06:19:08AM -0600, Keith Busch wrote:
> > Partitioning device resources to assign to special purposes should be
> > under a well defined framework. Unfortunately the only thing I know of
> > approaching this is SIOV. :) Not sure how other maintainers and
> > developers feel about it, but that's the route I would go for this. It
> > at least provides memory access on a queue granularity and neatly
> > separates the control plane.
>
> Yeah, we can't just hand out queues. I/O to all namespaces can be done
> on queue, and any queue can address any IOVA, so this is fundamentally
> unsafe. Add to that fun like abort handling and it's just not going
> to work at all. We had at least to previous public attempts at such
> schemes (Damiens' libvnme back in the day, and the Mellanox nvmet
> offloading) that were rejected for the same reason.
>
Thank you both for your feedback, I get the point. I'll definitely look
into using SRIOV or SPDK to migrate the system to an unmodified upstream
NVMe driver mid-term.
Thank you for such a stable base to build upon, David
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
2026-06-26 17:55 ` David Epping
@ 2026-06-26 22:22 ` Keith Busch
2026-06-29 12:20 ` David Epping
0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2026-06-26 22:22 UTC (permalink / raw)
To: David Epping
Cc: Christoph Hellwig, linux-nvme, Jens Axboe, Sagi Grimberg,
Leon Romanovsky, Joachim Foerster
On Fri, Jun 26, 2026 at 07:55:32PM +0200, David Epping wrote:
> Thank you both for your feedback, I get the point. I'll definitely look
> into using SRIOV or SPDK to migrate the system to an unmodified upstream
> NVMe driver mid-term.
SRIOV could definitely get you there with existing capable hardware and
software as long as you don't need to exceed the VF count, but it is a
bit heavy for what you're describing.
My SIOV suggestion is more fine grained for similar use cases, however
there's no nvme standard or kernel support for the feature, so anything
using the concepts would be a custom solution; NVMe would need some
mechanism to associate an IO queue to a PASID, then attach namespace
access to that queue. After that it's just a matter of implementing the
"mediated" device.
If you're interested, this is a recent proposal to generically setup
SIOV, but it needs some work:
https://lore.kernel.org/linux-pci/20260604150153.3619662-1-dimitri.daskalakis1@gmail.com/
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
[not found] <ajlDjpjK_clMrnwx@ubuntu-server>
2026-06-22 14:35 ` [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd Keith Busch
@ 2026-06-29 9:05 ` 顾泽兵
2026-06-29 13:02 ` David Epping
1 sibling, 1 reply; 12+ messages in thread
From: 顾泽兵 @ 2026-06-29 9:05 UTC (permalink / raw)
To: David Epping
Cc: linux-nvme, Keith Busch, Jens Axboe, Christoph Hellwig,
Sagi Grimberg, Leon Romanovsky, Joachim Foerster
> The system setup where this patch has been used is as follows:
> - P2P PCIe capable CPU (currently also IOMMU disabled)
> - patched Linux in-Kernel NVMe driver for local PCIe NVMe SSDs
> - FPGA accelerator implementing NVMe IO queue memory and IO queue handling,
> exposed via PCIe BAR
> - vfio-pcie Kernel driver plus vfio userspace FPGA driver / application
> - The userspace application creates new NVMe IO queues at the SSD using the
> patched admin ioctl and points them towards the FPGA BAR. It then informs
> the FPGA about the SSD BAR address and IO queue ID. From then on the FPGA
> can access the SSD storage entirely without software interaction.
Hi David,
I would like to ask for your insight on one point about the FPGA
queue-offload setup described in the RFC. This is not about the PRP1
ioctl change itself; I am personally interested in FPGA/NVMe datapath
offload and would like to better understand how your setup handled this.
For the I/O queues handled by the FPGA, how does the FPGA learn that the
SSD has posted new CQEs?
Did your implementation disable interrupts for those CQs and let the
FPGA poll the CQ phase tag, or did you use MSI/MSI-X with the
corresponding NVMe MSI-X vector targeting an FPGA BAR event register
instead of the host interrupt controller?
I also wonder how the I/O work was submitted to the FPGA in this model.
Does the CPU still provide the FPGA with per-I/O information such as the
data buffer address and the NVMe namespace/LBA range, while the FPGA
then builds and submits the NVMe commands? Or is the FPGA able to derive
most of that by itself after the initial queue setup?
Thanks,
Guzebing
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
2026-06-26 22:22 ` Keith Busch
@ 2026-06-29 12:20 ` David Epping
2026-06-29 12:28 ` Christoph Hellwig
0 siblings, 1 reply; 12+ messages in thread
From: David Epping @ 2026-06-29 12:20 UTC (permalink / raw)
To: Keith Busch
Cc: Christoph Hellwig, linux-nvme, Jens Axboe, Sagi Grimberg,
Leon Romanovsky, Joachim Foerster
On Fri, Jun 26, 2026 at 04:22:32PM -0600, Keith Busch wrote:
> On Fri, Jun 26, 2026 at 07:55:32PM +0200, David Epping wrote:
> > Thank you both for your feedback, I get the point. I'll definitely look
> > into using SRIOV or SPDK to migrate the system to an unmodified upstream
> > NVMe driver mid-term.
>
> SRIOV could definitely get you there with existing capable hardware and
> software as long as you don't need to exceed the VF count, but it is a
> bit heavy for what you're describing.
>
> My SIOV suggestion is more fine grained for similar use cases, however
> there's no nvme standard or kernel support for the feature, so anything
> using the concepts would be a custom solution; NVMe would need some
> mechanism to associate an IO queue to a PASID, then attach namespace
> access to that queue. After that it's just a matter of implementing the
> "mediated" device.
>
> If you're interested, this is a recent proposal to generically setup
> SIOV, but it needs some work:
>
> https://lore.kernel.org/linux-pci/20260604150153.3619662-1-dimitri.daskalakis1@gmail.com/
>
Keith, thank you for the follow up and catching my mistake. I noticed the
missing R and just assumed its a typo... Sorry.
I will absolutely look into SIOV to understand the concept!
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
2026-06-29 12:20 ` David Epping
@ 2026-06-29 12:28 ` Christoph Hellwig
0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2026-06-29 12:28 UTC (permalink / raw)
To: David Epping
Cc: Keith Busch, Christoph Hellwig, linux-nvme, Jens Axboe,
Sagi Grimberg, Leon Romanovsky, Joachim Foerster
On Mon, Jun 29, 2026 at 02:20:27PM +0200, David Epping wrote:
> Keith, thank you for the follow up and catching my mistake. I noticed the
> missing R and just assumed its a typo... Sorry.
> I will absolutely look into SIOV to understand the concept!
Note that SIOV would indeed be very interesting for these use cases,
but it requires spec work and hardware support. The NVMe technical
working group is looking into a proposal for it at the moment.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd
2026-06-29 9:05 ` 顾泽兵
@ 2026-06-29 13:02 ` David Epping
0 siblings, 0 replies; 12+ messages in thread
From: David Epping @ 2026-06-29 13:02 UTC (permalink / raw)
To: 顾泽兵
Cc: linux-nvme, Keith Busch, Jens Axboe, Christoph Hellwig,
Sagi Grimberg, Leon Romanovsky, Joachim Foerster
On Mon, Jun 29, 2026 at 05:05:51PM +0800, 顾泽兵 wrote:
> > The system setup where this patch has been used is as follows:
> > - P2P PCIe capable CPU (currently also IOMMU disabled)
> > - patched Linux in-Kernel NVMe driver for local PCIe NVMe SSDs
> > - FPGA accelerator implementing NVMe IO queue memory and IO queue handling,
> > exposed via PCIe BAR
> > - vfio-pcie Kernel driver plus vfio userspace FPGA driver / application
> > - The userspace application creates new NVMe IO queues at the SSD using the
> > patched admin ioctl and points them towards the FPGA BAR. It then informs
> > the FPGA about the SSD BAR address and IO queue ID. From then on the FPGA
> > can access the SSD storage entirely without software interaction.
>
> Hi David,
>
> I would like to ask for your insight on one point about the FPGA
> queue-offload setup described in the RFC. This is not about the PRP1
> ioctl change itself; I am personally interested in FPGA/NVMe datapath
> offload and would like to better understand how your setup handled this.
>
> For the I/O queues handled by the FPGA, how does the FPGA learn that the
> SSD has posted new CQEs?
>
> Did your implementation disable interrupts for those CQs and let the
> FPGA poll the CQ phase tag, or did you use MSI/MSI-X with the
> corresponding NVMe MSI-X vector targeting an FPGA BAR event register
> instead of the host interrupt controller?
>
> I also wonder how the I/O work was submitted to the FPGA in this model.
> Does the CPU still provide the FPGA with per-I/O information such as the
> data buffer address and the NVMe namespace/LBA range, while the FPGA
> then builds and submits the NVMe commands? Or is the FPGA able to derive
> most of that by itself after the initial queue setup?
>
> Thanks,
> Guzebing
>
Hi Guzebing,
the I/O queues managed by the FPGA are implemented as FPGA internal SRAM,
and thus the FPGA sees and performs every single queue memory access.
As you assumed, interrupts are disabled for these queues, and software
would call this polling, but for the FPGA it is instantaneous knowledge
about the access.
After initial I/O queue setup the FPGA operates completely autonomous as
far as NVMe is concerened.
There is additional Linux userspace software controlling the operation
and telling the FPGA which linear range of LBAs it is allowed to access,
but that is not a NVMe driver/protocol level knowledge or enforcement.
As such, Linux simultaneous access to the same LBAs is technically
possible, but does not make sense because of caching.
We use the FPGA to record data from external sources (FPGA attached
network interfaces, high-speed ADCs, ...) to a set of NVMe SSDs in RAID
configuration. Linux never gets to see this data (or even knows this is
happening). Only after the recording Linux may open and use the RAID
block device (we use mdraid structures). This mutually exclusive access
scheduling is managed by userspace software.
Best regards, David
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-06-29 13:02 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <ajlDjpjK_clMrnwx@ubuntu-server>
2026-06-22 14:35 ` [PATCH RFC] nvme-ioctl: propagate PRP1 from ioctl to admin cmd Keith Busch
2026-06-22 14:56 ` David Epping
2026-06-22 15:15 ` Keith Busch
2026-06-23 10:34 ` David Epping
2026-06-23 12:19 ` Keith Busch
2026-06-24 7:40 ` Christoph Hellwig
2026-06-26 17:55 ` David Epping
2026-06-26 22:22 ` Keith Busch
2026-06-29 12:20 ` David Epping
2026-06-29 12:28 ` Christoph Hellwig
2026-06-29 9:05 ` 顾泽兵
2026-06-29 13:02 ` David Epping
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox