[PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf
@ 2025-10-13 15:26 Leon Romanovsky
  2025-10-13 15:26 ` [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic Leon Romanovsky
                   ` (9 more replies)
  0 siblings, 10 replies; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-13 15:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

Changelog:
v5:
 * Rebased on top of v6.18-rc1.
 * Added more validation logic to make sure that DMA-BUF length doesn't
   overflow in various scenarios.
 * Hide kernel config from the users.
 * Fixed type conversion issue. DMA ranges are exposed with u64 length,
   but DMA-BUF uses "unsigned int" as a length for SG entries.
 * Added check to prevent from VFIO drivers which reports BAR size
   different from PCI, do not use DMA-BUF functionality.
v4: https://lore.kernel.org/all/cover.1759070796.git.leon@kernel.org
 * Split pcim_p2pdma_provider() to two functions, one that initializes
   array of providers and another to return right provider pointer.
v3: https://lore.kernel.org/all/cover.1758804980.git.leon@kernel.org
 * Changed pcim_p2pdma_enable() to be pcim_p2pdma_provider().
 * Cache provider in vfio_pci_dma_buf struct instead of BAR index.
 * Removed misleading comment from pcim_p2pdma_provider().
 * Moved MMIO check to be in pcim_p2pdma_provider().
v2: https://lore.kernel.org/all/cover.1757589589.git.leon@kernel.org/
 * Added extra patch which adds new CONFIG, so next patches can reuse
 * it.
 * Squashed "PCI/P2PDMA: Remove redundant bus_offset from map state"
   into the other patch.
 * Fixed revoke calls to be aligned with true->false semantics.
 * Extended p2pdma_providers to be per-BAR and not global to whole
 * device.
 * Fixed possible race between dmabuf states and revoke.
 * Moved revoke to PCI BAR zap block.
v1: https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org
 * Changed commit messages.
 * Reused DMA_ATTR_MMIO attribute.
 * Returned support for multiple DMA ranges per-dMABUF.
v0: https://lore.kernel.org/all/cover.1753274085.git.leonro@nvidia.com

---------------------------------------------------------------------------
Based on "[PATCH v6 00/16] dma-mapping: migrate to physical address-based API"
https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/ series.
---------------------------------------------------------------------------

This series extends the VFIO PCI subsystem to support exporting MMIO
regions from PCI device BARs as dma-buf objects, enabling safe sharing of
non-struct page memory with controlled lifetime management. This allows RDMA
and other subsystems to import dma-buf FDs and build them into memory regions
for PCI P2P operations.

The series supports a use case for SPDK where a NVMe device will be
owned by SPDK through VFIO but interacting with a RDMA device. The RDMA
device may directly access the NVMe CMB or directly manipulate the NVMe
device's doorbell using PCI P2P.

However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.

In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.

The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.

The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.

-----------------------------------------------------------------------
The series is based originally on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.com/
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dmabuf-vfio-v5

Thanks

Leon Romanovsky (7):
  PCI/P2PDMA: Separate the mmap() support from the core logic
  PCI/P2PDMA: Simplify bus address mapping API
  PCI/P2PDMA: Refactor to separate core P2P functionality from memory
    allocation
  PCI/P2PDMA: Export pci_p2pdma_map_type() function
  types: move phys_vec definition to common header
  vfio/pci: Enable peer-to-peer DMA transactions by default
  vfio/pci: Add dma-buf export support for MMIO regions

Vivek Kasireddy (2):
  vfio: Export vfio device get and put registration helpers
  vfio/pci: Share the core device pointer while invoking feature
    functions

 block/blk-mq-dma.c                 |   7 +-
 drivers/iommu/dma-iommu.c          |   4 +-
 drivers/pci/p2pdma.c               | 175 ++++++++---
 drivers/vfio/pci/Kconfig           |   3 +
 drivers/vfio/pci/Makefile          |   2 +
 drivers/vfio/pci/vfio_pci_config.c |  22 +-
 drivers/vfio/pci/vfio_pci_core.c   |  63 ++--
 drivers/vfio/pci/vfio_pci_dmabuf.c | 446 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_priv.h   |  23 ++
 drivers/vfio/vfio_main.c           |   2 +
 include/linux/pci-p2pdma.h         | 120 +++++---
 include/linux/types.h              |   5 +
 include/linux/vfio.h               |   2 +
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |  25 ++
 kernel/dma/direct.c                |   4 +-
 mm/hmm.c                           |   2 +-
 17 files changed, 785 insertions(+), 121 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c

-- 
2.51.0

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic
  2025-10-13 15:26 [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
@ 2025-10-13 15:26 ` Leon Romanovsky
  2025-10-17  6:30   ` Christoph Hellwig
  2025-10-13 15:26 ` [PATCH v5 2/9] PCI/P2PDMA: Simplify bus address mapping API Leon Romanovsky
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-13 15:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Currently the P2PDMA code requires a pgmap and a struct page to
function. The was serving three important purposes:

 - DMA API compatibility, where scatterlist required a struct page as
   input

 - Life cycle management, the percpu_ref is used to prevent UAF during
   device hot unplug

 - A way to get the P2P provider data through the pci_p2pdma_pagemap

The DMA API now has a new flow, and has gained phys_addr_t support, so
it no longer needs struct pages to perform P2P mapping.

Lifecycle management can be delegated to the user, DMABUF for instance
has a suitable invalidation protocol that does not require struct page.

Finding the P2P provider data can also be managed by the caller
without need to look it up from the phys_addr.

Split the P2PDMA code into two layers. The optional upper layer,
effectively, provides a way to mmap() P2P memory into a VMA by
providing struct page, pgmap, a genalloc and sysfs.

The lower layer provides the actual P2P infrastructure and is wrapped
up in a new struct p2pdma_provider. Rework the mmap layer to use new
p2pdma_provider based APIs.

Drivers that do not want to put P2P memory into VMA's can allocate a
struct p2pdma_provider after probe() starts and free it before
remove() completes. When DMA mapping the driver must convey the struct
p2pdma_provider to the DMA mapping code along with a phys_addr of the
MMIO BAR slice to map. The driver must ensure that no DMA mapping
outlives the lifetime of the struct p2pdma_provider.

The intended target of this new API layer is DMABUF. There is usually
only a single p2pdma_provider for a DMABUF exporter. Most drivers can
establish the p2pdma_provider during probe, access the single instance
during DMABUF attach and use that to drive the DMA mapping.

DMABUF provides an invalidation mechanism that can guarantee all DMA
is halted and the DMA mappings are undone prior to destroying the
struct p2pdma_provider. This ensures there is no UAF through DMABUFs
that are lingering past driver removal.

The new p2pdma_provider layer cannot be used to create P2P memory that
can be mapped into VMA's, be used with pin_user_pages(), O_DIRECT, and
so on. These use cases must still use the mmap() layer. The
p2pdma_provider layer is principally for DMABUF-like use cases where
DMABUF natively manages the life cycle and access instead of
vmas/pin_user_pages()/struct page.

In addition, remove the bus_off field from pci_p2pdma_map_state since
it duplicates information already available in the pgmap structure.
The bus_offset is only used in one location (pci_p2pdma_bus_addr_map)
and is always identical to pgmap->bus_offset.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/p2pdma.c       | 43 ++++++++++++++++++++------------------
 include/linux/pci-p2pdma.h | 19 ++++++++++++-----
 2 files changed, 37 insertions(+), 25 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 78e108e47254..59cd6fb40e83 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -28,9 +28,8 @@ struct pci_p2pdma {
 };
 
 struct pci_p2pdma_pagemap {
-	struct pci_dev *provider;
-	u64 bus_offset;
 	struct dev_pagemap pgmap;
+	struct p2pdma_provider mem;
 };
 
 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
@@ -204,8 +203,8 @@ static void p2pdma_page_free(struct page *page)
 {
 	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
 	/* safe to dereference while a reference is held to the percpu ref */
-	struct pci_p2pdma *p2pdma =
-		rcu_dereference_protected(pgmap->provider->p2pdma, 1);
+	struct pci_p2pdma *p2pdma = rcu_dereference_protected(
+		to_pci_dev(pgmap->mem.owner)->p2pdma, 1);
 	struct percpu_ref *ref;
 
 	gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page),
@@ -270,14 +269,15 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 
 static void pci_p2pdma_unmap_mappings(void *data)
 {
-	struct pci_dev *pdev = data;
+	struct pci_p2pdma_pagemap *p2p_pgmap = data;
 
 	/*
 	 * Removing the alloc attribute from sysfs will call
 	 * unmap_mapping_range() on the inode, teardown any existing userspace
 	 * mappings and prevent new ones from being created.
 	 */
-	sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
+	sysfs_remove_file_from_group(&p2p_pgmap->mem.owner->kobj,
+				     &p2pmem_alloc_attr.attr,
 				     p2pmem_group.name);
 }
 
@@ -328,10 +328,9 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->nr_range = 1;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
 	pgmap->ops = &p2pdma_pgmap_ops;
-
-	p2p_pgmap->provider = pdev;
-	p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
-		pci_resource_start(pdev, bar);
+	p2p_pgmap->mem.owner = &pdev->dev;
+	p2p_pgmap->mem.bus_offset =
+		pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar);
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
@@ -340,7 +339,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	}
 
 	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
-					 pdev);
+					 p2p_pgmap);
 	if (error)
 		goto pages_free;
 
@@ -972,16 +971,16 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
-						    struct device *dev)
+static enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
 {
 	enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
-	struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
+	struct pci_dev *pdev = to_pci_dev(provider->owner);
 	struct pci_dev *client;
 	struct pci_p2pdma *p2pdma;
 	int dist;
 
-	if (!provider->p2pdma)
+	if (!pdev->p2pdma)
 		return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 
 	if (!dev_is_pci(dev))
@@ -990,7 +989,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 	client = to_pci_dev(dev);
 
 	rcu_read_lock();
-	p2pdma = rcu_dereference(provider->p2pdma);
+	p2pdma = rcu_dereference(pdev->p2pdma);
 
 	if (p2pdma)
 		type = xa_to_value(xa_load(&p2pdma->map_types,
@@ -998,7 +997,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 	rcu_read_unlock();
 
 	if (type == PCI_P2PDMA_MAP_UNKNOWN)
-		return calc_map_type_and_dist(provider, client, &dist, true);
+		return calc_map_type_and_dist(pdev, client, &dist, true);
 
 	return type;
 }
@@ -1006,9 +1005,13 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 		struct device *dev, struct page *page)
 {
-	state->pgmap = page_pgmap(page);
-	state->map = pci_p2pdma_map_type(state->pgmap, dev);
-	state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page));
+
+	if (state->mem == &p2p_pgmap->mem)
+		return;
+
+	state->mem = &p2p_pgmap->mem;
+	state->map = pci_p2pdma_map_type(&p2p_pgmap->mem, dev);
 }
 
 /**
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 951f81a38f3a..1400f3ad4299 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -16,6 +16,16 @@
 struct block_device;
 struct scatterlist;
 
+/**
+ * struct p2pdma_provider
+ *
+ * A p2pdma provider is a range of MMIO address space available to the CPU.
+ */
+struct p2pdma_provider {
+	struct device *owner;
+	u64 bus_offset;
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
@@ -139,11 +149,11 @@ enum pci_p2pdma_map_type {
 };
 
 struct pci_p2pdma_map_state {
-	struct dev_pagemap *pgmap;
+	struct p2pdma_provider *mem;
 	enum pci_p2pdma_map_type map;
-	u64 bus_off;
 };
 
+
 /* helper for pci_p2pdma_state(), do not use directly */
 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 		struct device *dev, struct page *page);
@@ -162,8 +172,7 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
 		struct page *page)
 {
 	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
-		if (state->pgmap != page_pgmap(page))
-			__pci_p2pdma_update_state(state, dev, page);
+		__pci_p2pdma_update_state(state, dev, page);
 		return state->map;
 	}
 	return PCI_P2PDMA_MAP_NONE;
@@ -181,7 +190,7 @@ static inline dma_addr_t
 pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
 {
 	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
-	return paddr + state->bus_off;
+	return paddr + state->mem->bus_offset;
 }
 
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic
  2025-10-13 15:26 ` [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic Leon Romanovsky
@ 2025-10-17  6:30   ` Christoph Hellwig
  2025-10-17 11:53     ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-10-17  6:30 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote:
> The DMA API now has a new flow, and has gained phys_addr_t support, so
> it no longer needs struct pages to perform P2P mapping.

That's news to me.  All the pci_p2pdma_map_state machinery is still
based on pgmaps and thus pages.

> Lifecycle management can be delegated to the user, DMABUF for instance
> has a suitable invalidation protocol that does not require struct page.

How?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic
  2025-10-17  6:30   ` Christoph Hellwig
@ 2025-10-17 11:53     ` Jason Gunthorpe
  2025-10-20 12:27       ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-17 11:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Thu, Oct 16, 2025 at 11:30:06PM -0700, Christoph Hellwig wrote:
> On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote:
> > The DMA API now has a new flow, and has gained phys_addr_t support, so
> > it no longer needs struct pages to perform P2P mapping.
> 
> That's news to me.  All the pci_p2pdma_map_state machinery is still
> based on pgmaps and thus pages.

We had this discussion already three months ago:

https://lore.kernel.org/all/20250729131502.GJ36037@nvidia.com/

These couple patches make the core pci_p2pdma_map_state machinery work
on struct p2pdma_provider, and pgmap is just one way to get a
p2pdma_provider *

The struct page paths through pgmap go page->pgmap->mem to get
p2pdma_provider.

The non-struct page paths just have a p2pdma_provider * without a
pgmap. In this series VFIO uses

+	*provider = pcim_p2pdma_provider(pdev, bar);

To get the provider for a specific BAR.

> > Lifecycle management can be delegated to the user, DMABUF for instance
> > has a suitable invalidation protocol that does not require struct page.
> 
> How?

I think I've answered this three times now - for DMABUF the DMABUF
invalidation scheme is used to control the lifetime and no DMA mapping
outlives the provider, and the provider doesn't outlive the driver.

Hotplug works fine. VFIO gets the driver removal callback, it
invalidates all the DMABUFs, refuses to re-validate them, destroys the
P2P provider, and ends its driver. There is no lifetime issue.

Obviously you cannot use the new p2provider mechanism without some
kind of protection against use after hot unplug, but it doesn't have
to be struct page based.

For VFIO the invalidation scheme is linked to dma_buf_move_notify(),
for instance the hotunplug case goes:

static const struct vfio_device_ops vfio_pci_ops = {
   .close_device	= vfio_pci_core_close_device,

	vfio_pci_dma_buf_cleanup(vdev);

		dma_buf_move_notify(priv->dmabuf);

And then if we follow that into an importer like RDMA:

static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = {
   .move_notify = mlx5_ib_dmabuf_invalidate_cb,

	mlx5r_umr_update_mr_pas(mr, MLX5_IB_UPD_XLT_ZAP);
	ib_umem_dmabuf_unmap_pages(umem_dmabuf);

	    dma_buf_unmap_attachment(umem_dmabuf->attach, umem_dmabuf->sgt,
				 DMA_BIDIRECTIONAL);
               vfio_pci_dma_buf_unmap()

XLT_ZAP tells the HW to stop doing DMA and the unmap_pages -> 
unmap_attachment -> vfio_pci_dma_buf_unmap()
flow will tear down the DMA API mapping and remove it from the
IOMMU. All of this happens before device_driver remove completes.

There is no lifecycle issue here and we don't need pgmap to solve a
livecycle problem or to help find the p2pdma_provider.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic
  2025-10-17 11:53     ` Jason Gunthorpe
@ 2025-10-20 12:27       ` Christoph Hellwig
  2025-10-20 12:58         ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-10-20 12:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Alex Williamson,
	Leon Romanovsky, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Fri, Oct 17, 2025 at 08:53:20AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 16, 2025 at 11:30:06PM -0700, Christoph Hellwig wrote:
> > On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote:
> > > The DMA API now has a new flow, and has gained phys_addr_t support, so
> > > it no longer needs struct pages to perform P2P mapping.
> > 
> > That's news to me.  All the pci_p2pdma_map_state machinery is still
> > based on pgmaps and thus pages.
> 
> We had this discussion already three months ago:
> 
> https://lore.kernel.org/all/20250729131502.GJ36037@nvidia.com/
> 
> These couple patches make the core pci_p2pdma_map_state machinery work
> on struct p2pdma_provider, and pgmap is just one way to get a
> p2pdma_provider *
> 
> The struct page paths through pgmap go page->pgmap->mem to get
> p2pdma_provider.
> 
> The non-struct page paths just have a p2pdma_provider * without a
> pgmap. In this series VFIO uses
> 
> +	*provider = pcim_p2pdma_provider(pdev, bar);
> 
> To get the provider for a specific BAR.

And what protects that life time?  I've not seen anyone actually
building the proper lifetime management.  And if someone did the patches
need to clearly point to that.

> I think I've answered this three times now - for DMABUF the DMABUF
> invalidation scheme is used to control the lifetime and no DMA mapping
> outlives the provider, and the provider doesn't outlive the driver.

How?

> Obviously you cannot use the new p2provider mechanism without some
> kind of protection against use after hot unplug, but it doesn't have
> to be struct page based.

And how does this interact with everyone else expecting pgmap based
lifetime management.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic
  2025-10-20 12:27       ` Christoph Hellwig
@ 2025-10-20 12:58         ` Jason Gunthorpe
  2025-10-20 15:04           ` Leon Romanovsky
  2025-10-22  7:10           ` Christoph Hellwig
  0 siblings, 2 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-20 12:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Mon, Oct 20, 2025 at 05:27:02AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 17, 2025 at 08:53:20AM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 16, 2025 at 11:30:06PM -0700, Christoph Hellwig wrote:
> > > On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote:
> > > > The DMA API now has a new flow, and has gained phys_addr_t support, so
> > > > it no longer needs struct pages to perform P2P mapping.
> > > 
> > > That's news to me.  All the pci_p2pdma_map_state machinery is still
> > > based on pgmaps and thus pages.
> > 
> > We had this discussion already three months ago:
> > 
> > https://lore.kernel.org/all/20250729131502.GJ36037@nvidia.com/
> > 
> > These couple patches make the core pci_p2pdma_map_state machinery work
> > on struct p2pdma_provider, and pgmap is just one way to get a
> > p2pdma_provider *
> > 
> > The struct page paths through pgmap go page->pgmap->mem to get
> > p2pdma_provider.
> > 
> > The non-struct page paths just have a p2pdma_provider * without a
> > pgmap. In this series VFIO uses
> > 
> > +	*provider = pcim_p2pdma_provider(pdev, bar);
> > 
> > To get the provider for a specific BAR.
> 
> And what protects that life time?  I've not seen anyone actually
> building the proper lifetime management.  And if someone did the patches
> need to clearly point to that.

It is this series!

The above API gives a lifetime that is driver bound. The calling
driver must ensure it stops using provider and stops doing DMA with it
before remove() completes.

This VFIO series does that through the move_notify callchain I showed
in the previous email. This callchain is always triggered before
remove() of the VFIO PCI driver is completed.

> > I think I've answered this three times now - for DMABUF the DMABUF
> > invalidation scheme is used to control the lifetime and no DMA mapping
> > outlives the provider, and the provider doesn't outlive the driver.
> 
> How?

I explained it in detail in the message you are repling to. If
something is not clear can you please be more specific??

Is it the mmap in VFIO perhaps that is causing these questions?

VFIO uses a PFNMAP VMA, so you can't pin_user_page() it. It uses
unmap_mapping_range() during its remove() path to get rid of the VMA
PTEs.

The DMA activity doesn't use the mmap *at all*. It isn't like NVMe
which relies on the ZONE_DEVICE pages and VMAs to link drivers
togther.

Instead the DMABUF FD is used to pass the MMIO pages between VFIO and
another driver. DMABUF has a built in invalidation mechanism that VFIO
triggers before remove(). The invalidation removes access from the
other driver.

This is different than NVMe which has no invalidation. NVMe does
unmap_mapping_range() on the VMA and waits for all the short lived
pgmap references to clear. We don't need anything like that because
DMABUF invalidation is synchronous.

The full picture for VFIO is something like:

[startup]
  MMIO is acquired from the pci_resource
  p2p_providers are setup

[runtime]
  MMIO is mapped into PFNMAP VMAs
  MMIO is linked to a DMABUF FD
  DMABUF FD gets DMA mapped using the p2p_provider

[unplug]
  unmap_mapping_range() is called so all VMAs are emptied out and the
  fault handler prevents new PTEs 
    ** No access to the MMIO through VMAs is possible**

  vfio_pci_dma_buf_cleanup() is called which prevents new DMABUF
  mappings from starting, and does dma_buf_move_notify() on all the
  open DMABUF FDs to invalidate other drivers. Other drivers stop
  doing DMA and we need to free the IOVA from the IOMMU/etc.
    ** No DMA access from other drivers is possible now**

  Any still open DMABUF FD will fail inside VFIO immediately due to
  the priv->revoked checks.
    **No code touches the p2p_provider anymore**

  The p2p_provider is destroyed by devm.

> > Obviously you cannot use the new p2provider mechanism without some
> > kind of protection against use after hot unplug, but it doesn't have
> > to be struct page based.
> 
> And how does this interact with everyone else expecting pgmap based
> lifetime management.

They continue to use pgmap and nothing changes for them.

The pgmap path always waited until nothing was using the pgmap and
thus provider before allowing device driver remove() to complete.

The refactoring doesn't change the lifecycle model, it just provides
entry points to access the driver bound lifetime model directly
instead of being forced to use pgmap.

Leon, can you add some remarks to the comments about what the rules
are to call pcim_p2pdma_provider() ?

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic
  2025-10-20 12:58         ` Jason Gunthorpe
@ 2025-10-20 15:04           ` Leon Romanovsky
  2025-10-22  7:10           ` Christoph Hellwig
  1 sibling, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-20 15:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Mon, Oct 20, 2025 at 09:58:54AM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 20, 2025 at 05:27:02AM -0700, Christoph Hellwig wrote:
> > On Fri, Oct 17, 2025 at 08:53:20AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Oct 16, 2025 at 11:30:06PM -0700, Christoph Hellwig wrote:
> > > > On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote:
> > > > > The DMA API now has a new flow, and has gained phys_addr_t support, so
> > > > > it no longer needs struct pages to perform P2P mapping.
> > > > 
> > > > That's news to me.  All the pci_p2pdma_map_state machinery is still
> > > > based on pgmaps and thus pages.
> > > 
> > > We had this discussion already three months ago:
> > > 
> > > https://lore.kernel.org/all/20250729131502.GJ36037@nvidia.com/
> > > 
> > > These couple patches make the core pci_p2pdma_map_state machinery work
> > > on struct p2pdma_provider, and pgmap is just one way to get a
> > > p2pdma_provider *
> > > 
> > > The struct page paths through pgmap go page->pgmap->mem to get
> > > p2pdma_provider.
> > > 
> > > The non-struct page paths just have a p2pdma_provider * without a
> > > pgmap. In this series VFIO uses
> > > 
> > > +	*provider = pcim_p2pdma_provider(pdev, bar);
> > > 
> > > To get the provider for a specific BAR.
> > 
> > And what protects that life time?  I've not seen anyone actually
> > building the proper lifetime management.  And if someone did the patches
> > need to clearly point to that.
> 
> It is this series!
> 
> The above API gives a lifetime that is driver bound. The calling
> driver must ensure it stops using provider and stops doing DMA with it
> before remove() completes.
> 
> This VFIO series does that through the move_notify callchain I showed
> in the previous email. This callchain is always triggered before
> remove() of the VFIO PCI driver is completed.
> 
> > > I think I've answered this three times now - for DMABUF the DMABUF
> > > invalidation scheme is used to control the lifetime and no DMA mapping
> > > outlives the provider, and the provider doesn't outlive the driver.
> > 
> > How?
> 
> I explained it in detail in the message you are repling to. If
> something is not clear can you please be more specific??
> 
> Is it the mmap in VFIO perhaps that is causing these questions?
> 
> VFIO uses a PFNMAP VMA, so you can't pin_user_page() it. It uses
> unmap_mapping_range() during its remove() path to get rid of the VMA
> PTEs.
> 
> The DMA activity doesn't use the mmap *at all*. It isn't like NVMe
> which relies on the ZONE_DEVICE pages and VMAs to link drivers
> togther.
> 
> Instead the DMABUF FD is used to pass the MMIO pages between VFIO and
> another driver. DMABUF has a built in invalidation mechanism that VFIO
> triggers before remove(). The invalidation removes access from the
> other driver.
> 
> This is different than NVMe which has no invalidation. NVMe does
> unmap_mapping_range() on the VMA and waits for all the short lived
> pgmap references to clear. We don't need anything like that because
> DMABUF invalidation is synchronous.
> 
> The full picture for VFIO is something like:
> 
> [startup]
>   MMIO is acquired from the pci_resource
>   p2p_providers are setup
> 
> [runtime]
>   MMIO is mapped into PFNMAP VMAs
>   MMIO is linked to a DMABUF FD
>   DMABUF FD gets DMA mapped using the p2p_provider
> 
> [unplug]
>   unmap_mapping_range() is called so all VMAs are emptied out and the
>   fault handler prevents new PTEs 
>     ** No access to the MMIO through VMAs is possible**
> 
>   vfio_pci_dma_buf_cleanup() is called which prevents new DMABUF
>   mappings from starting, and does dma_buf_move_notify() on all the
>   open DMABUF FDs to invalidate other drivers. Other drivers stop
>   doing DMA and we need to free the IOVA from the IOMMU/etc.
>     ** No DMA access from other drivers is possible now**
> 
>   Any still open DMABUF FD will fail inside VFIO immediately due to
>   the priv->revoked checks.
>     **No code touches the p2p_provider anymore**
> 
>   The p2p_provider is destroyed by devm.
> 
> > > Obviously you cannot use the new p2provider mechanism without some
> > > kind of protection against use after hot unplug, but it doesn't have
> > > to be struct page based.
> > 
> > And how does this interact with everyone else expecting pgmap based
> > lifetime management.
> 
> They continue to use pgmap and nothing changes for them.
> 
> The pgmap path always waited until nothing was using the pgmap and
> thus provider before allowing device driver remove() to complete.
> 
> The refactoring doesn't change the lifecycle model, it just provides
> entry points to access the driver bound lifetime model directly
> instead of being forced to use pgmap.
> 
> Leon, can you add some remarks to the comments about what the rules
> are to call pcim_p2pdma_provider() ?

Yes, sure.

Thanks

> 
> Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic
  2025-10-20 12:58         ` Jason Gunthorpe
  2025-10-20 15:04           ` Leon Romanovsky
@ 2025-10-22  7:10           ` Christoph Hellwig
  2025-10-22 11:43             ` Jason Gunthorpe
  1 sibling, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-10-22  7:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Alex Williamson,
	Leon Romanovsky, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Mon, Oct 20, 2025 at 09:58:54AM -0300, Jason Gunthorpe wrote:
> I explained it in detail in the message you are repling to. If
> something is not clear can you please be more specific??
> 
> Is it the mmap in VFIO perhaps that is causing these questions?
> 
> VFIO uses a PFNMAP VMA, so you can't pin_user_page() it. It uses
> unmap_mapping_range() during its remove() path to get rid of the VMA
> PTEs.

This all needs to g• into the explanation.

> Instead the DMABUF FD is used to pass the MMIO pages between VFIO and
> another driver. DMABUF has a built in invalidation mechanism that VFIO
> triggers before remove(). The invalidation removes access from the
> other driver.
> 
> This is different than NVMe which has no invalidation. NVMe does
> unmap_mapping_range() on the VMA and waits for all the short lived
> pgmap references to clear. We don't need anything like that because
> DMABUF invalidation is synchronous.

Please add documentation for this model to the source tree.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic
  2025-10-22  7:10           ` Christoph Hellwig
@ 2025-10-22 11:43             ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-22 11:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Wed, Oct 22, 2025 at 12:10:35AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 20, 2025 at 09:58:54AM -0300, Jason Gunthorpe wrote:
> > I explained it in detail in the message you are repling to. If
> > something is not clear can you please be more specific??
> > 
> > Is it the mmap in VFIO perhaps that is causing these questions?
> > 
> > VFIO uses a PFNMAP VMA, so you can't pin_user_page() it. It uses
> > unmap_mapping_range() during its remove() path to get rid of the VMA
> > PTEs.
> 
> This all needs to g• into the explanation.
> 
> > Instead the DMABUF FD is used to pass the MMIO pages between VFIO and
> > another driver. DMABUF has a built in invalidation mechanism that VFIO
> > triggers before remove(). The invalidation removes access from the
> > other driver.
> > 
> > This is different than NVMe which has no invalidation. NVMe does
> > unmap_mapping_range() on the VMA and waits for all the short lived
> > pgmap references to clear. We don't need anything like that because
> > DMABUF invalidation is synchronous.
> 
> Please add documentation for this model to the source tree.

Okay, Lets see what we can come up with. I think explaining the dmabuf
model with respect to the p2p provider in the new common dmabuf
mapping API code would make sense.

Jason
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 2/9] PCI/P2PDMA: Simplify bus address mapping API
  2025-10-13 15:26 [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
  2025-10-13 15:26 ` [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic Leon Romanovsky
@ 2025-10-13 15:26 ` Leon Romanovsky
  2025-10-13 15:26 ` [PATCH v5 3/9] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation Leon Romanovsky
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-13 15:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Update the pci_p2pdma_bus_addr_map() function to take a direct pointer
to the p2pdma_provider structure instead of the pci_p2pdma_map_state.
This simplifies the API by removing the need for callers to extract
the provider from the state structure.

The change updates all callers across the kernel (block layer, IOMMU,
DMA direct, and HMM) to pass the provider pointer directly, making
the code more explicit and reducing unnecessary indirection. This
also removes the runtime warning check since callers now have direct
control over which provider they use.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/blk-mq-dma.c         | 2 +-
 drivers/iommu/dma-iommu.c  | 4 ++--
 include/linux/pci-p2pdma.h | 7 +++----
 kernel/dma/direct.c        | 4 ++--
 mm/hmm.c                   | 2 +-
 5 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index 9495b78b6fd3..badef1d925b2 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -85,7 +85,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,
 
 static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
 {
-	iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
+	iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr);
 	iter->len = vec->len;
 	return true;
 }
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7944a3af4545..e52d19d2e833 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1439,8 +1439,8 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 * as a bus address, __finalise_sg() will copy the dma
 			 * address into the output segment.
 			 */
-			s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
-						sg_phys(s));
+			s->dma_address = pci_p2pdma_bus_addr_map(
+				p2pdma_state.mem, sg_phys(s));
 			sg_dma_len(s) = sg->length;
 			sg_dma_mark_bus_address(s);
 			continue;
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 1400f3ad4299..9516ef97b17a 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -181,16 +181,15 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
 /**
  * pci_p2pdma_bus_addr_map - Translate a physical address to a bus address
  *			     for a PCI_P2PDMA_MAP_BUS_ADDR transfer.
- * @state:	P2P state structure
+ * @provider:	P2P provider structure
  * @paddr:	physical address to map
  *
  * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer.
  */
 static inline dma_addr_t
-pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
+pci_p2pdma_bus_addr_map(struct p2pdma_provider *provider, phys_addr_t paddr)
 {
-	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
-	return paddr + state->mem->bus_offset;
+	return paddr + provider->bus_offset;
 }
 
 #endif /* _LINUX_PCI_P2P_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 1f9ee9759426..d8b3dfc598b2 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -479,8 +479,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 			}
 			break;
 		case PCI_P2PDMA_MAP_BUS_ADDR:
-			sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
-					sg_phys(sg));
+			sg->dma_address = pci_p2pdma_bus_addr_map(
+				p2pdma_state.mem, sg_phys(sg));
 			sg_dma_mark_bus_address(sg);
 			continue;
 		default:
diff --git a/mm/hmm.c b/mm/hmm.c
index 87562914670a..9bf0b831a029 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -811,7 +811,7 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
 		break;
 	case PCI_P2PDMA_MAP_BUS_ADDR:
 		pfns[idx] |= HMM_PFN_P2PDMA_BUS | HMM_PFN_DMA_MAPPED;
-		return pci_p2pdma_bus_addr_map(p2pdma_state, paddr);
+		return pci_p2pdma_bus_addr_map(p2pdma_state->mem, paddr);
 	default:
 		return DMA_MAPPING_ERROR;
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 3/9] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
  2025-10-13 15:26 [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
  2025-10-13 15:26 ` [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic Leon Romanovsky
  2025-10-13 15:26 ` [PATCH v5 2/9] PCI/P2PDMA: Simplify bus address mapping API Leon Romanovsky
@ 2025-10-13 15:26 ` Leon Romanovsky
  2025-10-13 15:26 ` [PATCH v5 4/9] PCI/P2PDMA: Export pci_p2pdma_map_type() function Leon Romanovsky
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-13 15:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Refactor the PCI P2PDMA subsystem to separate the core peer-to-peer DMA
functionality from the optional memory allocation layer. This creates a
two-tier architecture:

The core layer provides P2P mapping functionality for physical addresses
based on PCI device MMIO BARs and integrates with the DMA API for
mapping operations. This layer is required for all P2PDMA users.

The optional upper layer provides memory allocation capabilities
including gen_pool allocator, struct page support, and sysfs interface
for user space access.

This separation allows subsystems like VFIO to use only the core P2P
mapping functionality without the overhead of memory allocation features
they don't need. The core functionality is now available through the
new pcim_p2pdma_provider() function that returns a p2pdma_provider
structure.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/p2pdma.c       | 139 ++++++++++++++++++++++++++++---------
 include/linux/pci-p2pdma.h |  11 +++
 2 files changed, 119 insertions(+), 31 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 59cd6fb40e83..a2ec7e93fd71 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -25,11 +25,12 @@ struct pci_p2pdma {
 	struct gen_pool *pool;
 	bool p2pmem_published;
 	struct xarray map_types;
+	struct p2pdma_provider mem[PCI_STD_NUM_BARS];
 };
 
 struct pci_p2pdma_pagemap {
 	struct dev_pagemap pgmap;
-	struct p2pdma_provider mem;
+	struct p2pdma_provider *mem;
 };
 
 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
@@ -204,7 +205,7 @@ static void p2pdma_page_free(struct page *page)
 	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
 	/* safe to dereference while a reference is held to the percpu ref */
 	struct pci_p2pdma *p2pdma = rcu_dereference_protected(
-		to_pci_dev(pgmap->mem.owner)->p2pdma, 1);
+		to_pci_dev(pgmap->mem->owner)->p2pdma, 1);
 	struct percpu_ref *ref;
 
 	gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page),
@@ -227,44 +228,111 @@ static void pci_p2pdma_release(void *data)
 
 	/* Flush and disable pci_alloc_p2p_mem() */
 	pdev->p2pdma = NULL;
-	synchronize_rcu();
+	if (p2pdma->pool)
+		synchronize_rcu();
+	xa_destroy(&p2pdma->map_types);
+
+	if (!p2pdma->pool)
+		return;
 
 	gen_pool_destroy(p2pdma->pool);
 	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
-	xa_destroy(&p2pdma->map_types);
 }
 
-static int pci_p2pdma_setup(struct pci_dev *pdev)
+/**
+ * pcim_p2pdma_init - Initialise peer-to-peer DMA providers
+ * @pdev: The PCI device to enable P2PDMA for
+ *
+ * This function initializes the peer-to-peer DMA infrastructure
+ * for a PCI device. It allocates and sets up the necessary data
+ * structures to support P2PDMA operations, including mapping type
+ * tracking.
+ */
+int pcim_p2pdma_init(struct pci_dev *pdev)
 {
-	int error = -ENOMEM;
 	struct pci_p2pdma *p2p;
+	int i, ret;
+
+	p2p = rcu_dereference_protected(pdev->p2pdma, 1);
+	if (p2p)
+		return 0;
 
 	p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
 	if (!p2p)
 		return -ENOMEM;
 
 	xa_init(&p2p->map_types);
+	/*
+	 * Iterate over all standard PCI BARs and record only those that
+	 * correspond to MMIO regions. Skip non-memory resources (e.g. I/O
+	 * port BARs) since they cannot be used for peer-to-peer (P2P)
+	 * transactions.
+	 */
+	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
+		if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM))
+			continue;
 
-	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
-	if (!p2p->pool)
-		goto out;
+		p2p->mem[i].owner = &pdev->dev;
+		p2p->mem[i].bus_offset =
+			pci_bus_address(pdev, i) - pci_resource_start(pdev, i);
+	}
 
-	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
-	if (error)
-		goto out_pool_destroy;
+	ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (ret)
+		goto out_p2p;
 
-	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
-	if (error)
+	rcu_assign_pointer(pdev->p2pdma, p2p);
+	return 0;
+
+out_p2p:
+	devm_kfree(&pdev->dev, p2p);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pcim_p2pdma_init);
+
+/**
+ * pcim_p2pdma_provider - Get peer-to-peer DMA provider
+ * @pdev: The PCI device to enable P2PDMA for
+ * @bar: BAR index to get provider
+ *
+ * This function gets peer-to-peer DMA provider for a PCI device.
+ */
+struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar)
+{
+	struct pci_p2pdma *p2p;
+
+	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+		return NULL;
+
+	p2p = rcu_dereference_protected(pdev->p2pdma, 1);
+	return &p2p->mem[bar];
+}
+EXPORT_SYMBOL_GPL(pcim_p2pdma_provider);
+
+static int pci_p2pdma_setup_pool(struct pci_dev *pdev)
+{
+	struct pci_p2pdma *p2pdma;
+	int ret;
+
+	p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
+	if (p2pdma->pool)
+		/* We already setup pools, do nothing, */
+		return 0;
+
+	p2pdma->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+	if (!p2pdma->pool)
+		return -ENOMEM;
+
+	ret = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+	if (ret)
 		goto out_pool_destroy;
 
-	rcu_assign_pointer(pdev->p2pdma, p2p);
 	return 0;
 
 out_pool_destroy:
-	gen_pool_destroy(p2p->pool);
-out:
-	devm_kfree(&pdev->dev, p2p);
-	return error;
+	gen_pool_destroy(p2pdma->pool);
+	p2pdma->pool = NULL;
+	return ret;
 }
 
 static void pci_p2pdma_unmap_mappings(void *data)
@@ -276,7 +344,7 @@ static void pci_p2pdma_unmap_mappings(void *data)
 	 * unmap_mapping_range() on the inode, teardown any existing userspace
 	 * mappings and prevent new ones from being created.
 	 */
-	sysfs_remove_file_from_group(&p2p_pgmap->mem.owner->kobj,
+	sysfs_remove_file_from_group(&p2p_pgmap->mem->owner->kobj,
 				     &p2pmem_alloc_attr.attr,
 				     p2pmem_group.name);
 }
@@ -295,6 +363,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 			    u64 offset)
 {
 	struct pci_p2pdma_pagemap *p2p_pgmap;
+	struct p2pdma_provider *mem;
 	struct dev_pagemap *pgmap;
 	struct pci_p2pdma *p2pdma;
 	void *addr;
@@ -312,11 +381,21 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	if (size + offset > pci_resource_len(pdev, bar))
 		return -EINVAL;
 
-	if (!pdev->p2pdma) {
-		error = pci_p2pdma_setup(pdev);
-		if (error)
-			return error;
-	}
+	error = pcim_p2pdma_init(pdev);
+	if (error)
+		return error;
+
+	error = pci_p2pdma_setup_pool(pdev);
+	if (error)
+		return error;
+
+	mem = pcim_p2pdma_provider(pdev, bar);
+	/*
+	 * We checked validity of BAR prior to call
+	 * to pcim_p2pdma_provider. It should never return NULL.
+	 */
+	if (WARN_ON(!mem))
+		return -EINVAL;
 
 	p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL);
 	if (!p2p_pgmap)
@@ -328,9 +407,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->nr_range = 1;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
 	pgmap->ops = &p2pdma_pgmap_ops;
-	p2p_pgmap->mem.owner = &pdev->dev;
-	p2p_pgmap->mem.bus_offset =
-		pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar);
+	p2p_pgmap->mem = mem;
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
@@ -1007,11 +1084,11 @@ void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 {
 	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page));
 
-	if (state->mem == &p2p_pgmap->mem)
+	if (state->mem == p2p_pgmap->mem)
 		return;
 
-	state->mem = &p2p_pgmap->mem;
-	state->map = pci_p2pdma_map_type(&p2p_pgmap->mem, dev);
+	state->mem = p2p_pgmap->mem;
+	state->map = pci_p2pdma_map_type(p2p_pgmap->mem, dev);
 }
 
 /**
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 9516ef97b17a..e307c9380d46 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -27,6 +27,8 @@ struct p2pdma_provider {
 };
 
 #ifdef CONFIG_PCI_P2PDMA
+int pcim_p2pdma_init(struct pci_dev *pdev);
+struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
@@ -44,6 +46,15 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 			       bool use_p2pdma);
 #else /* CONFIG_PCI_P2PDMA */
+static inline int pcim_p2pdma_init(struct pci_dev *pdev)
+{
+	return -EOPNOTSUPP;
+}
+static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev,
+							   int bar)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 4/9] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-10-13 15:26 [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (2 preceding siblings ...)
  2025-10-13 15:26 ` [PATCH v5 3/9] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation Leon Romanovsky
@ 2025-10-13 15:26 ` Leon Romanovsky
  2025-10-17  6:31   ` Christoph Hellwig
  2025-10-13 15:26 ` [PATCH v5 5/9] types: move phys_vec definition to common header Leon Romanovsky
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-13 15:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Export the pci_p2pdma_map_type() function to allow external modules
and subsystems to determine the appropriate mapping type for P2PDMA
transfers between a provider and target device.

The function determines whether peer-to-peer DMA transfers can be
done directly through PCI switches (PCI_P2PDMA_MAP_BUS_ADDR) or
must go through the host bridge (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE),
or if the transfer is not supported at all.

This export enables subsystems like VFIO to properly handle P2PDMA
operations by querying the mapping type before attempting transfers,
ensuring correct DMA address programming and error handling.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/p2pdma.c       | 15 ++++++-
 include/linux/pci-p2pdma.h | 85 +++++++++++++++++++++-----------------
 2 files changed, 59 insertions(+), 41 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index a2ec7e93fd71..bdbbc49f46ee 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -1048,8 +1048,18 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type
-pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
+/**
+ * pci_p2pdma_map_type - Determine the mapping type for P2PDMA transfers
+ * @provider: P2PDMA provider structure
+ * @dev: Target device for the transfer
+ *
+ * Determines how peer-to-peer DMA transfers should be mapped between
+ * the provider and the target device. The mapping type indicates whether
+ * the transfer can be done directly through PCI switches or must go
+ * through the host bridge.
+ */
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
+					     struct device *dev)
 {
 	enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
 	struct pci_dev *pdev = to_pci_dev(provider->owner);
@@ -1078,6 +1088,7 @@ pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
 
 	return type;
 }
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_type);
 
 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 		struct device *dev, struct page *page)
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index e307c9380d46..1e499a8e0099 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -26,6 +26,45 @@ struct p2pdma_provider {
 	u64 bus_offset;
 };
 
+enum pci_p2pdma_map_type {
+	/*
+	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
+	 * the mapping type has been calculated. Exported routines for the API
+	 * will never return this value.
+	 */
+	PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+	/*
+	 * Not a PCI P2PDMA transfer.
+	 */
+	PCI_P2PDMA_MAP_NONE,
+
+	/*
+	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+	 * traverse the host bridge and the host bridge is not in the
+	 * allowlist. DMA Mapping routines should return an error when
+	 * this is returned.
+	 */
+	PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+	/*
+	 * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to
+	 * each other directly through a PCI switch and the transaction will
+	 * not traverse the host bridge. Such a mapping should program
+	 * the DMA engine with PCI bus addresses.
+	 */
+	PCI_P2PDMA_MAP_BUS_ADDR,
+
+	/*
+	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+	 * to each other, but the transaction traverses a host bridge on the
+	 * allowlist. In this case, a normal mapping either with CPU physical
+	 * addresses (in the case of dma-direct) or IOVA addresses (in the
+	 * case of IOMMUs) should be used to program the DMA engine.
+	 */
+	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 int pcim_p2pdma_init(struct pci_dev *pdev);
 struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar);
@@ -45,6 +84,8 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 			    bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 			       bool use_p2pdma);
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
+					     struct device *dev);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pcim_p2pdma_init(struct pci_dev *pdev)
 {
@@ -106,6 +147,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
 {
 	return sprintf(page, "none\n");
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
+{
+	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
 #endif /* CONFIG_PCI_P2PDMA */
 
 
@@ -120,45 +166,6 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client)
 	return pci_p2pmem_find_many(&client, 1);
 }
 
-enum pci_p2pdma_map_type {
-	/*
-	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
-	 * the mapping type has been calculated. Exported routines for the API
-	 * will never return this value.
-	 */
-	PCI_P2PDMA_MAP_UNKNOWN = 0,
-
-	/*
-	 * Not a PCI P2PDMA transfer.
-	 */
-	PCI_P2PDMA_MAP_NONE,
-
-	/*
-	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
-	 * traverse the host bridge and the host bridge is not in the
-	 * allowlist. DMA Mapping routines should return an error when
-	 * this is returned.
-	 */
-	PCI_P2PDMA_MAP_NOT_SUPPORTED,
-
-	/*
-	 * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to
-	 * each other directly through a PCI switch and the transaction will
-	 * not traverse the host bridge. Such a mapping should program
-	 * the DMA engine with PCI bus addresses.
-	 */
-	PCI_P2PDMA_MAP_BUS_ADDR,
-
-	/*
-	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
-	 * to each other, but the transaction traverses a host bridge on the
-	 * allowlist. In this case, a normal mapping either with CPU physical
-	 * addresses (in the case of dma-direct) or IOVA addresses (in the
-	 * case of IOMMUs) should be used to program the DMA engine.
-	 */
-	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
 struct pci_p2pdma_map_state {
 	struct p2pdma_provider *mem;
 	enum pci_p2pdma_map_type map;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/9] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-10-13 15:26 ` [PATCH v5 4/9] PCI/P2PDMA: Export pci_p2pdma_map_type() function Leon Romanovsky
@ 2025-10-17  6:31   ` Christoph Hellwig
  2025-10-17 12:14     ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-10-17  6:31 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon


Nacked-by: Christoph Hellwig <hch@lst.de>

As explained to you multiple times, pci_p2pdma_map_type is a low-level
helper that absolutely MUST be wrapper in proper accessors.  It is
dangerous when used incorrectly and requires too much boiler plate.
There is no way this can be directly exported, and you really need to
stop resending this.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/9] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-10-17  6:31   ` Christoph Hellwig
@ 2025-10-17 12:14     ` Jason Gunthorpe
  2025-10-20 12:29       ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-17 12:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Thu, Oct 16, 2025 at 11:31:53PM -0700, Christoph Hellwig wrote:
> 
> Nacked-by: Christoph Hellwig <hch@lst.de>
> 
> As explained to you multiple times, pci_p2pdma_map_type is a low-level
> helper that absolutely MUST be wrapper in proper accessors. 

You never responded to the discussion:

https://lore.kernel.org/all/20250727190252.GF7551@nvidia.com/

What is the plan here? Is the new DMA API unusable by modules? That
seems a little challenging.

> It is dangerous when used incorrectly and requires too much boiler
> plate.  There is no way this can be directly exported, and you
> really need to stop resending this.

Yeah, I don't like the boilerplate at all either.

It looks like there is a simple enough solution here. I wanted to
tackle this after, but maybe it is small enough to do it now.

dmabuf should gain some helpers like BIO has to manage its map/unmap
flows, so lets put a start of some helpers in
drivers/dma/dma-mapping.c (or whatever). dmabuf is a built in so it
can call the function without exporting it just like block and hmm are
doing.

The same code as in this vfio patch will get moved into the helper and
vfio will call it under its dmabuf map/unmap ops.

The goal would be to make it much easier for other dmabuf exporters to
switch from dma_map_resource() to this new dmabuf api which is safe
for P2P.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/9] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-10-17 12:14     ` Jason Gunthorpe
@ 2025-10-20 12:29       ` Christoph Hellwig
  2025-10-20 13:14         ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-10-20 12:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Alex Williamson,
	Leon Romanovsky, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Fri, Oct 17, 2025 at 09:14:47AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 16, 2025 at 11:31:53PM -0700, Christoph Hellwig wrote:
> > 
> > Nacked-by: Christoph Hellwig <hch@lst.de>
> > 
> > As explained to you multiple times, pci_p2pdma_map_type is a low-level
> > helper that absolutely MUST be wrapper in proper accessors. 
> 
> You never responded to the discussion:
> 
> https://lore.kernel.org/all/20250727190252.GF7551@nvidia.com/
> 
> What is the plan here? Is the new DMA API unusable by modules? That
> seems a little challenging.

Yes.  These are only intended to be wrapped by subsystems.

> It looks like there is a simple enough solution here. I wanted to
> tackle this after, but maybe it is small enough to do it now.
> 
> dmabuf should gain some helpers like BIO has to manage its map/unmap
> flows, so lets put a start of some helpers in
> drivers/dma/dma-mapping.c (or whatever). dmabuf is a built in so it
> can call the function without exporting it just like block and hmm are
> doing.

Yes, that sounds much better.  And dmabuf in general could use some
deduplicating of their dma mapping patterns (and eventual fixing).


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 4/9] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-10-20 12:29       ` Christoph Hellwig
@ 2025-10-20 13:14         ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-20 13:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Mon, Oct 20, 2025 at 05:29:06AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 17, 2025 at 09:14:47AM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 16, 2025 at 11:31:53PM -0700, Christoph Hellwig wrote:
> > > 
> > > Nacked-by: Christoph Hellwig <hch@lst.de>
> > > 
> > > As explained to you multiple times, pci_p2pdma_map_type is a low-level
> > > helper that absolutely MUST be wrapper in proper accessors. 
> > 
> > You never responded to the discussion:
> > 
> > https://lore.kernel.org/all/20250727190252.GF7551@nvidia.com/
> > 
> > What is the plan here? Is the new DMA API unusable by modules? That
> > seems a little challenging.
> 
> Yes.  These are only intended to be wrapped by subsystems.

Sure, but many subsystems are fully modular too.. RDMA for example.

Well, lets see what comes in the future..

> Yes, that sounds much better.  And dmabuf in general could use some
> deduplicating of their dma mapping patterns (and eventual fixing).

Yes, it certainly could, but I wanted to tackle this later..

I think adding some 'dmabuf create a map for this list of phys on this
provider' is a good simplifed primitive. Simple drivers like VFIO that
only want to expose MMIO can just call it directly.

Once this is settled I want to have RDMA wrap some of its MMIO VMAs in
DMABUF as well, so I can see at least two users of the helper.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 5/9] types: move phys_vec definition to common header
  2025-10-13 15:26 [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (3 preceding siblings ...)
  2025-10-13 15:26 ` [PATCH v5 4/9] PCI/P2PDMA: Export pci_p2pdma_map_type() function Leon Romanovsky
@ 2025-10-13 15:26 ` Leon Romanovsky
  2025-10-13 15:26 ` [PATCH v5 6/9] vfio: Export vfio device get and put registration helpers Leon Romanovsky
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-13 15:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Move the struct phys_vec definition from block/blk-mq-dma.c to
include/linux/types.h to make it available for use across the kernel.

The phys_vec structure represents a physical address range with a
length, which is used by the new physical address-based DMA mapping
API. This structure is already used by the block layer and will be
needed by upcoming VFIO patches for dma-buf operations.

Moving this definition to types.h provides a centralized location
for this common data structure and eliminates code duplication
across subsystems that need to work with physical address ranges.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/blk-mq-dma.c    | 5 -----
 include/linux/types.h | 5 +++++
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index badef1d925b2..38f5c34ca223 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -6,11 +6,6 @@
 #include <linux/blk-mq-dma.h>
 #include "blk.h"
 
-struct phys_vec {
-	phys_addr_t	paddr;
-	u32		len;
-};
-
 static bool __blk_map_iter_next(struct blk_map_iter *iter)
 {
 	if (iter->iter.bi_size)
diff --git a/include/linux/types.h b/include/linux/types.h
index 6dfdb8e8e4c3..2bc56681b2e6 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -170,6 +170,11 @@ typedef u64 phys_addr_t;
 typedef u32 phys_addr_t;
 #endif
 
+struct phys_vec {
+	phys_addr_t	paddr;
+	u32		len;
+};
+
 typedef phys_addr_t resource_size_t;
 
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 6/9] vfio: Export vfio device get and put registration helpers
  2025-10-13 15:26 [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (4 preceding siblings ...)
  2025-10-13 15:26 ` [PATCH v5 5/9] types: move phys_vec definition to common header Leon Romanovsky
@ 2025-10-13 15:26 ` Leon Romanovsky
  2025-10-13 15:26 ` [PATCH v5 7/9] vfio/pci: Share the core device pointer while invoking feature functions Leon Romanovsky
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-13 15:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Vivek Kasireddy, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Will Deacon

From: Vivek Kasireddy <vivek.kasireddy@intel.com>

These helpers are useful for managing additional references taken
on the device from other associated VFIO modules.

Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/vfio_main.c | 2 ++
 include/linux/vfio.h     | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 38c8e9350a60..9aa4a5d081e8 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -172,11 +172,13 @@ void vfio_device_put_registration(struct vfio_device *device)
 	if (refcount_dec_and_test(&device->refcount))
 		complete(&device->comp);
 }
+EXPORT_SYMBOL_GPL(vfio_device_put_registration);
 
 bool vfio_device_try_get_registration(struct vfio_device *device)
 {
 	return refcount_inc_not_zero(&device->refcount);
 }
+EXPORT_SYMBOL_GPL(vfio_device_try_get_registration);
 
 /*
  * VFIO driver API
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index eb563f538dee..217ba4ef1752 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -297,6 +297,8 @@ static inline void vfio_put_device(struct vfio_device *device)
 int vfio_register_group_dev(struct vfio_device *device);
 int vfio_register_emulated_iommu_dev(struct vfio_device *device);
 void vfio_unregister_group_dev(struct vfio_device *device);
+bool vfio_device_try_get_registration(struct vfio_device *device);
+void vfio_device_put_registration(struct vfio_device *device);
 
 int vfio_assign_device_set(struct vfio_device *device, void *set_id);
 unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 7/9] vfio/pci: Share the core device pointer while invoking feature functions
  2025-10-13 15:26 [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (5 preceding siblings ...)
  2025-10-13 15:26 ` [PATCH v5 6/9] vfio: Export vfio device get and put registration helpers Leon Romanovsky
@ 2025-10-13 15:26 ` Leon Romanovsky
  2025-10-13 15:26 ` [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-13 15:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Vivek Kasireddy, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Will Deacon

From: Vivek Kasireddy <vivek.kasireddy@intel.com>

There is no need to share the main device pointer (struct vfio_device *)
with all the feature functions as they only need the core device
pointer. Therefore, extract the core device pointer once in the
caller (vfio_pci_core_ioctl_feature) and share it instead.

Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 7dcf5439dedc..ca9a95716a85 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -299,11 +299,9 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
 	return 0;
 }
 
-static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
+static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags,
 				  void __user *arg, size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	int ret;
 
 	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@@ -320,12 +318,10 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
 }
 
 static int vfio_pci_core_pm_entry_with_wakeup(
-	struct vfio_device *device, u32 flags,
+	struct vfio_pci_core_device *vdev, u32 flags,
 	struct vfio_device_low_power_entry_with_wakeup __user *arg,
 	size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	struct vfio_device_low_power_entry_with_wakeup entry;
 	struct eventfd_ctx *efdctx;
 	int ret;
@@ -376,11 +372,9 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
 	up_write(&vdev->memory_lock);
 }
 
-static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
+static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags,
 				 void __user *arg, size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	int ret;
 
 	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@@ -1473,11 +1467,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
-				       uuid_t __user *arg, size_t argsz)
+static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
+				       u32 flags, uuid_t __user *arg,
+				       size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	uuid_t uuid;
 	int ret;
 
@@ -1504,16 +1497,19 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+
 	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
 	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
-		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
+		return vfio_pci_core_pm_entry(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP:
-		return vfio_pci_core_pm_entry_with_wakeup(device, flags,
+		return vfio_pci_core_pm_entry_with_wakeup(vdev, flags,
 							  arg, argsz);
 	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
-		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
+		return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
-		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-10-13 15:26 [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (6 preceding siblings ...)
  2025-10-13 15:26 ` [PATCH v5 7/9] vfio/pci: Share the core device pointer while invoking feature functions Leon Romanovsky
@ 2025-10-13 15:26 ` Leon Romanovsky
  2025-10-16  4:09   ` Nicolin Chen
                     ` (2 more replies)
  2025-10-13 15:26 ` [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
  2025-10-15 21:15 ` [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf shinichiro.kawasaki
  9 siblings, 3 replies; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-13 15:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Make sure that all VFIO PCI devices have peer-to-peer capabilities
enables, so we would be able to export their MMIO memory through DMABUF,

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ca9a95716a85..fe247d0e2831 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -28,6 +28,7 @@
 #include <linux/nospec.h>
 #include <linux/sched/mm.h>
 #include <linux/iommufd.h>
+#include <linux/pci-p2pdma.h>
 #if IS_ENABLED(CONFIG_EEH)
 #include <asm/eeh.h>
 #endif
@@ -2081,6 +2082,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 {
 	struct vfio_pci_core_device *vdev =
 		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+	int ret;
 
 	vdev->pdev = to_pci_dev(core_vdev->dev);
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
@@ -2090,6 +2092,9 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	INIT_LIST_HEAD(&vdev->dummy_resources_list);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
 	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
+	ret = pcim_p2pdma_init(vdev->pdev);
+	if (ret != -EOPNOTSUPP)
+		return ret;
 	init_rwsem(&vdev->memory_lock);
 	xa_init(&vdev->ctx);
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-10-13 15:26 ` [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
@ 2025-10-16  4:09   ` Nicolin Chen
  2025-10-16  6:10     ` Leon Romanovsky
  2025-10-17  6:32   ` Christoph Hellwig
  2025-10-22 11:54   ` Jason Gunthorpe
  2 siblings, 1 reply; 45+ messages in thread
From: Nicolin Chen @ 2025-10-16  4:09 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

Hi Leon,

On Mon, Oct 13, 2025 at 06:26:10PM +0300, Leon Romanovsky wrote:
> @@ -2090,6 +2092,9 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
>  	INIT_LIST_HEAD(&vdev->dummy_resources_list);
>  	INIT_LIST_HEAD(&vdev->ioeventfds_list);
>  	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
> +	ret = pcim_p2pdma_init(vdev->pdev);
> +	if (ret != -EOPNOTSUPP)
> +		return ret;
>  	init_rwsem(&vdev->memory_lock);
>  	xa_init(&vdev->ctx);

I think this should be:
	if (ret && ret != -EOPNOTSUPP)
		return ret;

Otherwise, init_rwsem() and xa_init() would be missed if ret==0.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-10-16  4:09   ` Nicolin Chen
@ 2025-10-16  6:10     ` Leon Romanovsky
  0 siblings, 0 replies; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-16  6:10 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Alex Williamson, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Wed, Oct 15, 2025 at 09:09:53PM -0700, Nicolin Chen wrote:
> Hi Leon,
> 
> On Mon, Oct 13, 2025 at 06:26:10PM +0300, Leon Romanovsky wrote:
> > @@ -2090,6 +2092,9 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
> >  	INIT_LIST_HEAD(&vdev->dummy_resources_list);
> >  	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> >  	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
> > +	ret = pcim_p2pdma_init(vdev->pdev);
> > +	if (ret != -EOPNOTSUPP)
> > +		return ret;
> >  	init_rwsem(&vdev->memory_lock);
> >  	xa_init(&vdev->ctx);
> 
> I think this should be:
> 	if (ret && ret != -EOPNOTSUPP)
> 		return ret;
> 
> Otherwise, init_rwsem() and xa_init() would be missed if ret==0.

You absolutely right.

Thanks

> 
> Thanks
> Nicolin
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-10-13 15:26 ` [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
  2025-10-16  4:09   ` Nicolin Chen
@ 2025-10-17  6:32   ` Christoph Hellwig
  2025-10-17 11:55     ` Jason Gunthorpe
  2025-10-22 11:54   ` Jason Gunthorpe
  2 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-10-17  6:32 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Mon, Oct 13, 2025 at 06:26:10PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Make sure that all VFIO PCI devices have peer-to-peer capabilities
> enables, so we would be able to export their MMIO memory through DMABUF,

How do you know that they are safe to use with P2P?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-10-17  6:32   ` Christoph Hellwig
@ 2025-10-17 11:55     ` Jason Gunthorpe
  2025-10-20 12:28       ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-17 11:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Thu, Oct 16, 2025 at 11:32:59PM -0700, Christoph Hellwig wrote:
> On Mon, Oct 13, 2025 at 06:26:10PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Make sure that all VFIO PCI devices have peer-to-peer capabilities
> > enables, so we would be able to export their MMIO memory through DMABUF,
> 
> How do you know that they are safe to use with P2P?

All PCI devices are "safe" for P2P by spec. I've never heard of a
non-complaint device causing problems in this area.

The issue is always SOC support inside the CPU and that is delt with
inside the P2P subsystem logic.

If we ever see a problem it would be delt with by quirking the broken
device through pci-quirks and having the p2p subsystem refuse any p2p
with that device.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-10-17 11:55     ` Jason Gunthorpe
@ 2025-10-20 12:28       ` Christoph Hellwig
  2025-10-20 13:08         ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-10-20 12:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Alex Williamson,
	Leon Romanovsky, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Fri, Oct 17, 2025 at 08:55:24AM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 16, 2025 at 11:32:59PM -0700, Christoph Hellwig wrote:
> > On Mon, Oct 13, 2025 at 06:26:10PM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > Make sure that all VFIO PCI devices have peer-to-peer capabilities
> > > enables, so we would be able to export their MMIO memory through DMABUF,
> > 
> > How do you know that they are safe to use with P2P?
> 
> All PCI devices are "safe" for P2P by spec. I've never heard of a
> non-complaint device causing problems in this area.

Real PCIe device, yes.  But we have a lot of stuff mascquerading as
such with is just emulated or special integrated.  I.e. a lot of
integrated Intel GPUs claim had issue there.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-10-20 12:28       ` Christoph Hellwig
@ 2025-10-20 13:08         ` Jason Gunthorpe
  2025-10-22  7:08           ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-20 13:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Mon, Oct 20, 2025 at 05:28:02AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 17, 2025 at 08:55:24AM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 16, 2025 at 11:32:59PM -0700, Christoph Hellwig wrote:
> > > On Mon, Oct 13, 2025 at 06:26:10PM +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > 
> > > > Make sure that all VFIO PCI devices have peer-to-peer capabilities
> > > > enables, so we would be able to export their MMIO memory through DMABUF,
> > > 
> > > How do you know that they are safe to use with P2P?
> > 
> > All PCI devices are "safe" for P2P by spec. I've never heard of a
> > non-complaint device causing problems in this area.
> 
> Real PCIe device, yes.  But we have a lot of stuff mascquerading as
> such with is just emulated or special integrated.  I.e. a lot of
> integrated Intel GPUs claim had issue there.

Sure, but this should be handled by the P2P subsystem and PCI quirks,
IMHO. It isn't VFIOs job.. If people complain about broken HW then it
is easy to add those things.

I think the majority of stuff is OK, there is a chunk of
configurations that will have clean failures - meaning the initiating
device gets an error indication and handles it. Then there is a small
minority where the platform crashes with a machine check.

IDK where Intel GPU lands on this, but VFIO has always supported P2P
and userspace/VMs have always been able to trigger these kinds of
bugs. If nobody has complained so far I'm not inclined to do anything
right now.

VFIO has always kind of come along with a footnote that if you
actually want fully safe VFIO then it is up to the user to validate
the SOC and device implementations are sane.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-10-20 13:08         ` Jason Gunthorpe
@ 2025-10-22  7:08           ` Christoph Hellwig
  2025-10-22 11:38             ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-10-22  7:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Alex Williamson,
	Leon Romanovsky, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Mon, Oct 20, 2025 at 10:08:55AM -0300, Jason Gunthorpe wrote:
> Sure, but this should be handled by the P2P subsystem and PCI quirks,
> IMHO. It isn't VFIOs job.. If people complain about broken HW then it
> is easy to add those things.

I think it is.  You now open up behavior generally that previously
had specific drivers in charge.

> IDK where Intel GPU lands on this, but VFIO has always supported P2P

How?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-10-22  7:08           ` Christoph Hellwig
@ 2025-10-22 11:38             ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-22 11:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Wed, Oct 22, 2025 at 12:08:48AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 20, 2025 at 10:08:55AM -0300, Jason Gunthorpe wrote:
> > Sure, but this should be handled by the P2P subsystem and PCI quirks,
> > IMHO. It isn't VFIOs job.. If people complain about broken HW then it
> > is easy to add those things.
> 
> I think it is.  You now open up behavior generally that previously
> had specific drivers in charge.

It has always been available in VFIO. This series is fixing it up to
not have the lifetime bugs.

> > IDK where Intel GPU lands on this, but VFIO has always supported P2P
> 
> How?

It uses follow_pfnmap_start()/etc to fish the MMIO PFNs out of a VMA and
stick them into the iommu.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-10-13 15:26 ` [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
  2025-10-16  4:09   ` Nicolin Chen
  2025-10-17  6:32   ` Christoph Hellwig
@ 2025-10-22 11:54   ` Jason Gunthorpe
  2 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-22 11:54 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Mon, Oct 13, 2025 at 06:26:10PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Make sure that all VFIO PCI devices have peer-to-peer capabilities
> enables, so we would be able to export their MMIO memory through DMABUF,

Let's enhance this:

VFIO has always supported P2P mappings with itself. VFIO type 1
insecurely reads PFNs directly out of a VMA's PTEs and programs them
into the IOMMU allowing any two VFIO devices to perform P2P to each
other.

All existing VMMs use this capability to export P2P into a VM where
the VM could setup any kind of DMA it likes. Projects like DPDK/SPDK
are also known to make use of this, though less frequently.

As a first step to more properly integrating VFIO with the P2P
subsystem unconditionally enable P2P support for VFIO PCI devices. The
struct p2pdma_provider will act has a handle to the P2P subsystem to
do things like DMA mapping.

While real PCI devices have to support P2P (they can't even tell if an
IOVA is P2P or not) there may be fake PCI devices that may trigger
some kind of catastrophic system failure. To date VFIO has never
tripped up on such a case, but if one is discovered the plan is to add
a PCI quirk and have pcim_p2pdma_init() fail. This will fully block
the broken device throughout any users of the P2P subsystem in the
kernel.

Thus P2P through DMABUF will follow the historical VFIO model and be
unconditionally enabled by vfio-pci.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-13 15:26 [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (7 preceding siblings ...)
  2025-10-13 15:26 ` [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
@ 2025-10-13 15:26 ` Leon Romanovsky
  2025-10-16 23:53   ` Jason Gunthorpe
                     ` (5 more replies)
  2025-10-15 21:15 ` [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf shinichiro.kawasaki
  9 siblings, 6 replies; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-13 15:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Add support for exporting PCI device MMIO regions through dma-buf,
enabling safe sharing of non-struct page memory with controlled
lifetime management. This allows RDMA and other subsystems to import
dma-buf FDs and build them into memory regions for PCI P2P operations.

The implementation provides a revocable attachment mechanism using
dma-buf move operations. MMIO regions are normally pinned as BARs
don't change physical addresses, but access is revoked when the VFIO
device is closed or a PCI reset is issued. This ensures kernel
self-defense against potentially hostile userspace.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/Kconfig           |   3 +
 drivers/vfio/pci/Makefile          |   2 +
 drivers/vfio/pci/vfio_pci_config.c |  22 +-
 drivers/vfio/pci/vfio_pci_core.c   |  28 ++
 drivers/vfio/pci/vfio_pci_dmabuf.c | 446 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_priv.h   |  23 ++
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |  25 ++
 8 files changed, 546 insertions(+), 4 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 2b0172f54665..2b9fca00e9e8 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -55,6 +55,9 @@ config VFIO_PCI_ZDEV_KVM
 
 	  To enable s390x KVM vfio-pci extensions, say Y.
 
+config VFIO_PCI_DMABUF
+	def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
+
 source "drivers/vfio/pci/mlx5/Kconfig"
 
 source "drivers/vfio/pci/hisilicon/Kconfig"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index cf00c0a7e55c..f9155e9c5f63 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,7 +2,9 @@
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
+
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
 
 vfio-pci-y := vfio_pci.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 8f02f236b5b4..1f6008eabf23 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -589,10 +589,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
 		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
 
-		if (!new_mem)
+		if (!new_mem) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
-		else
+			vfio_pci_dma_buf_move(vdev, true);
+		} else {
 			down_write(&vdev->memory_lock);
+		}
 
 		/*
 		 * If the user is writing mem/io enable (new_mem/io) and we
@@ -627,6 +629,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		*virt_cmd &= cpu_to_le16(~mask);
 		*virt_cmd |= cpu_to_le16(new_cmd & mask);
 
+		if (__vfio_pci_memory_enabled(vdev))
+			vfio_pci_dma_buf_move(vdev, false);
 		up_write(&vdev->memory_lock);
 	}
 
@@ -707,12 +711,16 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
 static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
 					  pci_power_t state)
 {
-	if (state >= PCI_D3hot)
+	if (state >= PCI_D3hot) {
 		vfio_pci_zap_and_down_write_memory_lock(vdev);
-	else
+		vfio_pci_dma_buf_move(vdev, true);
+	} else {
 		down_write(&vdev->memory_lock);
+	}
 
 	vfio_pci_set_power_state(vdev, state);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 }
 
@@ -900,7 +908,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
 
 		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
+			vfio_pci_dma_buf_move(vdev, true);
 			pci_try_reset_function(vdev->pdev);
+			if (__vfio_pci_memory_enabled(vdev))
+				vfio_pci_dma_buf_move(vdev, false);
 			up_write(&vdev->memory_lock);
 		}
 	}
@@ -982,7 +993,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
 
 		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
+			vfio_pci_dma_buf_move(vdev, true);
 			pci_try_reset_function(vdev->pdev);
+			if (__vfio_pci_memory_enabled(vdev))
+				vfio_pci_dma_buf_move(vdev, false);
 			up_write(&vdev->memory_lock);
 		}
 	}
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index fe247d0e2831..56b1320238a9 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -287,6 +287,8 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
 	 * semaphore.
 	 */
 	vfio_pci_zap_and_down_write_memory_lock(vdev);
+	vfio_pci_dma_buf_move(vdev, true);
+
 	if (vdev->pm_runtime_engaged) {
 		up_write(&vdev->memory_lock);
 		return -EINVAL;
@@ -370,6 +372,8 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
 	 */
 	down_write(&vdev->memory_lock);
 	__vfio_pci_runtime_pm_exit(vdev);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 }
 
@@ -690,6 +694,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
 #endif
 	vfio_pci_core_disable(vdev);
 
+	vfio_pci_dma_buf_cleanup(vdev);
+
 	mutex_lock(&vdev->igate);
 	if (vdev->err_trigger) {
 		eventfd_ctx_put(vdev->err_trigger);
@@ -1222,7 +1228,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
 	 */
 	vfio_pci_set_power_state(vdev, PCI_D0);
 
+	vfio_pci_dma_buf_move(vdev, true);
 	ret = pci_try_reset_function(vdev->pdev);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 
 	return ret;
@@ -1511,6 +1520,19 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_DMA_BUF:
+		if (device->ops->ioctl != vfio_pci_core_ioctl)
+			/*
+			 * Devices that overwrite general .ioctl() callback
+			 * usually do it to implement their own
+			 * VFIO_DEVICE_GET_REGION_INFO handlerm and they present
+			 * different BAR information from the real PCI.
+			 *
+			 * DMABUF relies on real PCI information.
+			 */
+			return -EOPNOTSUPP;
+
+		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
@@ -2095,6 +2117,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	ret = pcim_p2pdma_init(vdev->pdev);
 	if (ret != -EOPNOTSUPP)
 		return ret;
+	INIT_LIST_HEAD(&vdev->dmabufs);
 	init_rwsem(&vdev->memory_lock);
 	xa_init(&vdev->ctx);
 
@@ -2459,6 +2482,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 			break;
 		}
 
+		vfio_pci_dma_buf_move(vdev, true);
 		vfio_pci_zap_bars(vdev);
 	}
 
@@ -2482,6 +2506,10 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 
 	ret = pci_reset_bus(pdev);
 
+	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
+		if (__vfio_pci_memory_enabled(vdev))
+			vfio_pci_dma_buf_move(vdev, false);
+
 	vdev = list_last_entry(&dev_set->device_list,
 			       struct vfio_pci_core_device, vdev.dev_set_list);
 
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
new file mode 100644
index 000000000000..eaba010777f3
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -0,0 +1,446 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.
+ */
+#include <linux/dma-buf.h>
+#include <linux/pci-p2pdma.h>
+#include <linux/dma-resv.h>
+
+#include "vfio_pci_priv.h"
+
+MODULE_IMPORT_NS("DMA_BUF");
+
+struct vfio_pci_dma_buf {
+	struct dma_buf *dmabuf;
+	struct vfio_pci_core_device *vdev;
+	struct list_head dmabufs_elm;
+	size_t size;
+	struct phys_vec *phys_vec;
+	struct p2pdma_provider *provider;
+	u32 nr_ranges;
+	u8 revoked : 1;
+};
+
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
+				   struct dma_buf_attachment *attachment)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	if (!attachment->peer2peer)
+		return -EOPNOTSUPP;
+
+	if (priv->revoked)
+		return -ENODEV;
+
+	switch (pci_p2pdma_map_type(priv->provider, attachment->dev)) {
+	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+		break;
+	case PCI_P2PDMA_MAP_BUS_ADDR:
+		/*
+		 * There is no need in IOVA at all for this flow.
+		 * We rely on attachment->priv == NULL as a marker
+		 * for this mode.
+		 */
+		return 0;
+	default:
+		return -EINVAL;
+	}
+
+	attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL);
+	if (!attachment->priv)
+		return -ENOMEM;
+
+	dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->size);
+	return 0;
+}
+
+static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
+				    struct dma_buf_attachment *attachment)
+{
+	kfree(attachment->priv);
+}
+
+static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, u64 length,
+					 dma_addr_t addr)
+{
+	unsigned int len, nents;
+	int i;
+
+	nents = DIV_ROUND_UP(length, UINT_MAX);
+	for (i = 0; i < nents; i++) {
+		len = min_t(u64, length, UINT_MAX);
+		length -= len;
+		/*
+		 * Follow the DMABUF rules for scatterlist, the struct page can
+		 * be NULL'd for MMIO only memory.
+		 */
+		sg_set_page(sgl, NULL, len, 0);
+		sg_dma_address(sgl) = addr + i * UINT_MAX;
+		sg_dma_len(sgl) = len;
+		sgl = sg_next(sgl);
+	}
+
+	return sgl;
+}
+
+static unsigned int calc_sg_nents(struct vfio_pci_dma_buf *priv,
+				  struct dma_iova_state *state)
+{
+	struct phys_vec *phys_vec = priv->phys_vec;
+	unsigned int nents = 0;
+	u32 i;
+
+	if (!state || !dma_use_iova(state))
+		for (i = 0; i < priv->nr_ranges; i++)
+			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
+	else
+		/*
+		 * In IOVA case, there is only one SG entry which spans
+		 * for whole IOVA address space, but we need to make sure
+		 * that it fits sg->length, maybe we need more.
+		 */
+		nents = DIV_ROUND_UP(priv->size, UINT_MAX);
+
+	return nents;
+}
+
+static struct sg_table *
+vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
+		     enum dma_data_direction dir)
+{
+	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
+	struct dma_iova_state *state = attachment->priv;
+	struct phys_vec *phys_vec = priv->phys_vec;
+	unsigned long attrs = DMA_ATTR_MMIO;
+	unsigned int nents, mapped_len = 0;
+	struct scatterlist *sgl;
+	struct sg_table *sgt;
+	dma_addr_t addr;
+	int ret;
+	u32 i;
+
+	dma_resv_assert_held(priv->dmabuf->resv);
+
+	if (priv->revoked)
+		return ERR_PTR(-ENODEV);
+
+	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
+	if (!sgt)
+		return ERR_PTR(-ENOMEM);
+
+	nents = calc_sg_nents(priv, state);
+	ret = sg_alloc_table(sgt, nents, GFP_KERNEL | __GFP_ZERO);
+	if (ret)
+		goto err_kfree_sgt;
+
+	sgl = sgt->sgl;
+
+	for (i = 0; i < priv->nr_ranges; i++) {
+		if (!state) {
+			addr = pci_p2pdma_bus_addr_map(priv->provider,
+						       phys_vec[i].paddr);
+		} else if (dma_use_iova(state)) {
+			ret = dma_iova_link(attachment->dev, state,
+					    phys_vec[i].paddr, 0,
+					    phys_vec[i].len, dir, attrs);
+			if (ret)
+				goto err_unmap_dma;
+
+			mapped_len += phys_vec[i].len;
+		} else {
+			addr = dma_map_phys(attachment->dev, phys_vec[i].paddr,
+					    phys_vec[i].len, dir, attrs);
+			ret = dma_mapping_error(attachment->dev, addr);
+			if (ret)
+				goto err_unmap_dma;
+		}
+
+		if (!state || !dma_use_iova(state))
+			sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
+	}
+
+	if (state && dma_use_iova(state)) {
+		WARN_ON_ONCE(mapped_len != priv->size);
+		ret = dma_iova_sync(attachment->dev, state, 0, mapped_len);
+		if (ret)
+			goto err_unmap_dma;
+		sgl = fill_sg_entry(sgl, mapped_len, state->addr);
+	}
+
+	/*
+	 * SGL must be NULL to indicate that SGL is the last one
+	 * and we allocated correct number of entries in sg_alloc_table()
+	 */
+	WARN_ON_ONCE(sgl);
+	return sgt;
+
+err_unmap_dma:
+	if (!i || !state)
+		; /* Do nothing */
+	else if (dma_use_iova(state))
+		dma_iova_destroy(attachment->dev, state, mapped_len, dir,
+				 attrs);
+	else
+		for_each_sgtable_dma_sg(sgt, sgl, i)
+			dma_unmap_phys(attachment->dev, sg_dma_address(sgl),
+					sg_dma_len(sgl), dir, attrs);
+	sg_free_table(sgt);
+err_kfree_sgt:
+	kfree(sgt);
+	return ERR_PTR(ret);
+}
+
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
+				   struct sg_table *sgt,
+				   enum dma_data_direction dir)
+{
+	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
+	struct dma_iova_state *state = attachment->priv;
+	unsigned long attrs = DMA_ATTR_MMIO;
+	struct scatterlist *sgl;
+	int i;
+
+	if (!state)
+		; /* Do nothing */
+	else if (dma_use_iova(state))
+		dma_iova_destroy(attachment->dev, state, priv->size, dir,
+				 attrs);
+	else
+		for_each_sgtable_dma_sg(sgt, sgl, i)
+			dma_unmap_phys(attachment->dev, sg_dma_address(sgl),
+				       sg_dma_len(sgl), dir, attrs);
+
+	sg_free_table(sgt);
+	kfree(sgt);
+}
+
+static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	/*
+	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
+	 * The refcount prevents both.
+	 */
+	if (priv->vdev) {
+		down_write(&priv->vdev->memory_lock);
+		list_del_init(&priv->dmabufs_elm);
+		up_write(&priv->vdev->memory_lock);
+		vfio_device_put_registration(&priv->vdev->vdev);
+	}
+	kfree(priv->phys_vec);
+	kfree(priv);
+}
+
+static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
+	.attach = vfio_pci_dma_buf_attach,
+	.detach = vfio_pci_dma_buf_detach,
+	.map_dma_buf = vfio_pci_dma_buf_map,
+	.release = vfio_pci_dma_buf_release,
+	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
+};
+
+static void dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
+				   struct vfio_device_feature_dma_buf *dma_buf,
+				   struct vfio_region_dma_range *dma_ranges,
+				   struct p2pdma_provider *provider)
+{
+	struct pci_dev *pdev = priv->vdev->pdev;
+	phys_addr_t pci_start;
+	u32 i;
+
+	pci_start = pci_resource_start(pdev, dma_buf->region_index);
+	for (i = 0; i < dma_buf->nr_ranges; i++) {
+		priv->phys_vec[i].len = dma_ranges[i].length;
+		priv->phys_vec[i].paddr = pci_start + dma_ranges[i].offset;
+		priv->size += priv->phys_vec[i].len;
+	}
+	priv->nr_ranges = dma_buf->nr_ranges;
+	priv->provider = provider;
+}
+
+static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
+				 struct vfio_device_feature_dma_buf *dma_buf,
+				 struct vfio_region_dma_range *dma_ranges,
+				 struct p2pdma_provider **provider)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u32 bar = dma_buf->region_index;
+	resource_size_t bar_size;
+	u64 length = 0, sum;
+	u32 i;
+
+	if (dma_buf->flags)
+		return -EINVAL;
+	/*
+	 * For PCI the region_index is the BAR number like  everything else.
+	 */
+	if (bar >= VFIO_PCI_ROM_REGION_INDEX)
+		return -ENODEV;
+
+	*provider = pcim_p2pdma_provider(pdev, bar);
+	if (!*provider)
+		return -EINVAL;
+
+	bar_size = pci_resource_len(pdev, bar);
+	for (i = 0; i < dma_buf->nr_ranges; i++) {
+		u64 offset = dma_ranges[i].offset;
+		u64 len = dma_ranges[i].length;
+
+		if (!len || !PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+			return -EINVAL;
+
+		if (check_add_overflow(offset, len, &sum) || sum > bar_size)
+			return -EINVAL;
+
+		/* Total requested length can't overflow IOVA size */
+		if (check_add_overflow(length, len, &sum))
+			return -EINVAL;
+
+		length = sum;
+	}
+
+	/*
+	 * DMA API uses size_t, so make sure that requested region length
+	 * can fit into size_t variable, which can be unsigned int (32bits).
+	 *
+	 * In addition make sure that high bit of total length is not used too
+	 * as it is used as a marker for DMA IOVA API.
+	 */
+	if (overflows_type(length, size_t) || length & DMA_IOVA_USE_SWIOTLB)
+		return -EINVAL;
+
+	return 0;
+}
+
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz)
+{
+	struct vfio_device_feature_dma_buf get_dma_buf = {};
+	struct vfio_region_dma_range *dma_ranges;
+	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+	struct p2pdma_provider *provider;
+	struct vfio_pci_dma_buf *priv;
+	int ret;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(get_dma_buf));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
+		return -EFAULT;
+
+	if (!get_dma_buf.nr_ranges)
+		return -EINVAL;
+
+	dma_ranges = memdup_array_user(&arg->dma_ranges, get_dma_buf.nr_ranges,
+				       sizeof(*dma_ranges));
+	if (IS_ERR(dma_ranges))
+		return PTR_ERR(dma_ranges);
+
+	ret = validate_dmabuf_input(vdev, &get_dma_buf, dma_ranges, &provider);
+	if (ret)
+		goto err_free_ranges;
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv) {
+		ret = -ENOMEM;
+		goto err_free_ranges;
+	}
+	priv->phys_vec = kcalloc(get_dma_buf.nr_ranges, sizeof(*priv->phys_vec),
+				 GFP_KERNEL);
+	if (!priv->phys_vec) {
+		ret = -ENOMEM;
+		goto err_free_priv;
+	}
+
+	priv->vdev = vdev;
+	dma_ranges_to_p2p_phys(priv, &get_dma_buf, dma_ranges, provider);
+	kfree(dma_ranges);
+	dma_ranges = NULL;
+
+	if (!vfio_device_try_get_registration(&vdev->vdev)) {
+		ret = -ENODEV;
+		goto err_free_phys;
+	}
+
+	exp_info.ops = &vfio_pci_dmabuf_ops;
+	exp_info.size = priv->size;
+	exp_info.flags = get_dma_buf.open_flags;
+	exp_info.priv = priv;
+
+	priv->dmabuf = dma_buf_export(&exp_info);
+	if (IS_ERR(priv->dmabuf)) {
+		ret = PTR_ERR(priv->dmabuf);
+		goto err_dev_put;
+	}
+
+	/* dma_buf_put() now frees priv */
+	INIT_LIST_HEAD(&priv->dmabufs_elm);
+	down_write(&vdev->memory_lock);
+	dma_resv_lock(priv->dmabuf->resv, NULL);
+	priv->revoked = !__vfio_pci_memory_enabled(vdev);
+	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
+	dma_resv_unlock(priv->dmabuf->resv);
+	up_write(&vdev->memory_lock);
+
+	/*
+	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
+	 * will be released.
+	 */
+	return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
+
+err_dev_put:
+	vfio_device_put_registration(&vdev->vdev);
+err_free_phys:
+	kfree(priv->phys_vec);
+err_free_priv:
+	kfree(priv);
+err_free_ranges:
+	kfree(dma_ranges);
+	return ret;
+}
+
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	lockdep_assert_held_write(&vdev->memory_lock);
+
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!get_file_active(&priv->dmabuf->file))
+			continue;
+
+		if (priv->revoked != revoked) {
+			dma_resv_lock(priv->dmabuf->resv, NULL);
+			priv->revoked = revoked;
+			dma_buf_move_notify(priv->dmabuf);
+			dma_resv_unlock(priv->dmabuf->resv);
+		}
+		dma_buf_put(priv->dmabuf);
+	}
+}
+
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	down_write(&vdev->memory_lock);
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!get_file_active(&priv->dmabuf->file))
+			continue;
+
+		dma_resv_lock(priv->dmabuf->resv, NULL);
+		list_del_init(&priv->dmabufs_elm);
+		priv->vdev = NULL;
+		priv->revoked = true;
+		dma_buf_move_notify(priv->dmabuf);
+		dma_resv_unlock(priv->dmabuf->resv);
+		vfio_device_put_registration(&vdev->vdev);
+		dma_buf_put(priv->dmabuf);
+	}
+	up_write(&vdev->memory_lock);
+}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index a9972eacb293..28a405f8b97c 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -107,4 +107,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+#ifdef CONFIG_VFIO_PCI_DMABUF
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz);
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
+#else
+static inline int
+vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+			      struct vfio_device_feature_dma_buf __user *arg,
+			      size_t argsz)
+{
+	return -ENOTTY;
+}
+static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+}
+static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
+					 bool revoked)
+{
+}
+#endif
+
 #endif
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index f541044e42a2..30d74b364f25 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -94,6 +94,7 @@ struct vfio_pci_core_device {
 	struct vfio_pci_core_device	*sriov_pf_core_dev;
 	struct notifier_block	nb;
 	struct rw_semaphore	memory_lock;
+	struct list_head	dmabufs;
 };
 
 /* Will be exported for vfio pci drivers usage */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 75100bf009ba..63214467c875 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1478,6 +1478,31 @@ struct vfio_device_feature_bus_master {
 };
 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
+ * regions selected.
+ *
+ * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
+ * etc. offset/length specify a slice of the region to create the dmabuf from.
+ * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf.
+ *
+ * Return: The fd number on success, -1 and errno is set on failure.
+ */
+#define VFIO_DEVICE_FEATURE_DMA_BUF 11
+
+struct vfio_region_dma_range {
+	__u64 offset;
+	__u64 length;
+};
+
+struct vfio_device_feature_dma_buf {
+	__u32	region_index;
+	__u32	open_flags;
+	__u32   flags;
+	__u32   nr_ranges;
+	struct vfio_region_dma_range dma_ranges[];
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-13 15:26 ` [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
@ 2025-10-16 23:53   ` Jason Gunthorpe
  2025-10-17  5:40     ` Leon Romanovsky
  2025-10-17  0:01   ` Jason Gunthorpe
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-16 23:53 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> +
> +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> +				   struct dma_buf_attachment *attachment)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	if (!attachment->peer2peer)
> +		return -EOPNOTSUPP;
> +
> +	if (priv->revoked)
> +		return -ENODEV;
> +
> +	switch (pci_p2pdma_map_type(priv->provider, attachment->dev)) {
> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> +		break;
> +	case PCI_P2PDMA_MAP_BUS_ADDR:
> +		/*
> +		 * There is no need in IOVA at all for this flow.
> +		 * We rely on attachment->priv == NULL as a marker
> +		 * for this mode.
> +		 */
> +		return 0;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL);
> +	if (!attachment->priv)
> +		return -ENOMEM;
> +
> +	dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->size);

The lifetime of this isn't good..

> +	return 0;
> +}
> +
> +static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
> +				    struct dma_buf_attachment *attachment)
> +{
> +	kfree(attachment->priv);
> +}

If the caller fails to call map then it leaks the iova.

> +static struct sg_table *
> +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
> +		     enum dma_data_direction dir)
> +{
[..]


> +err_unmap_dma:
> +	if (!i || !state)
> +		; /* Do nothing */
> +	else if (dma_use_iova(state))
> +		dma_iova_destroy(attachment->dev, state, mapped_len, dir,
> +				 attrs);

If we hit this error path then it is freed..

> +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
> +				   struct sg_table *sgt,
> +				   enum dma_data_direction dir)
> +{
> +	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
> +	struct dma_iova_state *state = attachment->priv;
> +	unsigned long attrs = DMA_ATTR_MMIO;
> +	struct scatterlist *sgl;
> +	int i;
> +
> +	if (!state)
> +		; /* Do nothing */
> +	else if (dma_use_iova(state))
> +		dma_iova_destroy(attachment->dev, state, priv->size, dir,
> +				 attrs);

It is freed here too, but we can call map multiple times. Every time a
move_notify happens can trigger another call to map.

I think just call unlink in those two and put dma_iova_free in detach

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-16 23:53   ` Jason Gunthorpe
@ 2025-10-17  5:40     ` Leon Romanovsky
  2025-10-17 15:58       ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-17  5:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Thu, Oct 16, 2025 at 08:53:32PM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> > +
> > +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> > +				   struct dma_buf_attachment *attachment)
> > +{
> > +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> > +
> > +	if (!attachment->peer2peer)
> > +		return -EOPNOTSUPP;
> > +
> > +	if (priv->revoked)
> > +		return -ENODEV;
> > +
> > +	switch (pci_p2pdma_map_type(priv->provider, attachment->dev)) {
> > +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> > +		break;
> > +	case PCI_P2PDMA_MAP_BUS_ADDR:
> > +		/*
> > +		 * There is no need in IOVA at all for this flow.
> > +		 * We rely on attachment->priv == NULL as a marker
> > +		 * for this mode.
> > +		 */
> > +		return 0;
> > +	default:
> > +		return -EINVAL;
> > +	}
> > +
> > +	attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL);
> > +	if (!attachment->priv)
> > +		return -ENOMEM;
> > +
> > +	dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->size);
> 
> The lifetime of this isn't good..
> 
> > +	return 0;
> > +}
> > +
> > +static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
> > +				    struct dma_buf_attachment *attachment)
> > +{
> > +	kfree(attachment->priv);
> > +}
> 
> If the caller fails to call map then it leaks the iova.

I'm relying on dmabuf code and documentation:

   926 /**
   927  * dma_buf_dynamic_attach - Add the device to dma_buf's attachments list
...   
   932  *
   933  * Returns struct dma_buf_attachment pointer for this attachment. Attachments
   934  * must be cleaned up by calling dma_buf_detach().

Successful call to vfio_pci_dma_buf_attach() MUST be accompanied by call
to vfio_pci_dma_buf_detach(), so as far as dmabuf implementation follows
it, there is no leak.

> 
> > +static struct sg_table *
> > +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
> > +		     enum dma_data_direction dir)
> > +{
> [..]
> 
> 
> > +err_unmap_dma:
> > +	if (!i || !state)
> > +		; /* Do nothing */
> > +	else if (dma_use_iova(state))
> > +		dma_iova_destroy(attachment->dev, state, mapped_len, dir,
> > +				 attrs);
> 
> If we hit this error path then it is freed..
> 
> > +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
> > +				   struct sg_table *sgt,
> > +				   enum dma_data_direction dir)
> > +{
> > +	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
> > +	struct dma_iova_state *state = attachment->priv;
> > +	unsigned long attrs = DMA_ATTR_MMIO;
> > +	struct scatterlist *sgl;
> > +	int i;
> > +
> > +	if (!state)
> > +		; /* Do nothing */
> > +	else if (dma_use_iova(state))
> > +		dma_iova_destroy(attachment->dev, state, priv->size, dir,
> > +				 attrs);
> 
> It is freed here too, but we can call map multiple times. Every time a
> move_notify happens can trigger another call to map.
> 
> I think just call unlink in those two and put dma_iova_free in detach

Yes, it can work.

Thanks

> 
> Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-17  5:40     ` Leon Romanovsky
@ 2025-10-17 15:58       ` Jason Gunthorpe
  2025-10-17 16:01         ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-17 15:58 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Fri, Oct 17, 2025 at 08:40:07AM +0300, Leon Romanovsky wrote:
> > > +static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
> > > +				    struct dma_buf_attachment *attachment)
> > > +{
> > > +	kfree(attachment->priv);
> > > +}
> > 
> > If the caller fails to call map then it leaks the iova.
> 
> I'm relying on dmabuf code and documentation:
> 
>    926 /**
>    927  * dma_buf_dynamic_attach - Add the device to dma_buf's attachments list
> ...   
>    932  *
>    933  * Returns struct dma_buf_attachment pointer for this attachment. Attachments
>    934  * must be cleaned up by calling dma_buf_detach().
> 
> Successful call to vfio_pci_dma_buf_attach() MUST be accompanied by call
> to vfio_pci_dma_buf_detach(), so as far as dmabuf implementation follows
> it, there is no leak.

It leaks the ivoa because there is no dma_iova_destroy() unless you
call unmap. detach is not unmap and unmap is not mandatory to call.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-17 15:58       ` Jason Gunthorpe
@ 2025-10-17 16:01         ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-17 16:01 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Fri, Oct 17, 2025 at 12:58:50PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 17, 2025 at 08:40:07AM +0300, Leon Romanovsky wrote:
> > > > +static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
> > > > +				    struct dma_buf_attachment *attachment)
> > > > +{
> > > > +	kfree(attachment->priv);
> > > > +}
> > > 
> > > If the caller fails to call map then it leaks the iova.
> > 
> > I'm relying on dmabuf code and documentation:
> > 
> >    926 /**
> >    927  * dma_buf_dynamic_attach - Add the device to dma_buf's attachments list
> > ...   
> >    932  *
> >    933  * Returns struct dma_buf_attachment pointer for this attachment. Attachments
> >    934  * must be cleaned up by calling dma_buf_detach().
> > 
> > Successful call to vfio_pci_dma_buf_attach() MUST be accompanied by call
> > to vfio_pci_dma_buf_detach(), so as far as dmabuf implementation follows
> > it, there is no leak.
> 
> It leaks the ivoa because there is no dma_iova_destroy() unless you
> call unmap. detach is not unmap and unmap is not mandatory to call.

Though putting iova free in detach is problematic for the hot-unplug
case. In that instance we need to ensure the iova is cleaned up prior
to returning from vfio's remove(). detached is called on the importers
timeline but unmap is required to be called in move_notify..

So I guess some kind of flag to trigger the unmap after cleanup to
free the iova?

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-13 15:26 ` [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
  2025-10-16 23:53   ` Jason Gunthorpe
@ 2025-10-17  0:01   ` Jason Gunthorpe
  2025-10-17  6:33   ` Christoph Hellwig
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-17  0:01 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add support for exporting PCI device MMIO regions through dma-buf,
> enabling safe sharing of non-struct page memory with controlled
> lifetime management. This allows RDMA and other subsystems to import
> dma-buf FDs and build them into memory regions for PCI P2P operations.
> 
> The implementation provides a revocable attachment mechanism using
> dma-buf move operations. MMIO regions are normally pinned as BARs
> don't change physical addresses, but access is revoked when the VFIO
> device is closed or a PCI reset is issued. This ensures kernel
> self-defense against potentially hostile userspace.

I have drafted the iommufd importer side of this using the private
interconnect approach for now.

https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf/

Due to this iommufd never calls map and we run into trouble here:

> +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> +				   struct dma_buf_attachment *attachment)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	if (!attachment->peer2peer)
> +		return -EOPNOTSUPP;
> +
> +	if (priv->revoked)
> +		return -ENODEV;
> +
> +	switch (pci_p2pdma_map_type(priv->provider, attachment->dev)) {
> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> +		break;
> +	case PCI_P2PDMA_MAP_BUS_ADDR:
> +		/*
> +		 * There is no need in IOVA at all for this flow.
> +		 * We rely on attachment->priv == NULL as a marker
> +		 * for this mode.
> +		 */
> +		return 0;
> +	default:
> +		return -EINVAL;

Where the dev from iommufd is also not p2p capable so the attach
fails.

This is OK since it won't call map.

So I reworked this logic to succeed attach but block map in this
case.. Can we fold this in for the next version? This diff has the
fixing for the iova lifecycle too.

I have a few more checks to make but so far it looks Ok and with some
luck we can get some iommufd p2p support this cycle..

Jason

diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index eaba010777f3b7..a0650bd816d99b 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -20,10 +20,21 @@ struct vfio_pci_dma_buf {
 	u8 revoked : 1;
 };
 
+struct vfio_pci_attach {
+	struct dma_iova_state state;
+	enum {
+		VFIO_ATTACH_NONE,
+		VFIO_ATTACH_HOST_BRIDGE_DMA,
+		VFIO_ATTACH_HOST_BRIDGE_IOVA,
+		VFIO_ATTACH_BUS
+	} kind;
+};
+
 static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
 				   struct dma_buf_attachment *attachment)
 {
 	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+	struct vfio_pci_attach *attach;
 
 	if (!attachment->peer2peer)
 		return -EOPNOTSUPP;
@@ -31,32 +42,38 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
 	if (priv->revoked)
 		return -ENODEV;
 
+	attach = kzalloc(sizeof(*attach), GFP_KERNEL);
+	if (!attach)
+		return -ENOMEM;
+	attachment->priv = attach;
+
 	switch (pci_p2pdma_map_type(priv->provider, attachment->dev)) {
 	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-		break;
+		if (dma_iova_try_alloc(attachment->dev, &attach->state, 0,
+				       priv->size))
+			attach->kind = VFIO_ATTACH_HOST_BRIDGE_IOVA;
+		else
+			attach->kind = VFIO_ATTACH_HOST_BRIDGE_DMA;
+		return 0;
 	case PCI_P2PDMA_MAP_BUS_ADDR:
-		/*
-		 * There is no need in IOVA at all for this flow.
-		 * We rely on attachment->priv == NULL as a marker
-		 * for this mode.
-		 */
+		/* There is no need in IOVA at all for this flow. */
+		attach->kind = VFIO_ATTACH_BUS;
 		return 0;
 	default:
-		return -EINVAL;
+		attach->kind = VFIO_ATTACH_NONE;
+		return 0;
 	}
-
-	attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL);
-	if (!attachment->priv)
-		return -ENOMEM;
-
-	dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->size);
 	return 0;
 }
 
 static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
 				    struct dma_buf_attachment *attachment)
 {
-	kfree(attachment->priv);
+	struct vfio_pci_attach *attach = attachment->priv;
+
+	if (attach->kind == VFIO_ATTACH_HOST_BRIDGE_IOVA)
+		dma_iova_free(attachment->dev, &attach->state);
+	kfree(attach);
 }
 
 static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, u64 length,
@@ -83,22 +100,23 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, u64 length,
 }
 
 static unsigned int calc_sg_nents(struct vfio_pci_dma_buf *priv,
-				  struct dma_iova_state *state)
+				  struct vfio_pci_attach *attach)
 {
 	struct phys_vec *phys_vec = priv->phys_vec;
 	unsigned int nents = 0;
 	u32 i;
 
-	if (!state || !dma_use_iova(state))
+	if (attach->kind != VFIO_ATTACH_HOST_BRIDGE_IOVA) {
 		for (i = 0; i < priv->nr_ranges; i++)
 			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
-	else
+	} else {
 		/*
 		 * In IOVA case, there is only one SG entry which spans
 		 * for whole IOVA address space, but we need to make sure
 		 * that it fits sg->length, maybe we need more.
 		 */
 		nents = DIV_ROUND_UP(priv->size, UINT_MAX);
+	}
 
 	return nents;
 }
@@ -108,7 +126,7 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
 		     enum dma_data_direction dir)
 {
 	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
-	struct dma_iova_state *state = attachment->priv;
+	struct vfio_pci_attach *attach = attachment->priv;
 	struct phys_vec *phys_vec = priv->phys_vec;
 	unsigned long attrs = DMA_ATTR_MMIO;
 	unsigned int nents, mapped_len = 0;
@@ -127,7 +145,7 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
 	if (!sgt)
 		return ERR_PTR(-ENOMEM);
 
-	nents = calc_sg_nents(priv, state);
+	nents = calc_sg_nents(priv, attach);
 	ret = sg_alloc_table(sgt, nents, GFP_KERNEL | __GFP_ZERO);
 	if (ret)
 		goto err_kfree_sgt;
@@ -135,35 +153,42 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
 	sgl = sgt->sgl;
 
 	for (i = 0; i < priv->nr_ranges; i++) {
-		if (!state) {
+		switch (attach->kind) {
+		case VFIO_ATTACH_BUS:
 			addr = pci_p2pdma_bus_addr_map(priv->provider,
 						       phys_vec[i].paddr);
-		} else if (dma_use_iova(state)) {
-			ret = dma_iova_link(attachment->dev, state,
+			break;
+		case VFIO_ATTACH_HOST_BRIDGE_IOVA:
+			ret = dma_iova_link(attachment->dev, &attach->state,
 					    phys_vec[i].paddr, 0,
 					    phys_vec[i].len, dir, attrs);
 			if (ret)
 				goto err_unmap_dma;
 
 			mapped_len += phys_vec[i].len;
-		} else {
+			break;
+		case VFIO_ATTACH_HOST_BRIDGE_DMA:
 			addr = dma_map_phys(attachment->dev, phys_vec[i].paddr,
 					    phys_vec[i].len, dir, attrs);
 			ret = dma_mapping_error(attachment->dev, addr);
 			if (ret)
 				goto err_unmap_dma;
+			break;
+		default:
+			ret = -EINVAL;
+			goto err_unmap_dma;
 		}
 
-		if (!state || !dma_use_iova(state))
+		if (attach->kind != VFIO_ATTACH_HOST_BRIDGE_IOVA)
 			sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
 	}
 
-	if (state && dma_use_iova(state)) {
+	if (attach->kind == VFIO_ATTACH_HOST_BRIDGE_IOVA) {
 		WARN_ON_ONCE(mapped_len != priv->size);
-		ret = dma_iova_sync(attachment->dev, state, 0, mapped_len);
+		ret = dma_iova_sync(attachment->dev, &attach->state, 0, mapped_len);
 		if (ret)
 			goto err_unmap_dma;
-		sgl = fill_sg_entry(sgl, mapped_len, state->addr);
+		sgl = fill_sg_entry(sgl, mapped_len, attach->state.addr);
 	}
 
 	/*
@@ -174,15 +199,22 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
 	return sgt;
 
 err_unmap_dma:
-	if (!i || !state)
-		; /* Do nothing */
-	else if (dma_use_iova(state))
-		dma_iova_destroy(attachment->dev, state, mapped_len, dir,
-				 attrs);
-	else
+	switch (attach->kind) {
+	case VFIO_ATTACH_HOST_BRIDGE_IOVA:
+		if (mapped_len)
+			dma_iova_unlink(attachment->dev, &attach->state, 0,
+					mapped_len, dir, attrs);
+		break;
+	case VFIO_ATTACH_HOST_BRIDGE_DMA:
+		if (!i)
+			break;
 		for_each_sgtable_dma_sg(sgt, sgl, i)
 			dma_unmap_phys(attachment->dev, sg_dma_address(sgl),
-					sg_dma_len(sgl), dir, attrs);
+				       sg_dma_len(sgl), dir, attrs);
+		break;
+	default:
+		break;
+	}
 	sg_free_table(sgt);
 err_kfree_sgt:
 	kfree(sgt);
@@ -194,20 +226,24 @@ static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
 				   enum dma_data_direction dir)
 {
 	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
-	struct dma_iova_state *state = attachment->priv;
+	struct vfio_pci_attach *attach = attachment->priv;
 	unsigned long attrs = DMA_ATTR_MMIO;
 	struct scatterlist *sgl;
 	int i;
 
-	if (!state)
-		; /* Do nothing */
-	else if (dma_use_iova(state))
-		dma_iova_destroy(attachment->dev, state, priv->size, dir,
-				 attrs);
-	else
+	switch (attach->kind) {
+	case VFIO_ATTACH_HOST_BRIDGE_IOVA:
+		dma_iova_destroy(attachment->dev, &attach->state, priv->size,
+				 dir, attrs);
+		break;
+	case VFIO_ATTACH_HOST_BRIDGE_DMA:
 		for_each_sgtable_dma_sg(sgt, sgl, i)
 			dma_unmap_phys(attachment->dev, sg_dma_address(sgl),
 				       sg_dma_len(sgl), dir, attrs);
+		break;
+	default:
+		break;
+	}
 
 	sg_free_table(sgt);
 	kfree(sgt);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-13 15:26 ` [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
  2025-10-16 23:53   ` Jason Gunthorpe
  2025-10-17  0:01   ` Jason Gunthorpe
@ 2025-10-17  6:33   ` Christoph Hellwig
  2025-10-17 12:16     ` Jason Gunthorpe
  2025-10-17 13:02   ` Jason Gunthorpe
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2025-10-17  6:33 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add support for exporting PCI device MMIO regions through dma-buf,
> enabling safe sharing of non-struct page memory with controlled
> lifetime management. This allows RDMA and other subsystems to import
> dma-buf FDs and build them into memory regions for PCI P2P operations.
> 
> The implementation provides a revocable attachment mechanism using
> dma-buf move operations. MMIO regions are normally pinned as BARs
> don't change physical addresses, but access is revoked when the VFIO
> device is closed or a PCI reset is issued. This ensures kernel
> self-defense against potentially hostile userspace.

This still completely fails to explain why you think that it actually
is safe without the proper pgmap handling.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-17  6:33   ` Christoph Hellwig
@ 2025-10-17 12:16     ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-17 12:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Joerg Roedel, kvm, linaro-mm-sig, linux-block, linux-kernel,
	linux-media, linux-mm, linux-pci, Logan Gunthorpe,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Thu, Oct 16, 2025 at 11:33:52PM -0700, Christoph Hellwig wrote:
> On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Add support for exporting PCI device MMIO regions through dma-buf,
> > enabling safe sharing of non-struct page memory with controlled
> > lifetime management. This allows RDMA and other subsystems to import
> > dma-buf FDs and build them into memory regions for PCI P2P operations.
> > 
> > The implementation provides a revocable attachment mechanism using
> > dma-buf move operations. MMIO regions are normally pinned as BARs
> > don't change physical addresses, but access is revoked when the VFIO
> > device is closed or a PCI reset is issued. This ensures kernel
> > self-defense against potentially hostile userspace.
> 
> This still completely fails to explain why you think that it actually
> is safe without the proper pgmap handling.

Leon, this keeps coming up, please clean up and copy the text from my
prior email into the commit message & cover letter explaining in
detail how lifetime magement works by using revocation/invalidation
driven by dmabuf move_notify callbacks.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-13 15:26 ` [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
                     ` (2 preceding siblings ...)
  2025-10-17  6:33   ` Christoph Hellwig
@ 2025-10-17 13:02   ` Jason Gunthorpe
  2025-10-17 16:13     ` Leon Romanovsky
  2025-10-17 23:40   ` Jason Gunthorpe
  2025-10-22 12:50   ` Jason Gunthorpe
  5 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-17 13:02 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> +static void dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
> +				   struct vfio_device_feature_dma_buf *dma_buf,
> +				   struct vfio_region_dma_range *dma_ranges,
> +				   struct p2pdma_provider *provider)
> +{
> +	struct pci_dev *pdev = priv->vdev->pdev;
> +	phys_addr_t pci_start;
> +	u32 i;
> +
> +	pci_start = pci_resource_start(pdev, dma_buf->region_index);
> +	for (i = 0; i < dma_buf->nr_ranges; i++) {
> +		priv->phys_vec[i].len = dma_ranges[i].length;
> +		priv->phys_vec[i].paddr = pci_start + dma_ranges[i].offset;
> +		priv->size += priv->phys_vec[i].len;
> +	}

This is missing validation, the userspace can pass in any
length/offset but the resource is of limited size. Like this:

static int dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
				  struct vfio_device_feature_dma_buf *dma_buf,
				  struct vfio_region_dma_range *dma_ranges,
				  struct p2pdma_provider *provider)
{
	struct pci_dev *pdev = priv->vdev->pdev;
	phys_addr_t len = pci_resource_len(pdev, dma_buf->region_index);
	phys_addr_t pci_start;
	phys_addr_t pci_last;
	u32 i;

	if (!len)
		return -EINVAL;
	pci_start = pci_resource_start(pdev, dma_buf->region_index);
	pci_last = pci_start + len - 1;
	for (i = 0; i < dma_buf->nr_ranges; i++) {
		phys_addr_t last;

		if (!dma_ranges[i].length)
			return -EINVAL;

		if (check_add_overflow(pci_start, dma_ranges[i].offset,
				       &priv->phys_vec[i].paddr) ||
		    check_add_overflow(priv->phys_vec[i].paddr,
				       dma_ranges[i].length - 1, &last))
			return -EOVERFLOW;
		if (last > pci_last)
			return -EINVAL;

		priv->phys_vec[i].len = dma_ranges[i].length;
		priv->size += priv->phys_vec[i].len;
	}
	priv->nr_ranges = dma_buf->nr_ranges;
	priv->provider = provider;
	return 0;
}

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-17 13:02   ` Jason Gunthorpe
@ 2025-10-17 16:13     ` Leon Romanovsky
  2025-10-20 16:15       ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-17 16:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Fri, Oct 17, 2025 at 10:02:49AM -0300, Jason Gunthorpe wrote:
> On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> > +static void dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
> > +				   struct vfio_device_feature_dma_buf *dma_buf,
> > +				   struct vfio_region_dma_range *dma_ranges,
> > +				   struct p2pdma_provider *provider)
> > +{
> > +	struct pci_dev *pdev = priv->vdev->pdev;
> > +	phys_addr_t pci_start;
> > +	u32 i;
> > +
> > +	pci_start = pci_resource_start(pdev, dma_buf->region_index);
> > +	for (i = 0; i < dma_buf->nr_ranges; i++) {
> > +		priv->phys_vec[i].len = dma_ranges[i].length;
> > +		priv->phys_vec[i].paddr = pci_start + dma_ranges[i].offset;
> > +		priv->size += priv->phys_vec[i].len;
> > +	}
> 
> This is missing validation, the userspace can pass in any
> length/offset but the resource is of limited size. Like this:
> 
> static int dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
> 				  struct vfio_device_feature_dma_buf *dma_buf,
> 				  struct vfio_region_dma_range *dma_ranges,
> 				  struct p2pdma_provider *provider)
> {
> 	struct pci_dev *pdev = priv->vdev->pdev;
> 	phys_addr_t len = pci_resource_len(pdev, dma_buf->region_index);
> 	phys_addr_t pci_start;
> 	phys_addr_t pci_last;
> 	u32 i;
> 
> 	if (!len)
> 		return -EINVAL;
> 	pci_start = pci_resource_start(pdev, dma_buf->region_index);
> 	pci_last = pci_start + len - 1;
> 	for (i = 0; i < dma_buf->nr_ranges; i++) {
> 		phys_addr_t last;
> 
> 		if (!dma_ranges[i].length)
> 			return -EINVAL;
> 
> 		if (check_add_overflow(pci_start, dma_ranges[i].offset,
> 				       &priv->phys_vec[i].paddr) ||
> 		    check_add_overflow(priv->phys_vec[i].paddr,
> 				       dma_ranges[i].length - 1, &last))
> 			return -EOVERFLOW;
> 		if (last > pci_last)
> 			return -EINVAL;
> 
> 		priv->phys_vec[i].len = dma_ranges[i].length;
> 		priv->size += priv->phys_vec[i].len;
> 	}
> 	priv->nr_ranges = dma_buf->nr_ranges;
> 	priv->provider = provider;
> 	return 0;
> }

I have these checks in validate_dmabuf_input(). Do you think that I need
to add extra checks?

Thanks

> 
> Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-17 16:13     ` Leon Romanovsky
@ 2025-10-20 16:15       ` Jason Gunthorpe
  2025-10-20 16:44         ` Leon Romanovsky
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-20 16:15 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Fri, Oct 17, 2025 at 07:13:58PM +0300, Leon Romanovsky wrote:
> > static int dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
> > 				  struct vfio_device_feature_dma_buf *dma_buf,
> > 				  struct vfio_region_dma_range *dma_ranges,
> > 				  struct p2pdma_provider *provider)
> > {
> > 	struct pci_dev *pdev = priv->vdev->pdev;
> > 	phys_addr_t len = pci_resource_len(pdev, dma_buf->region_index);
> > 	phys_addr_t pci_start;
> > 	phys_addr_t pci_last;
> > 	u32 i;
> > 
> > 	if (!len)
> > 		return -EINVAL;
> > 	pci_start = pci_resource_start(pdev, dma_buf->region_index);
> > 	pci_last = pci_start + len - 1;
> > 	for (i = 0; i < dma_buf->nr_ranges; i++) {
> > 		phys_addr_t last;
> > 
> > 		if (!dma_ranges[i].length)
> > 			return -EINVAL;
> > 
> > 		if (check_add_overflow(pci_start, dma_ranges[i].offset,
> > 				       &priv->phys_vec[i].paddr) ||
> > 		    check_add_overflow(priv->phys_vec[i].paddr,
> > 				       dma_ranges[i].length - 1, &last))
> > 			return -EOVERFLOW;
> > 		if (last > pci_last)
> > 			return -EINVAL;
> > 
> > 		priv->phys_vec[i].len = dma_ranges[i].length;
> > 		priv->size += priv->phys_vec[i].len;
> > 	}
> > 	priv->nr_ranges = dma_buf->nr_ranges;
> > 	priv->provider = provider;
> > 	return 0;
> > }
> 
> I have these checks in validate_dmabuf_input(). 
> Do you think that I need to add extra checks?

I think they work better in this function, so I'd move them here.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-20 16:15       ` Jason Gunthorpe
@ 2025-10-20 16:44         ` Leon Romanovsky
  2025-10-20 16:51           ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Leon Romanovsky @ 2025-10-20 16:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Mon, Oct 20, 2025 at 01:15:16PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 17, 2025 at 07:13:58PM +0300, Leon Romanovsky wrote:
> > > static int dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
> > > 				  struct vfio_device_feature_dma_buf *dma_buf,
> > > 				  struct vfio_region_dma_range *dma_ranges,
> > > 				  struct p2pdma_provider *provider)
> > > {
> > > 	struct pci_dev *pdev = priv->vdev->pdev;
> > > 	phys_addr_t len = pci_resource_len(pdev, dma_buf->region_index);
> > > 	phys_addr_t pci_start;
> > > 	phys_addr_t pci_last;
> > > 	u32 i;
> > > 
> > > 	if (!len)
> > > 		return -EINVAL;
> > > 	pci_start = pci_resource_start(pdev, dma_buf->region_index);
> > > 	pci_last = pci_start + len - 1;
> > > 	for (i = 0; i < dma_buf->nr_ranges; i++) {
> > > 		phys_addr_t last;
> > > 
> > > 		if (!dma_ranges[i].length)
> > > 			return -EINVAL;
> > > 
> > > 		if (check_add_overflow(pci_start, dma_ranges[i].offset,
> > > 				       &priv->phys_vec[i].paddr) ||
> > > 		    check_add_overflow(priv->phys_vec[i].paddr,
> > > 				       dma_ranges[i].length - 1, &last))
> > > 			return -EOVERFLOW;
> > > 		if (last > pci_last)
> > > 			return -EINVAL;
> > > 
> > > 		priv->phys_vec[i].len = dma_ranges[i].length;
> > > 		priv->size += priv->phys_vec[i].len;
> > > 	}
> > > 	priv->nr_ranges = dma_buf->nr_ranges;
> > > 	priv->provider = provider;
> > > 	return 0;
> > > }
> > 
> > I have these checks in validate_dmabuf_input(). 
> > Do you think that I need to add extra checks?
> 
> I think they work better in this function, so I'd move them here.

The main idea for validate_dmabuf_input() is to perform as much as
possible checks before allocating kernel memory.

Thanks

> 
> Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-20 16:44         ` Leon Romanovsky
@ 2025-10-20 16:51           ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-20 16:51 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Mon, Oct 20, 2025 at 07:44:57PM +0300, Leon Romanovsky wrote:
> On Mon, Oct 20, 2025 at 01:15:16PM -0300, Jason Gunthorpe wrote:
> > On Fri, Oct 17, 2025 at 07:13:58PM +0300, Leon Romanovsky wrote:
> > > > static int dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
> > > > 				  struct vfio_device_feature_dma_buf *dma_buf,
> > > > 				  struct vfio_region_dma_range *dma_ranges,
> > > > 				  struct p2pdma_provider *provider)
> > > > {
> > > > 	struct pci_dev *pdev = priv->vdev->pdev;
> > > > 	phys_addr_t len = pci_resource_len(pdev, dma_buf->region_index);
> > > > 	phys_addr_t pci_start;
> > > > 	phys_addr_t pci_last;
> > > > 	u32 i;
> > > > 
> > > > 	if (!len)
> > > > 		return -EINVAL;
> > > > 	pci_start = pci_resource_start(pdev, dma_buf->region_index);
> > > > 	pci_last = pci_start + len - 1;
> > > > 	for (i = 0; i < dma_buf->nr_ranges; i++) {
> > > > 		phys_addr_t last;
> > > > 
> > > > 		if (!dma_ranges[i].length)
> > > > 			return -EINVAL;
> > > > 
> > > > 		if (check_add_overflow(pci_start, dma_ranges[i].offset,
> > > > 				       &priv->phys_vec[i].paddr) ||
> > > > 		    check_add_overflow(priv->phys_vec[i].paddr,
> > > > 				       dma_ranges[i].length - 1, &last))
> > > > 			return -EOVERFLOW;
> > > > 		if (last > pci_last)
> > > > 			return -EINVAL;
> > > > 
> > > > 		priv->phys_vec[i].len = dma_ranges[i].length;
> > > > 		priv->size += priv->phys_vec[i].len;
> > > > 	}
> > > > 	priv->nr_ranges = dma_buf->nr_ranges;
> > > > 	priv->provider = provider;
> > > > 	return 0;
> > > > }
> > > 
> > > I have these checks in validate_dmabuf_input(). 
> > > Do you think that I need to add extra checks?
> > 
> > I think they work better in this function, so I'd move them here.
> 
> The main idea for validate_dmabuf_input() is to perform as much as
> possible checks before allocating kernel memory.

Yeah, but it's fine, it can just be turned into a function to safely
compute the total size. It makes more sense to try to validate once we
have the kernel memory and got the physical range from the driver to
copy into the phys.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-13 15:26 ` [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
                     ` (3 preceding siblings ...)
  2025-10-17 13:02   ` Jason Gunthorpe
@ 2025-10-17 23:40   ` Jason Gunthorpe
  2025-10-22 12:50   ` Jason Gunthorpe
  5 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-17 23:40 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add support for exporting PCI device MMIO regions through dma-buf,
> enabling safe sharing of non-struct page memory with controlled
> lifetime management. This allows RDMA and other subsystems to import
> dma-buf FDs and build them into memory regions for PCI P2P operations.

I was looking at how to address Alex's note about not all drivers
being compatible, and how to enable the non-compatible drivers.

It looks like the simplest thing is to make dma_ranges_to_p2p_phys
into an ops and have the driver provide it. If not provided the no
support.

Drivers with special needs can fill in phys in their own way and get
their own provider.

Sort of like this:

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index ac10f14417f2f3..6d41cf26b53994 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -147,6 +147,10 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
 };
 
+static const struct vfio_pci_device_ops vfio_pci_dev_ops = {
+	.get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys,
+};
+
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct vfio_pci_core_device *vdev;
@@ -161,6 +165,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return PTR_ERR(vdev);
 
 	dev_set_drvdata(&pdev->dev, vdev);
+	vdev->pci_ops = &vfio_pci_dev_ops;
 	ret = vfio_pci_core_register_device(vdev);
 	if (ret)
 		goto out_put_vdev;
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 358856e6b8a820..dad880781a9352 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -309,47 +309,52 @@ int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_dma_buf_iommufd_map);
 
-static int dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
-				  struct vfio_device_feature_dma_buf *dma_buf,
+int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev,
+				  struct p2pdma_provider **provider,
+				  unsigned int region_index,
+				  struct phys_vec *phys_vec,
 				  struct vfio_region_dma_range *dma_ranges,
-				  struct p2pdma_provider *provider)
+				  size_t nr_ranges)
 {
-	struct pci_dev *pdev = priv->vdev->pdev;
-	phys_addr_t len = pci_resource_len(pdev, dma_buf->region_index);
+	struct pci_dev *pdev = vdev->pdev;
+	phys_addr_t len = pci_resource_len(pdev, region_index);
 	phys_addr_t pci_start;
 	phys_addr_t pci_last;
 	u32 i;
 
 	if (!len)
 		return -EINVAL;
-	pci_start = pci_resource_start(pdev, dma_buf->region_index);
+
+	*provider = pcim_p2pdma_provider(pdev, region_index);
+	if (!*provider)
+		return -EINVAL;
+
+	pci_start = pci_resource_start(pdev, region_index);
 	pci_last = pci_start + len - 1;
-	for (i = 0; i < dma_buf->nr_ranges; i++) {
+	for (i = 0; i < nr_ranges; i++) {
 		phys_addr_t last;
 
 		if (!dma_ranges[i].length)
 			return -EINVAL;
 
 		if (check_add_overflow(pci_start, dma_ranges[i].offset,
-				       &priv->phys_vec[i].paddr) ||
-		    check_add_overflow(priv->phys_vec[i].paddr,
+				       &phys_vec[i].paddr) ||
+		    check_add_overflow(phys_vec[i].paddr,
 				       dma_ranges[i].length - 1, &last))
 			return -EOVERFLOW;
 		if (last > pci_last)
 			return -EINVAL;
 
-		priv->phys_vec[i].len = dma_ranges[i].length;
-		priv->size += priv->phys_vec[i].len;
+		phys_vec[i].len = dma_ranges[i].length;
 	}
-	priv->nr_ranges = dma_buf->nr_ranges;
-	priv->provider = provider;
 	return 0;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_get_dmabuf_phys);
 
 static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
 				 struct vfio_device_feature_dma_buf *dma_buf,
 				 struct vfio_region_dma_range *dma_ranges,
-				 struct p2pdma_provider **provider)
+				 size_t *lengthp)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	u32 bar = dma_buf->region_index;
@@ -365,10 +370,6 @@ static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
 	if (bar >= VFIO_PCI_ROM_REGION_INDEX)
 		return -ENODEV;
 
-	*provider = pcim_p2pdma_provider(pdev, bar);
-	if (!*provider)
-		return -EINVAL;
-
 	bar_size = pci_resource_len(pdev, bar);
 	for (i = 0; i < dma_buf->nr_ranges; i++) {
 		u64 offset = dma_ranges[i].offset;
@@ -397,6 +398,7 @@ static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
 	if (overflows_type(length, size_t) || length & DMA_IOVA_USE_SWIOTLB)
 		return -EINVAL;
 
+	*lengthp = length;
 	return 0;
 }
 
@@ -407,10 +409,13 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 	struct vfio_device_feature_dma_buf get_dma_buf = {};
 	struct vfio_region_dma_range *dma_ranges;
 	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
-	struct p2pdma_provider *provider;
 	struct vfio_pci_dma_buf *priv;
+	size_t length;
 	int ret;
 
+	if (!vdev->pci_ops->get_dmabuf_phys)
+		return -EOPNOTSUPP;
+
 	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
 				 sizeof(get_dma_buf));
 	if (ret != 1)
@@ -427,7 +432,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 	if (IS_ERR(dma_ranges))
 		return PTR_ERR(dma_ranges);
 
-	ret = validate_dmabuf_input(vdev, &get_dma_buf, dma_ranges, &provider);
+	ret = validate_dmabuf_input(vdev, &get_dma_buf, dma_ranges, &length);
 	if (ret)
 		goto err_free_ranges;
 
@@ -444,10 +449,16 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 	}
 
 	priv->vdev = vdev;
-	ret = dma_ranges_to_p2p_phys(priv, &get_dma_buf, dma_ranges, provider);
+	priv->nr_ranges = get_dma_buf.nr_ranges;
+	priv->size = length;
+	ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider,
+					     get_dma_buf.region_index,
+					     priv->phys_vec, dma_ranges,
+					     priv->nr_ranges);
 	if (ret)
 		goto err_free_phys;
 
+
 	kfree(dma_ranges);
 	dma_ranges = NULL;
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 37ce02e30c7632..4ea2095381eb24 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -26,6 +26,7 @@
 
 struct vfio_pci_core_device;
 struct vfio_pci_region;
+struct p2pdma_provider;
 
 struct vfio_pci_regops {
 	ssize_t (*rw)(struct vfio_pci_core_device *vdev, char __user *buf,
@@ -49,9 +50,26 @@ struct vfio_pci_region {
 	u32				flags;
 };
 
+struct vfio_pci_device_ops {
+	int (*get_dmabuf_phys)(struct vfio_pci_core_device *vdev,
+			       struct p2pdma_provider **provider,
+			       unsigned int region_index,
+			       struct phys_vec *phys_vec,
+			       struct vfio_region_dma_range *dma_ranges,
+			       size_t nr_ranges);
+};
+
+int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev,
+				  struct p2pdma_provider **provider,
+				  unsigned int region_index,
+				  struct phys_vec *phys_vec,
+				  struct vfio_region_dma_range *dma_ranges,
+				  size_t nr_ranges);
+
 struct vfio_pci_core_device {
 	struct vfio_device	vdev;
 	struct pci_dev		*pdev;
+	const struct vfio_pci_device_ops *pci_ops;
 	void __iomem		*barmap[PCI_STD_NUM_BARS];
 	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
 	u8			*pci_config_map;

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions
  2025-10-13 15:26 ` [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
                     ` (4 preceding siblings ...)
  2025-10-17 23:40   ` Jason Gunthorpe
@ 2025-10-22 12:50   ` Jason Gunthorpe
  5 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2025-10-22 12:50 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe, Joerg Roedel,
	kvm, linaro-mm-sig, linux-block, linux-kernel, linux-media,
	linux-mm, linux-pci, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Vivek Kasireddy, Will Deacon

On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add support for exporting PCI device MMIO regions through dma-buf,
> enabling safe sharing of non-struct page memory with controlled
> lifetime management. This allows RDMA and other subsystems to import
> dma-buf FDs and build them into memory regions for PCI P2P operations.
> 
> The implementation provides a revocable attachment mechanism using
> dma-buf move operations. MMIO regions are normally pinned as BARs
> don't change physical addresses, but access is revoked when the VFIO
> device is closed or a PCI reset is issued. This ensures kernel
> self-defense against potentially hostile userspace.

Let's enhance this:

Currently VFIO can take MMIO regions from the device's BAR and map
them into a PFNMAP VMA with special PTEs. This mapping type ensures
the memory cannot be used with things like pin_user_pages(), hmm, and
so on. In practice only the user process CPU and KVM can safely make
use of these VMA. When VFIO shuts down these VMAs are cleaned by
unmap_mapping_range() to prevent any UAF of the MMIO beyond driver
unbind.

However, VFIO type 1 has an insecure behavior where it uses
follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back
into the IOMMU. This has a long history of enabling P2P DMA inside
VMs, but has serious lifetime problems by allowing a UAF of the MMIO
after the VFIO driver has been unbound.

Introduce DMABUF as a new safe way to export a FD based handle for the
MMIO regions. This can be consumed by existing DMABUF importers like
RDMA or DRM without opening an UAF. A following series will add an
importer to iommufd to obsolete the type 1 code and allow safe
UAF-free MMIO P2P in VM cases.

DMABUF has a built in synchronous invalidation mechanism called
move_notify. VFIO keeps track of all drivers importing its MMIO and
can invoke a synchronous invalidation callback to tell the importing
drivers to DMA unmap and forget about the MMIO pfns. This process is
being called revoke. This synchronous invalidation fully prevents any
lifecycle problems. VFIO will do this before unbinding its driver
ensuring there is no UAF of the MMIO beyond the driver lifecycle.

Further, VFIO has additional behavior to block access to the MMIO
during things like Function Level Reset. This is because some poor
platforms may experience a MCE type crash when touching MMIO of a PCI
device that is undergoing a reset. Today this is done by using
unmap_mapping_range() on the VMAs. Extend that into the DMABUF world
and temporarily revoke the MMIO from the DMABUF importers during FLR
as well. This will more robustly prevent an errant P2P from possibly
upsetting the platform.

A DMABUF FD is a prefered handle for MMIO compared to using something
like a pgmap because:
 - VFIO is supported, including its P2P feature, on archs that don't
   support pgmap
 - PCI devices have all sorts of BAR sizes, including ones smaller
   than a section so a pgmap cannot always be created
 - It is undesirable to waste alot of memory for struct pages,
   especially for a case like a GPU with ~100GB of BAR size
 - We want a synchronous revoke semantic to support FLR with light
   hardware requirements

Use the P2P subsystem to help generate the DMA mapping. This is a
significant upgrade over the abuse of dma_map_resource() that has
historically been used by DMABUF exporters. Experience with an OOT
version of this patch shows that real systems do need this. This
approach deals with all the P2P scenarios:
 - Non-zero PCI bus_offset
 - ACS flags routing traffic to the IOMMU
 - ACS flags that bypass the IOMMU - though vfio noiommu is required
   to hit this.

There will be further work to formalize the revoke semantic in
DMABUF. For now this acts like a move_notify dynamic exporter where
importer fault handling will get a failure when they attempt to map.
This means that only fully restartable fault capable importers can
import the VFIO DMABUFs. A future revoke semantic should open this up
to more HW as the HW only needs to invalidate, not handle restartable
faults.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf
  2025-10-13 15:26 [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (8 preceding siblings ...)
  2025-10-13 15:26 ` [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
@ 2025-10-15 21:15 ` shinichiro.kawasaki
  9 siblings, 0 replies; 45+ messages in thread
From: shinichiro.kawasaki @ 2025-10-15 21:15 UTC (permalink / raw)
  To: linux-block; +Cc: shinichiro.kawasaki, dennis.maisenbacher

[-- Attachment #1: Type: text/plain, Size: 378 bytes --]

Dear patch submitter,

Blktests CI has tested the following submission:
Status:     FAILURE
Name:       [v5,0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf
Patchwork:  https://patchwork.kernel.org/project/linux-block/list/?series=1010803&state=*
Run record: https://github.com/linux-blktests/linux-block/actions/runs/18524773591


Failed test cases: nvme/057



^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2025-10-22 12:50 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-13 15:26 [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
2025-10-13 15:26 ` [PATCH v5 1/9] PCI/P2PDMA: Separate the mmap() support from the core logic Leon Romanovsky
2025-10-17  6:30   ` Christoph Hellwig
2025-10-17 11:53     ` Jason Gunthorpe
2025-10-20 12:27       ` Christoph Hellwig
2025-10-20 12:58         ` Jason Gunthorpe
2025-10-20 15:04           ` Leon Romanovsky
2025-10-22  7:10           ` Christoph Hellwig
2025-10-22 11:43             ` Jason Gunthorpe
2025-10-13 15:26 ` [PATCH v5 2/9] PCI/P2PDMA: Simplify bus address mapping API Leon Romanovsky
2025-10-13 15:26 ` [PATCH v5 3/9] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation Leon Romanovsky
2025-10-13 15:26 ` [PATCH v5 4/9] PCI/P2PDMA: Export pci_p2pdma_map_type() function Leon Romanovsky
2025-10-17  6:31   ` Christoph Hellwig
2025-10-17 12:14     ` Jason Gunthorpe
2025-10-20 12:29       ` Christoph Hellwig
2025-10-20 13:14         ` Jason Gunthorpe
2025-10-13 15:26 ` [PATCH v5 5/9] types: move phys_vec definition to common header Leon Romanovsky
2025-10-13 15:26 ` [PATCH v5 6/9] vfio: Export vfio device get and put registration helpers Leon Romanovsky
2025-10-13 15:26 ` [PATCH v5 7/9] vfio/pci: Share the core device pointer while invoking feature functions Leon Romanovsky
2025-10-13 15:26 ` [PATCH v5 8/9] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
2025-10-16  4:09   ` Nicolin Chen
2025-10-16  6:10     ` Leon Romanovsky
2025-10-17  6:32   ` Christoph Hellwig
2025-10-17 11:55     ` Jason Gunthorpe
2025-10-20 12:28       ` Christoph Hellwig
2025-10-20 13:08         ` Jason Gunthorpe
2025-10-22  7:08           ` Christoph Hellwig
2025-10-22 11:38             ` Jason Gunthorpe
2025-10-22 11:54   ` Jason Gunthorpe
2025-10-13 15:26 ` [PATCH v5 9/9] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
2025-10-16 23:53   ` Jason Gunthorpe
2025-10-17  5:40     ` Leon Romanovsky
2025-10-17 15:58       ` Jason Gunthorpe
2025-10-17 16:01         ` Jason Gunthorpe
2025-10-17  0:01   ` Jason Gunthorpe
2025-10-17  6:33   ` Christoph Hellwig
2025-10-17 12:16     ` Jason Gunthorpe
2025-10-17 13:02   ` Jason Gunthorpe
2025-10-17 16:13     ` Leon Romanovsky
2025-10-20 16:15       ` Jason Gunthorpe
2025-10-20 16:44         ` Leon Romanovsky
2025-10-20 16:51           ` Jason Gunthorpe
2025-10-17 23:40   ` Jason Gunthorpe
2025-10-22 12:50   ` Jason Gunthorpe
2025-10-15 21:15 ` [PATCH v5 0/9] vfio/pci: Allow MMIO regions to be exported through dma-buf shinichiro.kawasaki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).