[PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf
@ 2025-07-23 13:00 Leon Romanovsky
  2025-07-23 13:00 ` [PATCH 01/10] PCI/P2PDMA: Remove redundant bus_offset from map state Leon Romanovsky
                   ` (10 more replies)
  0 siblings, 11 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-23 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

---------------------------------------------------------------------------
Based on blk and DMA patches which will be sent during coming merge window.
---------------------------------------------------------------------------

This series extends the VFIO PCI subsystem to support exporting MMIO regions
from PCI device BARs as dma-buf objects, enabling safe sharing of non-struct
page memory with controlled lifetime management. This allows RDMA and other
subsystems to import dma-buf FDs and build them into memory regions for PCI
P2P operations.

The series supports a use case for SPDK where a NVMe device will be owned
by SPDK through VFIO but interacting with a RDMA device. The RDMA device
may directly access the NVMe CMB or directly manipulate the NVMe device's
doorbell using PCI P2P.

However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.

In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.

The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.

The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.

-----------------------------------------------------------------------
This is based on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.com/
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dmabuf-vfio

Thanks

Leon Romanovsky (8):
  PCI/P2PDMA: Remove redundant bus_offset from map state
  PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner
    abstraction
  PCI/P2PDMA: Simplify bus address mapping API
  PCI/P2PDMA: Refactor to separate core P2P functionality from memory
    allocation
  PCI/P2PDMA: Export pci_p2pdma_map_type() function
  types: move phys_vec definition to common header
  vfio/pci: Enable peer-to-peer DMA transactions by default
  vfio/pci: Add dma-buf export support for MMIO regions

Vivek Kasireddy (2):
  vfio: Export vfio device get and put registration helpers
  vfio/pci: Share the core device pointer while invoking feature
    functions

 block/blk-mq-dma.c                 |   7 +-
 drivers/iommu/dma-iommu.c          |   4 +-
 drivers/pci/p2pdma.c               | 144 +++++++++----
 drivers/vfio/pci/Kconfig           |  20 ++
 drivers/vfio/pci/Makefile          |   2 +
 drivers/vfio/pci/vfio_pci_config.c |  22 +-
 drivers/vfio/pci/vfio_pci_core.c   |  59 ++++--
 drivers/vfio/pci/vfio_pci_dmabuf.c | 321 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_priv.h   |  23 +++
 drivers/vfio/vfio_main.c           |   2 +
 include/linux/dma-buf.h            |   1 +
 include/linux/pci-p2pdma.h         | 114 +++++-----
 include/linux/types.h              |   5 +
 include/linux/vfio.h               |   2 +
 include/linux/vfio_pci_core.h      |   4 +
 include/uapi/linux/vfio.h          |  19 ++
 kernel/dma/direct.c                |   4 +-
 mm/hmm.c                           |   2 +-
 18 files changed, 631 insertions(+), 124 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c

-- 
2.50.1

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH 01/10] PCI/P2PDMA: Remove redundant bus_offset from map state
  2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
@ 2025-07-23 13:00 ` Leon Romanovsky
  2025-07-24  7:50   ` Christoph Hellwig
  2025-07-23 13:00 ` [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction Leon Romanovsky
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-23 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Remove the bus_off field from pci_p2pdma_map_state since it duplicates
information already available in the pgmap structure. The bus_offset
is only used in one location (pci_p2pdma_bus_addr_map) and is always
identical to pgmap->bus_offset.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/p2pdma.c       | 1 -
 include/linux/pci-p2pdma.h | 3 +--
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 8d955c25aed36..fe347ed7fd8f4 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -1009,7 +1009,6 @@ void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 {
 	state->pgmap = page_pgmap(page);
 	state->map = pci_p2pdma_map_type(state->pgmap, dev);
-	state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
 }
 
 /**
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 075c20b161d98..b502fc8b49bf9 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -146,7 +146,6 @@ enum pci_p2pdma_map_type {
 struct pci_p2pdma_map_state {
 	struct dev_pagemap *pgmap;
 	enum pci_p2pdma_map_type map;
-	u64 bus_off;
 };
 
 /* helper for pci_p2pdma_state(), do not use directly */
@@ -186,7 +185,7 @@ static inline dma_addr_t
 pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
 {
 	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
-	return paddr + state->bus_off;
+	return paddr + to_p2p_pgmap(state->pgmap)->bus_offsetf;
 }
 
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
  2025-07-23 13:00 ` [PATCH 01/10] PCI/P2PDMA: Remove redundant bus_offset from map state Leon Romanovsky
@ 2025-07-23 13:00 ` Leon Romanovsky
  2025-07-24  7:51   ` Christoph Hellwig
  2025-07-29 16:12   ` Jason Gunthorpe
  2025-07-23 13:00 ` [PATCH 03/10] PCI/P2PDMA: Simplify bus address mapping API Leon Romanovsky
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-23 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Extract the core P2PDMA provider information (device owner and bus
offset) from the dev_pagemap into a dedicated p2pdma_provider structure.
This creates a cleaner separation between the memory management layer and
the P2PDMA functionality.

The new p2pdma_provider structure contains:
- owner: pointer to the providing device
- bus_offset: computed offset for non-host transactions

This refactoring simplifies the P2PDMA state management by removing
the need to access pgmap internals directly. The pci_p2pdma_map_state
now stores a pointer to the provider instead of the pgmap, making
the API more explicit and easier to understand.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/p2pdma.c       | 42 +++++++++++++++++++++-----------------
 include/linux/pci-p2pdma.h | 18 ++++++++++++----
 2 files changed, 37 insertions(+), 23 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index fe347ed7fd8f4..5a310026bd24f 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -28,9 +28,8 @@ struct pci_p2pdma {
 };
 
 struct pci_p2pdma_pagemap {
-	struct pci_dev *provider;
-	u64 bus_offset;
 	struct dev_pagemap pgmap;
+	struct p2pdma_provider mem;
 };
 
 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
@@ -204,8 +203,8 @@ static void p2pdma_page_free(struct page *page)
 {
 	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
 	/* safe to dereference while a reference is held to the percpu ref */
-	struct pci_p2pdma *p2pdma =
-		rcu_dereference_protected(pgmap->provider->p2pdma, 1);
+	struct pci_p2pdma *p2pdma = rcu_dereference_protected(
+		to_pci_dev(pgmap->mem.owner)->p2pdma, 1);
 	struct percpu_ref *ref;
 
 	gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page),
@@ -270,14 +269,15 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 
 static void pci_p2pdma_unmap_mappings(void *data)
 {
-	struct pci_dev *pdev = data;
+	struct pci_p2pdma_pagemap *p2p_pgmap = data;
 
 	/*
 	 * Removing the alloc attribute from sysfs will call
 	 * unmap_mapping_range() on the inode, teardown any existing userspace
 	 * mappings and prevent new ones from being created.
 	 */
-	sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
+	sysfs_remove_file_from_group(&p2p_pgmap->mem.owner->kobj,
+				     &p2pmem_alloc_attr.attr,
 				     p2pmem_group.name);
 }
 
@@ -328,10 +328,9 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->nr_range = 1;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
 	pgmap->ops = &p2pdma_pgmap_ops;
-
-	p2p_pgmap->provider = pdev;
-	p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
-		pci_resource_start(pdev, bar);
+	p2p_pgmap->mem.owner = &pdev->dev;
+	p2p_pgmap->mem.bus_offset =
+		pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar);
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
@@ -340,7 +339,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	}
 
 	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
-					 pdev);
+					 p2p_pgmap);
 	if (error)
 		goto pages_free;
 
@@ -973,16 +972,16 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
-						    struct device *dev)
+static enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
 {
 	enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
-	struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
+	struct pci_dev *pdev = to_pci_dev(provider->owner);
 	struct pci_dev *client;
 	struct pci_p2pdma *p2pdma;
 	int dist;
 
-	if (!provider->p2pdma)
+	if (!pdev->p2pdma)
 		return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 
 	if (!dev_is_pci(dev))
@@ -991,7 +990,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 	client = to_pci_dev(dev);
 
 	rcu_read_lock();
-	p2pdma = rcu_dereference(provider->p2pdma);
+	p2pdma = rcu_dereference(pdev->p2pdma);
 
 	if (p2pdma)
 		type = xa_to_value(xa_load(&p2pdma->map_types,
@@ -999,7 +998,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 	rcu_read_unlock();
 
 	if (type == PCI_P2PDMA_MAP_UNKNOWN)
-		return calc_map_type_and_dist(provider, client, &dist, true);
+		return calc_map_type_and_dist(pdev, client, &dist, true);
 
 	return type;
 }
@@ -1007,8 +1006,13 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 		struct device *dev, struct page *page)
 {
-	state->pgmap = page_pgmap(page);
-	state->map = pci_p2pdma_map_type(state->pgmap, dev);
+	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page));
+
+	if (state->mem == &p2p_pgmap->mem)
+		return;
+
+	state->mem = &p2p_pgmap->mem;
+	state->map = pci_p2pdma_map_type(&p2p_pgmap->mem, dev);
 }
 
 /**
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index b502fc8b49bf9..27a2c399f47da 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -16,6 +16,16 @@
 struct block_device;
 struct scatterlist;
 
+/**
+ * struct p2pdma_provider
+ *
+ * A p2pdma provider is a range of MMIO address space available to the CPU.
+ */
+struct p2pdma_provider {
+	struct device *owner;
+	u64 bus_offset;
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
@@ -144,10 +154,11 @@ enum pci_p2pdma_map_type {
 };
 
 struct pci_p2pdma_map_state {
-	struct dev_pagemap *pgmap;
+	struct p2pdma_provider *mem;
 	enum pci_p2pdma_map_type map;
 };
 
+
 /* helper for pci_p2pdma_state(), do not use directly */
 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 		struct device *dev, struct page *page);
@@ -166,8 +177,7 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
 		struct page *page)
 {
 	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
-		if (state->pgmap != page_pgmap(page))
-			__pci_p2pdma_update_state(state, dev, page);
+		__pci_p2pdma_update_state(state, dev, page);
 		return state->map;
 	}
 	return PCI_P2PDMA_MAP_NONE;
@@ -185,7 +195,7 @@ static inline dma_addr_t
 pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
 {
 	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
-	return paddr + to_p2p_pgmap(state->pgmap)->bus_offsetf;
+	return paddr + state->mem->bus_offset;
 }
 
 #endif /* _LINUX_PCI_P2P_H */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 03/10] PCI/P2PDMA: Simplify bus address mapping API
  2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
  2025-07-23 13:00 ` [PATCH 01/10] PCI/P2PDMA: Remove redundant bus_offset from map state Leon Romanovsky
  2025-07-23 13:00 ` [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction Leon Romanovsky
@ 2025-07-23 13:00 ` Leon Romanovsky
  2025-07-24  7:52   ` Christoph Hellwig
  2025-07-23 13:00 ` [PATCH 04/10] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation Leon Romanovsky
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-23 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Update the pci_p2pdma_bus_addr_map() function to take a direct pointer
to the p2pdma_provider structure instead of the pci_p2pdma_map_state.
This simplifies the API by removing the need for callers to extract
the provider from the state structure.

The change updates all callers across the kernel (block layer, IOMMU,
DMA direct, and HMM) to pass the provider pointer directly, making
the code more explicit and reducing unnecessary indirection. This
also removes the runtime warning check since callers now have direct
control over which provider they use.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/blk-mq-dma.c         | 2 +-
 drivers/iommu/dma-iommu.c  | 4 ++--
 include/linux/pci-p2pdma.h | 7 +++----
 kernel/dma/direct.c        | 4 ++--
 mm/hmm.c                   | 2 +-
 5 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index 37e2142be4f7d..eeac653e3f3bd 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -79,7 +79,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,
 
 static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
 {
-	iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
+	iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr);
 	iter->len = vec->len;
 	return true;
 }
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index cd4bc22efa966..1853a969e1978 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1427,8 +1427,8 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 * as a bus address, __finalise_sg() will copy the dma
 			 * address into the output segment.
 			 */
-			s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
-						sg_phys(s));
+			s->dma_address = pci_p2pdma_bus_addr_map(
+				p2pdma_state.mem, sg_phys(s));
 			sg_dma_len(s) = sg->length;
 			sg_dma_mark_bus_address(s);
 			continue;
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 27a2c399f47da..eef96636c67e6 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -186,16 +186,15 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
 /**
  * pci_p2pdma_bus_addr_map - Translate a physical address to a bus address
  *			     for a PCI_P2PDMA_MAP_BUS_ADDR transfer.
- * @state:	P2P state structure
+ * @provider:	P2P provider structure
  * @paddr:	physical address to map
  *
  * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer.
  */
 static inline dma_addr_t
-pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
+pci_p2pdma_bus_addr_map(struct p2pdma_provider *provider, phys_addr_t paddr)
 {
-	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
-	return paddr + state->mem->bus_offset;
+	return paddr + provider->bus_offset;
 }
 
 #endif /* _LINUX_PCI_P2P_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index fa75e30700730..de34ee5903766 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -484,8 +484,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 			}
 			break;
 		case PCI_P2PDMA_MAP_BUS_ADDR:
-			sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
-					sg_phys(sg));
+			sg->dma_address = pci_p2pdma_bus_addr_map(
+				p2pdma_state.mem, sg_phys(sg));
 			sg_dma_mark_bus_address(sg);
 			continue;
 		default:
diff --git a/mm/hmm.c b/mm/hmm.c
index 9354fae3ae06f..f9970b0e527ed 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -755,7 +755,7 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
 		break;
 	case PCI_P2PDMA_MAP_BUS_ADDR:
 		pfns[idx] |= HMM_PFN_P2PDMA_BUS | HMM_PFN_DMA_MAPPED;
-		return pci_p2pdma_bus_addr_map(p2pdma_state, paddr);
+		return pci_p2pdma_bus_addr_map(p2pdma_state->mem, paddr);
 	default:
 		return DMA_MAPPING_ERROR;
 	}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 04/10] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
  2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (2 preceding siblings ...)
  2025-07-23 13:00 ` [PATCH 03/10] PCI/P2PDMA: Simplify bus address mapping API Leon Romanovsky
@ 2025-07-23 13:00 ` Leon Romanovsky
  2025-07-23 13:00 ` [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function Leon Romanovsky
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-23 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Refactor the PCI P2PDMA subsystem to separate the core peer-to-peer DMA
functionality from the optional memory allocation layer. This creates a
two-tier architecture:

The core layer provides P2P mapping functionality for physical addresses
based on PCI device MMIO BARs and integrates with the DMA API for
mapping operations. This layer is required for all P2PDMA users.

The optional upper layer provides memory allocation capabilities
including gen_pool allocator, struct page support, and sysfs interface
for user space access.

This separation allows subsystems like VFIO to use only the core P2P
mapping functionality without the overhead of memory allocation features
they don't need. The core functionality is now available through the
new pci_p2pdma_enable() function that returns a p2pdma_provider
structure.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/p2pdma.c       | 108 +++++++++++++++++++++++++------------
 include/linux/pci-p2pdma.h |   5 ++
 2 files changed, 80 insertions(+), 33 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 5a310026bd24f..8e2525618d922 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -25,11 +25,12 @@ struct pci_p2pdma {
 	struct gen_pool *pool;
 	bool p2pmem_published;
 	struct xarray map_types;
+	struct p2pdma_provider mem;
 };
 
 struct pci_p2pdma_pagemap {
 	struct dev_pagemap pgmap;
-	struct p2pdma_provider mem;
+	struct p2pdma_provider *mem;
 };
 
 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
@@ -204,7 +205,7 @@ static void p2pdma_page_free(struct page *page)
 	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
 	/* safe to dereference while a reference is held to the percpu ref */
 	struct pci_p2pdma *p2pdma = rcu_dereference_protected(
-		to_pci_dev(pgmap->mem.owner)->p2pdma, 1);
+		to_pci_dev(pgmap->mem->owner)->p2pdma, 1);
 	struct percpu_ref *ref;
 
 	gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page),
@@ -227,44 +228,77 @@ static void pci_p2pdma_release(void *data)
 
 	/* Flush and disable pci_alloc_p2p_mem() */
 	pdev->p2pdma = NULL;
-	synchronize_rcu();
+	if (p2pdma->pool)
+		synchronize_rcu();
+	xa_destroy(&p2pdma->map_types);
+
+	if (!p2pdma->pool)
+		return;
 
 	gen_pool_destroy(p2pdma->pool);
 	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
-	xa_destroy(&p2pdma->map_types);
 }
 
-static int pci_p2pdma_setup(struct pci_dev *pdev)
+/**
+ * pci_p2pdma_enable - Enable peer-to-peer DMA support for a PCI device
+ * @pdev: The PCI device to enable P2PDMA for
+ *
+ * This function initializes the peer-to-peer DMA infrastructure for a PCI
+ * device. It allocates and sets up the necessary data structures to support
+ * P2PDMA operations, including mapping type tracking.
+ */
+struct p2pdma_provider *pci_p2pdma_enable(struct pci_dev *pdev)
 {
-	int error = -ENOMEM;
 	struct pci_p2pdma *p2p;
+	int ret;
 
 	p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
 	if (!p2p)
-		return -ENOMEM;
+		return ERR_PTR(-ENOMEM);
 
 	xa_init(&p2p->map_types);
+	p2p->mem.owner = &pdev->dev;
+	/* On all p2p platforms bus_offset is the same for all BARs */
+	p2p->mem.bus_offset =
+		pci_bus_address(pdev, 0) - pci_resource_start(pdev, 0);
 
-	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
-	if (!p2p->pool)
-		goto out;
+	ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (ret)
+		goto out_p2p;
 
-	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
-	if (error)
-		goto out_pool_destroy;
+	rcu_assign_pointer(pdev->p2pdma, p2p);
+	return &p2p->mem;
 
-	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
-	if (error)
+out_p2p:
+	devm_kfree(&pdev->dev, p2p);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_enable);
+
+static int pci_p2pdma_setup_pool(struct pci_dev *pdev)
+{
+	struct pci_p2pdma *p2pdma;
+	int ret;
+
+	p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
+	if (p2pdma->pool)
+		/* We already setup pools, do nothing, */
+		return 0;
+
+	p2pdma->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+	if (!p2pdma->pool)
+		return -ENOMEM;
+
+	ret = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+	if (ret)
 		goto out_pool_destroy;
 
-	rcu_assign_pointer(pdev->p2pdma, p2p);
 	return 0;
 
 out_pool_destroy:
-	gen_pool_destroy(p2p->pool);
-out:
-	devm_kfree(&pdev->dev, p2p);
-	return error;
+	gen_pool_destroy(p2pdma->pool);
+	p2pdma->pool = NULL;
+	return ret;
 }
 
 static void pci_p2pdma_unmap_mappings(void *data)
@@ -276,7 +310,7 @@ static void pci_p2pdma_unmap_mappings(void *data)
 	 * unmap_mapping_range() on the inode, teardown any existing userspace
 	 * mappings and prevent new ones from being created.
 	 */
-	sysfs_remove_file_from_group(&p2p_pgmap->mem.owner->kobj,
+	sysfs_remove_file_from_group(&p2p_pgmap->mem->owner->kobj,
 				     &p2pmem_alloc_attr.attr,
 				     p2pmem_group.name);
 }
@@ -295,6 +329,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 			    u64 offset)
 {
 	struct pci_p2pdma_pagemap *p2p_pgmap;
+	struct p2pdma_provider *mem;
 	struct dev_pagemap *pgmap;
 	struct pci_p2pdma *p2pdma;
 	void *addr;
@@ -312,15 +347,22 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	if (size + offset > pci_resource_len(pdev, bar))
 		return -EINVAL;
 
-	if (!pdev->p2pdma) {
-		error = pci_p2pdma_setup(pdev);
+	p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
+	if (!p2pdma) {
+		mem = pci_p2pdma_enable(pdev);
+		if (IS_ERR(mem))
+			return PTR_ERR(mem);
+
+		error = pci_p2pdma_setup_pool(pdev);
 		if (error)
 			return error;
 	}
 
 	p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL);
-	if (!p2p_pgmap)
-		return -ENOMEM;
+	if (!p2p_pgmap) {
+		error = -ENOMEM;
+		goto free_pool;
+	}
 
 	pgmap = &p2p_pgmap->pgmap;
 	pgmap->range.start = pci_resource_start(pdev, bar) + offset;
@@ -328,9 +370,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->nr_range = 1;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
 	pgmap->ops = &p2pdma_pgmap_ops;
-	p2p_pgmap->mem.owner = &pdev->dev;
-	p2p_pgmap->mem.bus_offset =
-		pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar);
+	p2p_pgmap->mem = mem;
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
@@ -343,7 +383,6 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	if (error)
 		goto pages_free;
 
-	p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
 	error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,
 			pci_bus_address(pdev, bar) + offset,
 			range_len(&pgmap->range), dev_to_node(&pdev->dev),
@@ -359,7 +398,10 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 pages_free:
 	devm_memunmap_pages(&pdev->dev, pgmap);
 pgmap_free:
-	devm_kfree(&pdev->dev, pgmap);
+	devm_kfree(&pdev->dev, p2p_pgmap);
+free_pool:
+	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
+	gen_pool_destroy(p2pdma->pool);
 	return error;
 }
 EXPORT_SYMBOL_GPL(pci_p2pdma_add_resource);
@@ -1008,11 +1050,11 @@ void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 {
 	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page));
 
-	if (state->mem == &p2p_pgmap->mem)
+	if (state->mem == p2p_pgmap->mem)
 		return;
 
-	state->mem = &p2p_pgmap->mem;
-	state->map = pci_p2pdma_map_type(&p2p_pgmap->mem, dev);
+	state->mem = p2p_pgmap->mem;
+	state->map = pci_p2pdma_map_type(p2p_pgmap->mem, dev);
 }
 
 /**
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index eef96636c67e6..83f11dc8659a7 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -27,6 +27,7 @@ struct p2pdma_provider {
 };
 
 #ifdef CONFIG_PCI_P2PDMA
+struct p2pdma_provider *pci_p2pdma_enable(struct pci_dev *pdev);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
@@ -45,6 +46,10 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 			       bool use_p2pdma);
 #else /* CONFIG_PCI_P2PDMA */
+static inline struct p2pdma_provider *pci_p2pdma_enable(struct pci_dev *pdev)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (3 preceding siblings ...)
  2025-07-23 13:00 ` [PATCH 04/10] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation Leon Romanovsky
@ 2025-07-23 13:00 ` Leon Romanovsky
  2025-07-24  8:03   ` Christoph Hellwig
  2025-07-23 13:00 ` [PATCH 06/10] types: move phys_vec definition to common header Leon Romanovsky
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-23 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Export the pci_p2pdma_map_type() function to allow external modules
and subsystems to determine the appropriate mapping type for P2PDMA
transfers between a provider and target device.

The function determines whether peer-to-peer DMA transfers can be
done directly through PCI switches (PCI_P2PDMA_MAP_BUS_ADDR) or
must go through the host bridge (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE),
or if the transfer is not supported at all.

This export enables subsystems like VFIO to properly handle P2PDMA
operations by querying the mapping type before attempting transfers,
ensuring correct DMA address programming and error handling.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/p2pdma.c       | 15 ++++++-
 include/linux/pci-p2pdma.h | 85 +++++++++++++++++++++-----------------
 2 files changed, 59 insertions(+), 41 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 8e2525618d922..326c7d88a1690 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -1014,8 +1014,18 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type
-pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
+/**
+ * pci_p2pdma_map_type - Determine the mapping type for P2PDMA transfers
+ * @provider: P2PDMA provider structure
+ * @dev: Target device for the transfer
+ *
+ * Determines how peer-to-peer DMA transfers should be mapped between
+ * the provider and the target device. The mapping type indicates whether
+ * the transfer can be done directly through PCI switches or must go
+ * through the host bridge.
+ */
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
+					     struct device *dev)
 {
 	enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
 	struct pci_dev *pdev = to_pci_dev(provider->owner);
@@ -1044,6 +1054,7 @@ pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
 
 	return type;
 }
+EXPORT_SYMBOL_GPL(pci_p2pdma_map_type);
 
 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 		struct device *dev, struct page *page)
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 83f11dc8659a7..dea98baee5ce2 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -26,6 +26,45 @@ struct p2pdma_provider {
 	u64 bus_offset;
 };
 
+enum pci_p2pdma_map_type {
+	/*
+	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
+	 * the mapping type has been calculated. Exported routines for the API
+	 * will never return this value.
+	 */
+	PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+	/*
+	 * Not a PCI P2PDMA transfer.
+	 */
+	PCI_P2PDMA_MAP_NONE,
+
+	/*
+	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+	 * traverse the host bridge and the host bridge is not in the
+	 * allowlist. DMA Mapping routines should return an error when
+	 * this is returned.
+	 */
+	PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+	/*
+	 * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to
+	 * each other directly through a PCI switch and the transaction will
+	 * not traverse the host bridge. Such a mapping should program
+	 * the DMA engine with PCI bus addresses.
+	 */
+	PCI_P2PDMA_MAP_BUS_ADDR,
+
+	/*
+	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+	 * to each other, but the transaction traverses a host bridge on the
+	 * allowlist. In this case, a normal mapping either with CPU physical
+	 * addresses (in the case of dma-direct) or IOVA addresses (in the
+	 * case of IOMMUs) should be used to program the DMA engine.
+	 */
+	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 struct p2pdma_provider *pci_p2pdma_enable(struct pci_dev *pdev);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
@@ -45,6 +84,8 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 			    bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 			       bool use_p2pdma);
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
+					     struct device *dev);
 #else /* CONFIG_PCI_P2PDMA */
 static inline struct p2pdma_provider *pci_p2pdma_enable(struct pci_dev *pdev)
 {
@@ -105,6 +146,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
 {
 	return sprintf(page, "none\n");
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
+{
+	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
 #endif /* CONFIG_PCI_P2PDMA */
 
 
@@ -119,45 +165,6 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client)
 	return pci_p2pmem_find_many(&client, 1);
 }
 
-enum pci_p2pdma_map_type {
-	/*
-	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
-	 * the mapping type has been calculated. Exported routines for the API
-	 * will never return this value.
-	 */
-	PCI_P2PDMA_MAP_UNKNOWN = 0,
-
-	/*
-	 * Not a PCI P2PDMA transfer.
-	 */
-	PCI_P2PDMA_MAP_NONE,
-
-	/*
-	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
-	 * traverse the host bridge and the host bridge is not in the
-	 * allowlist. DMA Mapping routines should return an error when
-	 * this is returned.
-	 */
-	PCI_P2PDMA_MAP_NOT_SUPPORTED,
-
-	/*
-	 * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to
-	 * each other directly through a PCI switch and the transaction will
-	 * not traverse the host bridge. Such a mapping should program
-	 * the DMA engine with PCI bus addresses.
-	 */
-	PCI_P2PDMA_MAP_BUS_ADDR,
-
-	/*
-	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
-	 * to each other, but the transaction traverses a host bridge on the
-	 * allowlist. In this case, a normal mapping either with CPU physical
-	 * addresses (in the case of dma-direct) or IOVA addresses (in the
-	 * case of IOMMUs) should be used to program the DMA engine.
-	 */
-	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
 struct pci_p2pdma_map_state {
 	struct p2pdma_provider *mem;
 	enum pci_p2pdma_map_type map;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 06/10] types: move phys_vec definition to common header
  2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (4 preceding siblings ...)
  2025-07-23 13:00 ` [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function Leon Romanovsky
@ 2025-07-23 13:00 ` Leon Romanovsky
  2025-07-23 13:00 ` [PATCH 07/10] vfio: Export vfio device get and put registration helpers Leon Romanovsky
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-23 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Move the struct phys_vec definition from block/blk-mq-dma.c to
include/linux/types.h to make it available for use across the kernel.

The phys_vec structure represents a physical address range with a
length, which is used by the new physical address-based DMA mapping
API. This structure is already used by the block layer and will be
needed by upcoming VFIO patches for dma-buf operations.

Moving this definition to types.h provides a centralized location
for this common data structure and eliminates code duplication
across subsystems that need to work with physical address ranges.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/blk-mq-dma.c    | 5 -----
 include/linux/types.h | 5 +++++
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index eeac653e3f3bd..b0fa53c353d9d 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -5,11 +5,6 @@
 #include <linux/blk-mq-dma.h>
 #include "blk.h"
 
-struct phys_vec {
-	phys_addr_t	paddr;
-	u32		len;
-};
-
 static bool blk_map_iter_next(struct request *req, struct req_iterator *iter,
 			      struct phys_vec *vec)
 {
diff --git a/include/linux/types.h b/include/linux/types.h
index 6dfdb8e8e4c35..2bc56681b2e62 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -170,6 +170,11 @@ typedef u64 phys_addr_t;
 typedef u32 phys_addr_t;
 #endif
 
+struct phys_vec {
+	phys_addr_t	paddr;
+	u32		len;
+};
+
 typedef phys_addr_t resource_size_t;
 
 /*
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 07/10] vfio: Export vfio device get and put registration helpers
  2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (5 preceding siblings ...)
  2025-07-23 13:00 ` [PATCH 06/10] types: move phys_vec definition to common header Leon Romanovsky
@ 2025-07-23 13:00 ` Leon Romanovsky
  2025-07-23 13:00 ` [PATCH 08/10] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-23 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Vivek Kasireddy, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Will Deacon

From: Vivek Kasireddy <vivek.kasireddy@intel.com>

These helpers are useful for managing additional references taken
on the device from other associated VFIO modules.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/vfio_main.c | 2 ++
 include/linux/vfio.h     | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 1fd261efc582d..620a3ee5d04db 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -171,11 +171,13 @@ void vfio_device_put_registration(struct vfio_device *device)
 	if (refcount_dec_and_test(&device->refcount))
 		complete(&device->comp);
 }
+EXPORT_SYMBOL_GPL(vfio_device_put_registration);
 
 bool vfio_device_try_get_registration(struct vfio_device *device)
 {
 	return refcount_inc_not_zero(&device->refcount);
 }
+EXPORT_SYMBOL_GPL(vfio_device_try_get_registration);
 
 /*
  * VFIO driver API
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 707b00772ce1f..ba65bbdffd0b2 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -293,6 +293,8 @@ static inline void vfio_put_device(struct vfio_device *device)
 int vfio_register_group_dev(struct vfio_device *device);
 int vfio_register_emulated_iommu_dev(struct vfio_device *device);
 void vfio_unregister_group_dev(struct vfio_device *device);
+bool vfio_device_try_get_registration(struct vfio_device *device);
+void vfio_device_put_registration(struct vfio_device *device);
 
 int vfio_assign_device_set(struct vfio_device *device, void *set_id);
 unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 08/10] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (6 preceding siblings ...)
  2025-07-23 13:00 ` [PATCH 07/10] vfio: Export vfio device get and put registration helpers Leon Romanovsky
@ 2025-07-23 13:00 ` Leon Romanovsky
  2025-07-23 13:00 ` [PATCH 09/10] vfio/pci: Share the core device pointer while invoking feature functions Leon Romanovsky
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-23 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Make sure that all VFIO PCI devices have peer-to-peer capabilities
enables, so we would be able to export their MMIO memory through DMABUF,

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 4 ++++
 include/linux/vfio_pci_core.h    | 1 +
 2 files changed, 5 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 6328c3a05bcdd..1e675daab5753 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -29,6 +29,7 @@
 #include <linux/nospec.h>
 #include <linux/sched/mm.h>
 #include <linux/iommufd.h>
+#include <linux/pci-p2pdma.h>
 #if IS_ENABLED(CONFIG_EEH)
 #include <asm/eeh.h>
 #endif
@@ -2091,6 +2092,9 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	INIT_LIST_HEAD(&vdev->dummy_resources_list);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
 	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
+	vdev->provider = pci_p2pdma_enable(vdev->pdev);
+	if (IS_ERR(vdev->provider))
+		return PTR_ERR(vdev->provider);
 	init_rwsem(&vdev->memory_lock);
 	xa_init(&vdev->ctx);
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index fbb472dd99b36..b017fae251811 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -94,6 +94,7 @@ struct vfio_pci_core_device {
 	struct vfio_pci_core_device	*sriov_pf_core_dev;
 	struct notifier_block	nb;
 	struct rw_semaphore	memory_lock;
+	struct p2pdma_provider  *provider;
 };
 
 /* Will be exported for vfio pci drivers usage */
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 09/10] vfio/pci: Share the core device pointer while invoking feature functions
  2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (7 preceding siblings ...)
  2025-07-23 13:00 ` [PATCH 08/10] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
@ 2025-07-23 13:00 ` Leon Romanovsky
  2025-07-28 20:55   ` Alex Williamson
  2025-07-23 13:00 ` [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
  2025-07-30 19:58 ` [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Alex Williamson
  10 siblings, 1 reply; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-23 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Vivek Kasireddy, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Will Deacon

From: Vivek Kasireddy <vivek.kasireddy@intel.com>

There is no need to share the main device pointer (struct vfio_device *)
with all the feature functions as they only need the core device
pointer. Therefore, extract the core device pointer once in the
caller (vfio_pci_core_ioctl_feature) and share it instead.

Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 1e675daab5753..5512d13bb8899 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -301,11 +301,9 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
 	return 0;
 }
 
-static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
+static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags,
 				  void __user *arg, size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	int ret;
 
 	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@@ -322,12 +320,10 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
 }
 
 static int vfio_pci_core_pm_entry_with_wakeup(
-	struct vfio_device *device, u32 flags,
+	struct vfio_pci_core_device *vdev, u32 flags,
 	struct vfio_device_low_power_entry_with_wakeup __user *arg,
 	size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	struct vfio_device_low_power_entry_with_wakeup entry;
 	struct eventfd_ctx *efdctx;
 	int ret;
@@ -378,11 +374,9 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
 	up_write(&vdev->memory_lock);
 }
 
-static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
+static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags,
 				 void __user *arg, size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	int ret;
 
 	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@@ -1475,11 +1469,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
-				       uuid_t __user *arg, size_t argsz)
+static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
+				       u32 flags, uuid_t __user *arg,
+				       size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	uuid_t uuid;
 	int ret;
 
@@ -1506,16 +1499,19 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+
 	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
 	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
-		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
+		return vfio_pci_core_pm_entry(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP:
-		return vfio_pci_core_pm_entry_with_wakeup(device, flags,
+		return vfio_pci_core_pm_entry_with_wakeup(vdev, flags,
 							  arg, argsz);
 	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
-		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
+		return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
-		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions
  2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (8 preceding siblings ...)
  2025-07-23 13:00 ` [PATCH 09/10] vfio/pci: Share the core device pointer while invoking feature functions Leon Romanovsky
@ 2025-07-23 13:00 ` Leon Romanovsky
  2025-07-24  5:13   ` Kasireddy, Vivek
  2025-07-29 19:44   ` Robin Murphy
  2025-07-30 19:58 ` [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Alex Williamson
  10 siblings, 2 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-23 13:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

From: Leon Romanovsky <leonro@nvidia.com>

Add support for exporting PCI device MMIO regions through dma-buf,
enabling safe sharing of non-struct page memory with controlled
lifetime management. This allows RDMA and other subsystems to import
dma-buf FDs and build them into memory regions for PCI P2P operations.

The implementation provides a revocable attachment mechanism using
dma-buf move operations. MMIO regions are normally pinned as BARs
don't change physical addresses, but access is revoked when the VFIO
device is closed or a PCI reset is issued. This ensures kernel
self-defense against potentially hostile userspace.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/Kconfig           |  20 ++
 drivers/vfio/pci/Makefile          |   2 +
 drivers/vfio/pci/vfio_pci_config.c |  22 +-
 drivers/vfio/pci/vfio_pci_core.c   |  25 ++-
 drivers/vfio/pci/vfio_pci_dmabuf.c | 321 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_priv.h   |  23 +++
 include/linux/dma-buf.h            |   1 +
 include/linux/vfio_pci_core.h      |   3 +
 include/uapi/linux/vfio.h          |  19 ++
 9 files changed, 431 insertions(+), 5 deletions(-)
 create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 2b0172f546652..55ae888bf26ae 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -55,6 +55,26 @@ config VFIO_PCI_ZDEV_KVM
 
 	  To enable s390x KVM vfio-pci extensions, say Y.
 
+config VFIO_PCI_DMABUF
+	bool "VFIO PCI extensions for DMA-BUF"
+	depends on VFIO_PCI_CORE
+	depends on PCI_P2PDMA && DMA_SHARED_BUFFER
+	default y
+	help
+	  Enable support for VFIO PCI extensions that allow exporting
+	  device MMIO regions as DMA-BUFs for peer devices to access via
+	  peer-to-peer (P2P) DMA.
+
+	  This feature enables a VFIO-managed PCI device to export a portion
+	  of its MMIO BAR as a DMA-BUF file descriptor, which can be passed
+	  to other userspace drivers or kernel subsystems capable of
+	  initiating DMA to that region.
+
+	  Say Y here if you want to enable VFIO DMABUF-based MMIO export
+	  support for peer-to-peer DMA use cases.
+
+	  If unsure, say N.
+
 source "drivers/vfio/pci/mlx5/Kconfig"
 
 source "drivers/vfio/pci/hisilicon/Kconfig"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index cf00c0a7e55c8..f9155e9c5f630 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,7 +2,9 @@
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
+
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
 
 vfio-pci-y := vfio_pci.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 8f02f236b5b4b..7e23387a43b4d 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -589,10 +589,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
 		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
 
-		if (!new_mem)
+		if (!new_mem) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
-		else
+			vfio_pci_dma_buf_move(vdev, true);
+		} else {
 			down_write(&vdev->memory_lock);
+		}
 
 		/*
 		 * If the user is writing mem/io enable (new_mem/io) and we
@@ -627,6 +629,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		*virt_cmd &= cpu_to_le16(~mask);
 		*virt_cmd |= cpu_to_le16(new_cmd & mask);
 
+		if (__vfio_pci_memory_enabled(vdev))
+			vfio_pci_dma_buf_move(vdev, false);
 		up_write(&vdev->memory_lock);
 	}
 
@@ -707,12 +711,16 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
 static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
 					  pci_power_t state)
 {
-	if (state >= PCI_D3hot)
+	if (state >= PCI_D3hot) {
 		vfio_pci_zap_and_down_write_memory_lock(vdev);
-	else
+		vfio_pci_dma_buf_move(vdev, true);
+	} else {
 		down_write(&vdev->memory_lock);
+	}
 
 	vfio_pci_set_power_state(vdev, state);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 }
 
@@ -900,7 +908,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
 
 		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
+			vfio_pci_dma_buf_move(vdev, true);
 			pci_try_reset_function(vdev->pdev);
+			if (__vfio_pci_memory_enabled(vdev))
+				vfio_pci_dma_buf_move(vdev, true);
 			up_write(&vdev->memory_lock);
 		}
 	}
@@ -982,7 +993,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
 
 		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
+			vfio_pci_dma_buf_move(vdev, true);
 			pci_try_reset_function(vdev->pdev);
+			if (__vfio_pci_memory_enabled(vdev))
+				vfio_pci_dma_buf_move(vdev, true);
 			up_write(&vdev->memory_lock);
 		}
 	}
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 5512d13bb8899..e5ab5d1cafd9c 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -29,7 +29,9 @@
 #include <linux/nospec.h>
 #include <linux/sched/mm.h>
 #include <linux/iommufd.h>
+#ifdef CONFIG_VFIO_PCI_DMABUF
 #include <linux/pci-p2pdma.h>
+#endif
 #if IS_ENABLED(CONFIG_EEH)
 #include <asm/eeh.h>
 #endif
@@ -288,6 +290,8 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
 	 * semaphore.
 	 */
 	vfio_pci_zap_and_down_write_memory_lock(vdev);
+	vfio_pci_dma_buf_move(vdev, true);
+
 	if (vdev->pm_runtime_engaged) {
 		up_write(&vdev->memory_lock);
 		return -EINVAL;
@@ -371,6 +375,8 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
 	 */
 	down_write(&vdev->memory_lock);
 	__vfio_pci_runtime_pm_exit(vdev);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 }
 
@@ -691,6 +697,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
 #endif
 	vfio_pci_core_disable(vdev);
 
+	vfio_pci_dma_buf_cleanup(vdev);
+
 	mutex_lock(&vdev->igate);
 	if (vdev->err_trigger) {
 		eventfd_ctx_put(vdev->err_trigger);
@@ -1223,7 +1231,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
 	 */
 	vfio_pci_set_power_state(vdev, PCI_D0);
 
+	vfio_pci_dma_buf_move(vdev, true);
 	ret = pci_try_reset_function(vdev->pdev);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 
 	return ret;
@@ -1512,6 +1523,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_DMA_BUF:
+		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
@@ -2088,9 +2101,13 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	INIT_LIST_HEAD(&vdev->dummy_resources_list);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
 	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
+#ifdef CONFIG_VFIO_PCI_DMABUF
 	vdev->provider = pci_p2pdma_enable(vdev->pdev);
 	if (IS_ERR(vdev->provider))
 		return PTR_ERR(vdev->provider);
+
+	INIT_LIST_HEAD(&vdev->dmabufs);
+#endif
 	init_rwsem(&vdev->memory_lock);
 	xa_init(&vdev->ctx);
 
@@ -2473,11 +2490,17 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 	 * cause the PCI config space reset without restoring the original
 	 * state (saved locally in 'vdev->pm_save').
 	 */
-	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
+	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) {
+		vfio_pci_dma_buf_move(vdev, true);
 		vfio_pci_set_power_state(vdev, PCI_D0);
+	}
 
 	ret = pci_reset_bus(pdev);
 
+	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
+		if (__vfio_pci_memory_enabled(vdev))
+			vfio_pci_dma_buf_move(vdev, false);
+
 	vdev = list_last_entry(&dev_set->device_list,
 			       struct vfio_pci_core_device, vdev.dev_set_list);
 
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
new file mode 100644
index 0000000000000..5fefcdecd1329
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -0,0 +1,321 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.
+ */
+#include <linux/dma-buf.h>
+#include <linux/pci-p2pdma.h>
+#include <linux/dma-resv.h>
+
+#include "vfio_pci_priv.h"
+
+MODULE_IMPORT_NS("DMA_BUF");
+
+struct vfio_pci_dma_buf {
+	struct dma_buf *dmabuf;
+	struct vfio_pci_core_device *vdev;
+	struct list_head dmabufs_elm;
+	struct phys_vec phys_vec;
+	u8 revoked : 1;
+};
+
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
+				   struct dma_buf_attachment *attachment)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	if (!attachment->peer2peer)
+		return -EOPNOTSUPP;
+
+	if (priv->revoked)
+		return -ENODEV;
+
+	switch (pci_p2pdma_map_type(priv->vdev->provider, attachment->dev)) {
+	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+		break;
+	case PCI_P2PDMA_MAP_BUS_ADDR:
+		/*
+		 * There is no need in IOVA at all for this flow.
+		 * We rely on attachment->priv == NULL as a marker
+		 * for this mode.
+		 */
+		return 0;
+	default:
+		return -EINVAL;
+	}
+
+	attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL);
+	if (!attachment->priv)
+		return -ENOMEM;
+
+	dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->phys_vec.len);
+	return 0;
+}
+
+static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
+				    struct dma_buf_attachment *attachment)
+{
+	kfree(attachment->priv);
+}
+
+static void fill_sg_entry(struct scatterlist *sgl, unsigned int length,
+			 dma_addr_t addr)
+{
+	sg_set_page(sgl, NULL, length, 0);
+	sg_dma_address(sgl) = addr;
+	sg_dma_len(sgl) = length;
+}
+
+static struct sg_table *
+vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
+		     enum dma_data_direction dir)
+{
+	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
+	struct p2pdma_provider *provider = priv->vdev->provider;
+	struct dma_iova_state *state = attachment->priv;
+	struct phys_vec *phys_vec = &priv->phys_vec;
+	struct scatterlist *sgl;
+	struct sg_table *sgt;
+	dma_addr_t addr;
+	int ret;
+
+	dma_resv_assert_held(priv->dmabuf->resv);
+
+	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
+	if (!sgt)
+		return ERR_PTR(-ENOMEM);
+
+	ret = sg_alloc_table(sgt, 1, GFP_KERNEL | __GFP_ZERO);
+	if (ret)
+		goto err_kfree_sgt;
+
+	sgl = sgt->sgl;
+
+	if (!state) {
+		addr = pci_p2pdma_bus_addr_map(provider, phys_vec->paddr);
+	} else if (dma_use_iova(state)) {
+		ret = dma_iova_link(attachment->dev, state, phys_vec->paddr, 0,
+				    phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC);
+		if (ret)
+			goto err_free_table;
+
+		ret = dma_iova_sync(attachment->dev, state, 0, phys_vec->len);
+		if (ret)
+			goto err_unmap_dma;
+
+		addr = state->addr;
+	} else {
+		addr = dma_map_phys(attachment->dev, phys_vec->paddr,
+				    phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC);
+		ret = dma_mapping_error(attachment->dev, addr);
+		if (ret)
+			goto err_free_table;
+	}
+
+	fill_sg_entry(sgl, phys_vec->len, addr);
+	return sgt;
+
+err_unmap_dma:
+	dma_iova_destroy(attachment->dev, state, phys_vec->len, dir,
+			 DMA_ATTR_SKIP_CPU_SYNC);
+err_free_table:
+	sg_free_table(sgt);
+err_kfree_sgt:
+	kfree(sgt);
+	return ERR_PTR(ret);
+}
+
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
+				   struct sg_table *sgt,
+				   enum dma_data_direction dir)
+{
+	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
+	struct dma_iova_state *state = attachment->priv;
+	struct scatterlist *sgl;
+	int i;
+
+	if (!state)
+		; /* Do nothing */
+	else if (dma_use_iova(state))
+		dma_iova_destroy(attachment->dev, state, priv->phys_vec.len,
+				 dir, DMA_ATTR_SKIP_CPU_SYNC);
+	else
+		for_each_sgtable_dma_sg(sgt, sgl, i)
+			dma_unmap_phys(attachment->dev, sg_dma_address(sgl),
+				       sg_dma_len(sgl), dir,
+				       DMA_ATTR_SKIP_CPU_SYNC);
+
+	sg_free_table(sgt);
+	kfree(sgt);
+}
+
+static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	/*
+	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
+	 * The refcount prevents both.
+	 */
+	if (priv->vdev) {
+		down_write(&priv->vdev->memory_lock);
+		list_del_init(&priv->dmabufs_elm);
+		up_write(&priv->vdev->memory_lock);
+		vfio_device_put_registration(&priv->vdev->vdev);
+	}
+	kfree(priv);
+}
+
+static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
+	.attach = vfio_pci_dma_buf_attach,
+	.detach = vfio_pci_dma_buf_detach,
+	.map_dma_buf = vfio_pci_dma_buf_map,
+	.release = vfio_pci_dma_buf_release,
+	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
+};
+
+static void dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
+				   struct vfio_device_feature_dma_buf *dma_buf)
+{
+	struct pci_dev *pdev = priv->vdev->pdev;
+
+	priv->phys_vec.len = dma_buf->length;
+	priv->phys_vec.paddr = pci_resource_start(pdev, dma_buf->region_index);
+	priv->phys_vec.paddr += dma_buf->offset;
+}
+
+static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
+				 struct vfio_device_feature_dma_buf *dma_buf)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u32 bar = dma_buf->region_index;
+	u64 offset = dma_buf->offset;
+	u64 len = dma_buf->length;
+	resource_size_t bar_size;
+	u64 sum;
+
+	/*
+	 * For PCI the region_index is the BAR number like  everything else.
+	 */
+	if (bar >= VFIO_PCI_ROM_REGION_INDEX)
+		return -ENODEV;
+
+	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+		return -EINVAL;
+
+	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+		return -EINVAL;
+
+	bar_size = pci_resource_len(pdev, bar);
+	if (check_add_overflow(offset, len, &sum) || sum > bar_size)
+		return -EINVAL;
+
+	return 0;
+}
+
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz)
+{
+	struct vfio_device_feature_dma_buf get_dma_buf = {};
+	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+	struct vfio_pci_dma_buf *priv;
+	int ret;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(get_dma_buf));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
+		return -EFAULT;
+
+	ret = validate_dmabuf_input(vdev, &get_dma_buf);
+	if (ret)
+		return ret;
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->vdev = vdev;
+	dma_ranges_to_p2p_phys(priv, &get_dma_buf);
+
+	if (!vfio_device_try_get_registration(&vdev->vdev)) {
+		ret = -ENODEV;
+		goto err_free_priv;
+	}
+
+	exp_info.ops = &vfio_pci_dmabuf_ops;
+	exp_info.size = priv->phys_vec.len;
+	exp_info.flags = get_dma_buf.open_flags;
+	exp_info.priv = priv;
+
+	priv->dmabuf = dma_buf_export(&exp_info);
+	if (IS_ERR(priv->dmabuf)) {
+		ret = PTR_ERR(priv->dmabuf);
+		goto err_dev_put;
+	}
+
+	/* dma_buf_put() now frees priv */
+	INIT_LIST_HEAD(&priv->dmabufs_elm);
+	down_write(&vdev->memory_lock);
+	dma_resv_lock(priv->dmabuf->resv, NULL);
+	priv->revoked = !__vfio_pci_memory_enabled(vdev);
+	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
+	dma_resv_unlock(priv->dmabuf->resv);
+	up_write(&vdev->memory_lock);
+
+	/*
+	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
+	 * will be released.
+	 */
+	return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
+
+err_dev_put:
+	vfio_device_put_registration(&vdev->vdev);
+err_free_priv:
+	kfree(priv);
+	return ret;
+}
+
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	lockdep_assert_held_write(&vdev->memory_lock);
+
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!get_file_active(&priv->dmabuf->file))
+			continue;
+
+		if (priv->revoked != revoked) {
+			dma_resv_lock(priv->dmabuf->resv, NULL);
+			priv->revoked = revoked;
+			dma_buf_move_notify(priv->dmabuf);
+			dma_resv_unlock(priv->dmabuf->resv);
+		}
+		dma_buf_put(priv->dmabuf);
+	}
+}
+
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	down_write(&vdev->memory_lock);
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!get_file_active(&priv->dmabuf->file))
+			continue;
+
+		dma_resv_lock(priv->dmabuf->resv, NULL);
+		list_del_init(&priv->dmabufs_elm);
+		priv->vdev = NULL;
+		priv->revoked = true;
+		dma_buf_move_notify(priv->dmabuf);
+		dma_resv_unlock(priv->dmabuf->resv);
+		vfio_device_put_registration(&vdev->vdev);
+		dma_buf_put(priv->dmabuf);
+	}
+	up_write(&vdev->memory_lock);
+}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index a9972eacb2936..28a405f8b97c9 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -107,4 +107,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+#ifdef CONFIG_VFIO_PCI_DMABUF
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz);
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
+#else
+static inline int
+vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+			      struct vfio_device_feature_dma_buf __user *arg,
+			      size_t argsz)
+{
+	return -ENOTTY;
+}
+static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+}
+static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
+					 bool revoked)
+{
+}
+#endif
+
 #endif
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index d58e329ac0e71..f14b413aae48d 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -483,6 +483,7 @@ struct dma_buf_attach_ops {
  * @dev: device attached to the buffer.
  * @node: list of dma_buf_attachment, protected by dma_resv lock of the dmabuf.
  * @peer2peer: true if the importer can handle peer resources without pages.
+ * #state: DMA structure to provide support for physical addresses DMA interface
  * @priv: exporter specific attachment data.
  * @importer_ops: importer operations for this attachment, if provided
  * dma_buf_map/unmap_attachment() must be called with the dma_resv lock held.
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index b017fae251811..548cbb51bf146 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -94,7 +94,10 @@ struct vfio_pci_core_device {
 	struct vfio_pci_core_device	*sriov_pf_core_dev;
 	struct notifier_block	nb;
 	struct rw_semaphore	memory_lock;
+#ifdef CONFIG_VFIO_PCI_DMABUF
 	struct p2pdma_provider  *provider;
+	struct list_head	dmabufs;
+#endif
 };
 
 /* Will be exported for vfio pci drivers usage */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5764f315137f9..ad8e303697f97 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1468,6 +1468,25 @@ struct vfio_device_feature_bus_master {
 };
 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
+ * regions selected.
+ *
+ * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
+ * etc. offset/length specify a slice of the region to create the dmabuf from.
+ * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf.
+ *
+ * Return: The fd number on success, -1 and errno is set on failure.
+ */
+#define VFIO_DEVICE_FEATURE_DMA_BUF 11
+
+struct vfio_device_feature_dma_buf {
+	__u32	region_index;
+	__u32	open_flags;
+	__u64	offset;
+	__u64	length;
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* RE: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions
  2025-07-23 13:00 ` [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
@ 2025-07-24  5:13   ` Kasireddy, Vivek
  2025-07-24  5:44     ` Leon Romanovsky
  2025-07-29 19:44   ` Robin Murphy
  1 sibling, 1 reply; 54+ messages in thread
From: Kasireddy, Vivek @ 2025-07-24  5:13 UTC (permalink / raw)
  To: Leon Romanovsky, Alex Williamson
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König,
	dri-devel@lists.freedesktop.org, iommu@lists.linux.dev,
	Jens Axboe, Jérôme Glisse, Joerg Roedel,
	kvm@vger.kernel.org, linaro-mm-sig@lists.linaro.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-media@vger.kernel.org, linux-mm@kvack.org,
	linux-pci@vger.kernel.org, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Will Deacon

Hi Leon,

> Subject: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO
> regions
> 
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add support for exporting PCI device MMIO regions through dma-buf,
> enabling safe sharing of non-struct page memory with controlled
> lifetime management. This allows RDMA and other subsystems to import
> dma-buf FDs and build them into memory regions for PCI P2P operations.
> 
> The implementation provides a revocable attachment mechanism using
> dma-buf move operations. MMIO regions are normally pinned as BARs
> don't change physical addresses, but access is revoked when the VFIO
> device is closed or a PCI reset is issued. This ensures kernel
> self-defense against potentially hostile userspace.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/vfio/pci/Kconfig           |  20 ++
>  drivers/vfio/pci/Makefile          |   2 +
>  drivers/vfio/pci/vfio_pci_config.c |  22 +-
>  drivers/vfio/pci/vfio_pci_core.c   |  25 ++-
>  drivers/vfio/pci/vfio_pci_dmabuf.c | 321 +++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_priv.h   |  23 +++
>  include/linux/dma-buf.h            |   1 +
>  include/linux/vfio_pci_core.h      |   3 +
>  include/uapi/linux/vfio.h          |  19 ++
>  9 files changed, 431 insertions(+), 5 deletions(-)
>  create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 2b0172f546652..55ae888bf26ae 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -55,6 +55,26 @@ config VFIO_PCI_ZDEV_KVM
> 
>  	  To enable s390x KVM vfio-pci extensions, say Y.
> 
> +config VFIO_PCI_DMABUF
> +	bool "VFIO PCI extensions for DMA-BUF"
> +	depends on VFIO_PCI_CORE
> +	depends on PCI_P2PDMA && DMA_SHARED_BUFFER
> +	default y
> +	help
> +	  Enable support for VFIO PCI extensions that allow exporting
> +	  device MMIO regions as DMA-BUFs for peer devices to access via
> +	  peer-to-peer (P2P) DMA.
> +
> +	  This feature enables a VFIO-managed PCI device to export a portion
> +	  of its MMIO BAR as a DMA-BUF file descriptor, which can be passed
> +	  to other userspace drivers or kernel subsystems capable of
> +	  initiating DMA to that region.
> +
> +	  Say Y here if you want to enable VFIO DMABUF-based MMIO export
> +	  support for peer-to-peer DMA use cases.
> +
> +	  If unsure, say N.
> +
>  source "drivers/vfio/pci/mlx5/Kconfig"
> 
>  source "drivers/vfio/pci/hisilicon/Kconfig"
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index cf00c0a7e55c8..f9155e9c5f630 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -2,7 +2,9 @@
> 
>  vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o
> vfio_pci_config.o
>  vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
> +
>  obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
> +vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
> 
>  vfio-pci-y := vfio_pci.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
> diff --git a/drivers/vfio/pci/vfio_pci_config.c
> b/drivers/vfio/pci/vfio_pci_config.c
> index 8f02f236b5b4b..7e23387a43b4d 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -589,10 +589,12 @@ static int vfio_basic_config_write(struct
> vfio_pci_core_device *vdev, int pos,
>  		virt_mem = !!(le16_to_cpu(*virt_cmd) &
> PCI_COMMAND_MEMORY);
>  		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
> 
> -		if (!new_mem)
> +		if (!new_mem) {
>  			vfio_pci_zap_and_down_write_memory_lock(vdev);
> -		else
> +			vfio_pci_dma_buf_move(vdev, true);
> +		} else {
>  			down_write(&vdev->memory_lock);
> +		}
> 
>  		/*
>  		 * If the user is writing mem/io enable (new_mem/io) and we
> @@ -627,6 +629,8 @@ static int vfio_basic_config_write(struct
> vfio_pci_core_device *vdev, int pos,
>  		*virt_cmd &= cpu_to_le16(~mask);
>  		*virt_cmd |= cpu_to_le16(new_cmd & mask);
> 
> +		if (__vfio_pci_memory_enabled(vdev))
> +			vfio_pci_dma_buf_move(vdev, false);
>  		up_write(&vdev->memory_lock);
>  	}
> 
> @@ -707,12 +711,16 @@ static int __init init_pci_cap_basic_perm(struct
> perm_bits *perm)
>  static void vfio_lock_and_set_power_state(struct vfio_pci_core_device
> *vdev,
>  					  pci_power_t state)
>  {
> -	if (state >= PCI_D3hot)
> +	if (state >= PCI_D3hot) {
>  		vfio_pci_zap_and_down_write_memory_lock(vdev);
> -	else
> +		vfio_pci_dma_buf_move(vdev, true);
> +	} else {
>  		down_write(&vdev->memory_lock);
> +	}
> 
>  	vfio_pci_set_power_state(vdev, state);
> +	if (__vfio_pci_memory_enabled(vdev))
> +		vfio_pci_dma_buf_move(vdev, false);
>  	up_write(&vdev->memory_lock);
>  }
> 
> @@ -900,7 +908,10 @@ static int vfio_exp_config_write(struct
> vfio_pci_core_device *vdev, int pos,
> 
>  		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
>  			vfio_pci_zap_and_down_write_memory_lock(vdev);
> +			vfio_pci_dma_buf_move(vdev, true);
>  			pci_try_reset_function(vdev->pdev);
> +			if (__vfio_pci_memory_enabled(vdev))
> +				vfio_pci_dma_buf_move(vdev, true);
>  			up_write(&vdev->memory_lock);
>  		}
>  	}
> @@ -982,7 +993,10 @@ static int vfio_af_config_write(struct
> vfio_pci_core_device *vdev, int pos,
> 
>  		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP))
> {
>  			vfio_pci_zap_and_down_write_memory_lock(vdev);
> +			vfio_pci_dma_buf_move(vdev, true);
>  			pci_try_reset_function(vdev->pdev);
> +			if (__vfio_pci_memory_enabled(vdev))
> +				vfio_pci_dma_buf_move(vdev, true);
>  			up_write(&vdev->memory_lock);
>  		}
>  	}
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 5512d13bb8899..e5ab5d1cafd9c 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -29,7 +29,9 @@
>  #include <linux/nospec.h>
>  #include <linux/sched/mm.h>
>  #include <linux/iommufd.h>
> +#ifdef CONFIG_VFIO_PCI_DMABUF
>  #include <linux/pci-p2pdma.h>
> +#endif
>  #if IS_ENABLED(CONFIG_EEH)
>  #include <asm/eeh.h>
>  #endif
> @@ -288,6 +290,8 @@ static int vfio_pci_runtime_pm_entry(struct
> vfio_pci_core_device *vdev,
>  	 * semaphore.
>  	 */
>  	vfio_pci_zap_and_down_write_memory_lock(vdev);
> +	vfio_pci_dma_buf_move(vdev, true);
> +
>  	if (vdev->pm_runtime_engaged) {
>  		up_write(&vdev->memory_lock);
>  		return -EINVAL;
> @@ -371,6 +375,8 @@ static void vfio_pci_runtime_pm_exit(struct
> vfio_pci_core_device *vdev)
>  	 */
>  	down_write(&vdev->memory_lock);
>  	__vfio_pci_runtime_pm_exit(vdev);
> +	if (__vfio_pci_memory_enabled(vdev))
> +		vfio_pci_dma_buf_move(vdev, false);
>  	up_write(&vdev->memory_lock);
>  }
> 
> @@ -691,6 +697,8 @@ void vfio_pci_core_close_device(struct vfio_device
> *core_vdev)
>  #endif
>  	vfio_pci_core_disable(vdev);
> 
> +	vfio_pci_dma_buf_cleanup(vdev);
> +
>  	mutex_lock(&vdev->igate);
>  	if (vdev->err_trigger) {
>  		eventfd_ctx_put(vdev->err_trigger);
> @@ -1223,7 +1231,10 @@ static int vfio_pci_ioctl_reset(struct
> vfio_pci_core_device *vdev,
>  	 */
>  	vfio_pci_set_power_state(vdev, PCI_D0);
> 
> +	vfio_pci_dma_buf_move(vdev, true);
>  	ret = pci_try_reset_function(vdev->pdev);
> +	if (__vfio_pci_memory_enabled(vdev))
> +		vfio_pci_dma_buf_move(vdev, false);
>  	up_write(&vdev->memory_lock);
> 
>  	return ret;
> @@ -1512,6 +1523,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device
> *device, u32 flags,
>  		return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>  		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_DMA_BUF:
> +		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -2088,9 +2101,13 @@ int vfio_pci_core_init_dev(struct vfio_device
> *core_vdev)
>  	INIT_LIST_HEAD(&vdev->dummy_resources_list);
>  	INIT_LIST_HEAD(&vdev->ioeventfds_list);
>  	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
> +#ifdef CONFIG_VFIO_PCI_DMABUF
>  	vdev->provider = pci_p2pdma_enable(vdev->pdev);
>  	if (IS_ERR(vdev->provider))
>  		return PTR_ERR(vdev->provider);
> +
> +	INIT_LIST_HEAD(&vdev->dmabufs);
> +#endif
>  	init_rwsem(&vdev->memory_lock);
>  	xa_init(&vdev->ctx);
> 
> @@ -2473,11 +2490,17 @@ static int vfio_pci_dev_set_hot_reset(struct
> vfio_device_set *dev_set,
>  	 * cause the PCI config space reset without restoring the original
>  	 * state (saved locally in 'vdev->pm_save').
>  	 */
> -	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
> +	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) {
> +		vfio_pci_dma_buf_move(vdev, true);
>  		vfio_pci_set_power_state(vdev, PCI_D0);
> +	}
> 
>  	ret = pci_reset_bus(pdev);
> 
> +	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
> +		if (__vfio_pci_memory_enabled(vdev))
> +			vfio_pci_dma_buf_move(vdev, false);
> +
>  	vdev = list_last_entry(&dev_set->device_list,
>  			       struct vfio_pci_core_device, vdev.dev_set_list);
> 
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c
> b/drivers/vfio/pci/vfio_pci_dmabuf.c
> new file mode 100644
> index 0000000000000..5fefcdecd1329
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -0,0 +1,321 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.
> + */
> +#include <linux/dma-buf.h>
> +#include <linux/pci-p2pdma.h>
> +#include <linux/dma-resv.h>
> +
> +#include "vfio_pci_priv.h"
> +
> +MODULE_IMPORT_NS("DMA_BUF");
> +
> +struct vfio_pci_dma_buf {
> +	struct dma_buf *dmabuf;
> +	struct vfio_pci_core_device *vdev;
> +	struct list_head dmabufs_elm;
> +	struct phys_vec phys_vec;
> +	u8 revoked : 1;
> +};
> +
> +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> +				   struct dma_buf_attachment *attachment)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	if (!attachment->peer2peer)
> +		return -EOPNOTSUPP;
> +
> +	if (priv->revoked)
> +		return -ENODEV;
> +
> +	switch (pci_p2pdma_map_type(priv->vdev->provider, attachment-
> >dev)) {
> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> +		break;
> +	case PCI_P2PDMA_MAP_BUS_ADDR:
> +		/*
> +		 * There is no need in IOVA at all for this flow.
> +		 * We rely on attachment->priv == NULL as a marker
> +		 * for this mode.
> +		 */
> +		return 0;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	attachment->priv = kzalloc(sizeof(struct dma_iova_state),
> GFP_KERNEL);
> +	if (!attachment->priv)
> +		return -ENOMEM;
> +
> +	dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv-
> >phys_vec.len);
> +	return 0;
> +}
> +
> +static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
> +				    struct dma_buf_attachment *attachment)
> +{
> +	kfree(attachment->priv);
> +}
> +
> +static void fill_sg_entry(struct scatterlist *sgl, unsigned int length,
> +			 dma_addr_t addr)
> +{
> +	sg_set_page(sgl, NULL, length, 0);
> +	sg_dma_address(sgl) = addr;
> +	sg_dma_len(sgl) = length;
> +}
> +
> +static struct sg_table *
> +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
> +		     enum dma_data_direction dir)
> +{
> +	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
> +	struct p2pdma_provider *provider = priv->vdev->provider;
> +	struct dma_iova_state *state = attachment->priv;
> +	struct phys_vec *phys_vec = &priv->phys_vec;
> +	struct scatterlist *sgl;
> +	struct sg_table *sgt;
> +	dma_addr_t addr;
> +	int ret;
> +
> +	dma_resv_assert_held(priv->dmabuf->resv);
> +
> +	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
> +	if (!sgt)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = sg_alloc_table(sgt, 1, GFP_KERNEL | __GFP_ZERO);
> +	if (ret)
> +		goto err_kfree_sgt;
> +
> +	sgl = sgt->sgl;
> +
> +	if (!state) {
> +		addr = pci_p2pdma_bus_addr_map(provider, phys_vec-
> >paddr);
> +	} else if (dma_use_iova(state)) {
> +		ret = dma_iova_link(attachment->dev, state, phys_vec->paddr,
> 0,
> +				    phys_vec->len, dir,
> DMA_ATTR_SKIP_CPU_SYNC);
> +		if (ret)
> +			goto err_free_table;
> +
> +		ret = dma_iova_sync(attachment->dev, state, 0, phys_vec-
> >len);
> +		if (ret)
> +			goto err_unmap_dma;
> +
> +		addr = state->addr;
> +	} else {
> +		addr = dma_map_phys(attachment->dev, phys_vec->paddr,
> +				    phys_vec->len, dir,
> DMA_ATTR_SKIP_CPU_SYNC);
> +		ret = dma_mapping_error(attachment->dev, addr);
> +		if (ret)
> +			goto err_free_table;
> +	}
> +
> +	fill_sg_entry(sgl, phys_vec->len, addr);
> +	return sgt;
> +
> +err_unmap_dma:
> +	dma_iova_destroy(attachment->dev, state, phys_vec->len, dir,
> +			 DMA_ATTR_SKIP_CPU_SYNC);
> +err_free_table:
> +	sg_free_table(sgt);
> +err_kfree_sgt:
> +	kfree(sgt);
> +	return ERR_PTR(ret);
> +}
> +
> +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment
> *attachment,
> +				   struct sg_table *sgt,
> +				   enum dma_data_direction dir)
> +{
> +	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
> +	struct dma_iova_state *state = attachment->priv;
> +	struct scatterlist *sgl;
> +	int i;
> +
> +	if (!state)
> +		; /* Do nothing */
> +	else if (dma_use_iova(state))
> +		dma_iova_destroy(attachment->dev, state, priv->phys_vec.len,
> +				 dir, DMA_ATTR_SKIP_CPU_SYNC);
> +	else
> +		for_each_sgtable_dma_sg(sgt, sgl, i)
> +			dma_unmap_phys(attachment->dev,
> sg_dma_address(sgl),
> +				       sg_dma_len(sgl), dir,
> +				       DMA_ATTR_SKIP_CPU_SYNC);
> +
> +	sg_free_table(sgt);
> +	kfree(sgt);
> +}
> +
> +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	/*
> +	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
> +	 * The refcount prevents both.
> +	 */
> +	if (priv->vdev) {
> +		down_write(&priv->vdev->memory_lock);
> +		list_del_init(&priv->dmabufs_elm);
> +		up_write(&priv->vdev->memory_lock);
> +		vfio_device_put_registration(&priv->vdev->vdev);
> +	}
> +	kfree(priv);
> +}
> +
> +static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
> +	.attach = vfio_pci_dma_buf_attach,
> +	.detach = vfio_pci_dma_buf_detach,
> +	.map_dma_buf = vfio_pci_dma_buf_map,
> +	.release = vfio_pci_dma_buf_release,
> +	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
> +};
> +
> +static void dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
> +				   struct vfio_device_feature_dma_buf
> *dma_buf)
> +{
> +	struct pci_dev *pdev = priv->vdev->pdev;
> +
> +	priv->phys_vec.len = dma_buf->length;
> +	priv->phys_vec.paddr = pci_resource_start(pdev, dma_buf-
> >region_index);
> +	priv->phys_vec.paddr += dma_buf->offset;
> +}
> +
> +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
> +				 struct vfio_device_feature_dma_buf *dma_buf)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	u32 bar = dma_buf->region_index;
> +	u64 offset = dma_buf->offset;
> +	u64 len = dma_buf->length;
> +	resource_size_t bar_size;
> +	u64 sum;
> +
> +	/*
> +	 * For PCI the region_index is the BAR number like  everything else.
> +	 */
> +	if (bar >= VFIO_PCI_ROM_REGION_INDEX)
> +		return -ENODEV;
> +
> +	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
> +		return -EINVAL;
> +
> +	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> +		return -EINVAL;
> +
> +	bar_size = pci_resource_len(pdev, bar);
> +	if (check_add_overflow(offset, len, &sum) || sum > bar_size)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32
> flags,
> +				  struct vfio_device_feature_dma_buf __user
> *arg,
> +				  size_t argsz)
> +{
> +	struct vfio_device_feature_dma_buf get_dma_buf = {};
> +	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> +	struct vfio_pci_dma_buf *priv;
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(get_dma_buf));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
> +		return -EFAULT;
> +
> +	ret = validate_dmabuf_input(vdev, &get_dma_buf);
> +	if (ret)
> +		return ret;
> +
> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;
> +
> +	priv->vdev = vdev;
> +	dma_ranges_to_p2p_phys(priv, &get_dma_buf);
> +
> +	if (!vfio_device_try_get_registration(&vdev->vdev)) {
> +		ret = -ENODEV;
> +		goto err_free_priv;
> +	}
> +
> +	exp_info.ops = &vfio_pci_dmabuf_ops;
> +	exp_info.size = priv->phys_vec.len;
> +	exp_info.flags = get_dma_buf.open_flags;
> +	exp_info.priv = priv;
> +
> +	priv->dmabuf = dma_buf_export(&exp_info);
> +	if (IS_ERR(priv->dmabuf)) {
> +		ret = PTR_ERR(priv->dmabuf);
> +		goto err_dev_put;
> +	}
> +
> +	/* dma_buf_put() now frees priv */
> +	INIT_LIST_HEAD(&priv->dmabufs_elm);
> +	down_write(&vdev->memory_lock);
> +	dma_resv_lock(priv->dmabuf->resv, NULL);
> +	priv->revoked = !__vfio_pci_memory_enabled(vdev);
> +	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
> +	dma_resv_unlock(priv->dmabuf->resv);
> +	up_write(&vdev->memory_lock);
> +
> +	/*
> +	 * dma_buf_fd() consumes the reference, when the file closes the
> dmabuf
> +	 * will be released.
> +	 */
> +	return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
> +
> +err_dev_put:
> +	vfio_device_put_registration(&vdev->vdev);
> +err_free_priv:
> +	kfree(priv);
> +	return ret;
> +}
> +
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
> revoked)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	struct vfio_pci_dma_buf *tmp;
> +
> +	lockdep_assert_held_write(&vdev->memory_lock);
> +
> +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +		if (!get_file_active(&priv->dmabuf->file))
> +			continue;
> +
> +		if (priv->revoked != revoked) {
> +			dma_resv_lock(priv->dmabuf->resv, NULL);
> +			priv->revoked = revoked;
> +			dma_buf_move_notify(priv->dmabuf);
> +			dma_resv_unlock(priv->dmabuf->resv);
> +		}
> +		dma_buf_put(priv->dmabuf);
> +	}
> +}
> +
> +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	struct vfio_pci_dma_buf *tmp;
> +
> +	down_write(&vdev->memory_lock);
> +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +		if (!get_file_active(&priv->dmabuf->file))
> +			continue;
> +
> +		dma_resv_lock(priv->dmabuf->resv, NULL);
> +		list_del_init(&priv->dmabufs_elm);
> +		priv->vdev = NULL;
> +		priv->revoked = true;
> +		dma_buf_move_notify(priv->dmabuf);
> +		dma_resv_unlock(priv->dmabuf->resv);
> +		vfio_device_put_registration(&vdev->vdev);
> +		dma_buf_put(priv->dmabuf);
> +	}
> +	up_write(&vdev->memory_lock);
> +}
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index a9972eacb2936..28a405f8b97c9 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -107,4 +107,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev
> *pdev)
>  	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
>  }
> 
> +#ifdef CONFIG_VFIO_PCI_DMABUF
> +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32
> flags,
> +				  struct vfio_device_feature_dma_buf __user
> *arg,
> +				  size_t argsz);
> +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
> revoked);
> +#else
> +static inline int
> +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32
> flags,
> +			      struct vfio_device_feature_dma_buf __user *arg,
> +			      size_t argsz)
> +{
> +	return -ENOTTY;
> +}
> +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device
> *vdev)
> +{
> +}
> +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device
> *vdev,
> +					 bool revoked)
> +{
> +}
> +#endif
> +
>  #endif
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index d58e329ac0e71..f14b413aae48d 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -483,6 +483,7 @@ struct dma_buf_attach_ops {
>   * @dev: device attached to the buffer.
>   * @node: list of dma_buf_attachment, protected by dma_resv lock of the
> dmabuf.
>   * @peer2peer: true if the importer can handle peer resources without pages.
> + * #state: DMA structure to provide support for physical addresses DMA
> interface
>   * @priv: exporter specific attachment data.
>   * @importer_ops: importer operations for this attachment, if provided
>   * dma_buf_map/unmap_attachment() must be called with the dma_resv
> lock held.
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index b017fae251811..548cbb51bf146 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -94,7 +94,10 @@ struct vfio_pci_core_device {
>  	struct vfio_pci_core_device	*sriov_pf_core_dev;
>  	struct notifier_block	nb;
>  	struct rw_semaphore	memory_lock;
> +#ifdef CONFIG_VFIO_PCI_DMABUF
>  	struct p2pdma_provider  *provider;
> +	struct list_head	dmabufs;
> +#endif
>  };
> 
>  /* Will be exported for vfio pci drivers usage */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 5764f315137f9..ad8e303697f97 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1468,6 +1468,25 @@ struct vfio_device_feature_bus_master {
>  };
>  #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
> 
> +/**
> + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> + * regions selected.
> + *
> + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> O_CLOEXEC,
> + * etc. offset/length specify a slice of the region to create the dmabuf from.
> + * nr_ranges is the total number of (P2P DMA) ranges that comprise the
> dmabuf.
Any particular reason why you dropped the option (nr_ranges) of creating a
single dmabuf from multiple ranges of an MMIO region?

Restricting the dmabuf to a single range (or having to create multiple dmabufs
to represent multiple regions/ranges associated with a single scattered buffer)
would be very limiting and may not work in all cases. For instance, in my use-case,
I am trying to share a large (4k mode) framebuffer (FB) located in GPU's VRAM
between two (p2p compatible) GPU devices. And, this would probably not work
given that allocating a large contiguous FB (nr_ranges = 1) in VRAM may not be
feasible when there is memory pressure.

Furthermore, since you are adding a new UAPI with this patch/feature, as you know,
we cannot go back and tweak it (to add support for nr_ranges > 1) should there
be a need in the future, but you can always use nr_ranges = 1 anytime. Therefore,
I think it makes sense to be flexible in terms of the number of ranges to include
while creating a dmabuf instead of restricting ourselves to one range.

Thanks,
Vivek

> + *
> + * Return: The fd number on success, -1 and errno is set on failure.
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> +
> +struct vfio_device_feature_dma_buf {
> +	__u32	region_index;
> +	__u32	open_flags;
> +	__u64	offset;
> +	__u64	length;
> +};
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
> 
>  /**
> --
> 2.50.1


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions
  2025-07-24  5:13   ` Kasireddy, Vivek
@ 2025-07-24  5:44     ` Leon Romanovsky
  2025-07-25  5:34       ` Kasireddy, Vivek
  0 siblings, 1 reply; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-24  5:44 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: Alex Williamson, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König,
	dri-devel@lists.freedesktop.org, iommu@lists.linux.dev,
	Jens Axboe, Jérôme Glisse, Joerg Roedel,
	kvm@vger.kernel.org, linaro-mm-sig@lists.linaro.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-media@vger.kernel.org, linux-mm@kvack.org,
	linux-pci@vger.kernel.org, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Will Deacon

On Thu, Jul 24, 2025 at 05:13:49AM +0000, Kasireddy, Vivek wrote:
> Hi Leon,
> 
> > Subject: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO
> > regions
> > 
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Add support for exporting PCI device MMIO regions through dma-buf,
> > enabling safe sharing of non-struct page memory with controlled
> > lifetime management. This allows RDMA and other subsystems to import
> > dma-buf FDs and build them into memory regions for PCI P2P operations.
> > 
> > The implementation provides a revocable attachment mechanism using
> > dma-buf move operations. MMIO regions are normally pinned as BARs
> > don't change physical addresses, but access is revoked when the VFIO
> > device is closed or a PCI reset is issued. This ensures kernel
> > self-defense against potentially hostile userspace.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  drivers/vfio/pci/Kconfig           |  20 ++
> >  drivers/vfio/pci/Makefile          |   2 +
> >  drivers/vfio/pci/vfio_pci_config.c |  22 +-
> >  drivers/vfio/pci/vfio_pci_core.c   |  25 ++-
> >  drivers/vfio/pci/vfio_pci_dmabuf.c | 321 +++++++++++++++++++++++++++++
> >  drivers/vfio/pci/vfio_pci_priv.h   |  23 +++
> >  include/linux/dma-buf.h            |   1 +
> >  include/linux/vfio_pci_core.h      |   3 +
> >  include/uapi/linux/vfio.h          |  19 ++
> >  9 files changed, 431 insertions(+), 5 deletions(-)
> >  create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c

<...>

> > +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
> > +				 struct vfio_device_feature_dma_buf *dma_buf)
> > +{
> > +	struct pci_dev *pdev = vdev->pdev;
> > +	u32 bar = dma_buf->region_index;
> > +	u64 offset = dma_buf->offset;
> > +	u64 len = dma_buf->length;
> > +	resource_size_t bar_size;
> > +	u64 sum;
> > +
> > +	/*
> > +	 * For PCI the region_index is the BAR number like  everything else.
> > +	 */
> > +	if (bar >= VFIO_PCI_ROM_REGION_INDEX)
> > +		return -ENODEV;

<...>

> > +/**
> > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > + * regions selected.
> > + *
> > + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> > O_CLOEXEC,
> > + * etc. offset/length specify a slice of the region to create the dmabuf from.
> > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the
> > dmabuf.
> Any particular reason why you dropped the option (nr_ranges) of creating a
> single dmabuf from multiple ranges of an MMIO region?

I did it for two reasons. First, I wanted to simplify the code in order
to speed-up discussion over the patchset itself. Second, I failed to
find justification for need of multiple ranges, as the number of BARs
are limited by VFIO_PCI_ROM_REGION_INDEX (6) and same functionality
can be achieved by multiple calls to DMABUF import.

> 
> Restricting the dmabuf to a single range (or having to create multiple dmabufs
> to represent multiple regions/ranges associated with a single scattered buffer)
> would be very limiting and may not work in all cases. For instance, in my use-case,
> I am trying to share a large (4k mode) framebuffer (FB) located in GPU's VRAM
> between two (p2p compatible) GPU devices. And, this would probably not work
> given that allocating a large contiguous FB (nr_ranges = 1) in VRAM may not be
> feasible when there is memory pressure.

Can you please help me and point to the place in the code where this can fail?
I'm probably missing something basic as there are no large allocations
in the current patchset.

> 
> Furthermore, since you are adding a new UAPI with this patch/feature, as you know,
> we cannot go back and tweak it (to add support for nr_ranges > 1) should there
> be a need in the future, but you can always use nr_ranges = 1 anytime. Therefore,
> I think it makes sense to be flexible in terms of the number of ranges to include
> while creating a dmabuf instead of restricting ourselves to one range.

I'm not a big fan of over-engineering. Let's first understand if this
case is needed.

Thanks

> 
> Thanks,
> Vivek
> 
> > + *
> > + * Return: The fd number on success, -1 and errno is set on failure.
> > + */
> > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> > +
> > +struct vfio_device_feature_dma_buf {
> > +	__u32	region_index;
> > +	__u32	open_flags;
> > +	__u64	offset;
> > +	__u64	length;
> > +};
> > +
> >  /* -------- API for Type1 VFIO IOMMU -------- */
> > 
> >  /**
> > --
> > 2.50.1
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 01/10] PCI/P2PDMA: Remove redundant bus_offset from map state
  2025-07-23 13:00 ` [PATCH 01/10] PCI/P2PDMA: Remove redundant bus_offset from map state Leon Romanovsky
@ 2025-07-24  7:50   ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2025-07-24  7:50 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Christoph Hellwig,
	Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-23 13:00 ` [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction Leon Romanovsky
@ 2025-07-24  7:51   ` Christoph Hellwig
  2025-07-24  7:55     ` Leon Romanovsky
  2025-07-29 16:12   ` Jason Gunthorpe
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2025-07-24  7:51 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Christoph Hellwig,
	Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Wed, Jul 23, 2025 at 04:00:03PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Extract the core P2PDMA provider information (device owner and bus
> offset) from the dev_pagemap into a dedicated p2pdma_provider structure.
> This creates a cleaner separation between the memory management layer and
> the P2PDMA functionality.
> 
> The new p2pdma_provider structure contains:
> - owner: pointer to the providing device
> - bus_offset: computed offset for non-host transactions
> 
> This refactoring simplifies the P2PDMA state management by removing
> the need to access pgmap internals directly. The pci_p2pdma_map_state
> now stores a pointer to the provider instead of the pgmap, making
> the API more explicit and easier to understand.

I really don't see how anything becomes cleaner or simpler here.
It adds a new structure that only exists embedded in the exist one
and more code for no apparent benefit.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 03/10] PCI/P2PDMA: Simplify bus address mapping API
  2025-07-23 13:00 ` [PATCH 03/10] PCI/P2PDMA: Simplify bus address mapping API Leon Romanovsky
@ 2025-07-24  7:52   ` Christoph Hellwig
  0 siblings, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2025-07-24  7:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Christoph Hellwig,
	Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Wed, Jul 23, 2025 at 04:00:04PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Update the pci_p2pdma_bus_addr_map() function to take a direct pointer
> to the p2pdma_provider structure instead of the pci_p2pdma_map_state.
> This simplifies the API by removing the need for callers to extract
> the provider from the state structure.
> 
> The change updates all callers across the kernel (block layer, IOMMU,
> DMA direct, and HMM) to pass the provider pointer directly, making
> the code more explicit and reducing unnecessary indirection. This
> also removes the runtime warning check since callers now have direct
> control over which provider they use.

Again I don't actually see any simplification here.  But maybe I'm
missing the ultimate goal here.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-24  7:51   ` Christoph Hellwig
@ 2025-07-24  7:55     ` Leon Romanovsky
  2025-07-24  7:59       ` Christoph Hellwig
  0 siblings, 1 reply; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-24  7:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Williamson, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Thu, Jul 24, 2025 at 09:51:45AM +0200, Christoph Hellwig wrote:
> On Wed, Jul 23, 2025 at 04:00:03PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Extract the core P2PDMA provider information (device owner and bus
> > offset) from the dev_pagemap into a dedicated p2pdma_provider structure.
> > This creates a cleaner separation between the memory management layer and
> > the P2PDMA functionality.
> > 
> > The new p2pdma_provider structure contains:
> > - owner: pointer to the providing device
> > - bus_offset: computed offset for non-host transactions
> > 
> > This refactoring simplifies the P2PDMA state management by removing
> > the need to access pgmap internals directly. The pci_p2pdma_map_state
> > now stores a pointer to the provider instead of the pgmap, making
> > the API more explicit and easier to understand.
> 
> I really don't see how anything becomes cleaner or simpler here.
> It adds a new structure that only exists embedded in the exist one
> and more code for no apparent benefit.

Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274085.git.leonro@nvidia.com
It gives me a way to call p2p code with stable pointer for whole BAR.

Thanks

> 
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-24  7:55     ` Leon Romanovsky
@ 2025-07-24  7:59       ` Christoph Hellwig
  2025-07-24  8:07         ` Leon Romanovsky
  2025-07-27 18:51         ` Jason Gunthorpe
  0 siblings, 2 replies; 54+ messages in thread
From: Christoph Hellwig @ 2025-07-24  7:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Alex Williamson, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
> Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274085.git.leonro@nvidia.com
> It gives me a way to call p2p code with stable pointer for whole BAR.
> 

That simply can't work.  So I guess you're trying to do the same stupid
things shut down before again?  I might as well not waste my time
reviewing this.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-23 13:00 ` [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function Leon Romanovsky
@ 2025-07-24  8:03   ` Christoph Hellwig
  2025-07-24  8:13     ` Leon Romanovsky
  2025-07-27 19:02     ` Jason Gunthorpe
  0 siblings, 2 replies; 54+ messages in thread
From: Christoph Hellwig @ 2025-07-24  8:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Christoph Hellwig,
	Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Export the pci_p2pdma_map_type() function to allow external modules
> and subsystems to determine the appropriate mapping type for P2PDMA
> transfers between a provider and target device.

External modules have no business doing this.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-24  7:59       ` Christoph Hellwig
@ 2025-07-24  8:07         ` Leon Romanovsky
  2025-07-27 18:51         ` Jason Gunthorpe
  1 sibling, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-24  8:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Williamson, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Thu, Jul 24, 2025 at 09:59:22AM +0200, Christoph Hellwig wrote:
> On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
> > Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274085.git.leonro@nvidia.com
> > It gives me a way to call p2p code with stable pointer for whole BAR.
> > 
> 
> That simply can't work.  So I guess you're trying to do the same stupid
> things shut down before again?  I might as well not waste my time
> reviewing this.

I'm not aware of anything that is not acceptable in this series.

This series focused on replacing dma_map_resource() call from v3
https://lore.kernel.org/all/20250307052248.405803-4-vivek.kasireddy@intel.com/
to proper API.

   92         if (!state) { 
   93                 addr = pci_p2pdma_bus_addr_map(provider, phys_vec->paddr); 
   94         } else if (dma_use_iova(state)) {                                 
   95                 ret = dma_iova_link(attachment->dev, state, phys_vec->paddr, 0,
   96                                     phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC); 
   97                 if (ret)                                                        
   98                         goto err_free_table;                                   
   99                                                                               
  100                 ret = dma_iova_sync(attachment->dev, state, 0, phys_vec->len);
  101                 if (ret)                                                     
  102                         goto err_unmap_dma;                                 
  103                                                                            
  104                 addr = state->addr;                                       
  105         } else {                                                         
  106                 addr = dma_map_phys(attachment->dev, phys_vec->paddr,   
  107                                     phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC);
  108                 ret = dma_mapping_error(attachment->dev, addr);                
  109                 if (ret)                                                      
  110                         goto err_free_table;                                 
  111         }            

Thanks

> 
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-24  8:03   ` Christoph Hellwig
@ 2025-07-24  8:13     ` Leon Romanovsky
  2025-07-25 16:30       ` Logan Gunthorpe
  2025-07-29  7:52       ` Christoph Hellwig
  2025-07-27 19:02     ` Jason Gunthorpe
  1 sibling, 2 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-24  8:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Williamson, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Export the pci_p2pdma_map_type() function to allow external modules
> > and subsystems to determine the appropriate mapping type for P2PDMA
> > transfers between a provider and target device.
> 
> External modules have no business doing this.

VFIO PCI code is built as module. There is no way to access PCI p2p code
without exporting functions in it.

Thanks

> 
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions
  2025-07-24  5:44     ` Leon Romanovsky
@ 2025-07-25  5:34       ` Kasireddy, Vivek
  2025-07-27  6:16         ` Leon Romanovsky
  0 siblings, 1 reply; 54+ messages in thread
From: Kasireddy, Vivek @ 2025-07-25  5:34 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König,
	dri-devel@lists.freedesktop.org, iommu@lists.linux.dev,
	Jens Axboe, Jérôme Glisse, Joerg Roedel,
	kvm@vger.kernel.org, linaro-mm-sig@lists.linaro.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-media@vger.kernel.org, linux-mm@kvack.org,
	linux-pci@vger.kernel.org, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Will Deacon

Hi Leon,

> Subject: Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO
> regions
> 
> > >
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > Add support for exporting PCI device MMIO regions through dma-buf,
> > > enabling safe sharing of non-struct page memory with controlled
> > > lifetime management. This allows RDMA and other subsystems to
> import
> > > dma-buf FDs and build them into memory regions for PCI P2P
> operations.
> > >
> > > The implementation provides a revocable attachment mechanism using
> > > dma-buf move operations. MMIO regions are normally pinned as BARs
> > > don't change physical addresses, but access is revoked when the VFIO
> > > device is closed or a PCI reset is issued. This ensures kernel
> > > self-defense against potentially hostile userspace.
> > >
> > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > ---
> > >  drivers/vfio/pci/Kconfig           |  20 ++
> > >  drivers/vfio/pci/Makefile          |   2 +
> > >  drivers/vfio/pci/vfio_pci_config.c |  22 +-
> > >  drivers/vfio/pci/vfio_pci_core.c   |  25 ++-
> > >  drivers/vfio/pci/vfio_pci_dmabuf.c | 321
> +++++++++++++++++++++++++++++
> > >  drivers/vfio/pci/vfio_pci_priv.h   |  23 +++
> > >  include/linux/dma-buf.h            |   1 +
> > >  include/linux/vfio_pci_core.h      |   3 +
> > >  include/uapi/linux/vfio.h          |  19 ++
> > >  9 files changed, 431 insertions(+), 5 deletions(-)
> > >  create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
> 
> <...>
> 
> > > +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
> > > +				 struct vfio_device_feature_dma_buf
> *dma_buf)
> > > +{
> > > +	struct pci_dev *pdev = vdev->pdev;
> > > +	u32 bar = dma_buf->region_index;
> > > +	u64 offset = dma_buf->offset;
> > > +	u64 len = dma_buf->length;
> > > +	resource_size_t bar_size;
> > > +	u64 sum;
> > > +
> > > +	/*
> > > +	 * For PCI the region_index is the BAR number like  everything else.
> > > +	 */
> > > +	if (bar >= VFIO_PCI_ROM_REGION_INDEX)
> > > +		return -ENODEV;
> 
> <...>
> 
> > > +/**
> > > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > > + * regions selected.
> > > + *
> > > + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> > > O_CLOEXEC,
> > > + * etc. offset/length specify a slice of the region to create the dmabuf
> from.
> > > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the
> > > dmabuf.
> > Any particular reason why you dropped the option (nr_ranges) of creating
> a
> > single dmabuf from multiple ranges of an MMIO region?
> 
> I did it for two reasons. First, I wanted to simplify the code in order
> to speed-up discussion over the patchset itself. Second, I failed to
> find justification for need of multiple ranges, as the number of BARs
> are limited by VFIO_PCI_ROM_REGION_INDEX (6) and same functionality
> can be achieved by multiple calls to DMABUF import.
I don't think the same functionality can be achieved by multiple calls to
dmabuf import. AFAIU, a dmabuf (as of today) is backed by a SGL that can
have multiple entries because it represents a scattered buffer (multiple
non-contiguous entries in System RAM or an MMIO region). But in this
patch you are constraining it such that only one entry associated with a
buffer would be included, which effectively means that we cannot create
a dmabuf to represent scattered buffers (located in a single MMIO region
such as VRAM or other device memory) anymore. 

> 
> >
> > Restricting the dmabuf to a single range (or having to create multiple
> dmabufs
> > to represent multiple regions/ranges associated with a single scattered
> buffer)
> > would be very limiting and may not work in all cases. For instance, in my
> use-case,
> > I am trying to share a large (4k mode) framebuffer (FB) located in GPU's
> VRAM
> > between two (p2p compatible) GPU devices. And, this would probably not
> work
> > given that allocating a large contiguous FB (nr_ranges = 1) in VRAM may
> not be
> > feasible when there is memory pressure.
> 
> Can you please help me and point to the place in the code where this can
> fail?
> I'm probably missing something basic as there are no large allocations
> in the current patchset.
Sorry, I was not very clear. What I meant is that it is not prudent to assume that
there will only be one range associated with an MMIO region which we need to
consider while creating a dmabuf. And, I was pointing out my use-case as an
example where vfio-pci needs to create a dmabuf for a large buffer (FB) that
would likely be scattered (and not contiguous) in an MMIO region (such as VRAM).

Let me further explain with my use-case. Here is a link to my Qemu-based test:
https://gitlab.freedesktop.org/Vivek/qemu/-/commit/b2bdb16d9cfaf55384c95b1f060f175ad1c89e95#81dc845f0babf39649c4e086e173375614111b4a_29_46

While exhaustively testing this case, I noticed that the Guest VM (GPU driver)
would occasionally create the buffer (represented by virtio_gpu_simple_resource,
for which we need to create a dmabuf) in such a way that there are multiple
entries (indicated by res->iov_cnt) that need to be included. This is the main
reason why I added support for nr_ranges > 1 to this patch/feature.

Furthermore, creating multiple dmabufs to represent each range of the same
buffer, like you suggest IIUC is suboptimal and does not align with how dmabuf
works currently.

> 
> >
> > Furthermore, since you are adding a new UAPI with this patch/feature, as
> you know,
> > we cannot go back and tweak it (to add support for nr_ranges > 1) should
> there
> > be a need in the future, but you can always use nr_ranges = 1 anytime.
> Therefore,
> > I think it makes sense to be flexible in terms of the number of ranges to
> include
> > while creating a dmabuf instead of restricting ourselves to one range.
> 
> I'm not a big fan of over-engineering. Let's first understand if this
> case is needed.
As explained above with my use-case, having support for nr_ranges > 1 is not
just nice to have but absolutely necessary. Otherwise, this feature would be
constrained to creating dmabufs for contiguous buffers (nr_ranges = 1) only,
which would limit its effectiveness as most GPU buffers are rarely contiguous.

Thanks,
Vivek

> 
> Thanks
> 
> >
> > Thanks,
> > Vivek
> >
> > > + *
> > > + * Return: The fd number on success, -1 and errno is set on failure.
> > > + */
> > > +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> > > +
> > > +struct vfio_device_feature_dma_buf {
> > > +	__u32	region_index;
> > > +	__u32	open_flags;
> > > +	__u64	offset;
> > > +	__u64	length;
> > > +};
> > > +
> > >  /* -------- API for Type1 VFIO IOMMU -------- */
> > >
> > >  /**
> > > --
> > > 2.50.1
> >

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-24  8:13     ` Leon Romanovsky
@ 2025-07-25 16:30       ` Logan Gunthorpe
  2025-07-25 18:54         ` Leon Romanovsky
  2025-07-27 19:05         ` Jason Gunthorpe
  2025-07-29  7:52       ` Christoph Hellwig
  1 sibling, 2 replies; 54+ messages in thread
From: Logan Gunthorpe @ 2025-07-25 16:30 UTC (permalink / raw)
  To: Leon Romanovsky, Christoph Hellwig
  Cc: Alex Williamson, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Marek Szyprowski, Robin Murphy, Sumit Semwal, Vivek Kasireddy,
	Will Deacon



On 2025-07-24 02:13, Leon Romanovsky wrote:
> On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
>> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>
>>> Export the pci_p2pdma_map_type() function to allow external modules
>>> and subsystems to determine the appropriate mapping type for P2PDMA
>>> transfers between a provider and target device.
>>
>> External modules have no business doing this.
> 
> VFIO PCI code is built as module. There is no way to access PCI p2p code
> without exporting functions in it.

The solution that would make more sense to me would be for either
dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
P2PDMA case. dma-iommu.c already uses those same interfaces and thus
there would be no need to export the low level helpers from the p2pdma code.

Logan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-25 16:30       ` Logan Gunthorpe
@ 2025-07-25 18:54         ` Leon Romanovsky
  2025-07-25 19:12           ` Logan Gunthorpe
  2025-07-27 19:05         ` Jason Gunthorpe
  1 sibling, 1 reply; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-25 18:54 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Alex Williamson, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2025-07-24 02:13, Leon Romanovsky wrote:
> > On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro@nvidia.com>
> >>>
> >>> Export the pci_p2pdma_map_type() function to allow external modules
> >>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>> transfers between a provider and target device.
> >>
> >> External modules have no business doing this.
> > 
> > VFIO PCI code is built as module. There is no way to access PCI p2p code
> > without exporting functions in it.
> 
> The solution that would make more sense to me would be for either
> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> P2PDMA case. dma-iommu.c already uses those same interfaces and thus
> there would be no need to export the low level helpers from the p2pdma code.

I had same idea in early versions of DMA phys API discussion and it was
pointed (absolutely right) that this is layering violation.

At that time, that remark wasn't such clear to me because HMM code
performs check for p2p on every page and has call to dma_iova_try_alloc()
before that check. But this VFIO DMABUF code shows it much more clearer.

The p2p check is performed before any DMA calls and in case of PCI_P2PDMA_MAP_BUS_ADDR
p2p type between DMABUF exporter device and DMABUF importer device, we
don't call dma_iova_try_alloc() or any DMA API at all.

So unfortunately, I think that dma*.c|h is not right place for p2p
type check.

Thanks

> 
> Logan
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-25 18:54         ` Leon Romanovsky
@ 2025-07-25 19:12           ` Logan Gunthorpe
  2025-07-27  6:01             ` Leon Romanovsky
  0 siblings, 1 reply; 54+ messages in thread
From: Logan Gunthorpe @ 2025-07-25 19:12 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Alex Williamson, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On 2025-07-25 12:54, Leon Romanovsky wrote:
>> The solution that would make more sense to me would be for either
>> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
>> P2PDMA case. dma-iommu.c already uses those same interfaces and thus
>> there would be no need to export the low level helpers from the p2pdma code.
> 
> I had same idea in early versions of DMA phys API discussion and it was
> pointed (absolutely right) that this is layering violation.

Respectfully, I have to disagree with this. Having the layer (ie.
dma-iommu) that normally checks how to handle a P2PDMA request now check
 how to handle these DMA requests is the exact opposite of a layering
violation. Expecting every driver that wants to do P2PDMA to have to
figure out for themselves how to map the memory before calling into the
DMA API doesn't seem like a good design choice to me.

> So unfortunately, I think that dma*.c|h is not right place for p2p
> type check.

dma*.c is already where those checks are done. I'm not sure patches to
remove the code from that layer and put it into the NVMe driver would
make a lot of sense (and then, of course, we'd have to put it into every
other driver that wants to participate in p2p transactions).

Logan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-25 19:12           ` Logan Gunthorpe
@ 2025-07-27  6:01             ` Leon Romanovsky
  0 siblings, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-27  6:01 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Christoph Hellwig, Alex Williamson, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Fri, Jul 25, 2025 at 01:12:35PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2025-07-25 12:54, Leon Romanovsky wrote:
> >> The solution that would make more sense to me would be for either
> >> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> >> P2PDMA case. dma-iommu.c already uses those same interfaces and thus
> >> there would be no need to export the low level helpers from the p2pdma code.
> > 
> > I had same idea in early versions of DMA phys API discussion and it was
> > pointed (absolutely right) that this is layering violation.
> 
> Respectfully, I have to disagree with this. Having the layer (ie.
> dma-iommu) that normally checks how to handle a P2PDMA request now check
>  how to handle these DMA requests is the exact opposite of a layering
> violation. 

I'm aware of your implementation and have feeling that it was very
influenced by NVMe requirements, so the end result is very tailored
for it. Other users have very different paths if p2p is taken. Just
see last VFIO patch in this series, it skips all DMA logic.

> Expecting every driver that wants to do P2PDMA to have to
> figure out for themselves how to map the memory before calling into the
> DMA API doesn't seem like a good design choice to me.

We had this discussion earlier too on previous versions. The summary is
that p2p capable devices are very special anyway. They need to work with
p2p natively. BTW, the implementation is not supposed to be in the
drivers, but in their respective subsystems.

> 
> > So unfortunately, I think that dma*.c|h is not right place for p2p
> > type check.
> 
> dma*.c is already where those checks are done. I'm not sure patches to
> remove the code from that layer and put it into the NVMe driver would
> make a lot of sense (and then, of course, we'd have to put it into every
> other driver that wants to participate in p2p transactions).

I don't have plans to remove existing checks right now, but NVMe was already
converted to new DMA phys API.

Thanks

> 
> Logan
> 
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions
  2025-07-25  5:34       ` Kasireddy, Vivek
@ 2025-07-27  6:16         ` Leon Romanovsky
  0 siblings, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-27  6:16 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: Alex Williamson, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König,
	dri-devel@lists.freedesktop.org, iommu@lists.linux.dev,
	Jens Axboe, Jérôme Glisse, Joerg Roedel,
	kvm@vger.kernel.org, linaro-mm-sig@lists.linaro.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-media@vger.kernel.org, linux-mm@kvack.org,
	linux-pci@vger.kernel.org, Logan Gunthorpe, Marek Szyprowski,
	Robin Murphy, Sumit Semwal, Will Deacon

On Fri, Jul 25, 2025 at 05:34:40AM +0000, Kasireddy, Vivek wrote:
> Hi Leon,
> 
> > Subject: Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO
> > regions
> > 
> > > >
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > >
> > > > Add support for exporting PCI device MMIO regions through dma-buf,
> > > > enabling safe sharing of non-struct page memory with controlled
> > > > lifetime management. This allows RDMA and other subsystems to
> > import
> > > > dma-buf FDs and build them into memory regions for PCI P2P
> > operations.
> > > >
> > > > The implementation provides a revocable attachment mechanism using
> > > > dma-buf move operations. MMIO regions are normally pinned as BARs
> > > > don't change physical addresses, but access is revoked when the VFIO
> > > > device is closed or a PCI reset is issued. This ensures kernel
> > > > self-defense against potentially hostile userspace.
> > > >
> > > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > > > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > > > ---
> > > >  drivers/vfio/pci/Kconfig           |  20 ++
> > > >  drivers/vfio/pci/Makefile          |   2 +
> > > >  drivers/vfio/pci/vfio_pci_config.c |  22 +-
> > > >  drivers/vfio/pci/vfio_pci_core.c   |  25 ++-
> > > >  drivers/vfio/pci/vfio_pci_dmabuf.c | 321
> > +++++++++++++++++++++++++++++
> > > >  drivers/vfio/pci/vfio_pci_priv.h   |  23 +++
> > > >  include/linux/dma-buf.h            |   1 +
> > > >  include/linux/vfio_pci_core.h      |   3 +
> > > >  include/uapi/linux/vfio.h          |  19 ++
> > > >  9 files changed, 431 insertions(+), 5 deletions(-)
> > > >  create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
> > 
> > <...>
> > 
> > > > +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev,
> > > > +				 struct vfio_device_feature_dma_buf
> > *dma_buf)
> > > > +{
> > > > +	struct pci_dev *pdev = vdev->pdev;
> > > > +	u32 bar = dma_buf->region_index;
> > > > +	u64 offset = dma_buf->offset;
> > > > +	u64 len = dma_buf->length;
> > > > +	resource_size_t bar_size;
> > > > +	u64 sum;
> > > > +
> > > > +	/*
> > > > +	 * For PCI the region_index is the BAR number like  everything else.
> > > > +	 */
> > > > +	if (bar >= VFIO_PCI_ROM_REGION_INDEX)
> > > > +		return -ENODEV;
> > 
> > <...>
> > 
> > > > +/**
> > > > + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> > > > + * regions selected.
> > > > + *
> > > > + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> > > > O_CLOEXEC,
> > > > + * etc. offset/length specify a slice of the region to create the dmabuf
> > from.
> > > > + * nr_ranges is the total number of (P2P DMA) ranges that comprise the
> > > > dmabuf.
> > > Any particular reason why you dropped the option (nr_ranges) of creating
> > a
> > > single dmabuf from multiple ranges of an MMIO region?
> > 
> > I did it for two reasons. First, I wanted to simplify the code in order
> > to speed-up discussion over the patchset itself. Second, I failed to
> > find justification for need of multiple ranges, as the number of BARs
> > are limited by VFIO_PCI_ROM_REGION_INDEX (6) and same functionality
> > can be achieved by multiple calls to DMABUF import.
> I don't think the same functionality can be achieved by multiple calls to
> dmabuf import. AFAIU, a dmabuf (as of today) is backed by a SGL that can
> have multiple entries because it represents a scattered buffer (multiple
> non-contiguous entries in System RAM or an MMIO region). 

I don't know all the reasons why SG was chosen, but one of the main
reasons is that DMA SG API was the only one possible way to handle p2p
transfers (peer2peer flag).


> But in this patch you are constraining it such that only one entry associated with a
> buffer would be included, which effectively means that we cannot create
> a dmabuf to represent scattered buffers (located in a single MMIO region
> such as VRAM or other device memory) anymore. 

Yes

> 
> > 
> > >
> > > Restricting the dmabuf to a single range (or having to create multiple
> > dmabufs
> > > to represent multiple regions/ranges associated with a single scattered
> > buffer)
> > > would be very limiting and may not work in all cases. For instance, in my
> > use-case,
> > > I am trying to share a large (4k mode) framebuffer (FB) located in GPU's
> > VRAM
> > > between two (p2p compatible) GPU devices. And, this would probably not
> > work
> > > given that allocating a large contiguous FB (nr_ranges = 1) in VRAM may
> > not be
> > > feasible when there is memory pressure.
> > 
> > Can you please help me and point to the place in the code where this can
> > fail?
> > I'm probably missing something basic as there are no large allocations
> > in the current patchset.
> Sorry, I was not very clear. What I meant is that it is not prudent to assume that
> there will only be one range associated with an MMIO region which we need to
> consider while creating a dmabuf. And, I was pointing out my use-case as an
> example where vfio-pci needs to create a dmabuf for a large buffer (FB) that
> would likely be scattered (and not contiguous) in an MMIO region (such as VRAM).
> 
> Let me further explain with my use-case. Here is a link to my Qemu-based test:
> https://gitlab.freedesktop.org/Vivek/qemu/-/commit/b2bdb16d9cfaf55384c95b1f060f175ad1c89e95#81dc845f0babf39649c4e086e173375614111b4a_29_46

Ohh, thanks. I'll add nr_ranges in next version. I see that you are
using same region_index for all ranges and this is how I would like to
keep it: "multiple nr_ranges, same region_index".

Thanks

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-24  7:59       ` Christoph Hellwig
  2025-07-24  8:07         ` Leon Romanovsky
@ 2025-07-27 18:51         ` Jason Gunthorpe
  2025-07-29  7:52           ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: Jason Gunthorpe @ 2025-07-27 18:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Thu, Jul 24, 2025 at 09:59:22AM +0200, Christoph Hellwig wrote:
> On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
> > Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274085.git.leonro@nvidia.com
> > It gives me a way to call p2p code with stable pointer for whole BAR.
> > 
> 
> That simply can't work.

Why not?

That's the whole point of this, to remove struct page and use
something else as a handle for the p2p when doing the DMA API stuff.

The caller must make sure the lifetimes all work out. The handle must
live longer than any active DMAs, etc, etc. DMABUF with invalidation
lets vfio do that.

This is why the DMA api code was taught to use phys_addr_t and not
touch the struct page so it could work with struct-pageless memory.

The idea was to end up with two layers in the P2P code where the lower
layer only works on the handle, and then there is an optional struct
page/genalloc/etc layer for places that want struct page and mmap.

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-24  8:03   ` Christoph Hellwig
  2025-07-24  8:13     ` Leon Romanovsky
@ 2025-07-27 19:02     ` Jason Gunthorpe
  1 sibling, 0 replies; 54+ messages in thread
From: Jason Gunthorpe @ 2025-07-27 19:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Export the pci_p2pdma_map_type() function to allow external modules
> > and subsystems to determine the appropriate mapping type for P2PDMA
> > transfers between a provider and target device.
> 
> External modules have no business doing this.

So what's the plan?

Today the new DMA API broadly has the pattern:

        switch (pci_p2pdma_state(p2pdma_state, dev, page)) {
[..]
        if (dma_use_iova(state)) {
                ret = dma_iova_link(dev, state, paddr, offset,
[..]
        } else {
                dma_addr = dma_map_page(dev, page, 0, map->dma_entry_size,
[..]

You can't fully use the new API flow without calling
pci_p2pdma_state(), which is also not exported today.

Is the idea the full new DMA API flow should not be available to
modules? We did export dma_iova_link().

Otherwise, the p2p step needs two functions - a struct page-full and a
struct page-less version, and they need to be exported.

The names here are not so good, it would be nicer to have them be a
dma_* prefixed function since they are used with the other dma_
functions.

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-25 16:30       ` Logan Gunthorpe
  2025-07-25 18:54         ` Leon Romanovsky
@ 2025-07-27 19:05         ` Jason Gunthorpe
  2025-07-28 16:12           ` Logan Gunthorpe
  1 sibling, 1 reply; 54+ messages in thread
From: Jason Gunthorpe @ 2025-07-27 19:05 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Leon Romanovsky, Christoph Hellwig, Alex Williamson,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2025-07-24 02:13, Leon Romanovsky wrote:
> > On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro@nvidia.com>
> >>>
> >>> Export the pci_p2pdma_map_type() function to allow external modules
> >>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>> transfers between a provider and target device.
> >>
> >> External modules have no business doing this.
> > 
> > VFIO PCI code is built as module. There is no way to access PCI p2p code
> > without exporting functions in it.
> 
> The solution that would make more sense to me would be for either
> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> P2PDMA case.

This has nothing to do with dma-iommu.c, the decisions here still need
to be made even if dma-iommu.c is not compiled in.

It could be exported from the main dma code, but I think it would just
be a 1 line wrapper around the existing function? I'd rather rename
the functions and leave them in the p2pdma.c files...

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-27 19:05         ` Jason Gunthorpe
@ 2025-07-28 16:12           ` Logan Gunthorpe
  2025-07-28 16:41             ` Leon Romanovsky
  0 siblings, 1 reply; 54+ messages in thread
From: Logan Gunthorpe @ 2025-07-28 16:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Christoph Hellwig, Alex Williamson,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On 2025-07-27 13:05, Jason Gunthorpe wrote:
> On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2025-07-24 02:13, Leon Romanovsky wrote:
>>> On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
>>>> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
>>>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>>>
>>>>> Export the pci_p2pdma_map_type() function to allow external modules
>>>>> and subsystems to determine the appropriate mapping type for P2PDMA
>>>>> transfers between a provider and target device.
>>>>
>>>> External modules have no business doing this.
>>>
>>> VFIO PCI code is built as module. There is no way to access PCI p2p code
>>> without exporting functions in it.
>>
>> The solution that would make more sense to me would be for either
>> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
>> P2PDMA case.
> 
> This has nothing to do with dma-iommu.c, the decisions here still need
> to be made even if dma-iommu.c is not compiled in.

Doesn't it though? Every single call in patch 10 to the newly exported
PCI functions calls into the the dma-iommu functions. If there were
non-iommu paths then I would expect the code would use the regular DMA
api directly which would then call in to dma-iommu.

I can't imagine a use case where someone would want to call the p2pdma
functions to map p2p memory and not have a similar path to do the exact
same mapping with vanilla memory and thus call the DMA API. And it seems
much better to me to export higher level functions to drivers that take
care of the details correctly than to expose the nuts and bolts to every
driver.

The thing that seems special to me about VFIO is that it is calling
directly into dma-iommu code to setup unique mappings as opposed to
using the higher level DMA API. I don't see in what way it is special
that it needs to know intimate details of the memory it's mapping and
have different paths to map different types of memory. That's what the
dma layer is for.

Logan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-28 16:12           ` Logan Gunthorpe
@ 2025-07-28 16:41             ` Leon Romanovsky
  2025-07-28 17:07               ` Logan Gunthorpe
  0 siblings, 1 reply; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-28 16:41 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Christoph Hellwig, Alex Williamson,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Mon, Jul 28, 2025 at 10:12:31AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2025-07-27 13:05, Jason Gunthorpe wrote:
> > On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2025-07-24 02:13, Leon Romanovsky wrote:
> >>> On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >>>> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>>>> From: Leon Romanovsky <leonro@nvidia.com>
> >>>>>
> >>>>> Export the pci_p2pdma_map_type() function to allow external modules
> >>>>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>>>> transfers between a provider and target device.
> >>>>
> >>>> External modules have no business doing this.
> >>>
> >>> VFIO PCI code is built as module. There is no way to access PCI p2p code
> >>> without exporting functions in it.
> >>
> >> The solution that would make more sense to me would be for either
> >> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> >> P2PDMA case.
> > 
> > This has nothing to do with dma-iommu.c, the decisions here still need
> > to be made even if dma-iommu.c is not compiled in.
> 
> Doesn't it though? Every single call in patch 10 to the newly exported
> PCI functions calls into the the dma-iommu functions. If there were
> non-iommu paths then I would expect the code would use the regular DMA
> api directly which would then call in to dma-iommu.

If p2p type is PCI_P2PDMA_MAP_BUS_ADDR, there will no dma-iommu and DMA
at all.

+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
+				   struct dma_buf_attachment *attachment)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	if (!attachment->peer2peer)
+		return -EOPNOTSUPP;
+
+	if (priv->revoked)
+		return -ENODEV;
+
+	switch (pci_p2pdma_map_type(priv->vdev->provider, attachment->dev)) {
+	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+		break;
+	case PCI_P2PDMA_MAP_BUS_ADDR:
+		/*
+		 * There is no need in IOVA at all for this flow.
+		 * We rely on attachment->priv == NULL as a marker
+		 * for this mode.
+		 */
+		return 0;
+	default:
+		return -EINVAL;
+	}
+
+	attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL);
+	if (!attachment->priv)
+		return -ENOMEM;
+
+	dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->phys_vec.len);
+	return 0;
+}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-28 16:41             ` Leon Romanovsky
@ 2025-07-28 17:07               ` Logan Gunthorpe
  2025-07-28 23:11                 ` Jason Gunthorpe
  0 siblings, 1 reply; 54+ messages in thread
From: Logan Gunthorpe @ 2025-07-28 17:07 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Christoph Hellwig, Alex Williamson,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon



On 2025-07-28 10:41, Leon Romanovsky wrote:
> On Mon, Jul 28, 2025 at 10:12:31AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2025-07-27 13:05, Jason Gunthorpe wrote:
>>> On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
>>>>
>>>>
>>>> On 2025-07-24 02:13, Leon Romanovsky wrote:
>>>>> On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
>>>>>> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
>>>>>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>>>>>
>>>>>>> Export the pci_p2pdma_map_type() function to allow external modules
>>>>>>> and subsystems to determine the appropriate mapping type for P2PDMA
>>>>>>> transfers between a provider and target device.
>>>>>>
>>>>>> External modules have no business doing this.
>>>>>
>>>>> VFIO PCI code is built as module. There is no way to access PCI p2p code
>>>>> without exporting functions in it.
>>>>
>>>> The solution that would make more sense to me would be for either
>>>> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
>>>> P2PDMA case.
>>>
>>> This has nothing to do with dma-iommu.c, the decisions here still need
>>> to be made even if dma-iommu.c is not compiled in.
>>
>> Doesn't it though? Every single call in patch 10 to the newly exported
>> PCI functions calls into the the dma-iommu functions. If there were
>> non-iommu paths then I would expect the code would use the regular DMA
>> api directly which would then call in to dma-iommu.
> 
> If p2p type is PCI_P2PDMA_MAP_BUS_ADDR, there will no dma-iommu and DMA
> at all.

I understand that and it is completely beside my point.

If the dma mapping for P2P memory doesn't need to create an iommu
mapping then that's fine. But it should be the dma-iommu layer to decide
that. It's not a decision that should be made by every driver doing this
kind of thing.

With P2PDMA memory we are still creating a DMA mapping. It's just the
dma address will be a PCI bus address instead of an IOVA. My opinion
remains: none of these details should be exposed to the drivers.

Logan

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 09/10] vfio/pci: Share the core device pointer while invoking feature functions
  2025-07-23 13:00 ` [PATCH 09/10] vfio/pci: Share the core device pointer while invoking feature functions Leon Romanovsky
@ 2025-07-28 20:55   ` Alex Williamson
  2025-07-29  8:39     ` Leon Romanovsky
  0 siblings, 1 reply; 54+ messages in thread
From: Alex Williamson @ 2025-07-28 20:55 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Vivek Kasireddy, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Will Deacon

On Wed, 23 Jul 2025 16:00:10 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> From: Vivek Kasireddy <vivek.kasireddy@intel.com>
> 
> There is no need to share the main device pointer (struct vfio_device *)
> with all the feature functions as they only need the core device
> pointer. Therefore, extract the core device pointer once in the
> caller (vfio_pci_core_ioctl_feature) and share it instead.
> 
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 30 +++++++++++++-----------------
>  1 file changed, 13 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 1e675daab5753..5512d13bb8899 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -301,11 +301,9 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
>  	return 0;
>  }
>  
> -static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
> +static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags,
>  				  void __user *arg, size_t argsz)
>  {
> -	struct vfio_pci_core_device *vdev =
> -		container_of(device, struct vfio_pci_core_device, vdev);
>  	int ret;
>  
>  	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
> @@ -322,12 +320,10 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
>  }
>  
>  static int vfio_pci_core_pm_entry_with_wakeup(
> -	struct vfio_device *device, u32 flags,
> +	struct vfio_pci_core_device *vdev, u32 flags,
>  	struct vfio_device_low_power_entry_with_wakeup __user *arg,
>  	size_t argsz)

I'm tempted to fix the line wrapping here, but I think this patch
stands on its own.  Even if it's rather trivial, it makes sense to
consolidate and standardize on the vfio_pci_core_device getting passed
around within vfio_pci_core.c.  Any reason not to split this off?
Thanks,

Alex

>  {
> -	struct vfio_pci_core_device *vdev =
> -		container_of(device, struct vfio_pci_core_device, vdev);
>  	struct vfio_device_low_power_entry_with_wakeup entry;
>  	struct eventfd_ctx *efdctx;
>  	int ret;
> @@ -378,11 +374,9 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
>  	up_write(&vdev->memory_lock);
>  }
>  
> -static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
> +static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags,
>  				 void __user *arg, size_t argsz)
>  {
> -	struct vfio_pci_core_device *vdev =
> -		container_of(device, struct vfio_pci_core_device, vdev);
>  	int ret;
>  
>  	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
> @@ -1475,11 +1469,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>  
> -static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
> -				       uuid_t __user *arg, size_t argsz)
> +static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
> +				       u32 flags, uuid_t __user *arg,
> +				       size_t argsz)
>  {
> -	struct vfio_pci_core_device *vdev =
> -		container_of(device, struct vfio_pci_core_device, vdev);
>  	uuid_t uuid;
>  	int ret;
>  
> @@ -1506,16 +1499,19 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>  int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  				void __user *arg, size_t argsz)
>  {
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +
>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>  	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
> -		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
> +		return vfio_pci_core_pm_entry(vdev, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP:
> -		return vfio_pci_core_pm_entry_with_wakeup(device, flags,
> +		return vfio_pci_core_pm_entry_with_wakeup(vdev, flags,
>  							  arg, argsz);
>  	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
> -		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
> +		return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
> -		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-28 17:07               ` Logan Gunthorpe
@ 2025-07-28 23:11                 ` Jason Gunthorpe
  2025-07-29 20:54                   ` Logan Gunthorpe
  0 siblings, 1 reply; 54+ messages in thread
From: Jason Gunthorpe @ 2025-07-28 23:11 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Leon Romanovsky, Christoph Hellwig, Alex Williamson,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Mon, Jul 28, 2025 at 11:07:34AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2025-07-28 10:41, Leon Romanovsky wrote:
> > On Mon, Jul 28, 2025 at 10:12:31AM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2025-07-27 13:05, Jason Gunthorpe wrote:
> >>> On Fri, Jul 25, 2025 at 10:30:46AM -0600, Logan Gunthorpe wrote:
> >>>>
> >>>>
> >>>> On 2025-07-24 02:13, Leon Romanovsky wrote:
> >>>>> On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> >>>>>> On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> >>>>>>> From: Leon Romanovsky <leonro@nvidia.com>
> >>>>>>>
> >>>>>>> Export the pci_p2pdma_map_type() function to allow external modules
> >>>>>>> and subsystems to determine the appropriate mapping type for P2PDMA
> >>>>>>> transfers between a provider and target device.
> >>>>>>
> >>>>>> External modules have no business doing this.
> >>>>>
> >>>>> VFIO PCI code is built as module. There is no way to access PCI p2p code
> >>>>> without exporting functions in it.
> >>>>
> >>>> The solution that would make more sense to me would be for either
> >>>> dma_iova_try_alloc() or another helper in dma-iommu.c to handle the
> >>>> P2PDMA case.
> >>>
> >>> This has nothing to do with dma-iommu.c, the decisions here still need
> >>> to be made even if dma-iommu.c is not compiled in.
> >>
> >> Doesn't it though? Every single call in patch 10 to the newly exported
> >> PCI functions calls into the the dma-iommu functions. 

Patch 10 has lots of flows, only one will end up in dma-iommu.c

vfio_pci_dma_buf_map() calls pci_p2pdma_bus_addr_map(),
dma_iova_link(), dma_map_phys().

Only iova_link would call to dma-iommu.c - if dma_map_phys() is called
we know that dma-iommu.c won't be called by it.

> >> If there were non-iommu paths then I would expect the code would
> >> use the regular DMA api directly which would then call in to
> >> dma-iommu.
> > 
> > If p2p type is PCI_P2PDMA_MAP_BUS_ADDR, there will no dma-iommu and DMA
> > at all.
> 
> I understand that and it is completely beside my point.
> 
> If the dma mapping for P2P memory doesn't need to create an iommu
> mapping then that's fine. But it should be the dma-iommu layer to decide
> that.

So above, we can't use dma-iommu.c, it might not be compiled into the
kernel but the dma_map_phys() path is still valid.

> It's not a decision that should be made by every driver doing this
> kind of thing.

Sort of, I think we are trying to get to some place where there are
subsystem, or at least data structure specific helpers that do this
(ie nvme has BIO helpers), but the helpers should be running this
logic directly for performance. Leon hasn't done it but I think we
should see helpers for DMABUF too encapsulating the logic shown in
patch 10. I think we need to prove it out these basic points first
before trying to go and convert a bunch of GPU drivers.

The vfio in patch 10 is not the full example since it only has a
single scatter/gather" effectively, but the generalized version loops
over pci_p2pdma_bus_addr_map(), dma_iova_link(), dma_map_phys() for
each page.

Part of the new API design is to only do one kind of mapping operation
at once, and part of the design is we know that the P2P type is fixed.
It makes no performance sense to check the type inside the
pci_p2pdma_bus_addr_map()/ dma_iova_link()/dma_map_phys() within the
per-page loop.

I do think some level of abstraction has been lost here in pursuit of
performance. If someone does have a better way to structure this
without a performance hit then fantastic, but thats going back and
revising the new DMA API. This just builds on top of that, and yes, it
is not so abstract.

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-27 18:51         ` Jason Gunthorpe
@ 2025-07-29  7:52           ` Christoph Hellwig
  2025-07-29  8:53             ` Leon Romanovsky
  2025-07-29 13:15             ` Jason Gunthorpe
  0 siblings, 2 replies; 54+ messages in thread
From: Christoph Hellwig @ 2025-07-29  7:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Leon Romanovsky, Alex Williamson,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

On Sun, Jul 27, 2025 at 03:51:58PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 24, 2025 at 09:59:22AM +0200, Christoph Hellwig wrote:
> > On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
> > > Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274085.git.leonro@nvidia.com
> > > It gives me a way to call p2p code with stable pointer for whole BAR.
> > > 
> > 
> > That simply can't work.
> 
> Why not?
> 
> That's the whole point of this, to remove struct page and use
> something else as a handle for the p2p when doing the DMA API stuff.

Because the struct page is the only thing that:

 a) dma-mapping works on
 b) is the only place we can discover the routing information, but also
    more importantly ensure that the underlying page is still present
    and the device is not hot unplugged, or in a very theoretical worst
    case replaced by something else.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-24  8:13     ` Leon Romanovsky
  2025-07-25 16:30       ` Logan Gunthorpe
@ 2025-07-29  7:52       ` Christoph Hellwig
  2025-07-29  8:45         ` Leon Romanovsky
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2025-07-29  7:52 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Alex Williamson, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

On Thu, Jul 24, 2025 at 11:13:21AM +0300, Leon Romanovsky wrote:
> On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> > On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@nvidia.com>
> > > 
> > > Export the pci_p2pdma_map_type() function to allow external modules
> > > and subsystems to determine the appropriate mapping type for P2PDMA
> > > transfers between a provider and target device.
> > 
> > External modules have no business doing this.
> 
> VFIO PCI code is built as module. There is no way to access PCI p2p code
> without exporting functions in it.

We never ever export anything for "external" modules, and you really
should know that.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 09/10] vfio/pci: Share the core device pointer while invoking feature functions
  2025-07-28 20:55   ` Alex Williamson
@ 2025-07-29  8:39     ` Leon Romanovsky
  0 siblings, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-29  8:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Vivek Kasireddy, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Will Deacon

On Mon, Jul 28, 2025 at 02:55:53PM -0600, Alex Williamson wrote:
> On Wed, 23 Jul 2025 16:00:10 +0300
> Leon Romanovsky <leon@kernel.org> wrote:
> 
> > From: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > 
> > There is no need to share the main device pointer (struct vfio_device *)
> > with all the feature functions as they only need the core device
> > pointer. Therefore, extract the core device pointer once in the
> > caller (vfio_pci_core_ioctl_feature) and share it instead.
> > 
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_core.c | 30 +++++++++++++-----------------
> >  1 file changed, 13 insertions(+), 17 deletions(-)

<...>

> >  static int vfio_pci_core_pm_entry_with_wakeup(
> > -	struct vfio_device *device, u32 flags,
> > +	struct vfio_pci_core_device *vdev, u32 flags,
> >  	struct vfio_device_low_power_entry_with_wakeup __user *arg,
> >  	size_t argsz)
> 
> I'm tempted to fix the line wrapping here, but I think this patch
> stands on its own.  Even if it's rather trivial, it makes sense to
> consolidate and standardize on the vfio_pci_core_device getting passed
> around within vfio_pci_core.c.  Any reason not to split this off?

No problem, I will send it separately after merge window ends.

Thanks

> Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-29  7:52       ` Christoph Hellwig
@ 2025-07-29  8:45         ` Leon Romanovsky
  0 siblings, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-29  8:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alex Williamson, Jason Gunthorpe, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Tue, Jul 29, 2025 at 09:52:30AM +0200, Christoph Hellwig wrote:
> On Thu, Jul 24, 2025 at 11:13:21AM +0300, Leon Romanovsky wrote:
> > On Thu, Jul 24, 2025 at 10:03:13AM +0200, Christoph Hellwig wrote:
> > > On Wed, Jul 23, 2025 at 04:00:06PM +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > > > 
> > > > Export the pci_p2pdma_map_type() function to allow external modules
> > > > and subsystems to determine the appropriate mapping type for P2PDMA
> > > > transfers between a provider and target device.
> > > 
> > > External modules have no business doing this.
> > 
> > VFIO PCI code is built as module. There is no way to access PCI p2p code
> > without exporting functions in it.
> 
> We never ever export anything for "external" modules, and you really
> should know that.

It is just a wrong word in commit message. I clearly need it for
vfio-pci module and nothing more.

"Never attribute to malice that which is adequately explained by stupidity." - Hanlon's razor.

Thanks

> 
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-29  7:52           ` Christoph Hellwig
@ 2025-07-29  8:53             ` Leon Romanovsky
  2025-07-29 10:41               ` Christoph Hellwig
  2025-07-29 13:15             ` Jason Gunthorpe
  1 sibling, 1 reply; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-29  8:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Tue, Jul 29, 2025 at 09:52:09AM +0200, Christoph Hellwig wrote:
> On Sun, Jul 27, 2025 at 03:51:58PM -0300, Jason Gunthorpe wrote:
> > On Thu, Jul 24, 2025 at 09:59:22AM +0200, Christoph Hellwig wrote:
> > > On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
> > > > Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274085.git.leonro@nvidia.com
> > > > It gives me a way to call p2p code with stable pointer for whole BAR.
> > > > 
> > > 
> > > That simply can't work.
> > 
> > Why not?
> > 
> > That's the whole point of this, to remove struct page and use
> > something else as a handle for the p2p when doing the DMA API stuff.
> 
> Because the struct page is the only thing that:
> 
>  a) dma-mapping works on
>  b) is the only place we can discover the routing information, but also
>     more importantly ensure that the underlying page is still present
>     and the device is not hot unplugged, or in a very theoretical worst
>     case replaced by something else.

It is correct in general case, but here we are talking about MMIO
memory, which is "connected" to device X and routing information is
stable.

Thanks

> 
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-29  8:53             ` Leon Romanovsky
@ 2025-07-29 10:41               ` Christoph Hellwig
  2025-07-29 11:39                 ` Leon Romanovsky
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2025-07-29 10:41 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Christoph Hellwig, Jason Gunthorpe, Alex Williamson,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

On Tue, Jul 29, 2025 at 11:53:36AM +0300, Leon Romanovsky wrote:
> > Because the struct page is the only thing that:
> > 
> >  a) dma-mapping works on
> >  b) is the only place we can discover the routing information, but also
> >     more importantly ensure that the underlying page is still present
> >     and the device is not hot unplugged, or in a very theoretical worst
> >     case replaced by something else.
> 
> It is correct in general case, but here we are talking about MMIO
> memory, which is "connected" to device X and routing information is
> stable.

MMIO is literally the only thing we support to P2P to/from as that is
how PCIe P2P is defined.  And not, it's not stable - devices can be
unplugged, and BARs can be reenumerated.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-29 10:41               ` Christoph Hellwig
@ 2025-07-29 11:39                 ` Leon Romanovsky
  0 siblings, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-29 11:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Tue, Jul 29, 2025 at 12:41:00PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 29, 2025 at 11:53:36AM +0300, Leon Romanovsky wrote:
> > > Because the struct page is the only thing that:
> > > 
> > >  a) dma-mapping works on
> > >  b) is the only place we can discover the routing information, but also
> > >     more importantly ensure that the underlying page is still present
> > >     and the device is not hot unplugged, or in a very theoretical worst
> > >     case replaced by something else.
> > 
> > It is correct in general case, but here we are talking about MMIO
> > memory, which is "connected" to device X and routing information is
> > stable.
> 
> MMIO is literally the only thing we support to P2P to/from as that is
> how PCIe P2P is defined.  And not, it's not stable - devices can be
> unplugged, and BARs can be reenumerated.

I have a feeling that we are drifting from the current patchset to more
general discussion.

The whole idea of new DMA API is to provide flexibility to the callers
(subsystems) who are perfectly aware of their data and limitations to
implement direct addressing natively.

In this series, device is controlled by VFIO and DMABUF. It is not
possible to unplug it without VFIO notices it. In such case, p2pdma_provider
and related routing information (DMABUF) will be reevaluated.

So for VFIO + DMABUF, the pointer is very stable.

For other cases (general case), the flow is not changed.
Users  will continue to call to old and well-known pci_p2pdma_state()
to calculate p2p type.

Thanks

> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-29  7:52           ` Christoph Hellwig
  2025-07-29  8:53             ` Leon Romanovsky
@ 2025-07-29 13:15             ` Jason Gunthorpe
  1 sibling, 0 replies; 54+ messages in thread
From: Jason Gunthorpe @ 2025-07-29 13:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Leon Romanovsky, Alex Williamson, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Tue, Jul 29, 2025 at 09:52:09AM +0200, Christoph Hellwig wrote:
> On Sun, Jul 27, 2025 at 03:51:58PM -0300, Jason Gunthorpe wrote:
> > On Thu, Jul 24, 2025 at 09:59:22AM +0200, Christoph Hellwig wrote:
> > > On Thu, Jul 24, 2025 at 10:55:33AM +0300, Leon Romanovsky wrote:
> > > > Please, see last patch in the series https://lore.kernel.org/all/aea452cc27ca9e5169f7279d7b524190c39e7260.1753274085.git.leonro@nvidia.com
> > > > It gives me a way to call p2p code with stable pointer for whole BAR.
> > > > 
> > > 
> > > That simply can't work.
> > 
> > Why not?
> > 
> > That's the whole point of this, to remove struct page and use
> > something else as a handle for the p2p when doing the DMA API stuff.
> 
> Because the struct page is the only thing that:
> 
>  a) dma-mapping works on

The main point of the "dma-mapping: migrate to physical
address-based API" series was to remove the struct page dependencies
in the DMA API:

https://lore.kernel.org/all/cover.1750854543.git.leon@kernel.org/

If it is not complete, then it needs more fixing.

>  b) is the only place we can discover the routing information, 

This patch adds the p2pdma_provider structure to discover the routing
information, this is exactly the problem being solved here.

>     but also more importantly ensure that the underlying page is
>     still present and the device is not hot unplugged, or in a very
>     theoretical worst case replaced by something else.

I already answered this, for DMABUF the DMABUF invalidation scheme is
used to control the lifetime and no DMA mapping outlives the provider,
and the provider doesn't outlive the driver.

Hotplug works fine. VFIO gets the driver removal callback, it
invalidates all the DMABUFs, refuses to re-validate them, destroys the
P2P provider, and ends its driver. There is no lifetime issue.

Obviously you cannot use the new p2provider mechanism without some
kind of protection against use after hot unplug, but it doesn't have
to be struct page based.

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction
  2025-07-23 13:00 ` [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction Leon Romanovsky
  2025-07-24  7:51   ` Christoph Hellwig
@ 2025-07-29 16:12   ` Jason Gunthorpe
  1 sibling, 0 replies; 54+ messages in thread
From: Jason Gunthorpe @ 2025-07-29 16:12 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Alex Williamson, Leon Romanovsky, Christoph Hellwig,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

On Wed, Jul 23, 2025 at 04:00:03PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Extract the core P2PDMA provider information (device owner and bus
> offset) from the dev_pagemap into a dedicated p2pdma_provider structure.
> This creates a cleaner separation between the memory management layer and
> the P2PDMA functionality.
> 
> The new p2pdma_provider structure contains:
> - owner: pointer to the providing device
> - bus_offset: computed offset for non-host transactions
> 
> This refactoring simplifies the P2PDMA state management by removing
> the need to access pgmap internals directly. The pci_p2pdma_map_state
> now stores a pointer to the provider instead of the pgmap, making
> the API more explicit and easier to understand.

Based on the conversation how about this as a commit message:

PCI/P2PDMA: Separate the mmap() support from the core logic

Currently the P2PDMA code requires a pgmap and a struct page to
function. The was serving three important purposes:

 - DMA API compatibility, where scatterlist required a struct page as
   input

 - Life cycle management, the percpu_ref is used to prevent UAF during
   device hot unplug

 - A way to get the P2P provider data through the pci_p2pdma_pagemap

The DMA API now has a new flow, and has gained phys_addr_t support, so
it no longer needs struct pages to perform P2P mapping.

Lifecycle management can be delegated to the user, DMABUF for instance
has a suitable invalidation protocol that does not require struct
page.

Finding the P2P provider data can also be managed by the caller
without need to look it up from the phys_addr.

Split the P2PDMA code into two layers. The optionl upper layer,
effectively, provides a way to mmap() P2P memory into a VMA by
providing struct page, pgmap, a genalloc and sysfs.

The lower layer provides the actual P2P infrastructure and is wrapped
up in a new struct p2pdma_provider. Rework the mmap layer to use new
p2pdma_provider based APIs.

Drivers that do not want to put P2P memory into VMA's can allocate a
struct p2pdma_provider after probe() starts and free it before
remove() completes. When DMA mapping the driver must convey the struct
p2pdma_provider to the DMA mapping code along with a phys_addr of the
MMIO BAR slice to map. The driver must ensure that no DMA mapping
outlives the lifetime of the struct p2pdma_provider.

The intended target of this new API layer is DMABUF. There is usually
only a single p2pdma_provider for a DMABUF exporter. Most drivers can
establish the p2pdma_provider during probe, access the single instance
during DMABUF attach and use that to drive the DMA mapping.

DMABUF provides an invalidation mechanism that can guarentee all DMA
is halted and the DMA mappings are undone prior to destroying the
struct p2pdma_provider. This ensures there is no UAF through DMABUFs
that are lingering past driver removal.

The new p2pdma_provider layer cannot be used to create P2P memory that
can be mapped into VMA's, be used with pin_user_pages(), O_DIRECT, and
so on. These use cases must still use the mmap() layer. The
p2pdma_provider layer is principally for DMABUF-like use cases where
DMABUF natively manages the life cycle and access instead of
vmas/pin_user_pages()/struct page.

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions
  2025-07-23 13:00 ` [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
  2025-07-24  5:13   ` Kasireddy, Vivek
@ 2025-07-29 19:44   ` Robin Murphy
  2025-07-29 20:13     ` Jason Gunthorpe
  1 sibling, 1 reply; 54+ messages in thread
From: Robin Murphy @ 2025-07-29 19:44 UTC (permalink / raw)
  To: Leon Romanovsky, Alex Williamson
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On 2025-07-23 2:00 pm, Leon Romanovsky wrote:
[...]
> +static struct sg_table *
> +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
> +		     enum dma_data_direction dir)
> +{
> +	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
> +	struct p2pdma_provider *provider = priv->vdev->provider;
> +	struct dma_iova_state *state = attachment->priv;
> +	struct phys_vec *phys_vec = &priv->phys_vec;
> +	struct scatterlist *sgl;
> +	struct sg_table *sgt;
> +	dma_addr_t addr;
> +	int ret;
> +
> +	dma_resv_assert_held(priv->dmabuf->resv);
> +
> +	sgt = kzalloc(sizeof(*sgt), GFP_KERNEL);
> +	if (!sgt)
> +		return ERR_PTR(-ENOMEM);
> +
> +	ret = sg_alloc_table(sgt, 1, GFP_KERNEL | __GFP_ZERO);
> +	if (ret)
> +		goto err_kfree_sgt;
> +
> +	sgl = sgt->sgl;
> +
> +	if (!state) {
> +		addr = pci_p2pdma_bus_addr_map(provider, phys_vec->paddr);
> +	} else if (dma_use_iova(state)) {
> +		ret = dma_iova_link(attachment->dev, state, phys_vec->paddr, 0,
> +				    phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC);

The supposed benefits of this API are only for replacing scatterlists 
where multiple disjoint pages are being mapped. In this case with just 
one single contiguous mapping, it is clearly objectively worse to have 
to bounce in and out of the IOMMU layer 3 separate times and store a 
dma_map_state, to achieve the exact same operations that a single call 
to iommu_dma_map_resource() will perform more efficiently and with no 
external state required.

Oh yeah, and mapping MMIO with regular memory attributes (IOMMU_CACHE) 
rather than appropriate ones (IOMMU_MMIO), as this will end up doing, 
isn't guaranteed not to end badly either (e.g. if the system 
interconnect ends up merging consecutive write bursts and exceeding the 
target root port's MPS.)

> +		if (ret)
> +			goto err_free_table;
> +
> +		ret = dma_iova_sync(attachment->dev, state, 0, phys_vec->len);
> +		if (ret)
> +			goto err_unmap_dma;
> +
> +		addr = state->addr;
> +	} else {
> +		addr = dma_map_phys(attachment->dev, phys_vec->paddr,
> +				    phys_vec->len, dir, DMA_ATTR_SKIP_CPU_SYNC);

And again, if the IOMMU is in bypass (the idea of P2P with vfio-noiommu 
simply isn't worth entertaining) then what purpose do you imagine this 
call serves at all, other than to hilariously crash under 
"swiotlb=force"? Even in the case that phys_to_dma(phys_vec->paddr) != 
phys_vec->paddr, in almost all circumstances (both hardware offsets and 
CoCo environments with address-based aliasing), it is more likely than 
not that the latter is still the address you want and the former is 
wrong (and liable to lead to corruption or fatal system errors), because 
MMIO and memory remain fundamentally different things.

AFAICS you're *depending* on this call being an effective no-op, and 
thus only demonstrating that the dma_map_phys() idea is still entirely 
unnecessary.

> +		ret = dma_mapping_error(attachment->dev, addr);
> +		if (ret)
> +			goto err_free_table;
> +	}
> +
> +	fill_sg_entry(sgl, phys_vec->len, addr);
> +	return sgt;
> +
> +err_unmap_dma:
> +	dma_iova_destroy(attachment->dev, state, phys_vec->len, dir,
> +			 DMA_ATTR_SKIP_CPU_SYNC);
> +err_free_table:
> +	sg_free_table(sgt);
> +err_kfree_sgt:
> +	kfree(sgt);
> +	return ERR_PTR(ret);
> +}
> +
> +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
> +				   struct sg_table *sgt,
> +				   enum dma_data_direction dir)
> +{
> +	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
> +	struct dma_iova_state *state = attachment->priv;
> +	struct scatterlist *sgl;
> +	int i;
> +
> +	if (!state)
> +		; /* Do nothing */
> +	else if (dma_use_iova(state))
> +		dma_iova_destroy(attachment->dev, state, priv->phys_vec.len,
> +				 dir, DMA_ATTR_SKIP_CPU_SYNC);
> +	else
> +		for_each_sgtable_dma_sg(sgt, sgl, i)

The table always has exactly one entry...

Thanks,
Robin.

> +			dma_unmap_phys(attachment->dev, sg_dma_address(sgl),
> +				       sg_dma_len(sgl), dir,
> +				       DMA_ATTR_SKIP_CPU_SYNC);
> +
> +	sg_free_table(sgt);
> +	kfree(sgt);
> +}

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions
  2025-07-29 19:44   ` Robin Murphy
@ 2025-07-29 20:13     ` Jason Gunthorpe
  2025-07-30  9:32       ` Leon Romanovsky
  2025-07-30 14:49       ` Robin Murphy
  0 siblings, 2 replies; 54+ messages in thread
From: Jason Gunthorpe @ 2025-07-29 20:13 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky,
	Christoph Hellwig, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Tue, Jul 29, 2025 at 08:44:21PM +0100, Robin Murphy wrote:

> In this case with just one single
> contiguous mapping, it is clearly objectively worse to have to bounce in and
> out of the IOMMU layer 3 separate times and store a dma_map_state,

The non-contiguous mappings are comming back, it was in earlier drafts
of this. Regardless, the point is to show how to use the general API
that we would want to bring into the DRM drivers that don't have
contiguity even though VFIO is a bit special.

> Oh yeah, and mapping MMIO with regular memory attributes (IOMMU_CACHE)
> rather than appropriate ones (IOMMU_MMIO), as this will end up doing, isn't
> guaranteed not to end badly either (e.g. if the system interconnect ends up
> merging consecutive write bursts and exceeding the target root port's MPS.)

Yes, I recently noticed this too, it should be fixed..

But so we are all on the same page, alot of the PCI P2P systems are
setup so P2P does not transit through the iommu. It either takes the
ACS path through a switch or it uses ATS and takes a different ACS
path through a switch. It only transits through the iommu in
misconfigured systems or in the rarer case of P2P between root ports.

> And again, if the IOMMU is in bypass (the idea of P2P with vfio-noiommu simply
> isn't worth entertaining) 

Not quite. DMABUF is sort of upside down.

For example if we are exporting a DMABUF from VFIO and importing it to
RDMA then RDMA will call VFIO to make an attachment and the above VFIO
code will perform the DMA map to the RDMA struct device. DMABUF
returns a dma mapped scatterlist back to the RDMA driver.

The above dma_map_phys(rdma_dev,...) can be in bypass because the rdma
device can legitimately be in bypass, or not have a iommu, or
whatever.

> AFAICS you're *depending* on this call being an effective no-op, and thus
> only demonstrating that the dma_map_phys() idea is still entirely
> unnecessary.

It should not be a full no-op, and it should be closer to
dma map resource to avoid the mmio issues.

It should be failing for cases where it is not supported (ie
swiotlb=force), it should still be calling the legacy dma_ops, and it
should be undoing any CC mangling with the address. (also the
pci_p2pdma_bus_addr_map() needs to deal with any CC issues too)

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-28 23:11                 ` Jason Gunthorpe
@ 2025-07-29 20:54                   ` Logan Gunthorpe
  2025-07-29 22:14                     ` Jason Gunthorpe
  2025-07-30  8:03                     ` Leon Romanovsky
  0 siblings, 2 replies; 54+ messages in thread
From: Logan Gunthorpe @ 2025-07-29 20:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Christoph Hellwig, Alex Williamson,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon



On 2025-07-28 17:11, Jason Gunthorpe wrote:
>> If the dma mapping for P2P memory doesn't need to create an iommu
>> mapping then that's fine. But it should be the dma-iommu layer to decide
>> that.
> 
> So above, we can't use dma-iommu.c, it might not be compiled into the
> kernel but the dma_map_phys() path is still valid.

This is an easily solved problem. I did a very rough sketch below to say
it's really not that hard. (Note it has some rough edges that could be
cleaned up and I based it off Leon's git repo which appears to not be
the same as what was posted, but the core concept is sound).

Logan


diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 1853a969e197..da1a6003620a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1806,6 +1806,22 @@ bool dma_iova_try_alloc(struct device *dev,
struct dma_iova_state *state,
 }
 EXPORT_SYMBOL_GPL(dma_iova_try_alloc);
 +void dma_iova_try_alloc_p2p(struct p2pdma_provider *provider, struct
device *dev,
+		struct dma_iova_state *state, phys_addr_t phys, size_t size)
+{
+	switch (pci_p2pdma_map_type(provider, dev)) {
+	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+		dma_iova_try_alloc(dev, state, phys, size);
+		return;
+	case PCI_P2PDMA_MAP_BUS_ADDR:
+		state->bus_addr = true;
+		return;
+	default:
+		return;
+	}
+}
+EXPORT_SYMBOL_GPL(dma_iova_try_alloc_p2p);
+
 /**
  * dma_iova_free - Free an IOVA space
  * @dev: Device to free the IOVA space for
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c
b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 455541d21538..5749be3a9b58 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -30,25 +30,12 @@ static int vfio_pci_dma_buf_attach(struct dma_buf
*dmabuf,
 	if (priv->revoked)
 		return -ENODEV;
 -	switch (pci_p2pdma_map_type(priv->vdev->provider, attachment->dev)) {
-	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
-		break;
-	case PCI_P2PDMA_MAP_BUS_ADDR:
-		/*
-		 * There is no need in IOVA at all for this flow.
-		 * We rely on attachment->priv == NULL as a marker
-		 * for this mode.
-		 */
-		return 0;
-	default:
-		return -EINVAL;
-	}
-
 	attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL);
 	if (!attachment->priv)
 		return -ENOMEM;
 -	dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->size);
+	dma_iova_try_alloc_p2p(priv->vdev->provider, attachment->dev,
+			       attachment->priv, 0, priv->size);
 	return 0;
 }
 @@ -98,26 +85,11 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment
*attachment,
 	sgl = sgt->sgl;
  	for (i = 0; i < priv->nr_ranges; i++) {
-		if (!state) {
-			addr = pci_p2pdma_bus_addr_map(provider,
-						       phys_vec[i].paddr);
-		} else if (dma_use_iova(state)) {
-			ret = dma_iova_link(attachment->dev, state,
-					    phys_vec[i].paddr, 0,
-					    phys_vec[i].len, dir, attrs);
-			if (ret)
-				goto err_unmap_dma;
-
-			mapped_len += phys_vec[i].len;
-		} else {
-			addr = dma_map_phys(attachment->dev, phys_vec[i].paddr,
-					    phys_vec[i].len, dir, attrs);
-			ret = dma_mapping_error(attachment->dev, addr);
-			if (ret)
-				goto err_unmap_dma;
-		}
+		addr = dma_map_phys_prealloc(attachment->dev, phys_vec[i].paddr,
+					     phys_vec[i].len, dir, attrs, state,
+					     provider);
 -		if (!state || !dma_use_iova(state)) {
+		if (addr != DMA_MAPPING_USE_IOVA) {
 			/*
 			 * In IOVA case, there is only one SG entry which spans
 			 * for whole IOVA address space. So there is no need
@@ -128,7 +100,7 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment
*attachment,
 		}
 	}
 -	if (state && dma_use_iova(state)) {
+	if (addr == DMA_MAPPING_USE_IOVA) {
 		WARN_ON_ONCE(mapped_len != priv->size);
 		ret = dma_iova_sync(attachment->dev, state, 0, mapped_len);
 		if (ret)
@@ -139,7 +111,7 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment
*attachment,
 	return sgt;
  err_unmap_dma:
-	if (!i || !state)
+	if (!i || state->bus_addr)
 		; /* Do nothing */
 	else if (dma_use_iova(state))
 		dma_iova_destroy(attachment->dev, state, mapped_len, dir,
@@ -164,7 +136,7 @@ static void vfio_pci_dma_buf_unmap(struct
dma_buf_attachment *attachment,
 	struct scatterlist *sgl;
 	int i;
 -	if (!state)
+	if (state->bus_addr)
 		; /* Do nothing */
 	else if (dma_use_iova(state))
 		dma_iova_destroy(attachment->dev, state, priv->size, dir,
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index ba54bbeca861..675e5ac13265 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -70,11 +70,14 @@
  */
 #define DMA_MAPPING_ERROR		(~(dma_addr_t)0)
 +#define DMA_MAPPING_USE_IOVA		((dma_addr_t)-2)
+
 #define DMA_BIT_MASK(n)	(((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
  struct dma_iova_state {
 	dma_addr_t addr;
 	u64 __size;
+	bool bus_addr;
 };
  /*
@@ -120,6 +123,12 @@ void dma_unmap_page_attrs(struct device *dev,
dma_addr_t addr, size_t size,
 		enum dma_data_direction dir, unsigned long attrs);
 dma_addr_t dma_map_phys(struct device *dev, phys_addr_t phys, size_t size,
 		enum dma_data_direction dir, unsigned long attrs);
+
+struct p2pdma_provider;
+dma_addr_t dma_map_phys_prealloc(struct device *dev, phys_addr_t phys,
size_t size,
+		enum dma_data_direction dir, unsigned long attrs,
+		struct dma_iova_state *state, struct p2pdma_provider *provider);
+
 void dma_unmap_phys(struct device *dev, dma_addr_t addr, size_t size,
 		enum dma_data_direction dir, unsigned long attrs);
 unsigned int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
@@ -321,6 +330,8 @@ static inline bool dma_use_iova(struct
dma_iova_state *state)
  bool dma_iova_try_alloc(struct device *dev, struct dma_iova_state *state,
 		phys_addr_t phys, size_t size);
+void dma_iova_try_alloc_p2p(struct p2pdma_provider *provider, struct
device *dev,
+		struct dma_iova_state *state, phys_addr_t phys, size_t size);
 void dma_iova_free(struct device *dev, struct dma_iova_state *state);
 void dma_iova_destroy(struct device *dev, struct dma_iova_state *state,
 		size_t mapped_len, enum dma_data_direction dir,
@@ -343,6 +354,11 @@ static inline bool dma_iova_try_alloc(struct device
*dev,
 {
 	return false;
 }
+static inline void dma_iova_try_alloc_p2p(struct p2pdma_provider *provider,
+		struct device *dev, struct dma_iova_state *state, phys_addr_t phys,
+		size_t size)
+{
+}
 static inline void dma_iova_free(struct device *dev,
 		struct dma_iova_state *state)
 {
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index e1586eb52ab3..b2110098a29b 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -13,6 +13,7 @@
 #include <linux/iommu-dma.h>
 #include <linux/kmsan.h>
 #include <linux/of_device.h>
+#include <linux/pci-p2pdma.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include "debug.h"
@@ -202,6 +203,27 @@ dma_addr_t dma_map_phys(struct device *dev,
phys_addr_t phys, size_t size,
 }
 EXPORT_SYMBOL_GPL(dma_map_phys);
 +dma_addr_t dma_map_phys_prealloc(struct device *dev, phys_addr_t phys,
size_t size,
+		enum dma_data_direction dir, unsigned long attrs,
+		struct dma_iova_state *state, struct p2pdma_provider *provider)
+{
+	int ret;
+
+	if (state->bus_addr)
+		return pci_p2pdma_bus_addr_map(provider, phys);
+
+	if (dma_use_iova(state)) {
+		ret = dma_iova_link(dev, state, phys, 0, size, dir, attrs);
+		if (ret)
+			return DMA_MAPPING_ERROR;
+
+		return DMA_MAPPING_USE_IOVA;
+	}
+
+	return dma_map_phys(dev, phys, size, dir, attrs);
+}
+EXPORT_SYMBOL_GPL(dma_map_phys_prealloc);
+
 dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
 		size_t offset, size_t size, enum dma_data_direction dir,
 		unsigned long attrs)



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-29 20:54                   ` Logan Gunthorpe
@ 2025-07-29 22:14                     ` Jason Gunthorpe
  2025-07-30  8:03                     ` Leon Romanovsky
  1 sibling, 0 replies; 54+ messages in thread
From: Jason Gunthorpe @ 2025-07-29 22:14 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Leon Romanovsky, Christoph Hellwig, Alex Williamson,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Tue, Jul 29, 2025 at 02:54:13PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2025-07-28 17:11, Jason Gunthorpe wrote:
> >> If the dma mapping for P2P memory doesn't need to create an iommu
> >> mapping then that's fine. But it should be the dma-iommu layer to decide
> >> that.
> > 
> > So above, we can't use dma-iommu.c, it might not be compiled into the
> > kernel but the dma_map_phys() path is still valid.
> 
> This is an easily solved problem. I did a very rough sketch below to say
> it's really not that hard. (Note it has some rough edges that could be
> cleaned up and I based it off Leon's git repo which appears to not be
> the same as what was posted, but the core concept is sound).

I did hope for something like this in the early days, but it proved
not so easy to get agreements on details :(

My feeling was we should get some actual examples of using this thing
and then it is far easier to discuss ideas, like yours here, to
improve it. Many of the discussions kind of got confused without
enough actual usering code for everyone to refer to.

For instance the nvme use case is a big driver for the API design, and
it is quite different from these simpler flows, this idea needs to see
how it would work there.

Maybe this idea could also have provider = NULL meaning it is CPU
cachable memory?

> +static inline void dma_iova_try_alloc_p2p(struct p2pdma_provider *provider,
> +               struct device *dev, struct dma_iova_state *state, phys_addr_t phys,
> +               size_t size)
> +{
> +}

Can't be empty - PCI_P2PDMA_MAP_THRU_HOST_BRIDGE vs
PCI_P2PDMA_MAP_BUS_ADDR still matters so it still must set
dma_iova_state::bus_addr to get dma_map_phys_prealloc() to do the
right thing.

Still, it would make sense to put something like that in dma/mapping.c
and rely on the static inline stub for dma_iova_try_alloc()..

>   	for (i = 0; i < priv->nr_ranges; i++) {
> -		if (!state) {
> -			addr = pci_p2pdma_bus_addr_map(provider,
> -						       phys_vec[i].paddr);
> -		} else if (dma_use_iova(state)) {
> -			ret = dma_iova_link(attachment->dev, state,
> -					    phys_vec[i].paddr, 0,
> -					    phys_vec[i].len, dir, attrs);
> -			if (ret)
> -				goto err_unmap_dma;
> -
> -			mapped_len += phys_vec[i].len;
> -		} else {
> -			addr = dma_map_phys(attachment->dev, phys_vec[i].paddr,
> -					    phys_vec[i].len, dir, attrs);
> -			ret = dma_mapping_error(attachment->dev, addr);
> -			if (ret)
> -				goto err_unmap_dma;
> -		}
> +		addr = dma_map_phys_prealloc(attachment->dev, phys_vec[i].paddr,
> +					     phys_vec[i].len, dir, attrs, state,
> +					     provider);

There was a draft of something like this at some point. The
DMA_MAPPING_USE_IOVA is a new twist though

>  #define DMA_BIT_MASK(n)	(((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
>   struct dma_iova_state {
>  	dma_addr_t addr;
>  	u64 __size;
> +	bool bus_addr;
>  };

Gowing this structure has been strongly pushed back on. This probably
can be solved in some other way, a bitfield on size perhaps..

>  +dma_addr_t dma_map_phys_prealloc(struct device *dev, phys_addr_t phys,
> size_t size,
> +		enum dma_data_direction dir, unsigned long attrs,
> +		struct dma_iova_state *state, struct p2pdma_provider *provider)
> +{
> +	int ret;
> +
> +	if (state->bus_addr)
> +		return pci_p2pdma_bus_addr_map(provider, phys);
> +
> +	if (dma_use_iova(state)) {
> +		ret = dma_iova_link(dev, state, phys, 0, size, dir, attrs);
> +		if (ret)
> +			return DMA_MAPPING_ERROR;
> +
> +		return DMA_MAPPING_USE_IOVA;
> +	}
> +
> +	return dma_map_phys(dev, phys, size, dir, attrs);
> +}
> +EXPORT_SYMBOL_GPL(dma_map_phys_prealloc);

I would be tempted to inline this

Overall, yeah I would certainly welcome improvements like this if
everyone can agree, but I'd really like to see nvme merged before we
start working on ideas. That way the proposal can be properly
evaluated by all the stake holders.

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function
  2025-07-29 20:54                   ` Logan Gunthorpe
  2025-07-29 22:14                     ` Jason Gunthorpe
@ 2025-07-30  8:03                     ` Leon Romanovsky
  1 sibling, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-30  8:03 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Jason Gunthorpe, Christoph Hellwig, Alex Williamson,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Marek Szyprowski, Robin Murphy, Sumit Semwal,
	Vivek Kasireddy, Will Deacon

On Tue, Jul 29, 2025 at 02:54:13PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2025-07-28 17:11, Jason Gunthorpe wrote:
> >> If the dma mapping for P2P memory doesn't need to create an iommu
> >> mapping then that's fine. But it should be the dma-iommu layer to decide
> >> that.
> > 
> > So above, we can't use dma-iommu.c, it might not be compiled into the
> > kernel but the dma_map_phys() path is still valid.
> 
> This is an easily solved problem. I did a very rough sketch below to say
> it's really not that hard. (Note it has some rough edges that could be
> cleaned up and I based it off Leon's git repo which appears to not be
> the same as what was posted, but the core concept is sound).

I started to prepare v2, this is why posted version is slightly
different from dmabuf-vfio branch.

In addition to what Jason wrote. there is an extra complexity with using
state. The wrappers which operate on dma_iova_state assume that all memory,
which is going to be mapped, is the same type: or p2p or not.

This is not the cased for HMM/RDMA users, there you create state in
advance and get mixed type of pages.

Thanks

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions
  2025-07-29 20:13     ` Jason Gunthorpe
@ 2025-07-30  9:32       ` Leon Romanovsky
  2025-07-30 14:49       ` Robin Murphy
  1 sibling, 0 replies; 54+ messages in thread
From: Leon Romanovsky @ 2025-07-30  9:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Alex Williamson, Christoph Hellwig, Andrew Morton,
	Bjorn Helgaas, Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Tue, Jul 29, 2025 at 05:13:51PM -0300, Jason Gunthorpe wrote:
> On Tue, Jul 29, 2025 at 08:44:21PM +0100, Robin Murphy wrote:
> 
> > In this case with just one single
> > contiguous mapping, it is clearly objectively worse to have to bounce in and
> > out of the IOMMU layer 3 separate times and store a dma_map_state,
> 
> The non-contiguous mappings are comming back, it was in earlier drafts
> of this. Regardless, the point is to show how to use the general API
> that we would want to bring into the DRM drivers that don't have
> contiguity even though VFIO is a bit special.

Yes, we will see comeback of DMA ranges in v2.

Thanks

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions
  2025-07-29 20:13     ` Jason Gunthorpe
  2025-07-30  9:32       ` Leon Romanovsky
@ 2025-07-30 14:49       ` Robin Murphy
  2025-07-30 16:01         ` Jason Gunthorpe
  1 sibling, 1 reply; 54+ messages in thread
From: Robin Murphy @ 2025-07-30 14:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky,
	Christoph Hellwig, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On 2025-07-29 9:13 pm, Jason Gunthorpe wrote:
> On Tue, Jul 29, 2025 at 08:44:21PM +0100, Robin Murphy wrote:
> 
>> In this case with just one single
>> contiguous mapping, it is clearly objectively worse to have to bounce in and
>> out of the IOMMU layer 3 separate times and store a dma_map_state,
> 
> The non-contiguous mappings are comming back, it was in earlier drafts
> of this. Regardless, the point is to show how to use the general API
> that we would want to bring into the DRM drivers that don't have
> contiguity even though VFIO is a bit special.
> 
>> Oh yeah, and mapping MMIO with regular memory attributes (IOMMU_CACHE)
>> rather than appropriate ones (IOMMU_MMIO), as this will end up doing, isn't
>> guaranteed not to end badly either (e.g. if the system interconnect ends up
>> merging consecutive write bursts and exceeding the target root port's MPS.)
> 
> Yes, I recently noticed this too, it should be fixed..
> 
> But so we are all on the same page, alot of the PCI P2P systems are
> setup so P2P does not transit through the iommu. It either takes the
> ACS path through a switch or it uses ATS and takes a different ACS
> path through a switch. It only transits through the iommu in
> misconfigured systems or in the rarer case of P2P between root ports.

For non-ATS (and ATS Untranslated traffic), my understanding is that we 
rely on ACS upstream redirect to send transactions all the way up to the 
root port for translation (and without that then they are indeed pure 
bus addresses, take the pci_p2pdma_bus_addr_map() case, and the rest of 
this is all irrelevant). In Arm system terms, simpler root ports may 
well have to run that traffic out to an external SMMU TBU, at which 
point any P2P would loop back externally through the memory space window 
in the system interconnect PA space, as opposed to DTI-ATS root 
complexes that effectively implement their own internal translation 
agent on the PCIe side. Thus on some systems, even P2P behind a single 
root port may end up looking functionally the same as the cross-RP case, 
but in general cross-RP *is* something that people seem to care about as 
well. We're seeing more and more systems where each slot has its own RP 
as a separate segment, rather than giant root complexes with a host 
bridge and everyone on one big happy root bus together.

>> And again, if the IOMMU is in bypass (the idea of P2P with vfio-noiommu simply
>> isn't worth entertaining)
> 
> Not quite. DMABUF is sort of upside down.
> 
> For example if we are exporting a DMABUF from VFIO and importing it to
> RDMA then RDMA will call VFIO to make an attachment and the above VFIO
> code will perform the DMA map to the RDMA struct device. DMABUF
> returns a dma mapped scatterlist back to the RDMA driver.
> 
> The above dma_map_phys(rdma_dev,...) can be in bypass because the rdma
> device can legitimately be in bypass, or not have a iommu, or
> whatever.

I understand how dma-buf works - obviously DMA mapping for the VFIO 
device itself while it's not even attached to its default domain would 
be silly. I mean that any system that has 64-bit coherent PCIe behind an 
IOMMU such that this VFIO exporter could exist, is realistically going 
to have the same (or equivalent) IOMMU in front of any potential 
importers as well. *Especially* if you expect the normal case for P2P to 
be within a single hierarchy. Thus I was simply commenting that 
IOMMU_DOMAIN_IDENTITY is the *only* realistic reason to actually expect 
to interact with dma-direct here.

But of course, if it's not dma-direct because we're on POWER with TCE, 
rather than VFIO Type1 implying an iommu-dma/dma-direct arch, then who 
knows? I imagine the complete absence of any mention means this hasn't 
been tried, or possibly even considered?

>> AFAICS you're *depending* on this call being an effective no-op, and thus
>> only demonstrating that the dma_map_phys() idea is still entirely
>> unnecessary.
> 
> It should not be a full no-op, and it should be closer to
> dma map resource to avoid the mmio issues.

I don't get what you mean by "not be a full no-op", can you clarify 
exactly what you think it should be doing? Even if it's just the 
dma_capable() mask check equivalent to dma_direct_map_resource(), you 
don't actually want that here either - in that case you'd want to fail 
the entire attachment to begin with since it can never work.

> It should be failing for cases where it is not supported (ie
> swiotlb=force), it should still be calling the legacy dma_ops, and it
> should be undoing any CC mangling with the address. (also the
> pci_p2pdma_bus_addr_map() needs to deal with any CC issues too)

Um, my whole point is that the "legacy DMA ops" cannot be called, 
because they still assume page-backed memory, so at best are guaranteed 
to fail; any "CC mangling" assumed for memory is most likely wrong for 
MMIO, and there simply is no "deal with" at this point.

A device BAR is simply not under control of the trusted hypervisor the 
same way memory is; whatever (I/G)PA it is at must already be the 
correct address, if the aliasing scheme even applies at all. Sticking to 
Arm CCA terminology for example, if a device in shared state tries to 
import a BAR from a device in locked/private state, there is no notion 
of touching the shared alias and hoping it somehow magically works (at 
best it might throw the exporting device into TDISP error state 
terminally); that attachment simply cannot be allowed. If an shared 
resource exists in the shared IPA space to begin with, dma_to_phys() 
will do the wrong thing, and even phys_to_dma() would technically not 
walk dma_range_map correctly, because both assume "phys" represents 
kernel memory. However it's also all moot since any attempt at any 
combination will fail anyway due to SWIOTLB being forced by 
is_realm_world().

(OK, I admit "crash" wasn't strictly the right word to use there - I 
keep forgetting that some of the P2P scatterlist support in dma-direct 
ended up affecting the map_page path too, even though that was never 
really the functional intent - but hey, the overall result of failing to 
work as expected is the same.)

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions
  2025-07-30 14:49       ` Robin Murphy
@ 2025-07-30 16:01         ` Jason Gunthorpe
  0 siblings, 0 replies; 54+ messages in thread
From: Jason Gunthorpe @ 2025-07-30 16:01 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Leon Romanovsky, Alex Williamson, Leon Romanovsky,
	Christoph Hellwig, Andrew Morton, Bjorn Helgaas,
	Christian König, dri-devel, iommu, Jens Axboe,
	Jérôme Glisse, Joerg Roedel, kvm, linaro-mm-sig,
	linux-block, linux-kernel, linux-media, linux-mm, linux-pci,
	Logan Gunthorpe, Marek Szyprowski, Sumit Semwal, Vivek Kasireddy,
	Will Deacon

On Wed, Jul 30, 2025 at 03:49:45PM +0100, Robin Murphy wrote:
> On 2025-07-29 9:13 pm, Jason Gunthorpe wrote:
> > On Tue, Jul 29, 2025 at 08:44:21PM +0100, Robin Murphy wrote:
> > 
> > > In this case with just one single
> > > contiguous mapping, it is clearly objectively worse to have to bounce in and
> > > out of the IOMMU layer 3 separate times and store a dma_map_state,
> > 
> > The non-contiguous mappings are comming back, it was in earlier drafts
> > of this. Regardless, the point is to show how to use the general API
> > that we would want to bring into the DRM drivers that don't have
> > contiguity even though VFIO is a bit special.
> > 
> > > Oh yeah, and mapping MMIO with regular memory attributes (IOMMU_CACHE)
> > > rather than appropriate ones (IOMMU_MMIO), as this will end up doing, isn't
> > > guaranteed not to end badly either (e.g. if the system interconnect ends up
> > > merging consecutive write bursts and exceeding the target root port's MPS.)
> > 
> > Yes, I recently noticed this too, it should be fixed..
> > 
> > But so we are all on the same page, alot of the PCI P2P systems are
> > setup so P2P does not transit through the iommu. It either takes the
> > ACS path through a switch or it uses ATS and takes a different ACS
> > path through a switch. It only transits through the iommu in
> > misconfigured systems or in the rarer case of P2P between root ports.
> 
> For non-ATS (and ATS Untranslated traffic), my understanding is that we rely
> on ACS upstream redirect to send transactions all the way up to the root
> port for translation (and without that then they are indeed pure bus
> addresses, take the pci_p2pdma_bus_addr_map() case,

My point is it is common for real systems to take the pci_p2pdma_bus_addr_map()
path. Going through the RP is too slow.

> all irrelevant). In Arm system terms, simpler root ports may well have to
> run that traffic out to an external SMMU TBU, at which point any P2P would
> loop back externally through the memory space window in the system

Many real systems simply don't support this at all :(

> But of course, if it's not dma-direct because we're on POWER with TCE,
> rather than VFIO Type1 implying an iommu-dma/dma-direct arch, then who
> knows? I imagine the complete absence of any mention means this hasn't been
> tried, or possibly even considered?

POWER uses dma_ops and the point of this design is that dma_may_phys()
will still call the dma_ops. See below.

> I don't get what you mean by "not be a full no-op", can you clarify exactly
> what you think it should be doing? Even if it's just the dma_capable() mask
> check equivalent to dma_direct_map_resource(), you don't actually want that
> here either - in that case you'd want to fail the entire attachment to begin
> with since it can never work.

The expectation would be if the dma mapping can't succeed then the
phys map should fail. So if dma_capable() or whatever is not OK then
fail inside the loop and unwind back to failing the whole attach.

> > It should be failing for cases where it is not supported (ie
> > swiotlb=force), it should still be calling the legacy dma_ops, and it
> > should be undoing any CC mangling with the address. (also the
> > pci_p2pdma_bus_addr_map() needs to deal with any CC issues too)
> 
> Um, my whole point is that the "legacy DMA ops" cannot be called, because
> they still assume page-backed memory, so at best are guaranteed to fail; any
> "CC mangling" assumed for memory is most likely wrong for MMIO, and there
> simply is no "deal with" at this point.

I think we all agreed it should use the resource path. So legacy DMA
ops, including POWER, should end up calling

struct dma_map_ops {
	dma_addr_t (*map_resource)(struct device *dev, phys_addr_t phys_addr,
			size_t size, enum dma_data_direction dir,
			unsigned long attrs);

And if that is NULL it should fail.

> A device BAR is simply not under control of the trusted hypervisor the same
> way memory is;

I'm not sure what you mean? I think it is, at least for CC I expect
ACS to be setup to force translation and this squarly puts access to
the MMIO BAR under control of the the S2 translation.

In ARM terms I expect that the RMM's S2 will contain the MMIO BAR at
the shared IPA (ie top bit set), which will match where the CPU should
access it? Linux's IOMMU S2 should mirror this and put the MMIO BAR at
the shared IPA. Meaning upon locking the MMIO phys_addr_t effectively
moves?

At least I would be surprised to hear that shared MMIO was placed in
the private IPA space??

Outside CC we do have a rare configuration where the ACS is not
forcing translation and then your remarks are true. Hypervisor must
enfroce IPA == GPA == bus addr. It's a painful configuration to make
work.

> Sticking to Arm CCA terminology for example, if a device in shared
> state tries to import a BAR from a device in locked/private state,
> there is no notion of touching the shared alias and hoping it
> somehow magically works (at best it might throw the exporting device
> into TDISP error state terminally);

Right, we don't support T=1 DMA yet, or locked devices, but when we do
the p2pdma layer needs to be fixed up to catch this and reject it.

I think it is pretty easy, the p2pdma_provider struct can record if
the exporting struct device has shared or private MMIO. Then when
doing the mapping we require that private MMIO be accessed from T=1.

This should be addressed as part of enabling PCI T=1 support, eg in
ARM terms along with Aneesh's series "ARM CCA Device Assignment
support"

> simply cannot be allowed. If an shared resource exists in the shared IPA
> space to begin with, dma_to_phys() will do the wrong thing, and even
> phys_to_dma() would technically not walk dma_range_map correctly, because
> both assume "phys" represents kernel memory. 

As above for CC I am expecting that translation will always be
required. The S2 in both the RMM and hypervisor SMMUs should both have
shared accessiblity for whatever phys_addr the CPU is using.

So phys_to_dma() just needs to return the normal CPU phys_addr_t to
work, and this looks believable to me. ARM forces the shared IPA
through dma_addr_unencrypted(), but it is already wrong for the core
code to call that function for "encrypted" MMIO.

Not sure about the ranges or dma_to_phys(), I doubt anyone has ever
tested this so it probably doesn't work - but I don't see anything
architecturally catastrophic here, just some bugs.

> However it's also all moot since any attempt at any combination will
> fail anyway due to SWIOTLB being forced by is_realm_world().

Yep.

Basically P2P for ARM CCA today needs some bug fixing and testing -
not surprising. ARM CCA is already rare, and even we don't use P2P
under any CC architecture today.

I'm sure it will be fixed as a separate work, at least we will soon
care about P2P on ARM CCA working.

Regardless, from a driver perspective none of the CC detail should
leak into VFIO. The P2P APIs and the DMA APIs are the right place to
abstract it away, and yes they probably fail to do so right now.

I'm guessing that if DMA_ATTR_MMIO is agreed then a
DMA_ATTR_MMIO_ENCRYPTED would be the logical step. That should provide
enough detail that the DMA API can compute correct addressing.

Maybe this whole discussion improves the case for DMA_ATTR_MMIO.

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf
  2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (9 preceding siblings ...)
  2025-07-23 13:00 ` [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
@ 2025-07-30 19:58 ` Alex Williamson
  2025-07-31  0:21   ` Jason Gunthorpe
  10 siblings, 1 reply; 54+ messages in thread
From: Alex Williamson @ 2025-07-30 19:58 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Leon Romanovsky, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

On Wed, 23 Jul 2025 16:00:01 +0300
Leon Romanovsky <leon@kernel.org> wrote:

> From: Leon Romanovsky <leonro@nvidia.com>
> 
> ---------------------------------------------------------------------------
> Based on blk and DMA patches which will be sent during coming merge window.
> ---------------------------------------------------------------------------
> 
> This series extends the VFIO PCI subsystem to support exporting MMIO regions
> from PCI device BARs as dma-buf objects, enabling safe sharing of non-struct
> page memory with controlled lifetime management. This allows RDMA and other
> subsystems to import dma-buf FDs and build them into memory regions for PCI
> P2P operations.
> 
> The series supports a use case for SPDK where a NVMe device will be owned
> by SPDK through VFIO but interacting with a RDMA device. The RDMA device
> may directly access the NVMe CMB or directly manipulate the NVMe device's
> doorbell using PCI P2P.
> 
> However, as a general mechanism, it can support many other scenarios with
> VFIO. This dmabuf approach can be usable by iommufd as well for generic
> and safe P2P mappings.

I think this will eventually enable DMA mapping of device MMIO through
an IOMMUFD IOAS for the VM P2P use cases, right?  How do we get from
what appears to be a point-to-point mapping between two devices to a
shared IOVA between multiple devices?  I'm guessing we need IOMMUFD to
support something like IOMMU_IOAS_MAP_FILE for dma-buf, but I can't
connect all the dots.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf
  2025-07-30 19:58 ` [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Alex Williamson
@ 2025-07-31  0:21   ` Jason Gunthorpe
  0 siblings, 0 replies; 54+ messages in thread
From: Jason Gunthorpe @ 2025-07-31  0:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Leon Romanovsky, Christoph Hellwig,
	Andrew Morton, Bjorn Helgaas, Christian König, dri-devel,
	iommu, Jens Axboe, Jérôme Glisse, Joerg Roedel, kvm,
	linaro-mm-sig, linux-block, linux-kernel, linux-media, linux-mm,
	linux-pci, Logan Gunthorpe, Marek Szyprowski, Robin Murphy,
	Sumit Semwal, Vivek Kasireddy, Will Deacon

On Wed, Jul 30, 2025 at 01:58:46PM -0600, Alex Williamson wrote:
> On Wed, 23 Jul 2025 16:00:01 +0300
> Leon Romanovsky <leon@kernel.org> wrote:
> 
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > ---------------------------------------------------------------------------
> > Based on blk and DMA patches which will be sent during coming merge window.
> > ---------------------------------------------------------------------------
> > 
> > This series extends the VFIO PCI subsystem to support exporting MMIO regions
> > from PCI device BARs as dma-buf objects, enabling safe sharing of non-struct
> > page memory with controlled lifetime management. This allows RDMA and other
> > subsystems to import dma-buf FDs and build them into memory regions for PCI
> > P2P operations.
> > 
> > The series supports a use case for SPDK where a NVMe device will be owned
> > by SPDK through VFIO but interacting with a RDMA device. The RDMA device
> > may directly access the NVMe CMB or directly manipulate the NVMe device's
> > doorbell using PCI P2P.
> > 
> > However, as a general mechanism, it can support many other scenarios with
> > VFIO. This dmabuf approach can be usable by iommufd as well for generic
> > and safe P2P mappings.
> 
> I think this will eventually enable DMA mapping of device MMIO through
> an IOMMUFD IOAS for the VM P2P use cases, right?  

This is the plan

> How do we get from
> what appears to be a point-to-point mapping between two devices to a
> shared IOVA between multiple devices?

You have it right below, it is a point to point mapping between the
vfio device and the iommufd.

> I'm guessing we need IOMMUFD to support something like
> IOMMU_IOAS_MAP_FILE for dma-buf, 

1) The dma phys series which needs more work
2) This series to get basic 'movable' DMABUF support in VFIO
3) Add 'revokable' as a DMABUF concept and implement it with mlx5 and
   vfio
4) Add some way to get the phys_addr list from the DMABUF
5) IOMMU_IOAS_MAP_FILE using a revokable attachment and the phys_addr
   list. When VFIO does FLR the iommufd can remove the IOPTEs and then
   put them back when FLR is done.

It is not so much more code, but I think every step will take a lot of
work to get agreements.

Then we reuse all of the above with some tweaks for the CC problems
too.

Jason

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2025-07-31  0:21 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-23 13:00 [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
2025-07-23 13:00 ` [PATCH 01/10] PCI/P2PDMA: Remove redundant bus_offset from map state Leon Romanovsky
2025-07-24  7:50   ` Christoph Hellwig
2025-07-23 13:00 ` [PATCH 02/10] PCI/P2PDMA: Introduce p2pdma_provider structure for cleaner abstraction Leon Romanovsky
2025-07-24  7:51   ` Christoph Hellwig
2025-07-24  7:55     ` Leon Romanovsky
2025-07-24  7:59       ` Christoph Hellwig
2025-07-24  8:07         ` Leon Romanovsky
2025-07-27 18:51         ` Jason Gunthorpe
2025-07-29  7:52           ` Christoph Hellwig
2025-07-29  8:53             ` Leon Romanovsky
2025-07-29 10:41               ` Christoph Hellwig
2025-07-29 11:39                 ` Leon Romanovsky
2025-07-29 13:15             ` Jason Gunthorpe
2025-07-29 16:12   ` Jason Gunthorpe
2025-07-23 13:00 ` [PATCH 03/10] PCI/P2PDMA: Simplify bus address mapping API Leon Romanovsky
2025-07-24  7:52   ` Christoph Hellwig
2025-07-23 13:00 ` [PATCH 04/10] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation Leon Romanovsky
2025-07-23 13:00 ` [PATCH 05/10] PCI/P2PDMA: Export pci_p2pdma_map_type() function Leon Romanovsky
2025-07-24  8:03   ` Christoph Hellwig
2025-07-24  8:13     ` Leon Romanovsky
2025-07-25 16:30       ` Logan Gunthorpe
2025-07-25 18:54         ` Leon Romanovsky
2025-07-25 19:12           ` Logan Gunthorpe
2025-07-27  6:01             ` Leon Romanovsky
2025-07-27 19:05         ` Jason Gunthorpe
2025-07-28 16:12           ` Logan Gunthorpe
2025-07-28 16:41             ` Leon Romanovsky
2025-07-28 17:07               ` Logan Gunthorpe
2025-07-28 23:11                 ` Jason Gunthorpe
2025-07-29 20:54                   ` Logan Gunthorpe
2025-07-29 22:14                     ` Jason Gunthorpe
2025-07-30  8:03                     ` Leon Romanovsky
2025-07-29  7:52       ` Christoph Hellwig
2025-07-29  8:45         ` Leon Romanovsky
2025-07-27 19:02     ` Jason Gunthorpe
2025-07-23 13:00 ` [PATCH 06/10] types: move phys_vec definition to common header Leon Romanovsky
2025-07-23 13:00 ` [PATCH 07/10] vfio: Export vfio device get and put registration helpers Leon Romanovsky
2025-07-23 13:00 ` [PATCH 08/10] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
2025-07-23 13:00 ` [PATCH 09/10] vfio/pci: Share the core device pointer while invoking feature functions Leon Romanovsky
2025-07-28 20:55   ` Alex Williamson
2025-07-29  8:39     ` Leon Romanovsky
2025-07-23 13:00 ` [PATCH 10/10] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
2025-07-24  5:13   ` Kasireddy, Vivek
2025-07-24  5:44     ` Leon Romanovsky
2025-07-25  5:34       ` Kasireddy, Vivek
2025-07-27  6:16         ` Leon Romanovsky
2025-07-29 19:44   ` Robin Murphy
2025-07-29 20:13     ` Jason Gunthorpe
2025-07-30  9:32       ` Leon Romanovsky
2025-07-30 14:49       ` Robin Murphy
2025-07-30 16:01         ` Jason Gunthorpe
2025-07-30 19:58 ` [PATCH 00/10] vfio/pci: Allow MMIO regions to be exported through dma-buf Alex Williamson
2025-07-31  0:21   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).