[PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf
@ 2025-11-11  9:57 Leon Romanovsky
  2025-11-11  9:57 ` [PATCH v8 01/11] PCI/P2PDMA: Separate the mmap() support from the core logic Leon Romanovsky
                   ` (10 more replies)
  0 siblings, 11 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Alex Mastro, Nicolin Chen, Vivek Kasireddy

Changelog:
v8:
 * Fixed spelling errors in p2pdma documentation file.
 * Added vdev->pci_ops check for NULL in vfio_pci_core_feature_dma_buf().
 * Simplified the nvgrace_get_dmabuf_phys() function.
 * Added extra check in pcim_p2pdma_provider() to catch missing call
   to pcim_p2pdma_init().
v7: https://patch.msgid.link/20251106-dmabuf-vfio-v7-0-2503bf390699@nvidia.com
 * Dropped restore_revoke flag and added vfio_pci_dma_buf_move
   to reverse loop.
 * Fixed spelling errors in documentation patch.
 * Rebased on top of v6.18-rc3.
 * Added include to stddef.h to vfio.h, to keep uapi header file independent.
v6: https://patch.msgid.link/20251102-dmabuf-vfio-v6-0-d773cff0db9f@nvidia.com
 * Fixed wrong error check from pcim_p2pdma_init().
 * Documented pcim_p2pdma_provider() function.
 * Improved commit messages.
 * Added VFIO DMA-BUF selftest, not sent yet.
 * Added __counted_by(nr_ranges) annotation to struct vfio_device_feature_dma_buf.
 * Fixed error unwind when dma_buf_fd() fails.
 * Document latest changes to p2pmem.
 * Removed EXPORT_SYMBOL_GPL from pci_p2pdma_map_type.
 * Moved DMA mapping logic to DMA-BUF.
 * Removed types patch to avoid dependencies between subsystems.
 * Moved vfio_pci_dma_buf_move() in err_undo block.
 * Added nvgrace patch.
v5: https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org
 * Rebased on top of v6.18-rc1.
 * Added more validation logic to make sure that DMA-BUF length doesn't
   overflow in various scenarios.
 * Hide kernel config from the users.
 * Fixed type conversion issue. DMA ranges are exposed with u64 length,
   but DMA-BUF uses "unsigned int" as a length for SG entries.
 * Added check to prevent from VFIO drivers which reports BAR size
   different from PCI, do not use DMA-BUF functionality.
v4: https://lore.kernel.org/all/cover.1759070796.git.leon@kernel.org
 * Split pcim_p2pdma_provider() to two functions, one that initializes
   array of providers and another to return right provider pointer.
v3: https://lore.kernel.org/all/cover.1758804980.git.leon@kernel.org
 * Changed pcim_p2pdma_enable() to be pcim_p2pdma_provider().
 * Cache provider in vfio_pci_dma_buf struct instead of BAR index.
 * Removed misleading comment from pcim_p2pdma_provider().
 * Moved MMIO check to be in pcim_p2pdma_provider().
v2: https://lore.kernel.org/all/cover.1757589589.git.leon@kernel.org/
 * Added extra patch which adds new CONFIG, so next patches can reuse
 * it.
 * Squashed "PCI/P2PDMA: Remove redundant bus_offset from map state"
   into the other patch.
 * Fixed revoke calls to be aligned with true->false semantics.
 * Extended p2pdma_providers to be per-BAR and not global to whole
 * device.
 * Fixed possible race between dmabuf states and revoke.
 * Moved revoke to PCI BAR zap block.
v1: https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org
 * Changed commit messages.
 * Reused DMA_ATTR_MMIO attribute.
 * Returned support for multiple DMA ranges per-dMABUF.
v0: https://lore.kernel.org/all/cover.1753274085.git.leonro@nvidia.com

---------------------------------------------------------------------------
Based on "[PATCH v6 00/16] dma-mapping: migrate to physical address-based API"
https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/ series.
---------------------------------------------------------------------------

This series extends the VFIO PCI subsystem to support exporting MMIO
regions from PCI device BARs as dma-buf objects, enabling safe sharing of
non-struct page memory with controlled lifetime management. This allows RDMA
and other subsystems to import dma-buf FDs and build them into memory regions
for PCI P2P operations.

The series supports a use case for SPDK where a NVMe device will be
owned by SPDK through VFIO but interacting with a RDMA device. The RDMA
device may directly access the NVMe CMB or directly manipulate the NVMe
device's doorbell using PCI P2P.

However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.

In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.

The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.

The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.

-----------------------------------------------------------------------
The series is based originally on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.com/
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dmabuf-vfio-v8

Thanks

---
Jason Gunthorpe (2):
      PCI/P2PDMA: Document DMABUF model
      vfio/nvgrace: Support get_dmabuf_phys

Leon Romanovsky (7):
      PCI/P2PDMA: Separate the mmap() support from the core logic
      PCI/P2PDMA: Simplify bus address mapping API
      PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
      PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
      dma-buf: provide phys_vec to scatter-gather mapping routine
      vfio/pci: Enable peer-to-peer DMA transactions by default
      vfio/pci: Add dma-buf export support for MMIO regions

Vivek Kasireddy (2):
      vfio: Export vfio device get and put registration helpers
      vfio/pci: Share the core device pointer while invoking feature functions

 Documentation/driver-api/pci/p2pdma.rst |  95 +++++++---
 block/blk-mq-dma.c                      |   2 +-
 drivers/dma-buf/dma-buf.c               | 235 ++++++++++++++++++++++++
 drivers/iommu/dma-iommu.c               |   4 +-
 drivers/pci/p2pdma.c                    | 186 ++++++++++++++-----
 drivers/vfio/pci/Kconfig                |   3 +
 drivers/vfio/pci/Makefile               |   1 +
 drivers/vfio/pci/nvgrace-gpu/main.c     |  56 ++++++
 drivers/vfio/pci/vfio_pci.c             |   5 +
 drivers/vfio/pci/vfio_pci_config.c      |  22 ++-
 drivers/vfio/pci/vfio_pci_core.c        |  53 ++++--
 drivers/vfio/pci/vfio_pci_dmabuf.c      | 315 ++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_priv.h        |  23 +++
 drivers/vfio/vfio_main.c                |   2 +
 include/linux/dma-buf.h                 |  18 ++
 include/linux/pci-p2pdma.h              | 120 +++++++-----
 include/linux/vfio.h                    |   2 +
 include/linux/vfio_pci_core.h           |  42 +++++
 include/uapi/linux/vfio.h               |  28 +++
 kernel/dma/direct.c                     |   4 +-
 mm/hmm.c                                |   2 +-
 21 files changed, 1078 insertions(+), 140 deletions(-)
---
base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa
change-id: 20251016-dmabuf-vfio-6cef732adf5a

Best regards,
--  
Leon Romanovsky <leonro@nvidia.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v8 01/11] PCI/P2PDMA: Separate the mmap() support from the core logic
  2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
@ 2025-11-11  9:57 ` Leon Romanovsky
  2025-11-11  9:57 ` [PATCH v8 02/11] PCI/P2PDMA: Simplify bus address mapping API Leon Romanovsky
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Alex Mastro, Nicolin Chen

From: Leon Romanovsky <leonro@nvidia.com>

Currently the P2PDMA code requires a pgmap and a struct page to
function. The was serving three important purposes:

 - DMA API compatibility, where scatterlist required a struct page as
   input

 - Life cycle management, the percpu_ref is used to prevent UAF during
   device hot unplug

 - A way to get the P2P provider data through the pci_p2pdma_pagemap

The DMA API now has a new flow, and has gained phys_addr_t support, so
it no longer needs struct pages to perform P2P mapping.

Lifecycle management can be delegated to the user, DMABUF for instance
has a suitable invalidation protocol that does not require struct page.

Finding the P2P provider data can also be managed by the caller
without need to look it up from the phys_addr.

Split the P2PDMA code into two layers. The optional upper layer,
effectively, provides a way to mmap() P2P memory into a VMA by
providing struct page, pgmap, a genalloc and sysfs.

The lower layer provides the actual P2P infrastructure and is wrapped
up in a new struct p2pdma_provider. Rework the mmap layer to use new
p2pdma_provider based APIs.

Drivers that do not want to put P2P memory into VMA's can allocate a
struct p2pdma_provider after probe() starts and free it before
remove() completes. When DMA mapping the driver must convey the struct
p2pdma_provider to the DMA mapping code along with a phys_addr of the
MMIO BAR slice to map. The driver must ensure that no DMA mapping
outlives the lifetime of the struct p2pdma_provider.

The intended target of this new API layer is DMABUF. There is usually
only a single p2pdma_provider for a DMABUF exporter. Most drivers can
establish the p2pdma_provider during probe, access the single instance
during DMABUF attach and use that to drive the DMA mapping.

DMABUF provides an invalidation mechanism that can guarantee all DMA
is halted and the DMA mappings are undone prior to destroying the
struct p2pdma_provider. This ensures there is no UAF through DMABUFs
that are lingering past driver removal.

The new p2pdma_provider layer cannot be used to create P2P memory that
can be mapped into VMA's, be used with pin_user_pages(), O_DIRECT, and
so on. These use cases must still use the mmap() layer. The
p2pdma_provider layer is principally for DMABUF-like use cases where
DMABUF natively manages the life cycle and access instead of
vmas/pin_user_pages()/struct page.

In addition, remove the bus_off field from pci_p2pdma_map_state since
it duplicates information already available in the pgmap structure.
The bus_offset is only used in one location (pci_p2pdma_bus_addr_map)
and is always identical to pgmap->bus_offset.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/p2pdma.c       | 43 +++++++++++++++++++++++--------------------
 include/linux/pci-p2pdma.h | 19 ++++++++++++++-----
 2 files changed, 37 insertions(+), 25 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 78e108e47254..59cd6fb40e83 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -28,9 +28,8 @@ struct pci_p2pdma {
 };
 
 struct pci_p2pdma_pagemap {
-	struct pci_dev *provider;
-	u64 bus_offset;
 	struct dev_pagemap pgmap;
+	struct p2pdma_provider mem;
 };
 
 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
@@ -204,8 +203,8 @@ static void p2pdma_page_free(struct page *page)
 {
 	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
 	/* safe to dereference while a reference is held to the percpu ref */
-	struct pci_p2pdma *p2pdma =
-		rcu_dereference_protected(pgmap->provider->p2pdma, 1);
+	struct pci_p2pdma *p2pdma = rcu_dereference_protected(
+		to_pci_dev(pgmap->mem.owner)->p2pdma, 1);
 	struct percpu_ref *ref;
 
 	gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page),
@@ -270,14 +269,15 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
 
 static void pci_p2pdma_unmap_mappings(void *data)
 {
-	struct pci_dev *pdev = data;
+	struct pci_p2pdma_pagemap *p2p_pgmap = data;
 
 	/*
 	 * Removing the alloc attribute from sysfs will call
 	 * unmap_mapping_range() on the inode, teardown any existing userspace
 	 * mappings and prevent new ones from being created.
 	 */
-	sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
+	sysfs_remove_file_from_group(&p2p_pgmap->mem.owner->kobj,
+				     &p2pmem_alloc_attr.attr,
 				     p2pmem_group.name);
 }
 
@@ -328,10 +328,9 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->nr_range = 1;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
 	pgmap->ops = &p2pdma_pgmap_ops;
-
-	p2p_pgmap->provider = pdev;
-	p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
-		pci_resource_start(pdev, bar);
+	p2p_pgmap->mem.owner = &pdev->dev;
+	p2p_pgmap->mem.bus_offset =
+		pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar);
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
@@ -340,7 +339,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	}
 
 	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
-					 pdev);
+					 p2p_pgmap);
 	if (error)
 		goto pages_free;
 
@@ -972,16 +971,16 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
-						    struct device *dev)
+static enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
 {
 	enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
-	struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
+	struct pci_dev *pdev = to_pci_dev(provider->owner);
 	struct pci_dev *client;
 	struct pci_p2pdma *p2pdma;
 	int dist;
 
-	if (!provider->p2pdma)
+	if (!pdev->p2pdma)
 		return PCI_P2PDMA_MAP_NOT_SUPPORTED;
 
 	if (!dev_is_pci(dev))
@@ -990,7 +989,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 	client = to_pci_dev(dev);
 
 	rcu_read_lock();
-	p2pdma = rcu_dereference(provider->p2pdma);
+	p2pdma = rcu_dereference(pdev->p2pdma);
 
 	if (p2pdma)
 		type = xa_to_value(xa_load(&p2pdma->map_types,
@@ -998,7 +997,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 	rcu_read_unlock();
 
 	if (type == PCI_P2PDMA_MAP_UNKNOWN)
-		return calc_map_type_and_dist(provider, client, &dist, true);
+		return calc_map_type_and_dist(pdev, client, &dist, true);
 
 	return type;
 }
@@ -1006,9 +1005,13 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 		struct device *dev, struct page *page)
 {
-	state->pgmap = page_pgmap(page);
-	state->map = pci_p2pdma_map_type(state->pgmap, dev);
-	state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
+	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page));
+
+	if (state->mem == &p2p_pgmap->mem)
+		return;
+
+	state->mem = &p2p_pgmap->mem;
+	state->map = pci_p2pdma_map_type(&p2p_pgmap->mem, dev);
 }
 
 /**
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 951f81a38f3a..1400f3ad4299 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -16,6 +16,16 @@
 struct block_device;
 struct scatterlist;
 
+/**
+ * struct p2pdma_provider
+ *
+ * A p2pdma provider is a range of MMIO address space available to the CPU.
+ */
+struct p2pdma_provider {
+	struct device *owner;
+	u64 bus_offset;
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
@@ -139,11 +149,11 @@ enum pci_p2pdma_map_type {
 };
 
 struct pci_p2pdma_map_state {
-	struct dev_pagemap *pgmap;
+	struct p2pdma_provider *mem;
 	enum pci_p2pdma_map_type map;
-	u64 bus_off;
 };
 
+
 /* helper for pci_p2pdma_state(), do not use directly */
 void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 		struct device *dev, struct page *page);
@@ -162,8 +172,7 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
 		struct page *page)
 {
 	if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
-		if (state->pgmap != page_pgmap(page))
-			__pci_p2pdma_update_state(state, dev, page);
+		__pci_p2pdma_update_state(state, dev, page);
 		return state->map;
 	}
 	return PCI_P2PDMA_MAP_NONE;
@@ -181,7 +190,7 @@ static inline dma_addr_t
 pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
 {
 	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
-	return paddr + state->bus_off;
+	return paddr + state->mem->bus_offset;
 }
 
 #endif /* _LINUX_PCI_P2P_H */

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 02/11] PCI/P2PDMA: Simplify bus address mapping API
  2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
  2025-11-11  9:57 ` [PATCH v8 01/11] PCI/P2PDMA: Separate the mmap() support from the core logic Leon Romanovsky
@ 2025-11-11  9:57 ` Leon Romanovsky
  2025-11-11  9:57 ` [PATCH v8 03/11] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation Leon Romanovsky
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Alex Mastro, Nicolin Chen

From: Leon Romanovsky <leonro@nvidia.com>

Update the pci_p2pdma_bus_addr_map() function to take a direct pointer
to the p2pdma_provider structure instead of the pci_p2pdma_map_state.
This simplifies the API by removing the need for callers to extract
the provider from the state structure.

The change updates all callers across the kernel (block layer, IOMMU,
DMA direct, and HMM) to pass the provider pointer directly, making
the code more explicit and reducing unnecessary indirection. This
also removes the runtime warning check since callers now have direct
control over which provider they use.

Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 block/blk-mq-dma.c         | 2 +-
 drivers/iommu/dma-iommu.c  | 4 ++--
 include/linux/pci-p2pdma.h | 7 +++----
 kernel/dma/direct.c        | 4 ++--
 mm/hmm.c                   | 2 +-
 5 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c
index 449950029872..a1b623744b2f 100644
--- a/block/blk-mq-dma.c
+++ b/block/blk-mq-dma.c
@@ -85,7 +85,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,
 
 static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
 {
-	iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
+	iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr);
 	iter->len = vec->len;
 	return true;
 }
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7944a3af4545..e52d19d2e833 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1439,8 +1439,8 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
 			 * as a bus address, __finalise_sg() will copy the dma
 			 * address into the output segment.
 			 */
-			s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
-						sg_phys(s));
+			s->dma_address = pci_p2pdma_bus_addr_map(
+				p2pdma_state.mem, sg_phys(s));
 			sg_dma_len(s) = sg->length;
 			sg_dma_mark_bus_address(s);
 			continue;
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 1400f3ad4299..9516ef97b17a 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -181,16 +181,15 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
 /**
  * pci_p2pdma_bus_addr_map - Translate a physical address to a bus address
  *			     for a PCI_P2PDMA_MAP_BUS_ADDR transfer.
- * @state:	P2P state structure
+ * @provider:	P2P provider structure
  * @paddr:	physical address to map
  *
  * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer.
  */
 static inline dma_addr_t
-pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
+pci_p2pdma_bus_addr_map(struct p2pdma_provider *provider, phys_addr_t paddr)
 {
-	WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
-	return paddr + state->mem->bus_offset;
+	return paddr + provider->bus_offset;
 }
 
 #endif /* _LINUX_PCI_P2P_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 1f9ee9759426..d8b3dfc598b2 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -479,8 +479,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
 			}
 			break;
 		case PCI_P2PDMA_MAP_BUS_ADDR:
-			sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
-					sg_phys(sg));
+			sg->dma_address = pci_p2pdma_bus_addr_map(
+				p2pdma_state.mem, sg_phys(sg));
 			sg_dma_mark_bus_address(sg);
 			continue;
 		default:
diff --git a/mm/hmm.c b/mm/hmm.c
index 87562914670a..9bf0b831a029 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -811,7 +811,7 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
 		break;
 	case PCI_P2PDMA_MAP_BUS_ADDR:
 		pfns[idx] |= HMM_PFN_P2PDMA_BUS | HMM_PFN_DMA_MAPPED;
-		return pci_p2pdma_bus_addr_map(p2pdma_state, paddr);
+		return pci_p2pdma_bus_addr_map(p2pdma_state->mem, paddr);
 	default:
 		return DMA_MAPPING_ERROR;
 	}

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 03/11] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
  2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
  2025-11-11  9:57 ` [PATCH v8 01/11] PCI/P2PDMA: Separate the mmap() support from the core logic Leon Romanovsky
  2025-11-11  9:57 ` [PATCH v8 02/11] PCI/P2PDMA: Simplify bus address mapping API Leon Romanovsky
@ 2025-11-11  9:57 ` Leon Romanovsky
  2025-11-11  9:57 ` [PATCH v8 04/11] PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function Leon Romanovsky
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Alex Mastro, Nicolin Chen

From: Leon Romanovsky <leonro@nvidia.com>

Refactor the PCI P2PDMA subsystem to separate the core peer-to-peer DMA
functionality from the optional memory allocation layer. This creates a
two-tier architecture:

The core layer provides P2P mapping functionality for physical addresses
based on PCI device MMIO BARs and integrates with the DMA API for
mapping operations. This layer is required for all P2PDMA users.

The optional upper layer provides memory allocation capabilities
including gen_pool allocator, struct page support, and sysfs interface
for user space access.

This separation allows subsystems like DMABUF to use only the core P2P
mapping functionality without the overhead of memory allocation features
they don't need. The core functionality is now available through the
new pcim_p2pdma_provider() function that returns a p2pdma_provider
structure.

Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/p2pdma.c       | 151 +++++++++++++++++++++++++++++++++++----------
 include/linux/pci-p2pdma.h |  11 ++++
 2 files changed, 131 insertions(+), 31 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 59cd6fb40e83..855d3493634c 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -25,11 +25,12 @@ struct pci_p2pdma {
 	struct gen_pool *pool;
 	bool p2pmem_published;
 	struct xarray map_types;
+	struct p2pdma_provider mem[PCI_STD_NUM_BARS];
 };
 
 struct pci_p2pdma_pagemap {
 	struct dev_pagemap pgmap;
-	struct p2pdma_provider mem;
+	struct p2pdma_provider *mem;
 };
 
 static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
@@ -204,7 +205,7 @@ static void p2pdma_page_free(struct page *page)
 	struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
 	/* safe to dereference while a reference is held to the percpu ref */
 	struct pci_p2pdma *p2pdma = rcu_dereference_protected(
-		to_pci_dev(pgmap->mem.owner)->p2pdma, 1);
+		to_pci_dev(pgmap->mem->owner)->p2pdma, 1);
 	struct percpu_ref *ref;
 
 	gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page),
@@ -227,44 +228,123 @@ static void pci_p2pdma_release(void *data)
 
 	/* Flush and disable pci_alloc_p2p_mem() */
 	pdev->p2pdma = NULL;
-	synchronize_rcu();
+	if (p2pdma->pool)
+		synchronize_rcu();
+	xa_destroy(&p2pdma->map_types);
+
+	if (!p2pdma->pool)
+		return;
 
 	gen_pool_destroy(p2pdma->pool);
 	sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
-	xa_destroy(&p2pdma->map_types);
 }
 
-static int pci_p2pdma_setup(struct pci_dev *pdev)
+/**
+ * pcim_p2pdma_init - Initialise peer-to-peer DMA providers
+ * @pdev: The PCI device to enable P2PDMA for
+ *
+ * This function initializes the peer-to-peer DMA infrastructure
+ * for a PCI device. It allocates and sets up the necessary data
+ * structures to support P2PDMA operations, including mapping type
+ * tracking.
+ */
+int pcim_p2pdma_init(struct pci_dev *pdev)
 {
-	int error = -ENOMEM;
 	struct pci_p2pdma *p2p;
+	int i, ret;
+
+	p2p = rcu_dereference_protected(pdev->p2pdma, 1);
+	if (p2p)
+		return 0;
 
 	p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
 	if (!p2p)
 		return -ENOMEM;
 
 	xa_init(&p2p->map_types);
+	/*
+	 * Iterate over all standard PCI BARs and record only those that
+	 * correspond to MMIO regions. Skip non-memory resources (e.g. I/O
+	 * port BARs) since they cannot be used for peer-to-peer (P2P)
+	 * transactions.
+	 */
+	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
+		if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM))
+			continue;
 
-	p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
-	if (!p2p->pool)
-		goto out;
+		p2p->mem[i].owner = &pdev->dev;
+		p2p->mem[i].bus_offset =
+			pci_bus_address(pdev, i) - pci_resource_start(pdev, i);
+	}
 
-	error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
-	if (error)
-		goto out_pool_destroy;
+	ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+	if (ret)
+		goto out_p2p;
 
-	error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
-	if (error)
+	rcu_assign_pointer(pdev->p2pdma, p2p);
+	return 0;
+
+out_p2p:
+	devm_kfree(&pdev->dev, p2p);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pcim_p2pdma_init);
+
+/**
+ * pcim_p2pdma_provider - Get peer-to-peer DMA provider
+ * @pdev: The PCI device to enable P2PDMA for
+ * @bar: BAR index to get provider
+ *
+ * This function gets peer-to-peer DMA provider for a PCI device. The lifetime
+ * of the provider (and of course the MMIO) is bound to the lifetime of the
+ * driver. A driver calling this function must ensure that all references to the
+ * provider, and any DMA mappings created for any MMIO, are all cleaned up
+ * before the driver remove() completes.
+ *
+ * Since P2P is almost always shared with a second driver this means some system
+ * to notify, invalidate and revoke the MMIO's DMA must be in place to use this
+ * function. For example a revoke can be built using DMABUF.
+ */
+struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar)
+{
+	struct pci_p2pdma *p2p;
+
+	if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+		return NULL;
+
+	p2p = rcu_dereference_protected(pdev->p2pdma, 1);
+	if (WARN_ON(!p2p))
+		/* Someone forgot to call to pcim_p2pdma_init() before */
+		return NULL;
+
+	return &p2p->mem[bar];
+}
+EXPORT_SYMBOL_GPL(pcim_p2pdma_provider);
+
+static int pci_p2pdma_setup_pool(struct pci_dev *pdev)
+{
+	struct pci_p2pdma *p2pdma;
+	int ret;
+
+	p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
+	if (p2pdma->pool)
+		/* We already setup pools, do nothing, */
+		return 0;
+
+	p2pdma->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
+	if (!p2pdma->pool)
+		return -ENOMEM;
+
+	ret = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
+	if (ret)
 		goto out_pool_destroy;
 
-	rcu_assign_pointer(pdev->p2pdma, p2p);
 	return 0;
 
 out_pool_destroy:
-	gen_pool_destroy(p2p->pool);
-out:
-	devm_kfree(&pdev->dev, p2p);
-	return error;
+	gen_pool_destroy(p2pdma->pool);
+	p2pdma->pool = NULL;
+	return ret;
 }
 
 static void pci_p2pdma_unmap_mappings(void *data)
@@ -276,7 +356,7 @@ static void pci_p2pdma_unmap_mappings(void *data)
 	 * unmap_mapping_range() on the inode, teardown any existing userspace
 	 * mappings and prevent new ones from being created.
 	 */
-	sysfs_remove_file_from_group(&p2p_pgmap->mem.owner->kobj,
+	sysfs_remove_file_from_group(&p2p_pgmap->mem->owner->kobj,
 				     &p2pmem_alloc_attr.attr,
 				     p2pmem_group.name);
 }
@@ -295,6 +375,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 			    u64 offset)
 {
 	struct pci_p2pdma_pagemap *p2p_pgmap;
+	struct p2pdma_provider *mem;
 	struct dev_pagemap *pgmap;
 	struct pci_p2pdma *p2pdma;
 	void *addr;
@@ -312,11 +393,21 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	if (size + offset > pci_resource_len(pdev, bar))
 		return -EINVAL;
 
-	if (!pdev->p2pdma) {
-		error = pci_p2pdma_setup(pdev);
-		if (error)
-			return error;
-	}
+	error = pcim_p2pdma_init(pdev);
+	if (error)
+		return error;
+
+	error = pci_p2pdma_setup_pool(pdev);
+	if (error)
+		return error;
+
+	mem = pcim_p2pdma_provider(pdev, bar);
+	/*
+	 * We checked validity of BAR prior to call
+	 * to pcim_p2pdma_provider. It should never return NULL.
+	 */
+	if (WARN_ON(!mem))
+		return -EINVAL;
 
 	p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL);
 	if (!p2p_pgmap)
@@ -328,9 +419,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 	pgmap->nr_range = 1;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
 	pgmap->ops = &p2pdma_pgmap_ops;
-	p2p_pgmap->mem.owner = &pdev->dev;
-	p2p_pgmap->mem.bus_offset =
-		pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar);
+	p2p_pgmap->mem = mem;
 
 	addr = devm_memremap_pages(&pdev->dev, pgmap);
 	if (IS_ERR(addr)) {
@@ -1007,11 +1096,11 @@ void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
 {
 	struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page));
 
-	if (state->mem == &p2p_pgmap->mem)
+	if (state->mem == p2p_pgmap->mem)
 		return;
 
-	state->mem = &p2p_pgmap->mem;
-	state->map = pci_p2pdma_map_type(&p2p_pgmap->mem, dev);
+	state->mem = p2p_pgmap->mem;
+	state->map = pci_p2pdma_map_type(p2p_pgmap->mem, dev);
 }
 
 /**
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 9516ef97b17a..15471252817b 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -27,6 +27,8 @@ struct p2pdma_provider {
 };
 
 #ifdef CONFIG_PCI_P2PDMA
+int pcim_p2pdma_init(struct pci_dev *pdev);
+struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar);
 int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
 		u64 offset);
 int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
@@ -44,6 +46,15 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 			       bool use_p2pdma);
 #else /* CONFIG_PCI_P2PDMA */
+static inline int pcim_p2pdma_init(struct pci_dev *pdev)
+{
+	return -EOPNOTSUPP;
+}
+static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev,
+							   int bar)
+{
+	return NULL;
+}
 static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
 		size_t size, u64 offset)
 {

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 04/11] PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
  2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (2 preceding siblings ...)
  2025-11-11  9:57 ` [PATCH v8 03/11] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation Leon Romanovsky
@ 2025-11-11  9:57 ` Leon Romanovsky
  2025-11-11  9:57 ` [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model Leon Romanovsky
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Alex Mastro, Nicolin Chen

From: Leon Romanovsky <leonro@nvidia.com>

Provide an access to pci_p2pdma_map_type() function to allow subsystems
to determine the appropriate mapping type for P2PDMA transfers between
a provider and target device.

The pci_p2pdma_map_type() function is the core P2P layer version of
the existing public, but struct page focused, pci_p2pdma_state()
function. It returns the same result. It is required to use the p2p
subsystem from drivers that don't use the struct page layer.

Like __pci_p2pdma_update_state() it is not an exported function. The
idea is that only subsystem code will implement mapping helpers for
taking in phys_addr_t lists, this is deliberately not made accessible
to every driver to prevent abuse.

Following patches will use this function to implement a shared DMA
mapping helper for DMABUF.

Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/pci/p2pdma.c       | 14 ++++++--
 include/linux/pci-p2pdma.h | 85 +++++++++++++++++++++++++---------------------
 2 files changed, 58 insertions(+), 41 deletions(-)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 855d3493634c..981a76b6b7c0 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -1060,8 +1060,18 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
 }
 EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
 
-static enum pci_p2pdma_map_type
-pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
+/**
+ * pci_p2pdma_map_type - Determine the mapping type for P2PDMA transfers
+ * @provider: P2PDMA provider structure
+ * @dev: Target device for the transfer
+ *
+ * Determines how peer-to-peer DMA transfers should be mapped between
+ * the provider and the target device. The mapping type indicates whether
+ * the transfer can be done directly through PCI switches or must go
+ * through the host bridge.
+ */
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
+					     struct device *dev)
 {
 	enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
 	struct pci_dev *pdev = to_pci_dev(provider->owner);
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 15471252817b..517e121d2598 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -26,6 +26,45 @@ struct p2pdma_provider {
 	u64 bus_offset;
 };
 
+enum pci_p2pdma_map_type {
+	/*
+	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
+	 * the mapping type has been calculated. Exported routines for the API
+	 * will never return this value.
+	 */
+	PCI_P2PDMA_MAP_UNKNOWN = 0,
+
+	/*
+	 * Not a PCI P2PDMA transfer.
+	 */
+	PCI_P2PDMA_MAP_NONE,
+
+	/*
+	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
+	 * traverse the host bridge and the host bridge is not in the
+	 * allowlist. DMA Mapping routines should return an error when
+	 * this is returned.
+	 */
+	PCI_P2PDMA_MAP_NOT_SUPPORTED,
+
+	/*
+	 * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to
+	 * each other directly through a PCI switch and the transaction will
+	 * not traverse the host bridge. Such a mapping should program
+	 * the DMA engine with PCI bus addresses.
+	 */
+	PCI_P2PDMA_MAP_BUS_ADDR,
+
+	/*
+	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
+	 * to each other, but the transaction traverses a host bridge on the
+	 * allowlist. In this case, a normal mapping either with CPU physical
+	 * addresses (in the case of dma-direct) or IOVA addresses (in the
+	 * case of IOMMUs) should be used to program the DMA engine.
+	 */
+	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
+};
+
 #ifdef CONFIG_PCI_P2PDMA
 int pcim_p2pdma_init(struct pci_dev *pdev);
 struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar);
@@ -45,6 +84,8 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
 			    bool *use_p2pdma);
 ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
 			       bool use_p2pdma);
+enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
+					     struct device *dev);
 #else /* CONFIG_PCI_P2PDMA */
 static inline int pcim_p2pdma_init(struct pci_dev *pdev)
 {
@@ -106,6 +147,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
 {
 	return sprintf(page, "none\n");
 }
+static inline enum pci_p2pdma_map_type
+pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
+{
+	return PCI_P2PDMA_MAP_NOT_SUPPORTED;
+}
 #endif /* CONFIG_PCI_P2PDMA */
 
 
@@ -120,45 +166,6 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client)
 	return pci_p2pmem_find_many(&client, 1);
 }
 
-enum pci_p2pdma_map_type {
-	/*
-	 * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
-	 * the mapping type has been calculated. Exported routines for the API
-	 * will never return this value.
-	 */
-	PCI_P2PDMA_MAP_UNKNOWN = 0,
-
-	/*
-	 * Not a PCI P2PDMA transfer.
-	 */
-	PCI_P2PDMA_MAP_NONE,
-
-	/*
-	 * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
-	 * traverse the host bridge and the host bridge is not in the
-	 * allowlist. DMA Mapping routines should return an error when
-	 * this is returned.
-	 */
-	PCI_P2PDMA_MAP_NOT_SUPPORTED,
-
-	/*
-	 * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to
-	 * each other directly through a PCI switch and the transaction will
-	 * not traverse the host bridge. Such a mapping should program
-	 * the DMA engine with PCI bus addresses.
-	 */
-	PCI_P2PDMA_MAP_BUS_ADDR,
-
-	/*
-	 * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
-	 * to each other, but the transaction traverses a host bridge on the
-	 * allowlist. In this case, a normal mapping either with CPU physical
-	 * addresses (in the case of dma-direct) or IOVA addresses (in the
-	 * case of IOMMUs) should be used to program the DMA engine.
-	 */
-	PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
-};
-
 struct pci_p2pdma_map_state {
 	struct p2pdma_provider *mem;
 	enum pci_p2pdma_map_type map;

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model
  2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (3 preceding siblings ...)
  2025-11-11  9:57 ` [PATCH v8 04/11] PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function Leon Romanovsky
@ 2025-11-11  9:57 ` Leon Romanovsky
  2025-11-19  9:18   ` Christian König
  2025-11-11  9:57 ` [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine Leon Romanovsky
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening

From: Jason Gunthorpe <jgg@nvidia.com>

Reflect latest changes in p2p implementation to support DMABUF lifecycle.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/driver-api/pci/p2pdma.rst | 95 +++++++++++++++++++++++++--------
 1 file changed, 72 insertions(+), 23 deletions(-)

diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
index d0b241628cf1..77e310596955 100644
--- a/Documentation/driver-api/pci/p2pdma.rst
+++ b/Documentation/driver-api/pci/p2pdma.rst
@@ -9,22 +9,47 @@ between two devices on the bus. This type of transaction is henceforth
 called Peer-to-Peer (or P2P). However, there are a number of issues that
 make P2P transactions tricky to do in a perfectly safe way.
 
-One of the biggest issues is that PCI doesn't require forwarding
-transactions between hierarchy domains, and in PCIe, each Root Port
-defines a separate hierarchy domain. To make things worse, there is no
-simple way to determine if a given Root Complex supports this or not.
-(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
-only supports doing P2P when the endpoints involved are all behind the
-same PCI bridge, as such devices are all in the same PCI hierarchy
-domain, and the spec guarantees that all transactions within the
-hierarchy will be routable, but it does not require routing
-between hierarchies.
-
-The second issue is that to make use of existing interfaces in Linux,
-memory that is used for P2P transactions needs to be backed by struct
-pages. However, PCI BARs are not typically cache coherent so there are
-a few corner case gotchas with these pages so developers need to
-be careful about what they do with them.
+For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
+until they reach a host bridge or root port. If the path includes PCIe switches
+then based on the ACS settings the transaction can route entirely within
+the PCIe hierarchy and never reach the root port. The kernel will evaluate
+the PCIe topology and always permit P2P in these well-defined cases.
+
+However, if the P2P transaction reaches the host bridge then it might have to
+hairpin back out the same root port, be routed inside the CPU SOC to another
+PCIe root port, or routed internally to the SOC.
+
+As this is not well-defined or well-supported in real HW the kernel defaults to
+blocking such routing. There is an allow list to allow detecting known-good HW,
+in which case P2P between any two PCIe devices will be permitted.
+
+Since P2P inherently is doing transactions between two devices it requires two
+drivers to be co-operating inside the kernel. The providing driver has to convey
+its MMIO to the consuming driver. To meet the driver model lifecycle rules the
+MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
+table mappings undone before the providing driver completes remove().
+
+This requires the providing and consuming driver to actively work together to
+guarantee that the consuming driver has stopped using the MMIO during a removal
+cycle. This is done by either a synchronous invalidation shutdown or waiting
+for all usage refcounts to reach zero.
+
+At the lowest level the P2P subsystem offers a naked struct p2p_provider that
+delegates lifecycle management to the providing driver. It is expected that
+drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
+to provide an invalidation shutdown. These MMIO pages have no struct page, and
+if used with mmap() must create special PTEs. As such there are very few
+kernel uAPIs that can accept pointers to them; in particular they cannot be used
+with read()/write(), including O_DIRECT.
+
+Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
+pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
+pgmap ensures that when the pgmap is destroyed all other drivers have stopped
+using the MMIO. This option works with O_DIRECT flows, in some cases, if the
+underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
+FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
+it also relies on architecture support along with alignment and minimum size
+limitations.
 
 
 Driver Writer's Guide
@@ -114,14 +139,38 @@ allocating scatter-gather lists with P2P memory.
 Struct Page Caveats
 -------------------
 
-Driver writers should be very careful about not passing these special
-struct pages to code that isn't prepared for it. At this time, the kernel
-interfaces do not have any checks for ensuring this. This obviously
-precludes passing these pages to userspace.
+While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
+pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.
 
-P2P memory is also technically IO memory but should never have any side
-effects behind it. Thus, the order of loads and stores should not be important
-and ioreadX(), iowriteX() and friends should not be necessary.
+The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
+KVA is still MMIO and must still be accessed through the normal
+readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
+like any other MMIO mapping. While this will actually work on some
+architectures, others will experience corruption or just crash in the kernel.
+Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
+access happens.
+
+
+Usage With DMABUF
+=================
+
+DMABUF provides an alternative to the above struct page-based
+client/provider/orchestrator system. In this mode the exporting driver will wrap
+some of its MMIO in a DMABUF and give the DMABUF FD to userspace.
+
+Userspace can then pass the FD to an importing driver which will ask the
+exporting driver to map it.
+
+In this case the initiator and target pci_devices are known and the P2P subsystem
+is used to determine the mapping type. The phys_addr_t-based DMA API is used to
+establish the dma_addr_t.
+
+Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
+to remove() it must deliver an invalidation shutdown to all DMABUF importing
+drivers through move_notify() and synchronously DMA unmap all the MMIO.
+
+No importing driver can continue to have a DMA map to the MMIO after the
+exporting driver has destroyed its p2p_provider.
 
 
 P2P DMA Support Library

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model
  2025-11-11  9:57 ` [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model Leon Romanovsky
@ 2025-11-19  9:18   ` Christian König
  2025-11-19 13:13     ` Leon Romanovsky
  2025-11-19 13:35     ` Jason Gunthorpe
  0 siblings, 2 replies; 63+ messages in thread
From: Christian König @ 2025-11-19  9:18 UTC (permalink / raw)
  To: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Kees Cook, Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening



On 11/11/25 10:57, Leon Romanovsky wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> Reflect latest changes in p2p implementation to support DMABUF lifecycle.
> 
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  Documentation/driver-api/pci/p2pdma.rst | 95 +++++++++++++++++++++++++--------
>  1 file changed, 72 insertions(+), 23 deletions(-)
> 
> diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
> index d0b241628cf1..77e310596955 100644
> --- a/Documentation/driver-api/pci/p2pdma.rst
> +++ b/Documentation/driver-api/pci/p2pdma.rst
> @@ -9,22 +9,47 @@ between two devices on the bus. This type of transaction is henceforth
>  called Peer-to-Peer (or P2P). However, there are a number of issues that
>  make P2P transactions tricky to do in a perfectly safe way.
>  
> -One of the biggest issues is that PCI doesn't require forwarding
> -transactions between hierarchy domains, and in PCIe, each Root Port
> -defines a separate hierarchy domain. To make things worse, there is no
> -simple way to determine if a given Root Complex supports this or not.
> -(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
> -only supports doing P2P when the endpoints involved are all behind the
> -same PCI bridge, as such devices are all in the same PCI hierarchy
> -domain, and the spec guarantees that all transactions within the
> -hierarchy will be routable, but it does not require routing
> -between hierarchies.
> -
> -The second issue is that to make use of existing interfaces in Linux,
> -memory that is used for P2P transactions needs to be backed by struct
> -pages. However, PCI BARs are not typically cache coherent so there are
> -a few corner case gotchas with these pages so developers need to
> -be careful about what they do with them.
> +For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
> +until they reach a host bridge or root port. If the path includes PCIe switches
> +then based on the ACS settings the transaction can route entirely within
> +the PCIe hierarchy and never reach the root port. The kernel will evaluate
> +the PCIe topology and always permit P2P in these well-defined cases.
> +
> +However, if the P2P transaction reaches the host bridge then it might have to
> +hairpin back out the same root port, be routed inside the CPU SOC to another
> +PCIe root port, or routed internally to the SOC.

Please keep the reference to the PCIe specification where that behavior is defined somewhere here. E.g. "See PCIe r4.0, sec 1.3.1".

> +
> +As this is not well-defined or well-supported in real HW the kernel defaults to
> +blocking such routing. There is an allow list to allow detecting known-good HW,
> +in which case P2P between any two PCIe devices will be permitted.

That section sounds not correct to me. This is well supported in current HW, it's just not defined in some official specification.

> +
> +Since P2P inherently is doing transactions between two devices it requires two
> +drivers to be co-operating inside the kernel. The providing driver has to convey
> +its MMIO to the consuming driver. To meet the driver model lifecycle rules the
> +MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
> +table mappings undone before the providing driver completes remove().
> +
> +This requires the providing and consuming driver to actively work together to
> +guarantee that the consuming driver has stopped using the MMIO during a removal
> +cycle. This is done by either a synchronous invalidation shutdown or waiting
> +for all usage refcounts to reach zero.
> +
> +At the lowest level the P2P subsystem offers a naked struct p2p_provider that
> +delegates lifecycle management to the providing driver. It is expected that
> +drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
> +to provide an invalidation shutdown.

> These MMIO pages have no struct page, and

Well please drop "pages" here. Just say MMIO addresses.

> +if used with mmap() must create special PTEs. As such there are very few
> +kernel uAPIs that can accept pointers to them; in particular they cannot be used
> +with read()/write(), including O_DIRECT.

> +
> +Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
> +pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
> +pgmap ensures that when the pgmap is destroyed all other drivers have stopped
> +using the MMIO. This option works with O_DIRECT flows, in some cases, if the
> +underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
> +FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
> +it also relies on architecture support along with alignment and minimum size
> +limitations.

Actually that is up to the exporter of the DMA-buf what approach is used.

For the P2PDMA API it should be irrelevant if struct pages are used or not.

So I think you should potentially completely drop that description here.

>  
>  
>  Driver Writer's Guide
> @@ -114,14 +139,38 @@ allocating scatter-gather lists with P2P memory.
>  Struct Page Caveats
>  -------------------
>  
> -Driver writers should be very careful about not passing these special
> -struct pages to code that isn't prepared for it. At this time, the kernel
> -interfaces do not have any checks for ensuring this. This obviously
> -precludes passing these pages to userspace.
> +While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
> +pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.
>  
> -P2P memory is also technically IO memory but should never have any side
> -effects behind it. Thus, the order of loads and stores should not be important
> -and ioreadX(), iowriteX() and friends should not be necessary.
> +The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
> +KVA is still MMIO and must still be accessed through the normal
> +readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
> +like any other MMIO mapping. While this will actually work on some
> +architectures, others will experience corruption or just crash in the kernel.
> +Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
> +access happens.
> +
> +
> +Usage With DMABUF
> +=================
> +
> +DMABUF provides an alternative to the above struct page-based
> +client/provider/orchestrator system. In this mode the exporting driver will wrap
> +some of its MMIO in a DMABUF and give the DMABUF FD to userspace.
> +
> +Userspace can then pass the FD to an importing driver which will ask the
> +exporting driver to map it.

"to map it to the importer".

Regards,
Christian.

> +
> +In this case the initiator and target pci_devices are known and the P2P subsystem
> +is used to determine the mapping type. The phys_addr_t-based DMA API is used to
> +establish the dma_addr_t.
> +
> +Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
> +to remove() it must deliver an invalidation shutdown to all DMABUF importing
> +drivers through move_notify() and synchronously DMA unmap all the MMIO.
> +
> +No importing driver can continue to have a DMA map to the MMIO after the
> +exporting driver has destroyed its p2p_provider.
>  
>  
>  P2P DMA Support Library
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model
  2025-11-19  9:18   ` Christian König
@ 2025-11-19 13:13     ` Leon Romanovsky
  2025-11-19 13:35     ` Jason Gunthorpe
  1 sibling, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 13:13 UTC (permalink / raw)
  To: Christian König
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening

On Wed, Nov 19, 2025 at 10:18:08AM +0100, Christian König wrote:
> 
> 
> On 11/11/25 10:57, Leon Romanovsky wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > 
> > Reflect latest changes in p2p implementation to support DMABUF lifecycle.
> > 
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > ---
> >  Documentation/driver-api/pci/p2pdma.rst | 95 +++++++++++++++++++++++++--------
> >  1 file changed, 72 insertions(+), 23 deletions(-)

<...>

> > These MMIO pages have no struct page, and
> 
> Well please drop "pages" here. Just say MMIO addresses.
> 
> > +if used with mmap() must create special PTEs. As such there are very few
> > +kernel uAPIs that can accept pointers to them; in particular they cannot be used
> > +with read()/write(), including O_DIRECT.

<...>

> > +DMABUF provides an alternative to the above struct page-based
> > +client/provider/orchestrator system. In this mode the exporting driver will wrap
> > +some of its MMIO in a DMABUF and give the DMABUF FD to userspace.
> > +
> > +Userspace can then pass the FD to an importing driver which will ask the
> > +exporting driver to map it.
> 
> "to map it to the importer".

No problem, changed.

> 
> Regards,
> Christian.
> 
> > +
> > +In this case the initiator and target pci_devices are known and the P2P subsystem
> > +is used to determine the mapping type. The phys_addr_t-based DMA API is used to
> > +establish the dma_addr_t.
> > +
> > +Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
> > +to remove() it must deliver an invalidation shutdown to all DMABUF importing
> > +drivers through move_notify() and synchronously DMA unmap all the MMIO.
> > +
> > +No importing driver can continue to have a DMA map to the MMIO after the
> > +exporting driver has destroyed its p2p_provider.
> >  
> >  
> >  P2P DMA Support Library
> > 
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model
  2025-11-19  9:18   ` Christian König
  2025-11-19 13:13     ` Leon Romanovsky
@ 2025-11-19 13:35     ` Jason Gunthorpe
  2025-11-19 14:06       ` Christian König
  1 sibling, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2025-11-19 13:35 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening

On Wed, Nov 19, 2025 at 10:18:08AM +0100, Christian König wrote:
> > +As this is not well-defined or well-supported in real HW the kernel defaults to
> > +blocking such routing. There is an allow list to allow detecting known-good HW,
> > +in which case P2P between any two PCIe devices will be permitted.
>
> That section sounds not correct to me. 

It is correct in that it describes what the kernel does right now.

See calc_map_type_and_dist(), host_bridge_whitelist(), cpu_supports_p2pdma().

> This is well supported in current HW, it's just not defined in some
> official specification.

Only AMD HW.

Intel HW is a bit hit and miss.

ARM SOCs are frequently not supporting even on server CPUs.

> > +At the lowest level the P2P subsystem offers a naked struct p2p_provider that
> > +delegates lifecycle management to the providing driver. It is expected that
> > +drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
> > +to provide an invalidation shutdown.
> 
> > These MMIO pages have no struct page, and
> 
> Well please drop "pages" here. Just say MMIO addresses.

"These MMIO addresses have no struct page, and"

> > +Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
> > +pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
> > +pgmap ensures that when the pgmap is destroyed all other drivers have stopped
> > +using the MMIO. This option works with O_DIRECT flows, in some cases, if the
> > +underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
> > +FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
> > +it also relies on architecture support along with alignment and minimum size
> > +limitations.
> 
> Actually that is up to the exporter of the DMA-buf what approach is used.

The above is not talking about DMA-buf, it is describing the existing
interface that is all struct page based. The driver invoking the
P2PDMA APIs gets to pick if it uses the stuct page based interface
described above or the lower level p2p provider interface this series
introduces.

> For the P2PDMA API it should be irrelevant if struct pages are used or not.

Only for the lowest level p2p provider based P2PDMA API - there is a
higher level API family within P2PDMA's API that is all about creating
and managing ZONE_DEVICE struct pages and a pgmap, the above describes
that family.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model
  2025-11-19 13:35     ` Jason Gunthorpe
@ 2025-11-19 14:06       ` Christian König
  2025-11-19 19:45         ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Christian König @ 2025-11-19 14:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening

On 11/19/25 14:35, Jason Gunthorpe wrote:
> On Wed, Nov 19, 2025 at 10:18:08AM +0100, Christian König wrote:
>>> +As this is not well-defined or well-supported in real HW the kernel defaults to
>>> +blocking such routing. There is an allow list to allow detecting known-good HW,
>>> +in which case P2P between any two PCIe devices will be permitted.
>>
>> That section sounds not correct to me. 
> 
> It is correct in that it describes what the kernel does right now.
> 
> See calc_map_type_and_dist(), host_bridge_whitelist(), cpu_supports_p2pdma().

Well I'm the one who originally suggested that whitelist and the description still doesn't sound correct to me.

I would write something like "The PCIe specification doesn't define the forwarding of transactions between hierarchy domains...."

The previous text was actually much better than this summary since now it leaves out the important information where all of this is comes from.

What the kernel does can be figure out by reading the code, but we need to describe why it does it.

> 
>> This is well supported in current HW, it's just not defined in some
>> official specification.
> 
> Only AMD HW.
> 
> Intel HW is a bit hit and miss.
> 
> ARM SOCs are frequently not supporting even on server CPUs.
IIRC ARM actually has a validation program for this, but I've forgotten the name of it again.

Randy should know the name of it and I think mentioning the status of the vendors here would be a good idea.
>>> +At the lowest level the P2P subsystem offers a naked struct p2p_provider that
>>> +delegates lifecycle management to the providing driver. It is expected that
>>> +drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
>>> +to provide an invalidation shutdown.
>>
>>> These MMIO pages have no struct page, and
>>
>> Well please drop "pages" here. Just say MMIO addresses.
> 
> "These MMIO addresses have no struct page, and"

+1

> 
>>> +Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
>>> +pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
>>> +pgmap ensures that when the pgmap is destroyed all other drivers have stopped
>>> +using the MMIO. This option works with O_DIRECT flows, in some cases, if the
>>> +underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
>>> +FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
>>> +it also relies on architecture support along with alignment and minimum size
>>> +limitations.
>>
>> Actually that is up to the exporter of the DMA-buf what approach is used.
> 
> The above is not talking about DMA-buf, it is describing the existing
> interface that is all struct page based. The driver invoking the
> P2PDMA APIs gets to pick if it uses the stuct page based interface
> described above or the lower level p2p provider interface this series
> introduces.
> 
>> For the P2PDMA API it should be irrelevant if struct pages are used or not.
> 
> Only for the lowest level p2p provider based P2PDMA API - there is a
> higher level API family within P2PDMA's API that is all about creating
> and managing ZONE_DEVICE struct pages and a pgmap, the above describes
> that family.

I completely agree to all of this, but that's not what I meant.

The documentation makes it sound like DMA-buf is limited to not using struct pages and direct I/O, but that is not true.

You can have DMA-bufs backed by pages, both system memory and zone device pages.

But DMA-buf can also handle PCIe MMIO BARs which are micro controller doorbells or even classical HW registers.

Regards,
Christian.

> 
> Thanks,
> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model
  2025-11-19 14:06       ` Christian König
@ 2025-11-19 19:45         ` Jason Gunthorpe
  2025-11-19 20:45           ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2025-11-19 19:45 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening

On Wed, Nov 19, 2025 at 03:06:18PM +0100, Christian König wrote:
> On 11/19/25 14:35, Jason Gunthorpe wrote:
> > On Wed, Nov 19, 2025 at 10:18:08AM +0100, Christian König wrote:
> >>> +As this is not well-defined or well-supported in real HW the kernel defaults to
> >>> +blocking such routing. There is an allow list to allow detecting known-good HW,
> >>> +in which case P2P between any two PCIe devices will be permitted.
> >>
> >> That section sounds not correct to me. 
> > 
> > It is correct in that it describes what the kernel does right now.
> > 
> > See calc_map_type_and_dist(), host_bridge_whitelist(), cpu_supports_p2pdma().
> 
> Well I'm the one who originally suggested that whitelist and the description still doesn't sound correct to me.
> 
> I would write something like "The PCIe specification doesn't define the forwarding of transactions between hierarchy domains...."

Ok

> The previous text was actually much better than this summary since
> now it leaves out the important information where all of this is
> comes from.

Well, IMHO, it doesn't "come from" anywhere, this is all
implementation specific behaviors..

> > ARM SOCs are frequently not supporting even on server CPUs.
>
> IIRC ARM actually has a validation program for this, but I've forgotten the name of it again.

I suspect you mean SBSA, and I know at least one new SBSA approved
chip that doesn't have working P2P through the host bridge.. :(
 
> Randy should know the name of it and I think mentioning the status
> of the vendors here would be a good idea.

I think refer to the kernel code is best for what is currently permitted..

> The documentation makes it sound like DMA-buf is limited to not
> using struct pages and direct I/O, but that is not true.

Okay, I see what you mean, the intention was to be very strong and say
if you are not using struct pages then you must using DMABUF or
something like it to control lifetime. Not to say that was the only
way how DMABUF can be used.

Leon let's try to clarify that a bit more

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model
  2025-11-19 19:45         ` Jason Gunthorpe
@ 2025-11-19 20:45           ` Leon Romanovsky
  0 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 20:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian König, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening

On Wed, Nov 19, 2025 at 03:45:06PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 19, 2025 at 03:06:18PM +0100, Christian König wrote:
> > On 11/19/25 14:35, Jason Gunthorpe wrote:
> > > On Wed, Nov 19, 2025 at 10:18:08AM +0100, Christian König wrote:
> > >>> +As this is not well-defined or well-supported in real HW the kernel defaults to
> > >>> +blocking such routing. There is an allow list to allow detecting known-good HW,
> > >>> +in which case P2P between any two PCIe devices will be permitted.

<...>

> > The documentation makes it sound like DMA-buf is limited to not
> > using struct pages and direct I/O, but that is not true.
> 
> Okay, I see what you mean, the intention was to be very strong and say
> if you are not using struct pages then you must using DMABUF or
> something like it to control lifetime. Not to say that was the only
> way how DMABUF can be used.
> 
> Leon let's try to clarify that a bit more

diff --git a/Documentation/driver-api/pci/p2pdma.rst b/Documentation/driver-api/pci/p2pdma.rst
index 32e9b691508b..280673b50350 100644
--- a/Documentation/driver-api/pci/p2pdma.rst
+++ b/Documentation/driver-api/pci/p2pdma.rst
@@ -156,7 +156,8 @@ Usage With DMABUF
 =================
 
 DMABUF provides an alternative to the above struct page-based
-client/provider/orchestrator system. In this mode the exporting driver will wrap
+client/provider/orchestrator system and should be used when struct page
+doesn't exist. In this mode the exporting driver will wrap
 some of its MMIO in a DMABUF and give the DMABUF FD to userspace.
 
 Userspace can then pass the FD to an importing driver which will ask the


> 
> Jason

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (4 preceding siblings ...)
  2025-11-11  9:57 ` [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model Leon Romanovsky
@ 2025-11-11  9:57 ` Leon Romanovsky
  2025-11-18 23:02   ` Jason Gunthorpe
                     ` (3 more replies)
  2025-11-11  9:57 ` [PATCH v8 07/11] vfio: Export vfio device get and put registration helpers Leon Romanovsky
                   ` (4 subsequent siblings)
  10 siblings, 4 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Alex Mastro, Nicolin Chen

From: Leon Romanovsky <leonro@nvidia.com>

Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
MMIO physical address ranges into scatter-gather tables with proper
DMA mapping.

These common functions are a starting point and support any PCI
drivers creating mappings from their BAR's MMIO addresses. VFIO is one
case, as shortly will be RDMA. We can review existing DRM drivers to
refactor them separately. We hope this will evolve into routines to
help common DRM that include mixed CPU and MMIO mappings.

Compared to the dma_map_resource() abuse this implementation handles
the complicated PCI P2P scenarios properly, especially when an IOMMU
is enabled:

 - Direct bus address mapping without IOVA allocation for
   PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
   happens if the IOMMU is enabled but the PCIe switch ACS flags allow
   transactions to avoid the host bridge.

   Further, this handles the slightly obscure, case of MMIO with a
   phys_addr_t that is different from the physical BAR programming
   (bus offset). The phys_addr_t is converted to a dma_addr_t and
   accommodates this effect. This enables certain real systems to
   work, especially on ARM platforms.

 - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
   attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
   This happens when the IOMMU is enabled and the ACS flags are forcing
   all traffic to the IOMMU - ie for virtualization systems.

 - Cases where P2P is not supported through the host bridge/CPU. The
   P2P subsystem is the proper place to detect this and block it.

Helper functions fill_sg_entry() and calc_sg_nents() handle the
scatter-gather table construction, splitting large regions into
UINT_MAX-sized chunks to fit within sg->length field limits.

Since the physical address based DMA API forbids use of the CPU list
of the scatterlist this will produce a mangled scatterlist that has
a fully zero-length and NULL'd CPU list. The list is 0 length,
all the struct page pointers are NULL and zero sized. This is stronger
and more robust than the existing mangle_sg_table() technique. It is
a future project to migrate DMABUF as a subsystem away from using
scatterlist for this data structure.

Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dma-buf.h   |  18 ++++
 2 files changed, 253 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 2bcf9ceca997..cb55dff1dad5 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
 
+static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
+					 dma_addr_t addr)
+{
+	unsigned int len, nents;
+	int i;
+
+	nents = DIV_ROUND_UP(length, UINT_MAX);
+	for (i = 0; i < nents; i++) {
+		len = min_t(size_t, length, UINT_MAX);
+		length -= len;
+		/*
+		 * DMABUF abuses scatterlist to create a scatterlist
+		 * that does not have any CPU list, only the DMA list.
+		 * Always set the page related values to NULL to ensure
+		 * importers can't use it. The phys_addr based DMA API
+		 * does not require the CPU list for mapping or unmapping.
+		 */
+		sg_set_page(sgl, NULL, 0, 0);
+		sg_dma_address(sgl) = addr + i * UINT_MAX;
+		sg_dma_len(sgl) = len;
+		sgl = sg_next(sgl);
+	}
+
+	return sgl;
+}
+
+static unsigned int calc_sg_nents(struct dma_iova_state *state,
+				  struct dma_buf_phys_vec *phys_vec,
+				  size_t nr_ranges, size_t size)
+{
+	unsigned int nents = 0;
+	size_t i;
+
+	if (!state || !dma_use_iova(state)) {
+		for (i = 0; i < nr_ranges; i++)
+			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
+	} else {
+		/*
+		 * In IOVA case, there is only one SG entry which spans
+		 * for whole IOVA address space, but we need to make sure
+		 * that it fits sg->length, maybe we need more.
+		 */
+		nents = DIV_ROUND_UP(size, UINT_MAX);
+	}
+
+	return nents;
+}
+
+/**
+ * struct dma_buf_dma - holds DMA mapping information
+ * @sgt:    Scatter-gather table
+ * @state:  DMA IOVA state relevant in IOMMU-based DMA
+ * @size:   Total size of DMA transfer
+ */
+struct dma_buf_dma {
+	struct sg_table sgt;
+	struct dma_iova_state *state;
+	size_t size;
+};
+
+/**
+ * dma_buf_map - Returns the scatterlist table of the attachment from arrays
+ * of physical vectors. This funciton is intended for MMIO memory only.
+ * @attach:	[in]	attachment whose scatterlist is to be returned
+ * @provider:	[in]	p2pdma provider
+ * @phys_vec:	[in]	array of physical vectors
+ * @nr_ranges:	[in]	number of entries in phys_vec array
+ * @size:	[in]	total size of phys_vec
+ * @dir:	[in]	direction of DMA transfer
+ *
+ * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
+ * on error. May return -EINTR if it is interrupted by a signal.
+ *
+ * On success, the DMA addresses and lengths in the returned scatterlist are
+ * PAGE_SIZE aligned.
+ *
+ * A mapping must be unmapped by using dma_buf_unmap().
+ */
+struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
+			     struct p2pdma_provider *provider,
+			     struct dma_buf_phys_vec *phys_vec,
+			     size_t nr_ranges, size_t size,
+			     enum dma_data_direction dir)
+{
+	unsigned int nents, mapped_len = 0;
+	struct dma_buf_dma *dma;
+	struct scatterlist *sgl;
+	dma_addr_t addr;
+	size_t i;
+	int ret;
+
+	dma_resv_assert_held(attach->dmabuf->resv);
+
+	if (WARN_ON(!attach || !attach->dmabuf || !provider))
+		/* This function is supposed to work on MMIO memory only */
+		return ERR_PTR(-EINVAL);
+
+	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+	if (!dma)
+		return ERR_PTR(-ENOMEM);
+
+	switch (pci_p2pdma_map_type(provider, attach->dev)) {
+	case PCI_P2PDMA_MAP_BUS_ADDR:
+		/*
+		 * There is no need in IOVA at all for this flow.
+		 */
+		break;
+	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
+		if (!dma->state) {
+			ret = -ENOMEM;
+			goto err_free_dma;
+		}
+
+		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
+		break;
+	default:
+		ret = -EINVAL;
+		goto err_free_dma;
+	}
+
+	nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size);
+	ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO);
+	if (ret)
+		goto err_free_state;
+
+	sgl = dma->sgt.sgl;
+
+	for (i = 0; i < nr_ranges; i++) {
+		if (!dma->state) {
+			addr = pci_p2pdma_bus_addr_map(provider,
+						       phys_vec[i].paddr);
+		} else if (dma_use_iova(dma->state)) {
+			ret = dma_iova_link(attach->dev, dma->state,
+					    phys_vec[i].paddr, 0,
+					    phys_vec[i].len, dir,
+					    DMA_ATTR_MMIO);
+			if (ret)
+				goto err_unmap_dma;
+
+			mapped_len += phys_vec[i].len;
+		} else {
+			addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
+					    phys_vec[i].len, dir,
+					    DMA_ATTR_MMIO);
+			ret = dma_mapping_error(attach->dev, addr);
+			if (ret)
+				goto err_unmap_dma;
+		}
+
+		if (!dma->state || !dma_use_iova(dma->state))
+			sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
+	}
+
+	if (dma->state && dma_use_iova(dma->state)) {
+		WARN_ON_ONCE(mapped_len != size);
+		ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
+		if (ret)
+			goto err_unmap_dma;
+
+		sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
+	}
+
+	dma->size = size;
+
+	/*
+	 * No CPU list included — set orig_nents = 0 so others can detect
+	 * this via SG table (use nents only).
+	 */
+	dma->sgt.orig_nents = 0;
+
+
+	/*
+	 * SGL must be NULL to indicate that SGL is the last one
+	 * and we allocated correct number of entries in sg_alloc_table()
+	 */
+	WARN_ON_ONCE(sgl);
+	return &dma->sgt;
+
+err_unmap_dma:
+	if (!i || !dma->state) {
+		; /* Do nothing */
+	} else if (dma_use_iova(dma->state)) {
+		dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
+				 DMA_ATTR_MMIO);
+	} else {
+		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
+			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
+				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
+	}
+	sg_free_table(&dma->sgt);
+err_free_state:
+	kfree(dma->state);
+err_free_dma:
+	kfree(dma);
+	return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_map, "DMA_BUF");
+
+/**
+ * dma_buf_unmap - unmaps the buffer
+ * @attach:	[in]	attachment to unmap buffer from
+ * @sgt:	[in]	scatterlist info of the buffer to unmap
+ * @direction:	[in]	direction of DMA transfer
+ *
+ * This unmaps a DMA mapping for @attached obtained by dma_buf_map().
+ */
+void dma_buf_unmap(struct dma_buf_attachment *attach, struct sg_table *sgt,
+		   enum dma_data_direction dir)
+{
+	struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt);
+	int i;
+
+	dma_resv_assert_held(attach->dmabuf->resv);
+
+	if (!dma->state) {
+		; /* Do nothing */
+	} else if (dma_use_iova(dma->state)) {
+		dma_iova_destroy(attach->dev, dma->state, dma->size, dir,
+				 DMA_ATTR_MMIO);
+	} else {
+		struct scatterlist *sgl;
+
+		for_each_sgtable_dma_sg(sgt, sgl, i)
+			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
+				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
+	}
+
+	sg_free_table(sgt);
+	kfree(dma->state);
+	kfree(dma);
+
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_unmap, "DMA_BUF");
+
 /**
  * dma_buf_move_notify - notify attachments that DMA-buf is moving
  *
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index d58e329ac0e7..545ba27a5040 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -22,6 +22,7 @@
 #include <linux/fs.h>
 #include <linux/dma-fence.h>
 #include <linux/wait.h>
+#include <linux/pci-p2pdma.h>
 
 struct device;
 struct dma_buf;
@@ -530,6 +531,16 @@ struct dma_buf_export_info {
 	void *priv;
 };
 
+/**
+ * struct dma_buf_phys_vec - describe continuous chunk of memory
+ * @paddr:   physical address of that chunk
+ * @len:     Length of this chunk
+ */
+struct dma_buf_phys_vec {
+	phys_addr_t paddr;
+	size_t len;
+};
+
 /**
  * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters
  * @name: export-info name
@@ -609,4 +620,11 @@ int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
 void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
 struct dma_buf *dma_buf_iter_begin(void);
 struct dma_buf *dma_buf_iter_next(struct dma_buf *dmbuf);
+struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
+			     struct p2pdma_provider *provider,
+			     struct dma_buf_phys_vec *phys_vec,
+			     size_t nr_ranges, size_t size,
+			     enum dma_data_direction dir);
+void dma_buf_unmap(struct dma_buf_attachment *attach, struct sg_table *sgt,
+		   enum dma_data_direction dir);
 #endif /* __DMA_BUF_H__ */

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-11  9:57 ` [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine Leon Romanovsky
@ 2025-11-18 23:02   ` Jason Gunthorpe
  2025-11-19  0:06   ` Nicolin Chen
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2025-11-18 23:02 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Andrew Morton,
	Jonathan Corbet, Sumit Semwal, Christian König, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Tue, Nov 11, 2025 at 11:57:48AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> MMIO physical address ranges into scatter-gather tables with proper
> DMA mapping.
> 
> These common functions are a starting point and support any PCI
> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> case, as shortly will be RDMA. We can review existing DRM drivers to
> refactor them separately. We hope this will evolve into routines to
> help common DRM that include mixed CPU and MMIO mappings.
> 
> Compared to the dma_map_resource() abuse this implementation handles
> the complicated PCI P2P scenarios properly, especially when an IOMMU
> is enabled:
> 
>  - Direct bus address mapping without IOVA allocation for
>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
>    transactions to avoid the host bridge.
> 
>    Further, this handles the slightly obscure, case of MMIO with a
>    phys_addr_t that is different from the physical BAR programming
>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
>    accommodates this effect. This enables certain real systems to
>    work, especially on ARM platforms.
> 
>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
>    This happens when the IOMMU is enabled and the ACS flags are forcing
>    all traffic to the IOMMU - ie for virtualization systems.
> 
>  - Cases where P2P is not supported through the host bridge/CPU. The
>    P2P subsystem is the proper place to detect this and block it.
> 
> Helper functions fill_sg_entry() and calc_sg_nents() handle the
> scatter-gather table construction, splitting large regions into
> UINT_MAX-sized chunks to fit within sg->length field limits.
> 
> Since the physical address based DMA API forbids use of the CPU list
> of the scatterlist this will produce a mangled scatterlist that has
> a fully zero-length and NULL'd CPU list. The list is 0 length,
> all the struct page pointers are NULL and zero sized. This is stronger
> and more robust than the existing mangle_sg_table() technique. It is
> a future project to migrate DMABUF as a subsystem away from using
> scatterlist for this data structure.
> 
> Tested-by: Alex Mastro <amastro@fb.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dma-buf.h   |  18 ++++
>  2 files changed, 253 insertions(+)

I've looked at this enough times now, the logic for DMA mapping and
the construction of the scatterlist is good:

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-11  9:57 ` [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine Leon Romanovsky
  2025-11-18 23:02   ` Jason Gunthorpe
@ 2025-11-19  0:06   ` Nicolin Chen
  2025-11-19 13:32     ` Leon Romanovsky
  2025-11-19  5:54   ` Tian, Kevin
  2025-11-19 13:16   ` [Linaro-mm-sig] " Christian König
  3 siblings, 1 reply; 63+ messages in thread
From: Nicolin Chen @ 2025-11-19  0:06 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Krishnakant Jaju, Matt Ochs, linux-pci,
	linux-kernel, linux-block, iommu, linux-mm, linux-doc,
	linux-media, dri-devel, linaro-mm-sig, kvm, linux-hardening,
	Alex Mastro

On Tue, Nov 11, 2025 at 11:57:48AM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> MMIO physical address ranges into scatter-gather tables with proper
> DMA mapping.
> 
> These common functions are a starting point and support any PCI
> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> case, as shortly will be RDMA. We can review existing DRM drivers to
> refactor them separately. We hope this will evolve into routines to
> help common DRM that include mixed CPU and MMIO mappings.
> 
> Compared to the dma_map_resource() abuse this implementation handles
> the complicated PCI P2P scenarios properly, especially when an IOMMU
> is enabled:
> 
>  - Direct bus address mapping without IOVA allocation for
>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
>    transactions to avoid the host bridge.
> 
>    Further, this handles the slightly obscure, case of MMIO with a
>    phys_addr_t that is different from the physical BAR programming
>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
>    accommodates this effect. This enables certain real systems to
>    work, especially on ARM platforms.
> 
>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
>    This happens when the IOMMU is enabled and the ACS flags are forcing
>    all traffic to the IOMMU - ie for virtualization systems.
> 
>  - Cases where P2P is not supported through the host bridge/CPU. The
>    P2P subsystem is the proper place to detect this and block it.
> 
> Helper functions fill_sg_entry() and calc_sg_nents() handle the
> scatter-gather table construction, splitting large regions into
> UINT_MAX-sized chunks to fit within sg->length field limits.
> 
> Since the physical address based DMA API forbids use of the CPU list
> of the scatterlist this will produce a mangled scatterlist that has
> a fully zero-length and NULL'd CPU list. The list is 0 length,
> all the struct page pointers are NULL and zero sized. This is stronger
> and more robust than the existing mangle_sg_table() technique. It is
> a future project to migrate DMABUF as a subsystem away from using
> scatterlist for this data structure.
> 
> Tested-by: Alex Mastro <amastro@fb.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

With a nit:

> +err_unmap_dma:
> +	if (!i || !dma->state) {
> +		; /* Do nothing */
> +	} else if (dma_use_iova(dma->state)) {
> +		dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
> +				 DMA_ATTR_MMIO);
> +	} else {
> +		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
> +			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
> +				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);

Would it be safer to skip dma_unmap_phys() the range [i, nents)?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19  0:06   ` Nicolin Chen
@ 2025-11-19 13:32     ` Leon Romanovsky
  0 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 13:32 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Krishnakant Jaju, Matt Ochs, linux-pci,
	linux-kernel, linux-block, iommu, linux-mm, linux-doc,
	linux-media, dri-devel, linaro-mm-sig, kvm, linux-hardening,
	Alex Mastro

On Tue, Nov 18, 2025 at 04:06:11PM -0800, Nicolin Chen wrote:
> On Tue, Nov 11, 2025 at 11:57:48AM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> > MMIO physical address ranges into scatter-gather tables with proper
> > DMA mapping.
> > 
> > These common functions are a starting point and support any PCI
> > drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> > case, as shortly will be RDMA. We can review existing DRM drivers to
> > refactor them separately. We hope this will evolve into routines to
> > help common DRM that include mixed CPU and MMIO mappings.
> > 
> > Compared to the dma_map_resource() abuse this implementation handles
> > the complicated PCI P2P scenarios properly, especially when an IOMMU
> > is enabled:
> > 
> >  - Direct bus address mapping without IOVA allocation for
> >    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
> >    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
> >    transactions to avoid the host bridge.
> > 
> >    Further, this handles the slightly obscure, case of MMIO with a
> >    phys_addr_t that is different from the physical BAR programming
> >    (bus offset). The phys_addr_t is converted to a dma_addr_t and
> >    accommodates this effect. This enables certain real systems to
> >    work, especially on ARM platforms.
> > 
> >  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
> >    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
> >    This happens when the IOMMU is enabled and the ACS flags are forcing
> >    all traffic to the IOMMU - ie for virtualization systems.
> > 
> >  - Cases where P2P is not supported through the host bridge/CPU. The
> >    P2P subsystem is the proper place to detect this and block it.
> > 
> > Helper functions fill_sg_entry() and calc_sg_nents() handle the
> > scatter-gather table construction, splitting large regions into
> > UINT_MAX-sized chunks to fit within sg->length field limits.
> > 
> > Since the physical address based DMA API forbids use of the CPU list
> > of the scatterlist this will produce a mangled scatterlist that has
> > a fully zero-length and NULL'd CPU list. The list is 0 length,
> > all the struct page pointers are NULL and zero sized. This is stronger
> > and more robust than the existing mangle_sg_table() technique. It is
> > a future project to migrate DMABUF as a subsystem away from using
> > scatterlist for this data structure.
> > 
> > Tested-by: Alex Mastro <amastro@fb.com>
> > Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> 
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> 
> With a nit:
> 
> > +err_unmap_dma:
> > +	if (!i || !dma->state) {
> > +		; /* Do nothing */
> > +	} else if (dma_use_iova(dma->state)) {
> > +		dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
> > +				 DMA_ATTR_MMIO);
> > +	} else {
> > +		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
> > +			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
> > +				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
> 
> Would it be safer to skip dma_unmap_phys() the range [i, nents)?

[i, nents) is not supposed to be in SG list which we are iterating.

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-11  9:57 ` [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine Leon Romanovsky
  2025-11-18 23:02   ` Jason Gunthorpe
  2025-11-19  0:06   ` Nicolin Chen
@ 2025-11-19  5:54   ` Tian, Kevin
  2025-11-19 13:30     ` Leon Romanovsky
  2025-11-19 13:16   ` [Linaro-mm-sig] " Christian König
  3 siblings, 1 reply; 63+ messages in thread
From: Tian, Kevin @ 2025-11-19  5:54 UTC (permalink / raw)
  To: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, November 11, 2025 5:58 PM
> +
> +	if (dma->state && dma_use_iova(dma->state)) {
> +		WARN_ON_ONCE(mapped_len != size);

then "goto err_unmap_dma".

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19  5:54   ` Tian, Kevin
@ 2025-11-19 13:30     ` Leon Romanovsky
  2025-11-19 13:37       ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 13:30 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson,
	Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

On Wed, Nov 19, 2025 at 05:54:55AM +0000, Tian, Kevin wrote:
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Tuesday, November 11, 2025 5:58 PM
> > +
> > +	if (dma->state && dma_use_iova(dma->state)) {
> > +		WARN_ON_ONCE(mapped_len != size);
> 
> then "goto err_unmap_dma".

It never should happen, there is no need to provide error unwind to
something that you won't get.

> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 13:30     ` Leon Romanovsky
@ 2025-11-19 13:37       ` Jason Gunthorpe
  2025-11-19 13:45         ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2025-11-19 13:37 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Tian, Kevin, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson,
	Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

On Wed, Nov 19, 2025 at 03:30:00PM +0200, Leon Romanovsky wrote:
> On Wed, Nov 19, 2025 at 05:54:55AM +0000, Tian, Kevin wrote:
> > > From: Leon Romanovsky <leon@kernel.org>
> > > Sent: Tuesday, November 11, 2025 5:58 PM
> > > +
> > > +	if (dma->state && dma_use_iova(dma->state)) {
> > > +		WARN_ON_ONCE(mapped_len != size);
> > 
> > then "goto err_unmap_dma".
> 
> It never should happen, there is no need to provide error unwind to
> something that you won't get.

It is expected that WARN_ON has recovery code, if it is possible and
not burdensome.

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 13:37       ` Jason Gunthorpe
@ 2025-11-19 13:45         ` Leon Romanovsky
  0 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 13:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson,
	Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

On Wed, Nov 19, 2025 at 09:37:08AM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 19, 2025 at 03:30:00PM +0200, Leon Romanovsky wrote:
> > On Wed, Nov 19, 2025 at 05:54:55AM +0000, Tian, Kevin wrote:
> > > > From: Leon Romanovsky <leon@kernel.org>
> > > > Sent: Tuesday, November 11, 2025 5:58 PM
> > > > +
> > > > +	if (dma->state && dma_use_iova(dma->state)) {
> > > > +		WARN_ON_ONCE(mapped_len != size);
> > > 
> > > then "goto err_unmap_dma".
> > 
> > It never should happen, there is no need to provide error unwind to
> > something that you won't get.
> 
> It is expected that WARN_ON has recovery code, if it is possible and
> not burdensome.

It’s not necessary, but since I’m calculating mapped_len again, it’s natural—and completely
harmless—to double-check the arithmetic.

Thanks

> 
> Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-11  9:57 ` [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine Leon Romanovsky
                     ` (2 preceding siblings ...)
  2025-11-19  5:54   ` Tian, Kevin
@ 2025-11-19 13:16   ` Christian König
  2025-11-19 13:25     ` Jason Gunthorpe
  2025-11-19 13:42     ` Leon Romanovsky
  3 siblings, 2 replies; 63+ messages in thread
From: Christian König @ 2025-11-19 13:16 UTC (permalink / raw)
  To: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Alex Mastro, Nicolin Chen



On 11/11/25 10:57, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> MMIO physical address ranges into scatter-gather tables with proper
> DMA mapping.
> 
> These common functions are a starting point and support any PCI
> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> case, as shortly will be RDMA. We can review existing DRM drivers to
> refactor them separately. We hope this will evolve into routines to
> help common DRM that include mixed CPU and MMIO mappings.
> 
> Compared to the dma_map_resource() abuse this implementation handles
> the complicated PCI P2P scenarios properly, especially when an IOMMU
> is enabled:
> 
>  - Direct bus address mapping without IOVA allocation for
>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
>    transactions to avoid the host bridge.
> 
>    Further, this handles the slightly obscure, case of MMIO with a
>    phys_addr_t that is different from the physical BAR programming
>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
>    accommodates this effect. This enables certain real systems to
>    work, especially on ARM platforms.
> 
>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
>    This happens when the IOMMU is enabled and the ACS flags are forcing
>    all traffic to the IOMMU - ie for virtualization systems.
> 
>  - Cases where P2P is not supported through the host bridge/CPU. The
>    P2P subsystem is the proper place to detect this and block it.
> 
> Helper functions fill_sg_entry() and calc_sg_nents() handle the
> scatter-gather table construction, splitting large regions into
> UINT_MAX-sized chunks to fit within sg->length field limits.
> 
> Since the physical address based DMA API forbids use of the CPU list
> of the scatterlist this will produce a mangled scatterlist that has
> a fully zero-length and NULL'd CPU list. The list is 0 length,
> all the struct page pointers are NULL and zero sized. This is stronger
> and more robust than the existing mangle_sg_table() technique. It is
> a future project to migrate DMABUF as a subsystem away from using
> scatterlist for this data structure.
> 
> Tested-by: Alex Mastro <amastro@fb.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> ---
>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dma-buf.h   |  18 ++++
>  2 files changed, 253 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 2bcf9ceca997..cb55dff1dad5 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>  }
>  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>  
> +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> +					 dma_addr_t addr)
> +{
> +	unsigned int len, nents;
> +	int i;
> +
> +	nents = DIV_ROUND_UP(length, UINT_MAX);
> +	for (i = 0; i < nents; i++) {
> +		len = min_t(size_t, length, UINT_MAX);
> +		length -= len;
> +		/*
> +		 * DMABUF abuses scatterlist to create a scatterlist
> +		 * that does not have any CPU list, only the DMA list.
> +		 * Always set the page related values to NULL to ensure
> +		 * importers can't use it. The phys_addr based DMA API
> +		 * does not require the CPU list for mapping or unmapping.
> +		 */
> +		sg_set_page(sgl, NULL, 0, 0);
> +		sg_dma_address(sgl) = addr + i * UINT_MAX;
> +		sg_dma_len(sgl) = len;
> +		sgl = sg_next(sgl);
> +	}
> +
> +	return sgl;
> +}
> +
> +static unsigned int calc_sg_nents(struct dma_iova_state *state,
> +				  struct dma_buf_phys_vec *phys_vec,
> +				  size_t nr_ranges, size_t size)
> +{
> +	unsigned int nents = 0;
> +	size_t i;
> +
> +	if (!state || !dma_use_iova(state)) {
> +		for (i = 0; i < nr_ranges; i++)
> +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> +	} else {
> +		/*
> +		 * In IOVA case, there is only one SG entry which spans
> +		 * for whole IOVA address space, but we need to make sure
> +		 * that it fits sg->length, maybe we need more.
> +		 */
> +		nents = DIV_ROUND_UP(size, UINT_MAX);
> +	}
> +
> +	return nents;
> +}
> +
> +/**
> + * struct dma_buf_dma - holds DMA mapping information
> + * @sgt:    Scatter-gather table
> + * @state:  DMA IOVA state relevant in IOMMU-based DMA
> + * @size:   Total size of DMA transfer
> + */
> +struct dma_buf_dma {
> +	struct sg_table sgt;
> +	struct dma_iova_state *state;
> +	size_t size;
> +};
> +
> +/**
> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> + * of physical vectors. This funciton is intended for MMIO memory only.
> + * @attach:	[in]	attachment whose scatterlist is to be returned
> + * @provider:	[in]	p2pdma provider
> + * @phys_vec:	[in]	array of physical vectors
> + * @nr_ranges:	[in]	number of entries in phys_vec array
> + * @size:	[in]	total size of phys_vec
> + * @dir:	[in]	direction of DMA transfer
> + *
> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> + * on error. May return -EINTR if it is interrupted by a signal.
> + *
> + * On success, the DMA addresses and lengths in the returned scatterlist are
> + * PAGE_SIZE aligned.
> + *
> + * A mapping must be unmapped by using dma_buf_unmap().
> + */
> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,

That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.

> +			     struct p2pdma_provider *provider,
> +			     struct dma_buf_phys_vec *phys_vec,
> +			     size_t nr_ranges, size_t size,
> +			     enum dma_data_direction dir)
> +{
> +	unsigned int nents, mapped_len = 0;
> +	struct dma_buf_dma *dma;
> +	struct scatterlist *sgl;
> +	dma_addr_t addr;
> +	size_t i;
> +	int ret;
> +
> +	dma_resv_assert_held(attach->dmabuf->resv);
> +
> +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
> +		/* This function is supposed to work on MMIO memory only */
> +		return ERR_PTR(-EINVAL);
> +
> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> +	if (!dma)
> +		return ERR_PTR(-ENOMEM);
> +
> +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
> +	case PCI_P2PDMA_MAP_BUS_ADDR:
> +		/*
> +		 * There is no need in IOVA at all for this flow.
> +		 */
> +		break;
> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> +		if (!dma->state) {
> +			ret = -ENOMEM;
> +			goto err_free_dma;
> +		}
> +
> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);

Oh, that is a clear no-go for the core DMA-buf code.

It's intentionally up to the exporter how to create the DMA addresses the importer can work with.

We could add something like a dma_buf_sg_helper.c or similar and put it in there.

Regards,
Christian.


> +		break;
> +	default:
> +		ret = -EINVAL;
> +		goto err_free_dma;
> +	}
> +
> +	nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size);
> +	ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO);
> +	if (ret)
> +		goto err_free_state;
> +
> +	sgl = dma->sgt.sgl;
> +
> +	for (i = 0; i < nr_ranges; i++) {
> +		if (!dma->state) {
> +			addr = pci_p2pdma_bus_addr_map(provider,
> +						       phys_vec[i].paddr);
> +		} else if (dma_use_iova(dma->state)) {
> +			ret = dma_iova_link(attach->dev, dma->state,
> +					    phys_vec[i].paddr, 0,
> +					    phys_vec[i].len, dir,
> +					    DMA_ATTR_MMIO);
> +			if (ret)
> +				goto err_unmap_dma;
> +
> +			mapped_len += phys_vec[i].len;
> +		} else {
> +			addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
> +					    phys_vec[i].len, dir,
> +					    DMA_ATTR_MMIO);
> +			ret = dma_mapping_error(attach->dev, addr);
> +			if (ret)
> +				goto err_unmap_dma;
> +		}
> +
> +		if (!dma->state || !dma_use_iova(dma->state))
> +			sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
> +	}
> +
> +	if (dma->state && dma_use_iova(dma->state)) {
> +		WARN_ON_ONCE(mapped_len != size);
> +		ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
> +		if (ret)
> +			goto err_unmap_dma;
> +
> +		sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
> +	}
> +
> +	dma->size = size;
> +
> +	/*
> +	 * No CPU list included — set orig_nents = 0 so others can detect
> +	 * this via SG table (use nents only).
> +	 */
> +	dma->sgt.orig_nents = 0;
> +
> +
> +	/*
> +	 * SGL must be NULL to indicate that SGL is the last one
> +	 * and we allocated correct number of entries in sg_alloc_table()
> +	 */
> +	WARN_ON_ONCE(sgl);
> +	return &dma->sgt;
> +
> +err_unmap_dma:
> +	if (!i || !dma->state) {
> +		; /* Do nothing */
> +	} else if (dma_use_iova(dma->state)) {
> +		dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
> +				 DMA_ATTR_MMIO);
> +	} else {
> +		for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
> +			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
> +				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
> +	}
> +	sg_free_table(&dma->sgt);
> +err_free_state:
> +	kfree(dma->state);
> +err_free_dma:
> +	kfree(dma);
> +	return ERR_PTR(ret);
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_map, "DMA_BUF");
> +
> +/**
> + * dma_buf_unmap - unmaps the buffer
> + * @attach:	[in]	attachment to unmap buffer from
> + * @sgt:	[in]	scatterlist info of the buffer to unmap
> + * @direction:	[in]	direction of DMA transfer
> + *
> + * This unmaps a DMA mapping for @attached obtained by dma_buf_map().
> + */
> +void dma_buf_unmap(struct dma_buf_attachment *attach, struct sg_table *sgt,
> +		   enum dma_data_direction dir)
> +{
> +	struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt);
> +	int i;
> +
> +	dma_resv_assert_held(attach->dmabuf->resv);
> +
> +	if (!dma->state) {
> +		; /* Do nothing */
> +	} else if (dma_use_iova(dma->state)) {
> +		dma_iova_destroy(attach->dev, dma->state, dma->size, dir,
> +				 DMA_ATTR_MMIO);
> +	} else {
> +		struct scatterlist *sgl;
> +
> +		for_each_sgtable_dma_sg(sgt, sgl, i)
> +			dma_unmap_phys(attach->dev, sg_dma_address(sgl),
> +				       sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
> +	}
> +
> +	sg_free_table(sgt);
> +	kfree(dma->state);
> +	kfree(dma);
> +
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_unmap, "DMA_BUF");
> +
>  /**
>   * dma_buf_move_notify - notify attachments that DMA-buf is moving
>   *
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index d58e329ac0e7..545ba27a5040 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -22,6 +22,7 @@
>  #include <linux/fs.h>
>  #include <linux/dma-fence.h>
>  #include <linux/wait.h>
> +#include <linux/pci-p2pdma.h>
>  
>  struct device;
>  struct dma_buf;
> @@ -530,6 +531,16 @@ struct dma_buf_export_info {
>  	void *priv;
>  };
>  
> +/**
> + * struct dma_buf_phys_vec - describe continuous chunk of memory
> + * @paddr:   physical address of that chunk
> + * @len:     Length of this chunk
> + */
> +struct dma_buf_phys_vec {
> +	phys_addr_t paddr;
> +	size_t len;
> +};
> +
>  /**
>   * DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters
>   * @name: export-info name
> @@ -609,4 +620,11 @@ int dma_buf_vmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
>  void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, struct iosys_map *map);
>  struct dma_buf *dma_buf_iter_begin(void);
>  struct dma_buf *dma_buf_iter_next(struct dma_buf *dmbuf);
> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> +			     struct p2pdma_provider *provider,
> +			     struct dma_buf_phys_vec *phys_vec,
> +			     size_t nr_ranges, size_t size,
> +			     enum dma_data_direction dir);
> +void dma_buf_unmap(struct dma_buf_attachment *attach, struct sg_table *sgt,
> +		   enum dma_data_direction dir);
>  #endif /* __DMA_BUF_H__ */
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 13:16   ` [Linaro-mm-sig] " Christian König
@ 2025-11-19 13:25     ` Jason Gunthorpe
  2025-11-19 13:42       ` Christian König
  2025-11-19 13:42     ` Leon Romanovsky
  1 sibling, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2025-11-19 13:25 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
> > +/**
> > + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> > + * of physical vectors. This funciton is intended for MMIO memory only.
> > + * @attach:	[in]	attachment whose scatterlist is to be returned
> > + * @provider:	[in]	p2pdma provider
> > + * @phys_vec:	[in]	array of physical vectors
> > + * @nr_ranges:	[in]	number of entries in phys_vec array
> > + * @size:	[in]	total size of phys_vec
> > + * @dir:	[in]	direction of DMA transfer
> > + *
> > + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> > + * on error. May return -EINTR if it is interrupted by a signal.
> > + *
> > + * On success, the DMA addresses and lengths in the returned scatterlist are
> > + * PAGE_SIZE aligned.
> > + *
> > + * A mapping must be unmapped by using dma_buf_unmap().
> > + */
> > +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> 
> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> 
> > +			     struct p2pdma_provider *provider,
> > +			     struct dma_buf_phys_vec *phys_vec,
> > +			     size_t nr_ranges, size_t size,
> > +			     enum dma_data_direction dir)
> > +{
> > +	unsigned int nents, mapped_len = 0;
> > +	struct dma_buf_dma *dma;
> > +	struct scatterlist *sgl;
> > +	dma_addr_t addr;
> > +	size_t i;
> > +	int ret;
> > +
> > +	dma_resv_assert_held(attach->dmabuf->resv);
> > +
> > +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
> > +		/* This function is supposed to work on MMIO memory only */
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > +	if (!dma)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
> > +	case PCI_P2PDMA_MAP_BUS_ADDR:
> > +		/*
> > +		 * There is no need in IOVA at all for this flow.
> > +		 */
> > +		break;
> > +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> > +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> > +		if (!dma->state) {
> > +			ret = -ENOMEM;
> > +			goto err_free_dma;
> > +		}
> > +
> > +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> 
> Oh, that is a clear no-go for the core DMA-buf code.
> 
> It's intentionally up to the exporter how to create the DMA
> addresses the importer can work with.

I can't fully understand this remark?

> We could add something like a dma_buf_sg_helper.c or similar and put it in there.

Yes, the intention is this function is an "exporter helper" that an
exporter can call if it wants to help generate the scatterlist.

So your "no-go" is just about what file it is in, not anything about
how it works?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 13:25     ` Jason Gunthorpe
@ 2025-11-19 13:42       ` Christian König
  2025-11-19 13:48         ` Leon Romanovsky
  2025-11-19 19:31         ` Jason Gunthorpe
  0 siblings, 2 replies; 63+ messages in thread
From: Christian König @ 2025-11-19 13:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On 11/19/25 14:25, Jason Gunthorpe wrote:
> On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
>>> +/**
>>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
>>> + * of physical vectors. This funciton is intended for MMIO memory only.
>>> + * @attach:	[in]	attachment whose scatterlist is to be returned
>>> + * @provider:	[in]	p2pdma provider
>>> + * @phys_vec:	[in]	array of physical vectors
>>> + * @nr_ranges:	[in]	number of entries in phys_vec array
>>> + * @size:	[in]	total size of phys_vec
>>> + * @dir:	[in]	direction of DMA transfer
>>> + *
>>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
>>> + * on error. May return -EINTR if it is interrupted by a signal.
>>> + *
>>> + * On success, the DMA addresses and lengths in the returned scatterlist are
>>> + * PAGE_SIZE aligned.
>>> + *
>>> + * A mapping must be unmapped by using dma_buf_unmap().
>>> + */
>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
>>
>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
>>
>>> +			     struct p2pdma_provider *provider,
>>> +			     struct dma_buf_phys_vec *phys_vec,
>>> +			     size_t nr_ranges, size_t size,
>>> +			     enum dma_data_direction dir)
>>> +{
>>> +	unsigned int nents, mapped_len = 0;
>>> +	struct dma_buf_dma *dma;
>>> +	struct scatterlist *sgl;
>>> +	dma_addr_t addr;
>>> +	size_t i;
>>> +	int ret;
>>> +
>>> +	dma_resv_assert_held(attach->dmabuf->resv);
>>> +
>>> +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
>>> +		/* This function is supposed to work on MMIO memory only */
>>> +		return ERR_PTR(-EINVAL);
>>> +
>>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
>>> +	if (!dma)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
>>> +	case PCI_P2PDMA_MAP_BUS_ADDR:
>>> +		/*
>>> +		 * There is no need in IOVA at all for this flow.
>>> +		 */
>>> +		break;
>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
>>> +		if (!dma->state) {
>>> +			ret = -ENOMEM;
>>> +			goto err_free_dma;
>>> +		}
>>> +
>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
>>
>> Oh, that is a clear no-go for the core DMA-buf code.
>>
>> It's intentionally up to the exporter how to create the DMA
>> addresses the importer can work with.
> 
> I can't fully understand this remark?

The exporter should be able to decide if it actually wants to use P2P when the transfer has to go through the host bridge (e.g. when IOMMU/bridge routing bits are enabled).

Thinking more about it exporters can now probably call pci_p2pdma_map_type(provider, attach->dev) before calling this function so that is probably ok.

>> We could add something like a dma_buf_sg_helper.c or similar and put it in there.
> 
> Yes, the intention is this function is an "exporter helper" that an
> exporter can call if it wants to help generate the scatterlist.
> 
> So your "no-go" is just about what file it is in, not anything about
> how it works?

Yes, exactly that. Just move it into a separate file somewhere and it's probably good to go as far as I can see.

But only take that as Acked-by, I would need at least a day (or week) of free time to wrap my head around all the technical details again. And that is something I won't have before January or even later.

Regards,
Christian.

> 
> Thanks,
> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 13:42       ` Christian König
@ 2025-11-19 13:48         ` Leon Romanovsky
  2025-11-19 19:31         ` Jason Gunthorpe
  1 sibling, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 13:48 UTC (permalink / raw)
  To: Christian König
  Cc: Jason Gunthorpe, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
> On 11/19/25 14:25, Jason Gunthorpe wrote:
> > On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
> >>> +/**
> >>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> >>> + * of physical vectors. This funciton is intended for MMIO memory only.
> >>> + * @attach:	[in]	attachment whose scatterlist is to be returned
> >>> + * @provider:	[in]	p2pdma provider
> >>> + * @phys_vec:	[in]	array of physical vectors
> >>> + * @nr_ranges:	[in]	number of entries in phys_vec array
> >>> + * @size:	[in]	total size of phys_vec
> >>> + * @dir:	[in]	direction of DMA transfer
> >>> + *
> >>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> >>> + * on error. May return -EINTR if it is interrupted by a signal.
> >>> + *
> >>> + * On success, the DMA addresses and lengths in the returned scatterlist are
> >>> + * PAGE_SIZE aligned.
> >>> + *
> >>> + * A mapping must be unmapped by using dma_buf_unmap().
> >>> + */
> >>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> >>
> >> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> >>
> >>> +			     struct p2pdma_provider *provider,
> >>> +			     struct dma_buf_phys_vec *phys_vec,
> >>> +			     size_t nr_ranges, size_t size,
> >>> +			     enum dma_data_direction dir)
> >>> +{
> >>> +	unsigned int nents, mapped_len = 0;
> >>> +	struct dma_buf_dma *dma;
> >>> +	struct scatterlist *sgl;
> >>> +	dma_addr_t addr;
> >>> +	size_t i;
> >>> +	int ret;
> >>> +
> >>> +	dma_resv_assert_held(attach->dmabuf->resv);
> >>> +
> >>> +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
> >>> +		/* This function is supposed to work on MMIO memory only */
> >>> +		return ERR_PTR(-EINVAL);
> >>> +
> >>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> >>> +	if (!dma)
> >>> +		return ERR_PTR(-ENOMEM);
> >>> +
> >>> +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
> >>> +	case PCI_P2PDMA_MAP_BUS_ADDR:
> >>> +		/*
> >>> +		 * There is no need in IOVA at all for this flow.
> >>> +		 */
> >>> +		break;
> >>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> >>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> >>> +		if (!dma->state) {
> >>> +			ret = -ENOMEM;
> >>> +			goto err_free_dma;
> >>> +		}
> >>> +
> >>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> >>
> >> Oh, that is a clear no-go for the core DMA-buf code.
> >>
> >> It's intentionally up to the exporter how to create the DMA
> >> addresses the importer can work with.
> > 
> > I can't fully understand this remark?
> 
> The exporter should be able to decide if it actually wants to use P2P when the transfer has to go through the host bridge (e.g. when IOMMU/bridge routing bits are enabled).
> 
> Thinking more about it exporters can now probably call pci_p2pdma_map_type(provider, attach->dev) before calling this function so that is probably ok.
> 
> >> We could add something like a dma_buf_sg_helper.c or similar and put it in there.
> > 
> > Yes, the intention is this function is an "exporter helper" that an
> > exporter can call if it wants to help generate the scatterlist.
> > 
> > So your "no-go" is just about what file it is in, not anything about
> > how it works?
> 
> Yes, exactly that. Just move it into a separate file somewhere and it's probably good to go as far as I can see.
> 
> But only take that as Acked-by, I would need at least a day (or week) of free time to wrap my head around all the technical details again. And that is something I won't have before January or even later.

If it helps, we can meet at LPC. Jason and/or I will be happy to assist.

Thanks

> 
> Regards,
> Christian.
> 
> > 
> > Thanks,
> > Jason
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 13:42       ` Christian König
  2025-11-19 13:48         ` Leon Romanovsky
@ 2025-11-19 19:31         ` Jason Gunthorpe
  2025-11-19 20:54           ` Leon Romanovsky
  2025-11-20  7:08           ` Christian König
  1 sibling, 2 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2025-11-19 19:31 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:

> >>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> >>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> >>> +		if (!dma->state) {
> >>> +			ret = -ENOMEM;
> >>> +			goto err_free_dma;
> >>> +		}
> >>> +
> >>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> >>
> >> Oh, that is a clear no-go for the core DMA-buf code.
> >>
> >> It's intentionally up to the exporter how to create the DMA
> >> addresses the importer can work with.
> > 
> > I can't fully understand this remark?
> 
> The exporter should be able to decide if it actually wants to use
> P2P when the transfer has to go through the host bridge (e.g. when
> IOMMU/bridge routing bits are enabled).

Sure, but this is a simplified helper for exporters that don't have
choices where the memory comes from.

I fully expet to see changes to this to support more use cases,
including the one above. We should do those changes along with users
making use of them so we can evaluate what works best.

> But only take that as Acked-by, I would need at least a day (or
> week) of free time to wrap my head around all the technical details
> again. And that is something I won't have before January or even
> later.

Sure, it is alot, and I think DRM community in general should come up
to speed on the new DMA API and how we are pushing to see P2P work
within Linux.

So thanks, we can take the Acked-by and progress here. Interested
parties can pick it up from this point when time allows.

We can also have a mini-community call to give a summary/etc on these
topics.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 19:31         ` Jason Gunthorpe
@ 2025-11-19 20:54           ` Leon Romanovsky
  2025-11-20  7:08           ` Christian König
  1 sibling, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 20:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Christian König
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Andrew Morton,
	Jonathan Corbet, Sumit Semwal, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson, Krishnakant Jaju, Matt Ochs, linux-pci,
	linux-kernel, linux-block, iommu, linux-mm, linux-doc,
	linux-media, dri-devel, linaro-mm-sig, kvm, linux-hardening,
	Alex Mastro, Nicolin Chen

On Wed, Nov 19, 2025 at 03:31:14PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:

<...>

> So thanks, we can take the Acked-by and progress here. Interested
> parties can pick it up from this point when time allows.

Christian,

Can you please provide explicit Acked-by?

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 19:31         ` Jason Gunthorpe
  2025-11-19 20:54           ` Leon Romanovsky
@ 2025-11-20  7:08           ` Christian König
  2025-11-20  7:41             ` Leon Romanovsky
  2025-11-20 13:20             ` Jason Gunthorpe
  1 sibling, 2 replies; 63+ messages in thread
From: Christian König @ 2025-11-20  7:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On 11/19/25 20:31, Jason Gunthorpe wrote:
> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
> 
>>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
>>>>> +		if (!dma->state) {
>>>>> +			ret = -ENOMEM;
>>>>> +			goto err_free_dma;
>>>>> +		}
>>>>> +
>>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
>>>>
>>>> Oh, that is a clear no-go for the core DMA-buf code.
>>>>
>>>> It's intentionally up to the exporter how to create the DMA
>>>> addresses the importer can work with.
>>>
>>> I can't fully understand this remark?
>>
>> The exporter should be able to decide if it actually wants to use
>> P2P when the transfer has to go through the host bridge (e.g. when
>> IOMMU/bridge routing bits are enabled).
> 
> Sure, but this is a simplified helper for exporters that don't have
> choices where the memory comes from.

That is extremely questionable as justification to put that in common DMA-buf code.

> I fully expet to see changes to this to support more use cases,
> including the one above. We should do those changes along with users
> making use of them so we can evaluate what works best.

Yeah, exactly that's my concern.

>> But only take that as Acked-by, I would need at least a day (or
>> week) of free time to wrap my head around all the technical details
>> again. And that is something I won't have before January or even
>> later.
> 
> Sure, it is alot, and I think DRM community in general should come up
> to speed on the new DMA API and how we are pushing to see P2P work
> within Linux.
> 
> So thanks, we can take the Acked-by and progress here. Interested
> parties can pick it up from this point when time allows.

Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.

This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.

So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.

Regards,
Christian.

> 
> We can also have a mini-community call to give a summary/etc on these
> topics.
> 
> Thanks,
> Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-20  7:08           ` Christian König
@ 2025-11-20  7:41             ` Leon Romanovsky
  2025-11-20  7:54               ` Christian König
  2025-11-20 13:20             ` Jason Gunthorpe
  1 sibling, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-20  7:41 UTC (permalink / raw)
  To: Christian König
  Cc: Jason Gunthorpe, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
> On 11/19/25 20:31, Jason Gunthorpe wrote:
> > On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
> > 
> >>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> >>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> >>>>> +		if (!dma->state) {
> >>>>> +			ret = -ENOMEM;
> >>>>> +			goto err_free_dma;
> >>>>> +		}
> >>>>> +
> >>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> >>>>
> >>>> Oh, that is a clear no-go for the core DMA-buf code.
> >>>>
> >>>> It's intentionally up to the exporter how to create the DMA
> >>>> addresses the importer can work with.
> >>>
> >>> I can't fully understand this remark?
> >>
> >> The exporter should be able to decide if it actually wants to use
> >> P2P when the transfer has to go through the host bridge (e.g. when
> >> IOMMU/bridge routing bits are enabled).
> > 
> > Sure, but this is a simplified helper for exporters that don't have
> > choices where the memory comes from.
> 
> That is extremely questionable as justification to put that in common DMA-buf code.
> 
> > I fully expet to see changes to this to support more use cases,
> > including the one above. We should do those changes along with users
> > making use of them so we can evaluate what works best.
> 
> Yeah, exactly that's my concern.
> 
> >> But only take that as Acked-by, I would need at least a day (or
> >> week) of free time to wrap my head around all the technical details
> >> again. And that is something I won't have before January or even
> >> later.
> > 
> > Sure, it is alot, and I think DRM community in general should come up
> > to speed on the new DMA API and how we are pushing to see P2P work
> > within Linux.
> > 
> > So thanks, we can take the Acked-by and progress here. Interested
> > parties can pick it up from this point when time allows.
> 
> Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.
> 
> This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.
> 
> So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.

It was put in VFIO at the beginning, but Christoph objected to it,
because that will require exporting symbol for pci_p2pdma_map_type().
which was universally agreed as not good idea.

https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/

Thanks

> 
> Regards,
> Christian.
> 
> > 
> > We can also have a mini-community call to give a summary/etc on these
> > topics.
> > 
> > Thanks,
> > Jason
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-20  7:41             ` Leon Romanovsky
@ 2025-11-20  7:54               ` Christian König
  2025-11-20  8:06                 ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Christian König @ 2025-11-20  7:54 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On 11/20/25 08:41, Leon Romanovsky wrote:
> On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
>> On 11/19/25 20:31, Jason Gunthorpe wrote:
>>> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
>>>
>>>>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>>>>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
>>>>>>> +		if (!dma->state) {
>>>>>>> +			ret = -ENOMEM;
>>>>>>> +			goto err_free_dma;
>>>>>>> +		}
>>>>>>> +
>>>>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
>>>>>>
>>>>>> Oh, that is a clear no-go for the core DMA-buf code.
>>>>>>
>>>>>> It's intentionally up to the exporter how to create the DMA
>>>>>> addresses the importer can work with.
>>>>>
>>>>> I can't fully understand this remark?
>>>>
>>>> The exporter should be able to decide if it actually wants to use
>>>> P2P when the transfer has to go through the host bridge (e.g. when
>>>> IOMMU/bridge routing bits are enabled).
>>>
>>> Sure, but this is a simplified helper for exporters that don't have
>>> choices where the memory comes from.
>>
>> That is extremely questionable as justification to put that in common DMA-buf code.
>>
>>> I fully expet to see changes to this to support more use cases,
>>> including the one above. We should do those changes along with users
>>> making use of them so we can evaluate what works best.
>>
>> Yeah, exactly that's my concern.
>>
>>>> But only take that as Acked-by, I would need at least a day (or
>>>> week) of free time to wrap my head around all the technical details
>>>> again. And that is something I won't have before January or even
>>>> later.
>>>
>>> Sure, it is alot, and I think DRM community in general should come up
>>> to speed on the new DMA API and how we are pushing to see P2P work
>>> within Linux.
>>>
>>> So thanks, we can take the Acked-by and progress here. Interested
>>> parties can pick it up from this point when time allows.
>>
>> Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.
>>
>> This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.
>>
>> So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.
> 
> It was put in VFIO at the beginning, but Christoph objected to it,
> because that will require exporting symbol for pci_p2pdma_map_type().
> which was universally agreed as not good idea.

Yeah, that is exactly what I object here :)

We can have the helper in DMA-buf *if* pci_p2pdma_map_type() is called by drivers or at least accessible. That's what I pointed out in the other mail before as well.

The exporter must be able to make decisions based on if the transaction would go over the host bridge or not.

Background is that in a lot of use cases you rather want to move the backing store into system memory instead of keeping it in local memory if the driver doesn't have direct access over a common upstream bridge.

Currently drivers decide that based on if IOMMU is enabled or not (and a few other quirks), but essentially you absolutely want a function which gives this information to exporters. For the VFIO use case it doesn't matter because you can't switch the BAR for system memory.

To unblock you, please add a big fat comment in the kerneldoc of the mapping explaining this and that it might be necessary for exporters to call pci_p2pdma_map_type() as well.

Regards,
Christian.

> 
> https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/
> 
> Thanks
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> We can also have a mini-community call to give a summary/etc on these
>>> topics.
>>>
>>> Thanks,
>>> Jason
>>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-20  7:54               ` Christian König
@ 2025-11-20  8:06                 ` Leon Romanovsky
  2025-11-20  8:32                   ` Christian König
  0 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-20  8:06 UTC (permalink / raw)
  To: Christian König
  Cc: Jason Gunthorpe, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Thu, Nov 20, 2025 at 08:54:37AM +0100, Christian König wrote:
> On 11/20/25 08:41, Leon Romanovsky wrote:
> > On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
> >> On 11/19/25 20:31, Jason Gunthorpe wrote:
> >>> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
> >>>
> >>>>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> >>>>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> >>>>>>> +		if (!dma->state) {
> >>>>>>> +			ret = -ENOMEM;
> >>>>>>> +			goto err_free_dma;
> >>>>>>> +		}
> >>>>>>> +
> >>>>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> >>>>>>
> >>>>>> Oh, that is a clear no-go for the core DMA-buf code.
> >>>>>>
> >>>>>> It's intentionally up to the exporter how to create the DMA
> >>>>>> addresses the importer can work with.
> >>>>>
> >>>>> I can't fully understand this remark?
> >>>>
> >>>> The exporter should be able to decide if it actually wants to use
> >>>> P2P when the transfer has to go through the host bridge (e.g. when
> >>>> IOMMU/bridge routing bits are enabled).
> >>>
> >>> Sure, but this is a simplified helper for exporters that don't have
> >>> choices where the memory comes from.
> >>
> >> That is extremely questionable as justification to put that in common DMA-buf code.
> >>
> >>> I fully expet to see changes to this to support more use cases,
> >>> including the one above. We should do those changes along with users
> >>> making use of them so we can evaluate what works best.
> >>
> >> Yeah, exactly that's my concern.
> >>
> >>>> But only take that as Acked-by, I would need at least a day (or
> >>>> week) of free time to wrap my head around all the technical details
> >>>> again. And that is something I won't have before January or even
> >>>> later.
> >>>
> >>> Sure, it is alot, and I think DRM community in general should come up
> >>> to speed on the new DMA API and how we are pushing to see P2P work
> >>> within Linux.
> >>>
> >>> So thanks, we can take the Acked-by and progress here. Interested
> >>> parties can pick it up from this point when time allows.
> >>
> >> Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.
> >>
> >> This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.
> >>
> >> So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.
> > 
> > It was put in VFIO at the beginning, but Christoph objected to it,
> > because that will require exporting symbol for pci_p2pdma_map_type().
> > which was universally agreed as not good idea.
> 
> Yeah, that is exactly what I object here :)
> 
> We can have the helper in DMA-buf *if* pci_p2pdma_map_type() is called by drivers or at least accessible. That's what I pointed out in the other mail before as well.
> 
> The exporter must be able to make decisions based on if the transaction would go over the host bridge or not.
> 
> Background is that in a lot of use cases you rather want to move the backing store into system memory instead of keeping it in local memory if the driver doesn't have direct access over a common upstream bridge.
> 
> Currently drivers decide that based on if IOMMU is enabled or not (and a few other quirks), but essentially you absolutely want a function which gives this information to exporters. For the VFIO use case it doesn't matter because you can't switch the BAR for system memory.
> 
> To unblock you, please add a big fat comment in the kerneldoc of the mapping explaining this and that it might be necessary for exporters to call pci_p2pdma_map_type() as well.

Thanks,

What do you think about it?

diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
index a69bb73db86d..05ec84a0157b 100644
--- a/drivers/dma-buf/dma-buf-mapping.c
+++ b/drivers/dma-buf/dma-buf-mapping.c
@@ -84,6 +84,11 @@ struct dma_buf_dma {
  * PAGE_SIZE aligned.
  *
  * A mapping must be unmapped by using dma_buf_free_sgt().
+ *
+ * NOTE: While this function is intended for DMA-buf importers, it is critical
+ * that the DMA-buf exporter is capable of performing peer-to-peer (P2P) DMA
+ * directly between PCI devices, without routing transactions through the host
+ * bridge.
  */
 struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
                                         struct p2pdma_provider *provider,
(END)


> 
> Regards,
> Christian.
> 
> > 
> > https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/
> > 
> > Thanks
> > 
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> We can also have a mini-community call to give a summary/etc on these
> >>> topics.
> >>>
> >>> Thanks,
> >>> Jason
> >>
> 

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-20  8:06                 ` Leon Romanovsky
@ 2025-11-20  8:32                   ` Christian König
  2025-11-20  8:42                     ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Christian König @ 2025-11-20  8:32 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On 11/20/25 09:06, Leon Romanovsky wrote:
> On Thu, Nov 20, 2025 at 08:54:37AM +0100, Christian König wrote:
>> On 11/20/25 08:41, Leon Romanovsky wrote:
>>> On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
>>>> On 11/19/25 20:31, Jason Gunthorpe wrote:
>>>>> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
>>>>>
>>>>>>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>>>>>>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
>>>>>>>>> +		if (!dma->state) {
>>>>>>>>> +			ret = -ENOMEM;
>>>>>>>>> +			goto err_free_dma;
>>>>>>>>> +		}
>>>>>>>>> +
>>>>>>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
>>>>>>>>
>>>>>>>> Oh, that is a clear no-go for the core DMA-buf code.
>>>>>>>>
>>>>>>>> It's intentionally up to the exporter how to create the DMA
>>>>>>>> addresses the importer can work with.
>>>>>>>
>>>>>>> I can't fully understand this remark?
>>>>>>
>>>>>> The exporter should be able to decide if it actually wants to use
>>>>>> P2P when the transfer has to go through the host bridge (e.g. when
>>>>>> IOMMU/bridge routing bits are enabled).
>>>>>
>>>>> Sure, but this is a simplified helper for exporters that don't have
>>>>> choices where the memory comes from.
>>>>
>>>> That is extremely questionable as justification to put that in common DMA-buf code.
>>>>
>>>>> I fully expet to see changes to this to support more use cases,
>>>>> including the one above. We should do those changes along with users
>>>>> making use of them so we can evaluate what works best.
>>>>
>>>> Yeah, exactly that's my concern.
>>>>
>>>>>> But only take that as Acked-by, I would need at least a day (or
>>>>>> week) of free time to wrap my head around all the technical details
>>>>>> again. And that is something I won't have before January or even
>>>>>> later.
>>>>>
>>>>> Sure, it is alot, and I think DRM community in general should come up
>>>>> to speed on the new DMA API and how we are pushing to see P2P work
>>>>> within Linux.
>>>>>
>>>>> So thanks, we can take the Acked-by and progress here. Interested
>>>>> parties can pick it up from this point when time allows.
>>>>
>>>> Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.
>>>>
>>>> This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.
>>>>
>>>> So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.
>>>
>>> It was put in VFIO at the beginning, but Christoph objected to it,
>>> because that will require exporting symbol for pci_p2pdma_map_type().
>>> which was universally agreed as not good idea.
>>
>> Yeah, that is exactly what I object here :)
>>
>> We can have the helper in DMA-buf *if* pci_p2pdma_map_type() is called by drivers or at least accessible. That's what I pointed out in the other mail before as well.
>>
>> The exporter must be able to make decisions based on if the transaction would go over the host bridge or not.
>>
>> Background is that in a lot of use cases you rather want to move the backing store into system memory instead of keeping it in local memory if the driver doesn't have direct access over a common upstream bridge.
>>
>> Currently drivers decide that based on if IOMMU is enabled or not (and a few other quirks), but essentially you absolutely want a function which gives this information to exporters. For the VFIO use case it doesn't matter because you can't switch the BAR for system memory.
>>
>> To unblock you, please add a big fat comment in the kerneldoc of the mapping explaining this and that it might be necessary for exporters to call pci_p2pdma_map_type() as well.
> 
> Thanks,
> 
> What do you think about it?
> 
> diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> index a69bb73db86d..05ec84a0157b 100644
> --- a/drivers/dma-buf/dma-buf-mapping.c
> +++ b/drivers/dma-buf/dma-buf-mapping.c
> @@ -84,6 +84,11 @@ struct dma_buf_dma {
>   * PAGE_SIZE aligned.
>   *
>   * A mapping must be unmapped by using dma_buf_free_sgt().
> + *
> + * NOTE: While this function is intended for DMA-buf importers, it is critical
> + * that the DMA-buf exporter is capable of performing peer-to-peer (P2P) DMA
> + * directly between PCI devices, without routing transactions through the host
> + * bridge.

Well first of all this function is intended for exporters not importers.

Maybe write something like "This function is intended for exporters. If direct traffic routing is mandatory exporter should call routing pci_p2pdma_map_type() before calling this function.".

Regards,
Christian.

>   */
>  struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
>                                          struct p2pdma_provider *provider,
> (END)
> 
> 
>>
>> Regards,
>> Christian.
>>
>>>
>>> https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/
>>>
>>> Thanks
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>
>>>>> We can also have a mini-community call to give a summary/etc on these
>>>>> topics.
>>>>>
>>>>> Thanks,
>>>>> Jason
>>>>
>>


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-20  8:32                   ` Christian König
@ 2025-11-20  8:42                     ` Leon Romanovsky
  0 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-20  8:42 UTC (permalink / raw)
  To: Christian König
  Cc: Jason Gunthorpe, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Thu, Nov 20, 2025 at 09:32:22AM +0100, Christian König wrote:
> On 11/20/25 09:06, Leon Romanovsky wrote:
> > On Thu, Nov 20, 2025 at 08:54:37AM +0100, Christian König wrote:
> >> On 11/20/25 08:41, Leon Romanovsky wrote:
> >>> On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
> >>>> On 11/19/25 20:31, Jason Gunthorpe wrote:
> >>>>> On Wed, Nov 19, 2025 at 02:42:18PM +0100, Christian König wrote:
> >>>>>
> >>>>>>>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> >>>>>>>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> >>>>>>>>> +		if (!dma->state) {
> >>>>>>>>> +			ret = -ENOMEM;
> >>>>>>>>> +			goto err_free_dma;
> >>>>>>>>> +		}
> >>>>>>>>> +
> >>>>>>>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> >>>>>>>>
> >>>>>>>> Oh, that is a clear no-go for the core DMA-buf code.
> >>>>>>>>
> >>>>>>>> It's intentionally up to the exporter how to create the DMA
> >>>>>>>> addresses the importer can work with.
> >>>>>>>
> >>>>>>> I can't fully understand this remark?
> >>>>>>
> >>>>>> The exporter should be able to decide if it actually wants to use
> >>>>>> P2P when the transfer has to go through the host bridge (e.g. when
> >>>>>> IOMMU/bridge routing bits are enabled).
> >>>>>
> >>>>> Sure, but this is a simplified helper for exporters that don't have
> >>>>> choices where the memory comes from.
> >>>>
> >>>> That is extremely questionable as justification to put that in common DMA-buf code.
> >>>>
> >>>>> I fully expet to see changes to this to support more use cases,
> >>>>> including the one above. We should do those changes along with users
> >>>>> making use of them so we can evaluate what works best.
> >>>>
> >>>> Yeah, exactly that's my concern.
> >>>>
> >>>>>> But only take that as Acked-by, I would need at least a day (or
> >>>>>> week) of free time to wrap my head around all the technical details
> >>>>>> again. And that is something I won't have before January or even
> >>>>>> later.
> >>>>>
> >>>>> Sure, it is alot, and I think DRM community in general should come up
> >>>>> to speed on the new DMA API and how we are pushing to see P2P work
> >>>>> within Linux.
> >>>>>
> >>>>> So thanks, we can take the Acked-by and progress here. Interested
> >>>>> parties can pick it up from this point when time allows.
> >>>>
> >>>> Wait a second. After sleeping a night over it I think my initial take that we really should not put that into common DMA-buf code seems to hold true.
> >>>>
> >>>> This is the use case for VFIO, but I absolutely want to avoid other drivers from re-using this code until be have more experience with that.
> >>>>
> >>>> So to move forward I now strongly think we should keep that in VFIO until somebody else comes along and needs that helper.
> >>>
> >>> It was put in VFIO at the beginning, but Christoph objected to it,
> >>> because that will require exporting symbol for pci_p2pdma_map_type().
> >>> which was universally agreed as not good idea.
> >>
> >> Yeah, that is exactly what I object here :)
> >>
> >> We can have the helper in DMA-buf *if* pci_p2pdma_map_type() is called by drivers or at least accessible. That's what I pointed out in the other mail before as well.
> >>
> >> The exporter must be able to make decisions based on if the transaction would go over the host bridge or not.
> >>
> >> Background is that in a lot of use cases you rather want to move the backing store into system memory instead of keeping it in local memory if the driver doesn't have direct access over a common upstream bridge.
> >>
> >> Currently drivers decide that based on if IOMMU is enabled or not (and a few other quirks), but essentially you absolutely want a function which gives this information to exporters. For the VFIO use case it doesn't matter because you can't switch the BAR for system memory.
> >>
> >> To unblock you, please add a big fat comment in the kerneldoc of the mapping explaining this and that it might be necessary for exporters to call pci_p2pdma_map_type() as well.
> > 
> > Thanks,
> > 
> > What do you think about it?
> > 
> > diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> > index a69bb73db86d..05ec84a0157b 100644
> > --- a/drivers/dma-buf/dma-buf-mapping.c
> > +++ b/drivers/dma-buf/dma-buf-mapping.c
> > @@ -84,6 +84,11 @@ struct dma_buf_dma {
> >   * PAGE_SIZE aligned.
> >   *
> >   * A mapping must be unmapped by using dma_buf_free_sgt().
> > + *
> > + * NOTE: While this function is intended for DMA-buf importers, it is critical
> > + * that the DMA-buf exporter is capable of performing peer-to-peer (P2P) DMA
> > + * directly between PCI devices, without routing transactions through the host
> > + * bridge.
> 
> Well first of all this function is intended for exporters not importers.
> 
> Maybe write something like "This function is intended for exporters. If direct traffic routing is mandatory exporter should call routing pci_p2pdma_map_type() before calling this function.".

Sure, no problem.

Thanks

> 
> Regards,
> Christian.
> 
> >   */
> >  struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
> >                                          struct p2pdma_provider *provider,
> > (END)
> > 
> > 
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/
> >>>
> >>> Thanks
> >>>
> >>>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>>
> >>>>> We can also have a mini-community call to give a summary/etc on these
> >>>>> topics.
> >>>>>
> >>>>> Thanks,
> >>>>> Jason
> >>>>
> >>
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-20  7:08           ` Christian König
  2025-11-20  7:41             ` Leon Romanovsky
@ 2025-11-20 13:20             ` Jason Gunthorpe
  1 sibling, 0 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2025-11-20 13:20 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Thu, Nov 20, 2025 at 08:08:27AM +0100, Christian König wrote:
> >> The exporter should be able to decide if it actually wants to use
> >> P2P when the transfer has to go through the host bridge (e.g. when
> >> IOMMU/bridge routing bits are enabled).
> > 
> > Sure, but this is a simplified helper for exporters that don't have
> > choices where the memory comes from.
> 
> That is extremely questionable as justification to put that in common DMA-buf code.

FWIW we already have patches for a RDMA exporter lined up to use it as
well. That's two users already...

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 13:16   ` [Linaro-mm-sig] " Christian König
  2025-11-19 13:25     ` Jason Gunthorpe
@ 2025-11-19 13:42     ` Leon Romanovsky
  2025-11-19 14:11       ` Christian König
  1 sibling, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 13:42 UTC (permalink / raw)
  To: Christian König
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
> 
> 
> On 11/11/25 10:57, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@nvidia.com>
> > 
> > Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> > MMIO physical address ranges into scatter-gather tables with proper
> > DMA mapping.
> > 
> > These common functions are a starting point and support any PCI
> > drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> > case, as shortly will be RDMA. We can review existing DRM drivers to
> > refactor them separately. We hope this will evolve into routines to
> > help common DRM that include mixed CPU and MMIO mappings.
> > 
> > Compared to the dma_map_resource() abuse this implementation handles
> > the complicated PCI P2P scenarios properly, especially when an IOMMU
> > is enabled:
> > 
> >  - Direct bus address mapping without IOVA allocation for
> >    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
> >    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
> >    transactions to avoid the host bridge.
> > 
> >    Further, this handles the slightly obscure, case of MMIO with a
> >    phys_addr_t that is different from the physical BAR programming
> >    (bus offset). The phys_addr_t is converted to a dma_addr_t and
> >    accommodates this effect. This enables certain real systems to
> >    work, especially on ARM platforms.
> > 
> >  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
> >    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
> >    This happens when the IOMMU is enabled and the ACS flags are forcing
> >    all traffic to the IOMMU - ie for virtualization systems.
> > 
> >  - Cases where P2P is not supported through the host bridge/CPU. The
> >    P2P subsystem is the proper place to detect this and block it.
> > 
> > Helper functions fill_sg_entry() and calc_sg_nents() handle the
> > scatter-gather table construction, splitting large regions into
> > UINT_MAX-sized chunks to fit within sg->length field limits.
> > 
> > Since the physical address based DMA API forbids use of the CPU list
> > of the scatterlist this will produce a mangled scatterlist that has
> > a fully zero-length and NULL'd CPU list. The list is 0 length,
> > all the struct page pointers are NULL and zero sized. This is stronger
> > and more robust than the existing mangle_sg_table() technique. It is
> > a future project to migrate DMABUF as a subsystem away from using
> > scatterlist for this data structure.
> > 
> > Tested-by: Alex Mastro <amastro@fb.com>
> > Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> > ---
> >  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/dma-buf.h   |  18 ++++
> >  2 files changed, 253 insertions(+)
> > 
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index 2bcf9ceca997..cb55dff1dad5 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> >  }
> >  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
> >  
> > +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> > +					 dma_addr_t addr)
> > +{
> > +	unsigned int len, nents;
> > +	int i;
> > +
> > +	nents = DIV_ROUND_UP(length, UINT_MAX);
> > +	for (i = 0; i < nents; i++) {
> > +		len = min_t(size_t, length, UINT_MAX);
> > +		length -= len;
> > +		/*
> > +		 * DMABUF abuses scatterlist to create a scatterlist
> > +		 * that does not have any CPU list, only the DMA list.
> > +		 * Always set the page related values to NULL to ensure
> > +		 * importers can't use it. The phys_addr based DMA API
> > +		 * does not require the CPU list for mapping or unmapping.
> > +		 */
> > +		sg_set_page(sgl, NULL, 0, 0);
> > +		sg_dma_address(sgl) = addr + i * UINT_MAX;
> > +		sg_dma_len(sgl) = len;
> > +		sgl = sg_next(sgl);
> > +	}
> > +
> > +	return sgl;
> > +}
> > +
> > +static unsigned int calc_sg_nents(struct dma_iova_state *state,
> > +				  struct dma_buf_phys_vec *phys_vec,
> > +				  size_t nr_ranges, size_t size)
> > +{
> > +	unsigned int nents = 0;
> > +	size_t i;
> > +
> > +	if (!state || !dma_use_iova(state)) {
> > +		for (i = 0; i < nr_ranges; i++)
> > +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> > +	} else {
> > +		/*
> > +		 * In IOVA case, there is only one SG entry which spans
> > +		 * for whole IOVA address space, but we need to make sure
> > +		 * that it fits sg->length, maybe we need more.
> > +		 */
> > +		nents = DIV_ROUND_UP(size, UINT_MAX);
> > +	}
> > +
> > +	return nents;
> > +}
> > +
> > +/**
> > + * struct dma_buf_dma - holds DMA mapping information
> > + * @sgt:    Scatter-gather table
> > + * @state:  DMA IOVA state relevant in IOMMU-based DMA
> > + * @size:   Total size of DMA transfer
> > + */
> > +struct dma_buf_dma {
> > +	struct sg_table sgt;
> > +	struct dma_iova_state *state;
> > +	size_t size;
> > +};
> > +
> > +/**
> > + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> > + * of physical vectors. This funciton is intended for MMIO memory only.
> > + * @attach:	[in]	attachment whose scatterlist is to be returned
> > + * @provider:	[in]	p2pdma provider
> > + * @phys_vec:	[in]	array of physical vectors
> > + * @nr_ranges:	[in]	number of entries in phys_vec array
> > + * @size:	[in]	total size of phys_vec
> > + * @dir:	[in]	direction of DMA transfer
> > + *
> > + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> > + * on error. May return -EINTR if it is interrupted by a signal.
> > + *
> > + * On success, the DMA addresses and lengths in the returned scatterlist are
> > + * PAGE_SIZE aligned.
> > + *
> > + * A mapping must be unmapped by using dma_buf_unmap().
> > + */
> > +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> 
> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.

This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?

> 
> > +			     struct p2pdma_provider *provider,
> > +			     struct dma_buf_phys_vec *phys_vec,
> > +			     size_t nr_ranges, size_t size,
> > +			     enum dma_data_direction dir)
> > +{
> > +	unsigned int nents, mapped_len = 0;
> > +	struct dma_buf_dma *dma;
> > +	struct scatterlist *sgl;
> > +	dma_addr_t addr;
> > +	size_t i;
> > +	int ret;
> > +
> > +	dma_resv_assert_held(attach->dmabuf->resv);
> > +
> > +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
> > +		/* This function is supposed to work on MMIO memory only */
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > +	if (!dma)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
> > +	case PCI_P2PDMA_MAP_BUS_ADDR:
> > +		/*
> > +		 * There is no need in IOVA at all for this flow.
> > +		 */
> > +		break;
> > +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> > +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
> > +		if (!dma->state) {
> > +			ret = -ENOMEM;
> > +			goto err_free_dma;
> > +		}
> > +
> > +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
> 
> Oh, that is a clear no-go for the core DMA-buf code.
> 
> It's intentionally up to the exporter how to create the DMA addresses the importer can work with.

I didn't fully understand the email either. The importer needs to
configure DMA and it supports only MMIO addresses. Exporter controls it
by asking for peer2peer.

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 13:42     ` Leon Romanovsky
@ 2025-11-19 14:11       ` Christian König
  2025-11-19 14:50         ` Leon Romanovsky
  2025-11-19 19:36         ` Jason Gunthorpe
  0 siblings, 2 replies; 63+ messages in thread
From: Christian König @ 2025-11-19 14:11 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On 11/19/25 14:42, Leon Romanovsky wrote:
> On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
>>
>>
>> On 11/11/25 10:57, Leon Romanovsky wrote:
>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>
>>> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
>>> MMIO physical address ranges into scatter-gather tables with proper
>>> DMA mapping.
>>>
>>> These common functions are a starting point and support any PCI
>>> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
>>> case, as shortly will be RDMA. We can review existing DRM drivers to
>>> refactor them separately. We hope this will evolve into routines to
>>> help common DRM that include mixed CPU and MMIO mappings.
>>>
>>> Compared to the dma_map_resource() abuse this implementation handles
>>> the complicated PCI P2P scenarios properly, especially when an IOMMU
>>> is enabled:
>>>
>>>  - Direct bus address mapping without IOVA allocation for
>>>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
>>>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
>>>    transactions to avoid the host bridge.
>>>
>>>    Further, this handles the slightly obscure, case of MMIO with a
>>>    phys_addr_t that is different from the physical BAR programming
>>>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
>>>    accommodates this effect. This enables certain real systems to
>>>    work, especially on ARM platforms.
>>>
>>>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
>>>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
>>>    This happens when the IOMMU is enabled and the ACS flags are forcing
>>>    all traffic to the IOMMU - ie for virtualization systems.
>>>
>>>  - Cases where P2P is not supported through the host bridge/CPU. The
>>>    P2P subsystem is the proper place to detect this and block it.
>>>
>>> Helper functions fill_sg_entry() and calc_sg_nents() handle the
>>> scatter-gather table construction, splitting large regions into
>>> UINT_MAX-sized chunks to fit within sg->length field limits.
>>>
>>> Since the physical address based DMA API forbids use of the CPU list
>>> of the scatterlist this will produce a mangled scatterlist that has
>>> a fully zero-length and NULL'd CPU list. The list is 0 length,
>>> all the struct page pointers are NULL and zero sized. This is stronger
>>> and more robust than the existing mangle_sg_table() technique. It is
>>> a future project to migrate DMABUF as a subsystem away from using
>>> scatterlist for this data structure.
>>>
>>> Tested-by: Alex Mastro <amastro@fb.com>
>>> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>>> ---
>>>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
>>>  include/linux/dma-buf.h   |  18 ++++
>>>  2 files changed, 253 insertions(+)
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index 2bcf9ceca997..cb55dff1dad5 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>>>  }
>>>  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>>>  
>>> +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>>> +					 dma_addr_t addr)
>>> +{
>>> +	unsigned int len, nents;
>>> +	int i;
>>> +
>>> +	nents = DIV_ROUND_UP(length, UINT_MAX);
>>> +	for (i = 0; i < nents; i++) {
>>> +		len = min_t(size_t, length, UINT_MAX);
>>> +		length -= len;
>>> +		/*
>>> +		 * DMABUF abuses scatterlist to create a scatterlist
>>> +		 * that does not have any CPU list, only the DMA list.
>>> +		 * Always set the page related values to NULL to ensure
>>> +		 * importers can't use it. The phys_addr based DMA API
>>> +		 * does not require the CPU list for mapping or unmapping.
>>> +		 */
>>> +		sg_set_page(sgl, NULL, 0, 0);
>>> +		sg_dma_address(sgl) = addr + i * UINT_MAX;
>>> +		sg_dma_len(sgl) = len;
>>> +		sgl = sg_next(sgl);
>>> +	}
>>> +
>>> +	return sgl;
>>> +}
>>> +
>>> +static unsigned int calc_sg_nents(struct dma_iova_state *state,
>>> +				  struct dma_buf_phys_vec *phys_vec,
>>> +				  size_t nr_ranges, size_t size)
>>> +{
>>> +	unsigned int nents = 0;
>>> +	size_t i;
>>> +
>>> +	if (!state || !dma_use_iova(state)) {
>>> +		for (i = 0; i < nr_ranges; i++)
>>> +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
>>> +	} else {
>>> +		/*
>>> +		 * In IOVA case, there is only one SG entry which spans
>>> +		 * for whole IOVA address space, but we need to make sure
>>> +		 * that it fits sg->length, maybe we need more.
>>> +		 */
>>> +		nents = DIV_ROUND_UP(size, UINT_MAX);
>>> +	}
>>> +
>>> +	return nents;
>>> +}
>>> +
>>> +/**
>>> + * struct dma_buf_dma - holds DMA mapping information
>>> + * @sgt:    Scatter-gather table
>>> + * @state:  DMA IOVA state relevant in IOMMU-based DMA
>>> + * @size:   Total size of DMA transfer
>>> + */
>>> +struct dma_buf_dma {
>>> +	struct sg_table sgt;
>>> +	struct dma_iova_state *state;
>>> +	size_t size;
>>> +};
>>> +
>>> +/**
>>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
>>> + * of physical vectors. This funciton is intended for MMIO memory only.
>>> + * @attach:	[in]	attachment whose scatterlist is to be returned
>>> + * @provider:	[in]	p2pdma provider
>>> + * @phys_vec:	[in]	array of physical vectors
>>> + * @nr_ranges:	[in]	number of entries in phys_vec array
>>> + * @size:	[in]	total size of phys_vec
>>> + * @dir:	[in]	direction of DMA transfer
>>> + *
>>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
>>> + * on error. May return -EINTR if it is interrupted by a signal.
>>> + *
>>> + * On success, the DMA addresses and lengths in the returned scatterlist are
>>> + * PAGE_SIZE aligned.
>>> + *
>>> + * A mapping must be unmapped by using dma_buf_unmap().
>>> + */
>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
>>
>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> 
> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?

Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.

> 
>>
>>> +			     struct p2pdma_provider *provider,
>>> +			     struct dma_buf_phys_vec *phys_vec,
>>> +			     size_t nr_ranges, size_t size,
>>> +			     enum dma_data_direction dir)
>>> +{
>>> +	unsigned int nents, mapped_len = 0;
>>> +	struct dma_buf_dma *dma;
>>> +	struct scatterlist *sgl;
>>> +	dma_addr_t addr;
>>> +	size_t i;
>>> +	int ret;
>>> +
>>> +	dma_resv_assert_held(attach->dmabuf->resv);
>>> +
>>> +	if (WARN_ON(!attach || !attach->dmabuf || !provider))
>>> +		/* This function is supposed to work on MMIO memory only */
>>> +		return ERR_PTR(-EINVAL);
>>> +
>>> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
>>> +	if (!dma)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	switch (pci_p2pdma_map_type(provider, attach->dev)) {
>>> +	case PCI_P2PDMA_MAP_BUS_ADDR:
>>> +		/*
>>> +		 * There is no need in IOVA at all for this flow.
>>> +		 */
>>> +		break;
>>> +	case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
>>> +		dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
>>> +		if (!dma->state) {
>>> +			ret = -ENOMEM;
>>> +			goto err_free_dma;
>>> +		}
>>> +
>>> +		dma_iova_try_alloc(attach->dev, dma->state, 0, size);
>>
>> Oh, that is a clear no-go for the core DMA-buf code.
>>
>> It's intentionally up to the exporter how to create the DMA addresses the importer can work with.
> 
> I didn't fully understand the email either. The importer needs to
> configure DMA and it supports only MMIO addresses. Exporter controls it
> by asking for peer2peer.

I miss interpreted the call to pci_p2pdma_map_type() here in that now the DMA-buf code decides if transactions go over the root complex or not.

But the exporter can call pci_p2pdma_map_type() even before calling this function, so that looks fine to me.

Regards,
Christian.

> 
> Thanks


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 14:11       ` Christian König
@ 2025-11-19 14:50         ` Leon Romanovsky
  2025-11-19 14:53           ` Christian König
  2025-11-19 19:36         ` Jason Gunthorpe
  1 sibling, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 14:50 UTC (permalink / raw)
  To: Christian König
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Wed, Nov 19, 2025 at 03:11:01PM +0100, Christian König wrote:
> On 11/19/25 14:42, Leon Romanovsky wrote:
> > On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
> >>
> >>
> >> On 11/11/25 10:57, Leon Romanovsky wrote:
> >>> From: Leon Romanovsky <leonro@nvidia.com>
> >>>
> >>> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> >>> MMIO physical address ranges into scatter-gather tables with proper
> >>> DMA mapping.
> >>>
> >>> These common functions are a starting point and support any PCI
> >>> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> >>> case, as shortly will be RDMA. We can review existing DRM drivers to
> >>> refactor them separately. We hope this will evolve into routines to
> >>> help common DRM that include mixed CPU and MMIO mappings.
> >>>
> >>> Compared to the dma_map_resource() abuse this implementation handles
> >>> the complicated PCI P2P scenarios properly, especially when an IOMMU
> >>> is enabled:
> >>>
> >>>  - Direct bus address mapping without IOVA allocation for
> >>>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
> >>>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
> >>>    transactions to avoid the host bridge.
> >>>
> >>>    Further, this handles the slightly obscure, case of MMIO with a
> >>>    phys_addr_t that is different from the physical BAR programming
> >>>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
> >>>    accommodates this effect. This enables certain real systems to
> >>>    work, especially on ARM platforms.
> >>>
> >>>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
> >>>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
> >>>    This happens when the IOMMU is enabled and the ACS flags are forcing
> >>>    all traffic to the IOMMU - ie for virtualization systems.
> >>>
> >>>  - Cases where P2P is not supported through the host bridge/CPU. The
> >>>    P2P subsystem is the proper place to detect this and block it.
> >>>
> >>> Helper functions fill_sg_entry() and calc_sg_nents() handle the
> >>> scatter-gather table construction, splitting large regions into
> >>> UINT_MAX-sized chunks to fit within sg->length field limits.
> >>>
> >>> Since the physical address based DMA API forbids use of the CPU list
> >>> of the scatterlist this will produce a mangled scatterlist that has
> >>> a fully zero-length and NULL'd CPU list. The list is 0 length,
> >>> all the struct page pointers are NULL and zero sized. This is stronger
> >>> and more robust than the existing mangle_sg_table() technique. It is
> >>> a future project to migrate DMABUF as a subsystem away from using
> >>> scatterlist for this data structure.
> >>>
> >>> Tested-by: Alex Mastro <amastro@fb.com>
> >>> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> >>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> >>> ---
> >>>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>  include/linux/dma-buf.h   |  18 ++++
> >>>  2 files changed, 253 insertions(+)
> >>>
> >>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> >>> index 2bcf9ceca997..cb55dff1dad5 100644
> >>> --- a/drivers/dma-buf/dma-buf.c
> >>> +++ b/drivers/dma-buf/dma-buf.c
> >>> @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> >>>  }
> >>>  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
> >>>  
> >>> +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> >>> +					 dma_addr_t addr)
> >>> +{
> >>> +	unsigned int len, nents;
> >>> +	int i;
> >>> +
> >>> +	nents = DIV_ROUND_UP(length, UINT_MAX);
> >>> +	for (i = 0; i < nents; i++) {
> >>> +		len = min_t(size_t, length, UINT_MAX);
> >>> +		length -= len;
> >>> +		/*
> >>> +		 * DMABUF abuses scatterlist to create a scatterlist
> >>> +		 * that does not have any CPU list, only the DMA list.
> >>> +		 * Always set the page related values to NULL to ensure
> >>> +		 * importers can't use it. The phys_addr based DMA API
> >>> +		 * does not require the CPU list for mapping or unmapping.
> >>> +		 */
> >>> +		sg_set_page(sgl, NULL, 0, 0);
> >>> +		sg_dma_address(sgl) = addr + i * UINT_MAX;
> >>> +		sg_dma_len(sgl) = len;
> >>> +		sgl = sg_next(sgl);
> >>> +	}
> >>> +
> >>> +	return sgl;
> >>> +}
> >>> +
> >>> +static unsigned int calc_sg_nents(struct dma_iova_state *state,
> >>> +				  struct dma_buf_phys_vec *phys_vec,
> >>> +				  size_t nr_ranges, size_t size)
> >>> +{
> >>> +	unsigned int nents = 0;
> >>> +	size_t i;
> >>> +
> >>> +	if (!state || !dma_use_iova(state)) {
> >>> +		for (i = 0; i < nr_ranges; i++)
> >>> +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> >>> +	} else {
> >>> +		/*
> >>> +		 * In IOVA case, there is only one SG entry which spans
> >>> +		 * for whole IOVA address space, but we need to make sure
> >>> +		 * that it fits sg->length, maybe we need more.
> >>> +		 */
> >>> +		nents = DIV_ROUND_UP(size, UINT_MAX);
> >>> +	}
> >>> +
> >>> +	return nents;
> >>> +}
> >>> +
> >>> +/**
> >>> + * struct dma_buf_dma - holds DMA mapping information
> >>> + * @sgt:    Scatter-gather table
> >>> + * @state:  DMA IOVA state relevant in IOMMU-based DMA
> >>> + * @size:   Total size of DMA transfer
> >>> + */
> >>> +struct dma_buf_dma {
> >>> +	struct sg_table sgt;
> >>> +	struct dma_iova_state *state;
> >>> +	size_t size;
> >>> +};
> >>> +
> >>> +/**
> >>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> >>> + * of physical vectors. This funciton is intended for MMIO memory only.
> >>> + * @attach:	[in]	attachment whose scatterlist is to be returned
> >>> + * @provider:	[in]	p2pdma provider
> >>> + * @phys_vec:	[in]	array of physical vectors
> >>> + * @nr_ranges:	[in]	number of entries in phys_vec array
> >>> + * @size:	[in]	total size of phys_vec
> >>> + * @dir:	[in]	direction of DMA transfer
> >>> + *
> >>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> >>> + * on error. May return -EINTR if it is interrupted by a signal.
> >>> + *
> >>> + * On success, the DMA addresses and lengths in the returned scatterlist are
> >>> + * PAGE_SIZE aligned.
> >>> + *
> >>> + * A mapping must be unmapped by using dma_buf_unmap().
> >>> + */
> >>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> >>
> >> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> > 
> > This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
> 
> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.

Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
file per-your request.

Regarding SG, the long term plan is to remove SG table completely, so at
least external users of DMABUF shouldn't be exposed to internal implementation
details (SG table).

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 14:50         ` Leon Romanovsky
@ 2025-11-19 14:53           ` Christian König
  2025-11-19 15:41             ` Leon Romanovsky
  2025-11-19 16:33             ` Leon Romanovsky
  0 siblings, 2 replies; 63+ messages in thread
From: Christian König @ 2025-11-19 14:53 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen



On 11/19/25 15:50, Leon Romanovsky wrote:
> On Wed, Nov 19, 2025 at 03:11:01PM +0100, Christian König wrote:
>> On 11/19/25 14:42, Leon Romanovsky wrote:
>>> On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
>>>>
>>>>
>>>> On 11/11/25 10:57, Leon Romanovsky wrote:
>>>>> From: Leon Romanovsky <leonro@nvidia.com>
>>>>>
>>>>> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
>>>>> MMIO physical address ranges into scatter-gather tables with proper
>>>>> DMA mapping.
>>>>>
>>>>> These common functions are a starting point and support any PCI
>>>>> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
>>>>> case, as shortly will be RDMA. We can review existing DRM drivers to
>>>>> refactor them separately. We hope this will evolve into routines to
>>>>> help common DRM that include mixed CPU and MMIO mappings.
>>>>>
>>>>> Compared to the dma_map_resource() abuse this implementation handles
>>>>> the complicated PCI P2P scenarios properly, especially when an IOMMU
>>>>> is enabled:
>>>>>
>>>>>  - Direct bus address mapping without IOVA allocation for
>>>>>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
>>>>>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
>>>>>    transactions to avoid the host bridge.
>>>>>
>>>>>    Further, this handles the slightly obscure, case of MMIO with a
>>>>>    phys_addr_t that is different from the physical BAR programming
>>>>>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
>>>>>    accommodates this effect. This enables certain real systems to
>>>>>    work, especially on ARM platforms.
>>>>>
>>>>>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
>>>>>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
>>>>>    This happens when the IOMMU is enabled and the ACS flags are forcing
>>>>>    all traffic to the IOMMU - ie for virtualization systems.
>>>>>
>>>>>  - Cases where P2P is not supported through the host bridge/CPU. The
>>>>>    P2P subsystem is the proper place to detect this and block it.
>>>>>
>>>>> Helper functions fill_sg_entry() and calc_sg_nents() handle the
>>>>> scatter-gather table construction, splitting large regions into
>>>>> UINT_MAX-sized chunks to fit within sg->length field limits.
>>>>>
>>>>> Since the physical address based DMA API forbids use of the CPU list
>>>>> of the scatterlist this will produce a mangled scatterlist that has
>>>>> a fully zero-length and NULL'd CPU list. The list is 0 length,
>>>>> all the struct page pointers are NULL and zero sized. This is stronger
>>>>> and more robust than the existing mangle_sg_table() technique. It is
>>>>> a future project to migrate DMABUF as a subsystem away from using
>>>>> scatterlist for this data structure.
>>>>>
>>>>> Tested-by: Alex Mastro <amastro@fb.com>
>>>>> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
>>>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
>>>>> ---
>>>>>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  include/linux/dma-buf.h   |  18 ++++
>>>>>  2 files changed, 253 insertions(+)
>>>>>
>>>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>>>> index 2bcf9ceca997..cb55dff1dad5 100644
>>>>> --- a/drivers/dma-buf/dma-buf.c
>>>>> +++ b/drivers/dma-buf/dma-buf.c
>>>>> @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
>>>>>  }
>>>>>  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
>>>>>  
>>>>> +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
>>>>> +					 dma_addr_t addr)
>>>>> +{
>>>>> +	unsigned int len, nents;
>>>>> +	int i;
>>>>> +
>>>>> +	nents = DIV_ROUND_UP(length, UINT_MAX);
>>>>> +	for (i = 0; i < nents; i++) {
>>>>> +		len = min_t(size_t, length, UINT_MAX);
>>>>> +		length -= len;
>>>>> +		/*
>>>>> +		 * DMABUF abuses scatterlist to create a scatterlist
>>>>> +		 * that does not have any CPU list, only the DMA list.
>>>>> +		 * Always set the page related values to NULL to ensure
>>>>> +		 * importers can't use it. The phys_addr based DMA API
>>>>> +		 * does not require the CPU list for mapping or unmapping.
>>>>> +		 */
>>>>> +		sg_set_page(sgl, NULL, 0, 0);
>>>>> +		sg_dma_address(sgl) = addr + i * UINT_MAX;
>>>>> +		sg_dma_len(sgl) = len;
>>>>> +		sgl = sg_next(sgl);
>>>>> +	}
>>>>> +
>>>>> +	return sgl;
>>>>> +}
>>>>> +
>>>>> +static unsigned int calc_sg_nents(struct dma_iova_state *state,
>>>>> +				  struct dma_buf_phys_vec *phys_vec,
>>>>> +				  size_t nr_ranges, size_t size)
>>>>> +{
>>>>> +	unsigned int nents = 0;
>>>>> +	size_t i;
>>>>> +
>>>>> +	if (!state || !dma_use_iova(state)) {
>>>>> +		for (i = 0; i < nr_ranges; i++)
>>>>> +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
>>>>> +	} else {
>>>>> +		/*
>>>>> +		 * In IOVA case, there is only one SG entry which spans
>>>>> +		 * for whole IOVA address space, but we need to make sure
>>>>> +		 * that it fits sg->length, maybe we need more.
>>>>> +		 */
>>>>> +		nents = DIV_ROUND_UP(size, UINT_MAX);
>>>>> +	}
>>>>> +
>>>>> +	return nents;
>>>>> +}
>>>>> +
>>>>> +/**
>>>>> + * struct dma_buf_dma - holds DMA mapping information
>>>>> + * @sgt:    Scatter-gather table
>>>>> + * @state:  DMA IOVA state relevant in IOMMU-based DMA
>>>>> + * @size:   Total size of DMA transfer
>>>>> + */
>>>>> +struct dma_buf_dma {
>>>>> +	struct sg_table sgt;
>>>>> +	struct dma_iova_state *state;
>>>>> +	size_t size;
>>>>> +};
>>>>> +
>>>>> +/**
>>>>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
>>>>> + * of physical vectors. This funciton is intended for MMIO memory only.
>>>>> + * @attach:	[in]	attachment whose scatterlist is to be returned
>>>>> + * @provider:	[in]	p2pdma provider
>>>>> + * @phys_vec:	[in]	array of physical vectors
>>>>> + * @nr_ranges:	[in]	number of entries in phys_vec array
>>>>> + * @size:	[in]	total size of phys_vec
>>>>> + * @dir:	[in]	direction of DMA transfer
>>>>> + *
>>>>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
>>>>> + * on error. May return -EINTR if it is interrupted by a signal.
>>>>> + *
>>>>> + * On success, the DMA addresses and lengths in the returned scatterlist are
>>>>> + * PAGE_SIZE aligned.
>>>>> + *
>>>>> + * A mapping must be unmapped by using dma_buf_unmap().
>>>>> + */
>>>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
>>>>
>>>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
>>>
>>> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
>>
>> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.
> 
> Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
> file per-your request.

No, just completely drop the term "mapping" here. This is about phys_vector to sg_table conversion and nothing else.

That we create an IOVA mapping when the access needs to go through the root complex is an implementation detail.

> 
> Regarding SG, the long term plan is to remove SG table completely, so at
> least external users of DMABUF shouldn't be exposed to internal implementation
> details (SG table).

Hui? Well I suggested to remove the sg_table, but that doesn't mean that implementations shouldn't be aware of that.

Regards,
Christian.

> 
> Thanks


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 14:53           ` Christian König
@ 2025-11-19 15:41             ` Leon Romanovsky
  2025-11-19 16:33             ` Leon Romanovsky
  1 sibling, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 15:41 UTC (permalink / raw)
  To: Christian König
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Wed, Nov 19, 2025 at 03:53:30PM +0100, Christian König wrote:
> 
> 
> On 11/19/25 15:50, Leon Romanovsky wrote:
> > On Wed, Nov 19, 2025 at 03:11:01PM +0100, Christian König wrote:
> >> On 11/19/25 14:42, Leon Romanovsky wrote:
> >>> On Wed, Nov 19, 2025 at 02:16:57PM +0100, Christian König wrote:
> >>>>
> >>>>
> >>>> On 11/11/25 10:57, Leon Romanovsky wrote:
> >>>>> From: Leon Romanovsky <leonro@nvidia.com>
> >>>>>
> >>>>> Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> >>>>> MMIO physical address ranges into scatter-gather tables with proper
> >>>>> DMA mapping.
> >>>>>
> >>>>> These common functions are a starting point and support any PCI
> >>>>> drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> >>>>> case, as shortly will be RDMA. We can review existing DRM drivers to
> >>>>> refactor them separately. We hope this will evolve into routines to
> >>>>> help common DRM that include mixed CPU and MMIO mappings.
> >>>>>
> >>>>> Compared to the dma_map_resource() abuse this implementation handles
> >>>>> the complicated PCI P2P scenarios properly, especially when an IOMMU
> >>>>> is enabled:
> >>>>>
> >>>>>  - Direct bus address mapping without IOVA allocation for
> >>>>>    PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
> >>>>>    happens if the IOMMU is enabled but the PCIe switch ACS flags allow
> >>>>>    transactions to avoid the host bridge.
> >>>>>
> >>>>>    Further, this handles the slightly obscure, case of MMIO with a
> >>>>>    phys_addr_t that is different from the physical BAR programming
> >>>>>    (bus offset). The phys_addr_t is converted to a dma_addr_t and
> >>>>>    accommodates this effect. This enables certain real systems to
> >>>>>    work, especially on ARM platforms.
> >>>>>
> >>>>>  - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
> >>>>>    attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
> >>>>>    This happens when the IOMMU is enabled and the ACS flags are forcing
> >>>>>    all traffic to the IOMMU - ie for virtualization systems.
> >>>>>
> >>>>>  - Cases where P2P is not supported through the host bridge/CPU. The
> >>>>>    P2P subsystem is the proper place to detect this and block it.
> >>>>>
> >>>>> Helper functions fill_sg_entry() and calc_sg_nents() handle the
> >>>>> scatter-gather table construction, splitting large regions into
> >>>>> UINT_MAX-sized chunks to fit within sg->length field limits.
> >>>>>
> >>>>> Since the physical address based DMA API forbids use of the CPU list
> >>>>> of the scatterlist this will produce a mangled scatterlist that has
> >>>>> a fully zero-length and NULL'd CPU list. The list is 0 length,
> >>>>> all the struct page pointers are NULL and zero sized. This is stronger
> >>>>> and more robust than the existing mangle_sg_table() technique. It is
> >>>>> a future project to migrate DMABUF as a subsystem away from using
> >>>>> scatterlist for this data structure.
> >>>>>
> >>>>> Tested-by: Alex Mastro <amastro@fb.com>
> >>>>> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> >>>>> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
> >>>>> ---
> >>>>>  drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>  include/linux/dma-buf.h   |  18 ++++
> >>>>>  2 files changed, 253 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> >>>>> index 2bcf9ceca997..cb55dff1dad5 100644
> >>>>> --- a/drivers/dma-buf/dma-buf.c
> >>>>> +++ b/drivers/dma-buf/dma-buf.c
> >>>>> @@ -1254,6 +1254,241 @@ void dma_buf_unmap_attachment_unlocked(struct dma_buf_attachment *attach,
> >>>>>  }
> >>>>>  EXPORT_SYMBOL_NS_GPL(dma_buf_unmap_attachment_unlocked, "DMA_BUF");
> >>>>>  
> >>>>> +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> >>>>> +					 dma_addr_t addr)
> >>>>> +{
> >>>>> +	unsigned int len, nents;
> >>>>> +	int i;
> >>>>> +
> >>>>> +	nents = DIV_ROUND_UP(length, UINT_MAX);
> >>>>> +	for (i = 0; i < nents; i++) {
> >>>>> +		len = min_t(size_t, length, UINT_MAX);
> >>>>> +		length -= len;
> >>>>> +		/*
> >>>>> +		 * DMABUF abuses scatterlist to create a scatterlist
> >>>>> +		 * that does not have any CPU list, only the DMA list.
> >>>>> +		 * Always set the page related values to NULL to ensure
> >>>>> +		 * importers can't use it. The phys_addr based DMA API
> >>>>> +		 * does not require the CPU list for mapping or unmapping.
> >>>>> +		 */
> >>>>> +		sg_set_page(sgl, NULL, 0, 0);
> >>>>> +		sg_dma_address(sgl) = addr + i * UINT_MAX;
> >>>>> +		sg_dma_len(sgl) = len;
> >>>>> +		sgl = sg_next(sgl);
> >>>>> +	}
> >>>>> +
> >>>>> +	return sgl;
> >>>>> +}
> >>>>> +
> >>>>> +static unsigned int calc_sg_nents(struct dma_iova_state *state,
> >>>>> +				  struct dma_buf_phys_vec *phys_vec,
> >>>>> +				  size_t nr_ranges, size_t size)
> >>>>> +{
> >>>>> +	unsigned int nents = 0;
> >>>>> +	size_t i;
> >>>>> +
> >>>>> +	if (!state || !dma_use_iova(state)) {
> >>>>> +		for (i = 0; i < nr_ranges; i++)
> >>>>> +			nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> >>>>> +	} else {
> >>>>> +		/*
> >>>>> +		 * In IOVA case, there is only one SG entry which spans
> >>>>> +		 * for whole IOVA address space, but we need to make sure
> >>>>> +		 * that it fits sg->length, maybe we need more.
> >>>>> +		 */
> >>>>> +		nents = DIV_ROUND_UP(size, UINT_MAX);
> >>>>> +	}
> >>>>> +
> >>>>> +	return nents;
> >>>>> +}
> >>>>> +
> >>>>> +/**
> >>>>> + * struct dma_buf_dma - holds DMA mapping information
> >>>>> + * @sgt:    Scatter-gather table
> >>>>> + * @state:  DMA IOVA state relevant in IOMMU-based DMA
> >>>>> + * @size:   Total size of DMA transfer
> >>>>> + */
> >>>>> +struct dma_buf_dma {
> >>>>> +	struct sg_table sgt;
> >>>>> +	struct dma_iova_state *state;
> >>>>> +	size_t size;
> >>>>> +};
> >>>>> +
> >>>>> +/**
> >>>>> + * dma_buf_map - Returns the scatterlist table of the attachment from arrays
> >>>>> + * of physical vectors. This funciton is intended for MMIO memory only.
> >>>>> + * @attach:	[in]	attachment whose scatterlist is to be returned
> >>>>> + * @provider:	[in]	p2pdma provider
> >>>>> + * @phys_vec:	[in]	array of physical vectors
> >>>>> + * @nr_ranges:	[in]	number of entries in phys_vec array
> >>>>> + * @size:	[in]	total size of phys_vec
> >>>>> + * @dir:	[in]	direction of DMA transfer
> >>>>> + *
> >>>>> + * Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
> >>>>> + * on error. May return -EINTR if it is interrupted by a signal.
> >>>>> + *
> >>>>> + * On success, the DMA addresses and lengths in the returned scatterlist are
> >>>>> + * PAGE_SIZE aligned.
> >>>>> + *
> >>>>> + * A mapping must be unmapped by using dma_buf_unmap().
> >>>>> + */
> >>>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> >>>>
> >>>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> >>>
> >>> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
> >>
> >> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.
> > 
> > Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
> > file per-your request.
> 
> No, just completely drop the term "mapping" here. This is about phys_vector to sg_table conversion and nothing else.

We have both map and unmap, so dma_buf_*_to_*() can be applicable to dma_buf_map() only.
And it is not simple conversion, most of the logic is actually handles mapping:

  137         for (i = 0; i < nr_ranges; i++) {
  138                 if (!dma->state) {
  139                         addr = pci_p2pdma_bus_addr_map(provider,
  140                                                        phys_vec[i].paddr);
  141                 } else if (dma_use_iova(dma->state)) {
  142                         ret = dma_iova_link(attach->dev, dma->state,
  143                                             phys_vec[i].paddr, 0,
  144                                             phys_vec[i].len, dir,
  145                                             DMA_ATTR_MMIO);
  146                         if (ret)
  147                                 goto err_unmap_dma;
  148
  149                         mapped_len += phys_vec[i].len;
  150                 } else {
  151                         addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
  152                                             phys_vec[i].len, dir,
  153                                             DMA_ATTR_MMIO);
  154                         ret = dma_mapping_error(attach->dev, addr);
  155                         if (ret)
  156                                 goto err_unmap_dma;
  157                 }
  158
  159                 if (!dma->state || !dma_use_iova(dma->state))
  160                         sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
  161         }
  162
  163         if (dma->state && dma_use_iova(dma->state)) {
  164                 WARN_ON_ONCE(mapped_len != size);
  165                 ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
  166                 if (ret)
  167                         goto err_unmap_dma;
  168
  169                 sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
  170         }

SG table conversion is only two lines (160 and 169) which are here
because of DMABUF dependency on SG.

What about dma_buf_phys_vec_mapping()/dma_buf_phys_vec_unmapping()?

> 
> That we create an IOVA mapping when the access needs to go through the root complex is an implementation detail.
> 
> > 
> > Regarding SG, the long term plan is to remove SG table completely, so at
> > least external users of DMABUF shouldn't be exposed to internal implementation
> > details (SG table).
> 
> Hui? Well I suggested to remove the sg_table, but that doesn't mean that implementations shouldn't be aware of that.

VFIO which is first user of this interface. It doesn't care how
internally DMABUF handles array of phys_vecs. Today, it is sg_table,
tomorrow it will be something else.

Thanks

> 
> Regards,
> Christian.
> 
> > 
> > Thanks
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 14:53           ` Christian König
  2025-11-19 15:41             ` Leon Romanovsky
@ 2025-11-19 16:33             ` Leon Romanovsky
  2025-11-20  7:03               ` Christian König
  1 sibling, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 16:33 UTC (permalink / raw)
  To: Christian König
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Wed, Nov 19, 2025 at 03:53:30PM +0100, Christian König wrote:

<...>

> >>>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> >>>>
> >>>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> >>>
> >>> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
> >>
> >> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.
> > 
> > Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
> > file per-your request.
> 
> No, just completely drop the term "mapping" here. This is about phys_vector to sg_table conversion and nothing else.

In order to progress, I renamed these functions to be
dma_buf_phys_vec_to_sgt() and dma_buf_free_sgt(), and put everything in dma_buf_mapping.c file.

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 16:33             ` Leon Romanovsky
@ 2025-11-20  7:03               ` Christian König
  2025-11-20  7:38                 ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Christian König @ 2025-11-20  7:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On 11/19/25 17:33, Leon Romanovsky wrote:
> On Wed, Nov 19, 2025 at 03:53:30PM +0100, Christian König wrote:
> 
> <...>
> 
>>>>>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
>>>>>>
>>>>>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
>>>>>
>>>>> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
>>>>
>>>> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.
>>>
>>> Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
>>> file per-your request.
>>
>> No, just completely drop the term "mapping" here. This is about phys_vector to sg_table conversion and nothing else.
> 
> In order to progress, I renamed these functions to be
> dma_buf_phys_vec_to_sgt() and dma_buf_free_sgt(), and put everything in dma_buf_mapping.c file.

Yeah, the problem is I even thought more about it and came to the conclusion that this is still not sufficient for an rb or an Ack-by.

A core concept of DMA-buf is that the exporter takes care of all the mappings and not the framework.

Calling pci_p2pdma_bus_addr_map(), dma_map_phys() or dma_map_phys() from DMA-buf code is extremely questionable.

That should really be inside VFIO and not DMA-buf code, so to move forward I strongly suggest to either move that into VFIO or the DMA API directly.

Regards,
Christian.

> 
> Thanks


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-20  7:03               ` Christian König
@ 2025-11-20  7:38                 ` Leon Romanovsky
  0 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-20  7:38 UTC (permalink / raw)
  To: Christian König
  Cc: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Thu, Nov 20, 2025 at 08:03:09AM +0100, Christian König wrote:
> On 11/19/25 17:33, Leon Romanovsky wrote:
> > On Wed, Nov 19, 2025 at 03:53:30PM +0100, Christian König wrote:
> > 
> > <...>
> > 
> >>>>>>> +struct sg_table *dma_buf_map(struct dma_buf_attachment *attach,
> >>>>>>
> >>>>>> That is clearly not a good name for this function. We already have overloaded the term *mapping* with something completely different.
> >>>>>
> >>>>> This function performs DMA mapping, so what name do you suggest instead of dma_buf_map()?
> >>>>
> >>>> Something like dma_buf_phys_vec_to_sg_table(). I'm not good at naming either.
> >>>
> >>> Can I call it simply dma_buf_mapping() as I plan to put that function in dma_buf_mapping.c
> >>> file per-your request.
> >>
> >> No, just completely drop the term "mapping" here. This is about phys_vector to sg_table conversion and nothing else.
> > 
> > In order to progress, I renamed these functions to be
> > dma_buf_phys_vec_to_sgt() and dma_buf_free_sgt(), and put everything in dma_buf_mapping.c file.
> 
> Yeah, the problem is I even thought more about it and came to the conclusion that this is still not sufficient for an rb or an Ack-by.
> 
> A core concept of DMA-buf is that the exporter takes care of all the mappings and not the framework.
> 
> Calling pci_p2pdma_bus_addr_map(), dma_map_phys() or dma_map_phys() from DMA-buf code is extremely questionable.
> 
> That should really be inside VFIO and not DMA-buf code, so to move forward I strongly suggest to either move that into VFIO or the DMA API directly.

We got the request to move to DMABUF and agreement a long time ago, in v5.
https://lore.kernel.org/all/aPYrEroyWVOvAu-5@infradead.org/

Thanks

> 
> Regards,
> Christian.
> 
> > 
> > Thanks
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Linaro-mm-sig] [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine
  2025-11-19 14:11       ` Christian König
  2025-11-19 14:50         ` Leon Romanovsky
@ 2025-11-19 19:36         ` Jason Gunthorpe
  1 sibling, 0 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2025-11-19 19:36 UTC (permalink / raw)
  To: Christian König
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal, Kees Cook,
	Gustavo A. R. Silva, Ankit Agrawal, Yishai Hadas,
	Shameer Kolothum, Kevin Tian, Alex Williamson, Krishnakant Jaju,
	Matt Ochs, linux-pci, linux-kernel, linux-block, iommu, linux-mm,
	linux-doc, linux-media, dri-devel, linaro-mm-sig, kvm,
	linux-hardening, Alex Mastro, Nicolin Chen

On Wed, Nov 19, 2025 at 03:11:01PM +0100, Christian König wrote:

> I miss interpreted the call to pci_p2pdma_map_type() here in that
> now the DMA-buf code decides if transactions go over the root
> complex or not.

Oh, that's not it at all. I think you get it, but just to be really
clear:

This code is taking a physical address from the exporter and
determining how it MUST route inside the fabric. There is only one
single choice with no optionality.

The exporter already decided if it will go over the host bridge by
providing an address that must use a host bridge path.

> But the exporter can call pci_p2pdma_map_type() even before calling
> this function, so that looks fine to me.

Yes, the exporter needs to decide where the data is placed before it
tries to map it into the SGT.

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v8 07/11] vfio: Export vfio device get and put registration helpers
  2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (5 preceding siblings ...)
  2025-11-11  9:57 ` [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine Leon Romanovsky
@ 2025-11-11  9:57 ` Leon Romanovsky
  2025-11-18  7:10   ` Tian, Kevin
  2025-11-11  9:57 ` [PATCH v8 08/11] vfio/pci: Share the core device pointer while invoking feature functions Leon Romanovsky
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Vivek Kasireddy, Alex Mastro, Nicolin Chen,
	Jason Gunthorpe

From: Vivek Kasireddy <vivek.kasireddy@intel.com>

These helpers are useful for managing additional references taken
on the device from other associated VFIO modules.

Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/vfio_main.c | 2 ++
 include/linux/vfio.h     | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 38c8e9350a60..9aa4a5d081e8 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -172,11 +172,13 @@ void vfio_device_put_registration(struct vfio_device *device)
 	if (refcount_dec_and_test(&device->refcount))
 		complete(&device->comp);
 }
+EXPORT_SYMBOL_GPL(vfio_device_put_registration);
 
 bool vfio_device_try_get_registration(struct vfio_device *device)
 {
 	return refcount_inc_not_zero(&device->refcount);
 }
+EXPORT_SYMBOL_GPL(vfio_device_try_get_registration);
 
 /*
  * VFIO driver API
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index eb563f538dee..217ba4ef1752 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -297,6 +297,8 @@ static inline void vfio_put_device(struct vfio_device *device)
 int vfio_register_group_dev(struct vfio_device *device);
 int vfio_register_emulated_iommu_dev(struct vfio_device *device);
 void vfio_unregister_group_dev(struct vfio_device *device);
+bool vfio_device_try_get_registration(struct vfio_device *device);
+void vfio_device_put_registration(struct vfio_device *device);
 
 int vfio_assign_device_set(struct vfio_device *device, void *set_id);
 unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* RE: [PATCH v8 07/11] vfio: Export vfio device get and put registration helpers
  2025-11-11  9:57 ` [PATCH v8 07/11] vfio: Export vfio device get and put registration helpers Leon Romanovsky
@ 2025-11-18  7:10   ` Tian, Kevin
  0 siblings, 0 replies; 63+ messages in thread
From: Tian, Kevin @ 2025-11-18  7:10 UTC (permalink / raw)
  To: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org,
	Kasireddy, Vivek, Alex Mastro, Nicolin Chen, Jason Gunthorpe

> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, November 11, 2025 5:58 PM
> 
> From: Vivek Kasireddy <vivek.kasireddy@intel.com>
> 
> These helpers are useful for managing additional references taken
> on the device from other associated VFIO modules.
> 
> Original-patch-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> Tested-by: Alex Mastro <amastro@fb.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v8 08/11] vfio/pci: Share the core device pointer while invoking feature functions
  2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (6 preceding siblings ...)
  2025-11-11  9:57 ` [PATCH v8 07/11] vfio: Export vfio device get and put registration helpers Leon Romanovsky
@ 2025-11-11  9:57 ` Leon Romanovsky
  2025-11-18  7:11   ` Tian, Kevin
  2025-11-11  9:57 ` [PATCH v8 09/11] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Vivek Kasireddy, Alex Mastro, Nicolin Chen

From: Vivek Kasireddy <vivek.kasireddy@intel.com>

There is no need to share the main device pointer (struct vfio_device *)
with all the feature functions as they only need the core device
pointer. Therefore, extract the core device pointer once in the
caller (vfio_pci_core_ioctl_feature) and share it instead.

Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 7dcf5439dedc..ca9a95716a85 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -299,11 +299,9 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
 	return 0;
 }
 
-static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
+static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags,
 				  void __user *arg, size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	int ret;
 
 	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@@ -320,12 +318,10 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
 }
 
 static int vfio_pci_core_pm_entry_with_wakeup(
-	struct vfio_device *device, u32 flags,
+	struct vfio_pci_core_device *vdev, u32 flags,
 	struct vfio_device_low_power_entry_with_wakeup __user *arg,
 	size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	struct vfio_device_low_power_entry_with_wakeup entry;
 	struct eventfd_ctx *efdctx;
 	int ret;
@@ -376,11 +372,9 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
 	up_write(&vdev->memory_lock);
 }
 
-static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
+static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags,
 				 void __user *arg, size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	int ret;
 
 	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@@ -1473,11 +1467,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
-				       uuid_t __user *arg, size_t argsz)
+static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
+				       u32 flags, uuid_t __user *arg,
+				       size_t argsz)
 {
-	struct vfio_pci_core_device *vdev =
-		container_of(device, struct vfio_pci_core_device, vdev);
 	uuid_t uuid;
 	int ret;
 
@@ -1504,16 +1497,19 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz)
 {
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+
 	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
 	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
-		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
+		return vfio_pci_core_pm_entry(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP:
-		return vfio_pci_core_pm_entry_with_wakeup(device, flags,
+		return vfio_pci_core_pm_entry_with_wakeup(vdev, flags,
 							  arg, argsz);
 	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
-		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
+		return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
-		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* RE: [PATCH v8 08/11] vfio/pci: Share the core device pointer while invoking feature functions
  2025-11-11  9:57 ` [PATCH v8 08/11] vfio/pci: Share the core device pointer while invoking feature functions Leon Romanovsky
@ 2025-11-18  7:11   ` Tian, Kevin
  0 siblings, 0 replies; 63+ messages in thread
From: Tian, Kevin @ 2025-11-18  7:11 UTC (permalink / raw)
  To: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org,
	Kasireddy, Vivek, Alex Mastro, Nicolin Chen

> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, November 11, 2025 5:58 PM
> 
> From: Vivek Kasireddy <vivek.kasireddy@intel.com>
> 
> There is no need to share the main device pointer (struct vfio_device *)
> with all the feature functions as they only need the core device
> pointer. Therefore, extract the core device pointer once in the
> caller (vfio_pci_core_ioctl_feature) and share it instead.
> 
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> Tested-by: Alex Mastro <amastro@fb.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v8 09/11] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (7 preceding siblings ...)
  2025-11-11  9:57 ` [PATCH v8 08/11] vfio/pci: Share the core device pointer while invoking feature functions Leon Romanovsky
@ 2025-11-11  9:57 ` Leon Romanovsky
  2025-11-18  7:18   ` Tian, Kevin
  2025-11-11  9:57 ` [PATCH v8 10/11] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
  2025-11-11  9:57 ` [PATCH v8 11/11] vfio/nvgrace: Support get_dmabuf_phys Leon Romanovsky
  10 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Alex Mastro, Nicolin Chen

From: Leon Romanovsky <leonro@nvidia.com>

Make sure that all VFIO PCI devices have peer-to-peer capabilities
enables, so we would be able to export their MMIO memory through DMABUF,

VFIO has always supported P2P mappings with itself. VFIO type 1
insecurely reads PFNs directly out of a VMA's PTEs and programs them
into the IOMMU allowing any two VFIO devices to perform P2P to each
other.

All existing VMMs use this capability to export P2P into a VM where
the VM could setup any kind of DMA it likes. Projects like DPDK/SPDK
are also known to make use of this, though less frequently.

As a first step to more properly integrating VFIO with the P2P
subsystem unconditionally enable P2P support for VFIO PCI devices. The
struct p2pdma_provider will act has a handle to the P2P subsystem to
do things like DMA mapping.

While real PCI devices have to support P2P (they can't even tell if an
IOVA is P2P or not) there may be fake PCI devices that may trigger
some kind of catastrophic system failure. To date VFIO has never
tripped up on such a case, but if one is discovered the plan is to add
a PCI quirk and have pcim_p2pdma_init() fail. This will fully block
the broken device throughout any users of the P2P subsystem in the
kernel.

Thus P2P through DMABUF will follow the historical VFIO model and be
unconditionally enabled by vfio-pci.

Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ca9a95716a85..142b84b3f225 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -28,6 +28,7 @@
 #include <linux/nospec.h>
 #include <linux/sched/mm.h>
 #include <linux/iommufd.h>
+#include <linux/pci-p2pdma.h>
 #if IS_ENABLED(CONFIG_EEH)
 #include <asm/eeh.h>
 #endif
@@ -2081,6 +2082,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 {
 	struct vfio_pci_core_device *vdev =
 		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+	int ret;
 
 	vdev->pdev = to_pci_dev(core_vdev->dev);
 	vdev->irq_type = VFIO_PCI_NUM_IRQS;
@@ -2090,6 +2092,9 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	INIT_LIST_HEAD(&vdev->dummy_resources_list);
 	INIT_LIST_HEAD(&vdev->ioeventfds_list);
 	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
+	ret = pcim_p2pdma_init(vdev->pdev);
+	if (ret && ret != -EOPNOTSUPP)
+		return ret;
 	init_rwsem(&vdev->memory_lock);
 	xa_init(&vdev->ctx);
 

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* RE: [PATCH v8 09/11] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-11-11  9:57 ` [PATCH v8 09/11] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
@ 2025-11-18  7:18   ` Tian, Kevin
  2025-11-18 20:10     ` Alex Williamson
  2025-11-18 20:18     ` Keith Busch
  0 siblings, 2 replies; 63+ messages in thread
From: Tian, Kevin @ 2025-11-18  7:18 UTC (permalink / raw)
  To: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, November 11, 2025 5:58 PM
> 
> From: Leon Romanovsky <leonro@nvidia.com>

not required with only your own s-o-b

> @@ -2090,6 +2092,9 @@ int vfio_pci_core_init_dev(struct vfio_device
> *core_vdev)
>  	INIT_LIST_HEAD(&vdev->dummy_resources_list);
>  	INIT_LIST_HEAD(&vdev->ioeventfds_list);
>  	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
> +	ret = pcim_p2pdma_init(vdev->pdev);
> +	if (ret && ret != -EOPNOTSUPP)
> +		return ret;

Reading the commit msg seems -EOPNOTSUPP is only returned for fake
PCI devices, otherwise it implies regression. better add a comment for it?

otherwise,

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 09/11] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-11-18  7:18   ` Tian, Kevin
@ 2025-11-18 20:10     ` Alex Williamson
  2025-11-19  0:01       ` Tian, Kevin
  2025-11-18 20:18     ` Keith Busch
  1 sibling, 1 reply; 63+ messages in thread
From: Alex Williamson @ 2025-11-18 20:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Krishnakant Jaju,
	Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

On Tue, 18 Nov 2025 07:18:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Tuesday, November 11, 2025 5:58 PM
> > 
> > From: Leon Romanovsky <leonro@nvidia.com>  
> 
> not required with only your own s-o-b
> 
> > @@ -2090,6 +2092,9 @@ int vfio_pci_core_init_dev(struct vfio_device
> > *core_vdev)
> >  	INIT_LIST_HEAD(&vdev->dummy_resources_list);
> >  	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> >  	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
> > +	ret = pcim_p2pdma_init(vdev->pdev);
> > +	if (ret && ret != -EOPNOTSUPP)
> > +		return ret;  
> 
> Reading the commit msg seems -EOPNOTSUPP is only returned for fake
> PCI devices, otherwise it implies regression. better add a comment for it?

I think the commit log is saying that if a device comes along that
can't support this, we'd quirk the init path to return -EOPNOTSUPP for
that particular device here.  This path is currently used when
!CONFIG_PCI_P2PDMA to make this error non-fatal to the device init.

I don't see a regression if such a device comes along and while we
could survive other types of failures by disabling p2pdma here, I think
all such cases are sufficient rare out of memory cases to consider them
catastrophic.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH v8 09/11] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-11-18 20:10     ` Alex Williamson
@ 2025-11-19  0:01       ` Tian, Kevin
  0 siblings, 0 replies; 63+ messages in thread
From: Tian, Kevin @ 2025-11-19  0:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Krishnakant Jaju,
	Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

> From: Alex Williamson <alex@shazbot.org>
> Sent: Wednesday, November 19, 2025 4:11 AM
> 
> On Tue, 18 Nov 2025 07:18:36 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Leon Romanovsky <leon@kernel.org>
> > > Sent: Tuesday, November 11, 2025 5:58 PM
> > >
> > > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > not required with only your own s-o-b
> >
> > > @@ -2090,6 +2092,9 @@ int vfio_pci_core_init_dev(struct vfio_device
> > > *core_vdev)
> > >  	INIT_LIST_HEAD(&vdev->dummy_resources_list);
> > >  	INIT_LIST_HEAD(&vdev->ioeventfds_list);
> > >  	INIT_LIST_HEAD(&vdev->sriov_pfs_item);
> > > +	ret = pcim_p2pdma_init(vdev->pdev);
> > > +	if (ret && ret != -EOPNOTSUPP)
> > > +		return ret;
> >
> > Reading the commit msg seems -EOPNOTSUPP is only returned for fake
> > PCI devices, otherwise it implies regression. better add a comment for it?
> 
> I think the commit log is saying that if a device comes along that
> can't support this, we'd quirk the init path to return -EOPNOTSUPP for
> that particular device here.  This path is currently used when
> !CONFIG_PCI_P2PDMA to make this error non-fatal to the device init.
> 
> I don't see a regression if such a device comes along and while we
> could survive other types of failures by disabling p2pdma here, I think
> all such cases are sufficient rare out of memory cases to consider them
> catastrophic.  Thanks,
> 

ah yes. I read it inaccurately.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 09/11] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-11-18  7:18   ` Tian, Kevin
  2025-11-18 20:10     ` Alex Williamson
@ 2025-11-18 20:18     ` Keith Busch
  2025-11-19  0:02       ` Tian, Kevin
  1 sibling, 1 reply; 63+ messages in thread
From: Keith Busch @ 2025-11-18 20:18 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson,
	Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

On Tue, Nov 18, 2025 at 07:18:36AM +0000, Tian, Kevin wrote:
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Tuesday, November 11, 2025 5:58 PM
> > 
> > From: Leon Romanovsky <leonro@nvidia.com>
> 
> not required with only your own s-o-b

That's automatically appended when the sender and signer don't match.
It's not uncommon for developers to send from a kernel.org email but
sign off with a corporate account, or the other way around.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH v8 09/11] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-11-18 20:18     ` Keith Busch
@ 2025-11-19  0:02       ` Tian, Kevin
  2025-11-19 13:54         ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Tian, Kevin @ 2025-11-19  0:02 UTC (permalink / raw)
  To: Keith Busch
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson,
	Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

> From: Keith Busch <kbusch@kernel.org>
> Sent: Wednesday, November 19, 2025 4:19 AM
> 
> On Tue, Nov 18, 2025 at 07:18:36AM +0000, Tian, Kevin wrote:
> > > From: Leon Romanovsky <leon@kernel.org>
> > > Sent: Tuesday, November 11, 2025 5:58 PM
> > >
> > > From: Leon Romanovsky <leonro@nvidia.com>
> >
> > not required with only your own s-o-b
> 
> That's automatically appended when the sender and signer don't match.
> It's not uncommon for developers to send from a kernel.org email but
> sign off with a corporate account, or the other way around.

Good to know.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 09/11] vfio/pci: Enable peer-to-peer DMA transactions by default
  2025-11-19  0:02       ` Tian, Kevin
@ 2025-11-19 13:54         ` Leon Romanovsky
  0 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 13:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Keith Busch, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson,
	Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

On Wed, Nov 19, 2025 at 12:02:02AM +0000, Tian, Kevin wrote:
> > From: Keith Busch <kbusch@kernel.org>
> > Sent: Wednesday, November 19, 2025 4:19 AM
> > 
> > On Tue, Nov 18, 2025 at 07:18:36AM +0000, Tian, Kevin wrote:
> > > > From: Leon Romanovsky <leon@kernel.org>
> > > > Sent: Tuesday, November 11, 2025 5:58 PM
> > > >
> > > > From: Leon Romanovsky <leonro@nvidia.com>
> > >
> > > not required with only your own s-o-b
> > 
> > That's automatically appended when the sender and signer don't match.
> > It's not uncommon for developers to send from a kernel.org email but
> > sign off with a corporate account, or the other way around.
> 
> Good to know.

Yes, in addition, I used to separate between code authorship and my
open-source activity. Code belongs to my employer and this is why corporate
address is used as an author, but all emails and communications are coming from
my kernel.org account.

Thanks

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v8 10/11] vfio/pci: Add dma-buf export support for MMIO regions
  2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (8 preceding siblings ...)
  2025-11-11  9:57 ` [PATCH v8 09/11] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
@ 2025-11-11  9:57 ` Leon Romanovsky
  2025-11-18  7:33   ` Tian, Kevin
  2025-11-11  9:57 ` [PATCH v8 11/11] vfio/nvgrace: Support get_dmabuf_phys Leon Romanovsky
  10 siblings, 1 reply; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Vivek Kasireddy

From: Leon Romanovsky <leonro@nvidia.com>

Add support for exporting PCI device MMIO regions through dma-buf,
enabling safe sharing of non-struct page memory with controlled
lifetime management. This allows RDMA and other subsystems to import
dma-buf FDs and build them into memory regions for PCI P2P operations.

The implementation provides a revocable attachment mechanism using
dma-buf move operations. MMIO regions are normally pinned as BARs
don't change physical addresses, but access is revoked when the VFIO
device is closed or a PCI reset is issued. This ensures kernel
self-defense against potentially hostile userspace.

Currently VFIO can take MMIO regions from the device's BAR and map
them into a PFNMAP VMA with special PTEs. This mapping type ensures
the memory cannot be used with things like pin_user_pages(), hmm, and
so on. In practice only the user process CPU and KVM can safely make
use of these VMA. When VFIO shuts down these VMAs are cleaned by
unmap_mapping_range() to prevent any UAF of the MMIO beyond driver
unbind.

However, VFIO type 1 has an insecure behavior where it uses
follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back
into the IOMMU. This has a long history of enabling P2P DMA inside
VMs, but has serious lifetime problems by allowing a UAF of the MMIO
after the VFIO driver has been unbound.

Introduce DMABUF as a new safe way to export a FD based handle for the
MMIO regions. This can be consumed by existing DMABUF importers like
RDMA or DRM without opening an UAF. A following series will add an
importer to iommufd to obsolete the type 1 code and allow safe
UAF-free MMIO P2P in VM cases.

DMABUF has a built in synchronous invalidation mechanism called
move_notify. VFIO keeps track of all drivers importing its MMIO and
can invoke a synchronous invalidation callback to tell the importing
drivers to DMA unmap and forget about the MMIO pfns. This process is
being called revoke. This synchronous invalidation fully prevents any
lifecycle problems. VFIO will do this before unbinding its driver
ensuring there is no UAF of the MMIO beyond the driver lifecycle.

Further, VFIO has additional behavior to block access to the MMIO
during things like Function Level Reset. This is because some poor
platforms may experience a MCE type crash when touching MMIO of a PCI
device that is undergoing a reset. Today this is done by using
unmap_mapping_range() on the VMAs. Extend that into the DMABUF world
and temporarily revoke the MMIO from the DMABUF importers during FLR
as well. This will more robustly prevent an errant P2P from possibly
upsetting the platform.

A DMABUF FD is a preferred handle for MMIO compared to using something
like a pgmap because:
 - VFIO is supported, including its P2P feature, on archs that don't
   support pgmap
 - PCI devices have all sorts of BAR sizes, including ones smaller
   than a section so a pgmap cannot always be created
 - It is undesirable to waste a lot of memory for struct pages,
   especially for a case like a GPU with ~100GB of BAR size
 - We want a synchronous revoke semantic to support FLR with light
   hardware requirements

Use the P2P subsystem to help generate the DMA mapping. This is a
significant upgrade over the abuse of dma_map_resource() that has
historically been used by DMABUF exporters. Experience with an OOT
version of this patch shows that real systems do need this. This
approach deals with all the P2P scenarios:
 - Non-zero PCI bus_offset
 - ACS flags routing traffic to the IOMMU
 - ACS flags that bypass the IOMMU - though vfio noiommu is required
   to hit this.

There will be further work to formalize the revoke semantic in
DMABUF. For now this acts like a move_notify dynamic exporter where
importer fault handling will get a failure when they attempt to map.
This means that only fully restartable fault capable importers can
import the VFIO DMABUFs. A future revoke semantic should open this up
to more HW as the HW only needs to invalidate, not handle restartable
faults.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/Kconfig           |   3 +
 drivers/vfio/pci/Makefile          |   1 +
 drivers/vfio/pci/vfio_pci.c        |   5 +
 drivers/vfio/pci/vfio_pci_config.c |  22 ++-
 drivers/vfio/pci/vfio_pci_core.c   |  18 ++-
 drivers/vfio/pci/vfio_pci_dmabuf.c | 315 +++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_priv.h   |  23 +++
 include/linux/vfio_pci_core.h      |  42 +++++
 include/uapi/linux/vfio.h          |  28 ++++
 9 files changed, 452 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 2b0172f54665..2b9fca00e9e8 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -55,6 +55,9 @@ config VFIO_PCI_ZDEV_KVM
 
 	  To enable s390x KVM vfio-pci extensions, say Y.
 
+config VFIO_PCI_DMABUF
+	def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
+
 source "drivers/vfio/pci/mlx5/Kconfig"
 
 source "drivers/vfio/pci/hisilicon/Kconfig"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index cf00c0a7e55c..53f59226ae01 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,6 +2,7 @@
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
+vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 
 vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index ac10f14417f2..6d41cf26b539 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -147,6 +147,10 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
 };
 
+static const struct vfio_pci_device_ops vfio_pci_dev_ops = {
+	.get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys,
+};
+
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	struct vfio_pci_core_device *vdev;
@@ -161,6 +165,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return PTR_ERR(vdev);
 
 	dev_set_drvdata(&pdev->dev, vdev);
+	vdev->pci_ops = &vfio_pci_dev_ops;
 	ret = vfio_pci_core_register_device(vdev);
 	if (ret)
 		goto out_put_vdev;
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 8f02f236b5b4..1f6008eabf23 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -589,10 +589,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
 		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
 
-		if (!new_mem)
+		if (!new_mem) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
-		else
+			vfio_pci_dma_buf_move(vdev, true);
+		} else {
 			down_write(&vdev->memory_lock);
+		}
 
 		/*
 		 * If the user is writing mem/io enable (new_mem/io) and we
@@ -627,6 +629,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		*virt_cmd &= cpu_to_le16(~mask);
 		*virt_cmd |= cpu_to_le16(new_cmd & mask);
 
+		if (__vfio_pci_memory_enabled(vdev))
+			vfio_pci_dma_buf_move(vdev, false);
 		up_write(&vdev->memory_lock);
 	}
 
@@ -707,12 +711,16 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
 static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
 					  pci_power_t state)
 {
-	if (state >= PCI_D3hot)
+	if (state >= PCI_D3hot) {
 		vfio_pci_zap_and_down_write_memory_lock(vdev);
-	else
+		vfio_pci_dma_buf_move(vdev, true);
+	} else {
 		down_write(&vdev->memory_lock);
+	}
 
 	vfio_pci_set_power_state(vdev, state);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 }
 
@@ -900,7 +908,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
 
 		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
+			vfio_pci_dma_buf_move(vdev, true);
 			pci_try_reset_function(vdev->pdev);
+			if (__vfio_pci_memory_enabled(vdev))
+				vfio_pci_dma_buf_move(vdev, false);
 			up_write(&vdev->memory_lock);
 		}
 	}
@@ -982,7 +993,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
 
 		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
 			vfio_pci_zap_and_down_write_memory_lock(vdev);
+			vfio_pci_dma_buf_move(vdev, true);
 			pci_try_reset_function(vdev->pdev);
+			if (__vfio_pci_memory_enabled(vdev))
+				vfio_pci_dma_buf_move(vdev, false);
 			up_write(&vdev->memory_lock);
 		}
 	}
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 142b84b3f225..51a3bcc26f8b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -287,6 +287,8 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
 	 * semaphore.
 	 */
 	vfio_pci_zap_and_down_write_memory_lock(vdev);
+	vfio_pci_dma_buf_move(vdev, true);
+
 	if (vdev->pm_runtime_engaged) {
 		up_write(&vdev->memory_lock);
 		return -EINVAL;
@@ -370,6 +372,8 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
 	 */
 	down_write(&vdev->memory_lock);
 	__vfio_pci_runtime_pm_exit(vdev);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 }
 
@@ -690,6 +694,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
 #endif
 	vfio_pci_core_disable(vdev);
 
+	vfio_pci_dma_buf_cleanup(vdev);
+
 	mutex_lock(&vdev->igate);
 	if (vdev->err_trigger) {
 		eventfd_ctx_put(vdev->err_trigger);
@@ -1222,7 +1228,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
 	 */
 	vfio_pci_set_power_state(vdev, PCI_D0);
 
+	vfio_pci_dma_buf_move(vdev, true);
 	ret = pci_try_reset_function(vdev->pdev);
+	if (__vfio_pci_memory_enabled(vdev))
+		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
 
 	return ret;
@@ -1511,6 +1520,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_DMA_BUF:
+		return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
@@ -2095,6 +2106,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
 	ret = pcim_p2pdma_init(vdev->pdev);
 	if (ret && ret != -EOPNOTSUPP)
 		return ret;
+	INIT_LIST_HEAD(&vdev->dmabufs);
 	init_rwsem(&vdev->memory_lock);
 	xa_init(&vdev->ctx);
 
@@ -2459,6 +2471,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 			break;
 		}
 
+		vfio_pci_dma_buf_move(vdev, true);
 		vfio_pci_zap_bars(vdev);
 	}
 
@@ -2487,8 +2500,11 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 
 err_undo:
 	list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
-					 vdev.dev_set_list)
+					 vdev.dev_set_list) {
+		if (__vfio_pci_memory_enabled(vdev))
+			vfio_pci_dma_buf_move(vdev, false);
 		up_write(&vdev->memory_lock);
+	}
 
 	list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
 		pm_runtime_put(&vdev->pdev->dev);
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
new file mode 100644
index 000000000000..98ab96736935
--- /dev/null
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.
+ */
+#include <linux/dma-buf.h>
+#include <linux/pci-p2pdma.h>
+#include <linux/dma-resv.h>
+
+#include "vfio_pci_priv.h"
+
+MODULE_IMPORT_NS("DMA_BUF");
+
+struct vfio_pci_dma_buf {
+	struct dma_buf *dmabuf;
+	struct vfio_pci_core_device *vdev;
+	struct list_head dmabufs_elm;
+	size_t size;
+	struct dma_buf_phys_vec *phys_vec;
+	struct p2pdma_provider *provider;
+	u32 nr_ranges;
+	u8 revoked : 1;
+};
+
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
+				   struct dma_buf_attachment *attachment)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	if (!attachment->peer2peer)
+		return -EOPNOTSUPP;
+
+	if (priv->revoked)
+		return -ENODEV;
+
+	return 0;
+}
+
+static struct sg_table *
+vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
+		     enum dma_data_direction dir)
+{
+	struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
+
+	dma_resv_assert_held(priv->dmabuf->resv);
+
+	if (priv->revoked)
+		return ERR_PTR(-ENODEV);
+
+	return dma_buf_map(attachment, priv->provider, priv->phys_vec,
+			   priv->nr_ranges, priv->size, dir);
+}
+
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
+				   struct sg_table *sgt,
+				   enum dma_data_direction dir)
+{
+	dma_buf_unmap(attachment, sgt, dir);
+}
+
+static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	/*
+	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
+	 * The refcount prevents both.
+	 */
+	if (priv->vdev) {
+		down_write(&priv->vdev->memory_lock);
+		list_del_init(&priv->dmabufs_elm);
+		up_write(&priv->vdev->memory_lock);
+		vfio_device_put_registration(&priv->vdev->vdev);
+	}
+	kfree(priv->phys_vec);
+	kfree(priv);
+}
+
+static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
+	.attach = vfio_pci_dma_buf_attach,
+	.map_dma_buf = vfio_pci_dma_buf_map,
+	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
+	.release = vfio_pci_dma_buf_release,
+};
+
+int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec,
+				struct vfio_region_dma_range *dma_ranges,
+				size_t nr_ranges, phys_addr_t start,
+				phys_addr_t len)
+{
+	phys_addr_t max_addr;
+	unsigned int i;
+
+	max_addr = start + len;
+	for (i = 0; i < nr_ranges; i++) {
+		phys_addr_t end;
+
+		if (!dma_ranges[i].length)
+			return -EINVAL;
+
+		if (check_add_overflow(start, dma_ranges[i].offset,
+				       &phys_vec[i].paddr) ||
+		    check_add_overflow(phys_vec[i].paddr,
+				       dma_ranges[i].length, &end))
+			return -EOVERFLOW;
+		if (end > max_addr)
+			return -EINVAL;
+
+		phys_vec[i].len = dma_ranges[i].length;
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_fill_phys_vec);
+
+int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev,
+				  struct p2pdma_provider **provider,
+				  unsigned int region_index,
+				  struct dma_buf_phys_vec *phys_vec,
+				  struct vfio_region_dma_range *dma_ranges,
+				  size_t nr_ranges)
+{
+	struct pci_dev *pdev = vdev->pdev;
+
+	*provider = pcim_p2pdma_provider(pdev, region_index);
+	if (!*provider)
+		return -EINVAL;
+
+	return vfio_pci_core_fill_phys_vec(
+		phys_vec, dma_ranges, nr_ranges,
+		pci_resource_start(pdev, region_index),
+		pci_resource_len(pdev, region_index));
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_get_dmabuf_phys);
+
+static int validate_dmabuf_input(struct vfio_device_feature_dma_buf *dma_buf,
+				 struct vfio_region_dma_range *dma_ranges,
+				 size_t *lengthp)
+{
+	size_t length = 0;
+	u32 i;
+
+	for (i = 0; i < dma_buf->nr_ranges; i++) {
+		u64 offset = dma_ranges[i].offset;
+		u64 len = dma_ranges[i].length;
+
+		if (!len || !PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
+			return -EINVAL;
+
+		if (check_add_overflow(length, len, &length))
+			return -EINVAL;
+	}
+
+	/*
+	 * dma_iova_try_alloc() will WARN on if userspace proposes a size that
+	 * is too big, eg with lots of ranges.
+	 */
+	if ((u64)(length) & DMA_IOVA_USE_SWIOTLB)
+		return -EINVAL;
+
+	*lengthp = length;
+	return 0;
+}
+
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz)
+{
+	struct vfio_device_feature_dma_buf get_dma_buf = {};
+	struct vfio_region_dma_range *dma_ranges;
+	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+	struct vfio_pci_dma_buf *priv;
+	size_t length;
+	int ret;
+
+	if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys)
+		return -EOPNOTSUPP;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
+				 sizeof(get_dma_buf));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
+		return -EFAULT;
+
+	if (!get_dma_buf.nr_ranges || get_dma_buf.flags)
+		return -EINVAL;
+
+	/*
+	 * For PCI the region_index is the BAR number like everything else.
+	 */
+	if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -ENODEV;
+
+	dma_ranges = memdup_array_user(&arg->dma_ranges, get_dma_buf.nr_ranges,
+				       sizeof(*dma_ranges));
+	if (IS_ERR(dma_ranges))
+		return PTR_ERR(dma_ranges);
+
+	ret = validate_dmabuf_input(&get_dma_buf, dma_ranges, &length);
+	if (ret)
+		goto err_free_ranges;
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv) {
+		ret = -ENOMEM;
+		goto err_free_ranges;
+	}
+	priv->phys_vec = kcalloc(get_dma_buf.nr_ranges, sizeof(*priv->phys_vec),
+				 GFP_KERNEL);
+	if (!priv->phys_vec) {
+		ret = -ENOMEM;
+		goto err_free_priv;
+	}
+
+	priv->vdev = vdev;
+	priv->nr_ranges = get_dma_buf.nr_ranges;
+	priv->size = length;
+	ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider,
+					     get_dma_buf.region_index,
+					     priv->phys_vec, dma_ranges,
+					     priv->nr_ranges);
+	if (ret)
+		goto err_free_phys;
+
+	kfree(dma_ranges);
+	dma_ranges = NULL;
+
+	if (!vfio_device_try_get_registration(&vdev->vdev)) {
+		ret = -ENODEV;
+		goto err_free_phys;
+	}
+
+	exp_info.ops = &vfio_pci_dmabuf_ops;
+	exp_info.size = priv->size;
+	exp_info.flags = get_dma_buf.open_flags;
+	exp_info.priv = priv;
+
+	priv->dmabuf = dma_buf_export(&exp_info);
+	if (IS_ERR(priv->dmabuf)) {
+		ret = PTR_ERR(priv->dmabuf);
+		goto err_dev_put;
+	}
+
+	/* dma_buf_put() now frees priv */
+	INIT_LIST_HEAD(&priv->dmabufs_elm);
+	down_write(&vdev->memory_lock);
+	dma_resv_lock(priv->dmabuf->resv, NULL);
+	priv->revoked = !__vfio_pci_memory_enabled(vdev);
+	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
+	dma_resv_unlock(priv->dmabuf->resv);
+	up_write(&vdev->memory_lock);
+
+	/*
+	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
+	 * will be released.
+	 */
+	ret = dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
+	if (ret < 0)
+		goto err_dma_buf;
+	return ret;
+
+err_dma_buf:
+	dma_buf_put(priv->dmabuf);
+err_dev_put:
+	vfio_device_put_registration(&vdev->vdev);
+err_free_phys:
+	kfree(priv->phys_vec);
+err_free_priv:
+	kfree(priv);
+err_free_ranges:
+	kfree(dma_ranges);
+	return ret;
+}
+
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	lockdep_assert_held_write(&vdev->memory_lock);
+
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!get_file_active(&priv->dmabuf->file))
+			continue;
+
+		if (priv->revoked != revoked) {
+			dma_resv_lock(priv->dmabuf->resv, NULL);
+			priv->revoked = revoked;
+			dma_buf_move_notify(priv->dmabuf);
+			dma_resv_unlock(priv->dmabuf->resv);
+		}
+		dma_buf_put(priv->dmabuf);
+	}
+}
+
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_dma_buf *priv;
+	struct vfio_pci_dma_buf *tmp;
+
+	down_write(&vdev->memory_lock);
+	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+		if (!get_file_active(&priv->dmabuf->file))
+			continue;
+
+		dma_resv_lock(priv->dmabuf->resv, NULL);
+		list_del_init(&priv->dmabufs_elm);
+		priv->vdev = NULL;
+		priv->revoked = true;
+		dma_buf_move_notify(priv->dmabuf);
+		dma_resv_unlock(priv->dmabuf->resv);
+		vfio_device_put_registration(&vdev->vdev);
+		fput(priv->dmabuf->file);
+	}
+	up_write(&vdev->memory_lock);
+}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index a9972eacb293..28a405f8b97c 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -107,4 +107,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+#ifdef CONFIG_VFIO_PCI_DMABUF
+int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+				  struct vfio_device_feature_dma_buf __user *arg,
+				  size_t argsz);
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
+#else
+static inline int
+vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
+			      struct vfio_device_feature_dma_buf __user *arg,
+			      size_t argsz)
+{
+	return -ENOTTY;
+}
+static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
+{
+}
+static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
+					 bool revoked)
+{
+}
+#endif
+
 #endif
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index f541044e42a2..c9466ba323fa 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -26,6 +26,8 @@
 
 struct vfio_pci_core_device;
 struct vfio_pci_region;
+struct p2pdma_provider;
+struct dma_buf_phys_vec;
 
 struct vfio_pci_regops {
 	ssize_t (*rw)(struct vfio_pci_core_device *vdev, char __user *buf,
@@ -49,9 +51,48 @@ struct vfio_pci_region {
 	u32				flags;
 };
 
+struct vfio_pci_device_ops {
+	int (*get_dmabuf_phys)(struct vfio_pci_core_device *vdev,
+			       struct p2pdma_provider **provider,
+			       unsigned int region_index,
+			       struct dma_buf_phys_vec *phys_vec,
+			       struct vfio_region_dma_range *dma_ranges,
+			       size_t nr_ranges);
+};
+
+#if IS_ENABLED(CONFIG_VFIO_PCI_DMABUF)
+int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec,
+				struct vfio_region_dma_range *dma_ranges,
+				size_t nr_ranges, phys_addr_t start,
+				phys_addr_t len);
+int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev,
+				  struct p2pdma_provider **provider,
+				  unsigned int region_index,
+				  struct dma_buf_phys_vec *phys_vec,
+				  struct vfio_region_dma_range *dma_ranges,
+				  size_t nr_ranges);
+#else
+static inline int
+vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec,
+			    struct vfio_region_dma_range *dma_ranges,
+			    size_t nr_ranges, phys_addr_t start,
+			    phys_addr_t len)
+{
+	return -EINVAL;
+}
+static inline int vfio_pci_core_get_dmabuf_phys(
+	struct vfio_pci_core_device *vdev, struct p2pdma_provider **provider,
+	unsigned int region_index, struct dma_buf_phys_vec *phys_vec,
+	struct vfio_region_dma_range *dma_ranges, size_t nr_ranges)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
 struct vfio_pci_core_device {
 	struct vfio_device	vdev;
 	struct pci_dev		*pdev;
+	const struct vfio_pci_device_ops *pci_ops;
 	void __iomem		*barmap[PCI_STD_NUM_BARS];
 	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
 	u8			*pci_config_map;
@@ -94,6 +135,7 @@ struct vfio_pci_core_device {
 	struct vfio_pci_core_device	*sriov_pf_core_dev;
 	struct notifier_block	nb;
 	struct rw_semaphore	memory_lock;
+	struct list_head	dmabufs;
 };
 
 /* Will be exported for vfio pci drivers usage */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 75100bf009ba..ac2329f24141 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -14,6 +14,7 @@
 
 #include <linux/types.h>
 #include <linux/ioctl.h>
+#include <linux/stddef.h>
 
 #define VFIO_API_VERSION	0
 
@@ -1478,6 +1479,33 @@ struct vfio_device_feature_bus_master {
 };
 #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
 
+/**
+ * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
+ * regions selected.
+ *
+ * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
+ * etc. offset/length specify a slice of the region to create the dmabuf from.
+ * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf.
+ *
+ * flags should be 0.
+ *
+ * Return: The fd number on success, -1 and errno is set on failure.
+ */
+#define VFIO_DEVICE_FEATURE_DMA_BUF 11
+
+struct vfio_region_dma_range {
+	__u64 offset;
+	__u64 length;
+};
+
+struct vfio_device_feature_dma_buf {
+	__u32	region_index;
+	__u32	open_flags;
+	__u32   flags;
+	__u32   nr_ranges;
+	struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* RE: [PATCH v8 10/11] vfio/pci: Add dma-buf export support for MMIO regions
  2025-11-11  9:57 ` [PATCH v8 10/11] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
@ 2025-11-18  7:33   ` Tian, Kevin
  2025-11-18 14:28     ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Tian, Kevin @ 2025-11-18  7:33 UTC (permalink / raw)
  To: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org,
	Kasireddy, Vivek

> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, November 11, 2025 5:58 PM
> 
> -		if (!new_mem)
> +		if (!new_mem) {
>  			vfio_pci_zap_and_down_write_memory_lock(vdev);
> -		else
> +			vfio_pci_dma_buf_move(vdev, true);
> +		} else {
>  			down_write(&vdev->memory_lock);
> +		}

shouldn't we notify move before zapping the bars? otherwise there is
still a small window in between where the exporter already has the
mapping cleared while the importer still keeps it...

> +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> +{
> +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> +	/*
> +	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
> +	 * The refcount prevents both.

which refcount? I thought it's vdev->memory_lock preventing the race...

> +	 */
> +	if (priv->vdev) {
> +		down_write(&priv->vdev->memory_lock);
> +		list_del_init(&priv->dmabufs_elm);
> +		up_write(&priv->vdev->memory_lock);
> +		vfio_device_put_registration(&priv->vdev->vdev);
> +	}
> +	kfree(priv->phys_vec);
> +	kfree(priv);
> +}

[...]

> +int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec,
> +				struct vfio_region_dma_range *dma_ranges,
> +				size_t nr_ranges, phys_addr_t start,
> +				phys_addr_t len)
> +{
> +	phys_addr_t max_addr;
> +	unsigned int i;
> +
> +	max_addr = start + len;
> +	for (i = 0; i < nr_ranges; i++) {
> +		phys_addr_t end;
> +
> +		if (!dma_ranges[i].length)
> +			return -EINVAL;

Looks redundant as there is already a check in validate_dmabuf_input().

> +
> +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32
> flags,
> +				  struct vfio_device_feature_dma_buf __user
> *arg,
> +				  size_t argsz)
> +{
> +	struct vfio_device_feature_dma_buf get_dma_buf = {};
> +	struct vfio_region_dma_range *dma_ranges;
> +	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> +	struct vfio_pci_dma_buf *priv;
> +	size_t length;
> +	int ret;
> +
> +	if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys)
> +		return -EOPNOTSUPP;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(get_dma_buf));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
> +		return -EFAULT;
> +
> +	if (!get_dma_buf.nr_ranges || get_dma_buf.flags)
> +		return -EINVAL;

unknown flag bits get -EOPNOTSUPP.

> +
> +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	struct vfio_pci_dma_buf *tmp;
> +
> +	down_write(&vdev->memory_lock);
> +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm)
> {
> +		if (!get_file_active(&priv->dmabuf->file))
> +			continue;
> +
> +		dma_resv_lock(priv->dmabuf->resv, NULL);
> +		list_del_init(&priv->dmabufs_elm);
> +		priv->vdev = NULL;
> +		priv->revoked = true;
> +		dma_buf_move_notify(priv->dmabuf);
> +		dma_resv_unlock(priv->dmabuf->resv);
> +		vfio_device_put_registration(&vdev->vdev);
> +		fput(priv->dmabuf->file);

dma_buf_put(priv->dmabuf), consistent with other places.

> +/**
> + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
> + * regions selected.

s/regions/region/

> + *
> + * open_flags are the typical flags passed to open(2), eg O_RDWR,
> O_CLOEXEC,
> + * etc. offset/length specify a slice of the region to create the dmabuf from.
> + * nr_ranges is the total number of (P2P DMA) ranges that comprise the
> dmabuf.
> + *
> + * flags should be 0.
> + *
> + * Return: The fd number on success, -1 and errno is set on failure.
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_BUF 11
> +
> +struct vfio_region_dma_range {
> +	__u64 offset;
> +	__u64 length;
> +};
> +
> +struct vfio_device_feature_dma_buf {
> +	__u32	region_index;
> +	__u32	open_flags;
> +	__u32   flags;

Usually the 'flags' field is put in the start (following argsz if existing).

No big issues, so:

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 10/11] vfio/pci: Add dma-buf export support for MMIO regions
  2025-11-18  7:33   ` Tian, Kevin
@ 2025-11-18 14:28     ` Jason Gunthorpe
  2025-11-18 23:56       ` Tian, Kevin
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2025-11-18 14:28 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson,
	Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org,
	Kasireddy, Vivek

On Tue, Nov 18, 2025 at 07:33:23AM +0000, Tian, Kevin wrote:
> > From: Leon Romanovsky <leon@kernel.org>
> > Sent: Tuesday, November 11, 2025 5:58 PM
> > 
> > -		if (!new_mem)
> > +		if (!new_mem) {
> >  			vfio_pci_zap_and_down_write_memory_lock(vdev);
> > -		else
> > +			vfio_pci_dma_buf_move(vdev, true);
> > +		} else {
> >  			down_write(&vdev->memory_lock);
> > +		}
> 
> shouldn't we notify move before zapping the bars? otherwise there is
> still a small window in between where the exporter already has the
> mapping cleared while the importer still keeps it...

zapping the VMA and moving/revoking the DMABUF are independent
operations that can happen in any order. They effect different kinds
of users. The VMA zap prevents CPU access from userspace, the DMABUF
move prevents DMA access from devices.

The order has to be like the above because vfio_pci_dma_buf_move()
must be called under the memory lock and
vfio_pci_zap_and_down_write_memory_lock() gets the memory lock..

> > +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> > +{
> > +	struct vfio_pci_dma_buf *priv = dmabuf->priv;
> > +
> > +	/*
> > +	 * Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
> > +	 * The refcount prevents both.
> 
> which refcount? I thought it's vdev->memory_lock preventing the race...

Refcount on the dmabuf

> > +int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec,
> > +				struct vfio_region_dma_range *dma_ranges,
> > +				size_t nr_ranges, phys_addr_t start,
> > +				phys_addr_t len)
> > +{
> > +	phys_addr_t max_addr;
> > +	unsigned int i;
> > +
> > +	max_addr = start + len;
> > +	for (i = 0; i < nr_ranges; i++) {
> > +		phys_addr_t end;
> > +
> > +		if (!dma_ranges[i].length)
> > +			return -EINVAL;
> 
> Looks redundant as there is already a check in validate_dmabuf_input().

Agree

> > +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32
> > flags,
> > +				  struct vfio_device_feature_dma_buf __user
> > *arg,
> > +				  size_t argsz)
> > +{
> > +	struct vfio_device_feature_dma_buf get_dma_buf = {};
> > +	struct vfio_region_dma_range *dma_ranges;
> > +	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> > +	struct vfio_pci_dma_buf *priv;
> > +	size_t length;
> > +	int ret;
> > +
> > +	if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys)
> > +		return -EOPNOTSUPP;
> > +
> > +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
> > +				 sizeof(get_dma_buf));
> > +	if (ret != 1)
> > +		return ret;
> > +
> > +	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
> > +		return -EFAULT;
> > +
> > +	if (!get_dma_buf.nr_ranges || get_dma_buf.flags)
> > +		return -EINVAL;
> 
> unknown flag bits get -EOPNOTSUPP.

Agree

> > +
> > +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> > +{
> > +	struct vfio_pci_dma_buf *priv;
> > +	struct vfio_pci_dma_buf *tmp;
> > +
> > +	down_write(&vdev->memory_lock);
> > +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm)
> > {
> > +		if (!get_file_active(&priv->dmabuf->file))
> > +			continue;
> > +
> > +		dma_resv_lock(priv->dmabuf->resv, NULL);
> > +		list_del_init(&priv->dmabufs_elm);
> > +		priv->vdev = NULL;
> > +		priv->revoked = true;
> > +		dma_buf_move_notify(priv->dmabuf);
> > +		dma_resv_unlock(priv->dmabuf->resv);
> > +		vfio_device_put_registration(&vdev->vdev);
> > +		fput(priv->dmabuf->file);
> 
> dma_buf_put(priv->dmabuf), consistent with other places.

Someone else said this, I don't agree, the above got the get via

get_file_active() instead of a dma_buf version..

So we should pair with get_file_active() vs fput().

Christian rejected the idea of adding a dmabuf wrapper for
get_file_active(), oh well.

> > +struct vfio_device_feature_dma_buf {
> > +	__u32	region_index;
> > +	__u32	open_flags;
> > +	__u32   flags;
> 
> Usually the 'flags' field is put in the start (following argsz if existing).

Yeah, but doesn't really matter.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH v8 10/11] vfio/pci: Add dma-buf export support for MMIO regions
  2025-11-18 14:28     ` Jason Gunthorpe
@ 2025-11-18 23:56       ` Tian, Kevin
  2025-11-19 19:41         ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Tian, Kevin @ 2025-11-18 23:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson,
	Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org,
	Kasireddy, Vivek

> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Tuesday, November 18, 2025 10:29 PM
> 
> On Tue, Nov 18, 2025 at 07:33:23AM +0000, Tian, Kevin wrote:
> > > From: Leon Romanovsky <leon@kernel.org>
> > > Sent: Tuesday, November 11, 2025 5:58 PM
> > >
> > > -		if (!new_mem)
> > > +		if (!new_mem) {
> > >  			vfio_pci_zap_and_down_write_memory_lock(vdev);
> > > -		else
> > > +			vfio_pci_dma_buf_move(vdev, true);
> > > +		} else {
> > >  			down_write(&vdev->memory_lock);
> > > +		}
> >
> > shouldn't we notify move before zapping the bars? otherwise there is
> > still a small window in between where the exporter already has the
> > mapping cleared while the importer still keeps it...
> 
> zapping the VMA and moving/revoking the DMABUF are independent
> operations that can happen in any order. They effect different kinds
> of users. The VMA zap prevents CPU access from userspace, the DMABUF
> move prevents DMA access from devices.

The comment was triggered by the description about UAF in the 
commit msg.

> 
> The order has to be like the above because vfio_pci_dma_buf_move()
> must be called under the memory lock and
> vfio_pci_zap_and_down_write_memory_lock() gets the memory lock..

make sense.

> > > +	down_write(&vdev->memory_lock);
> > > +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm)
> > > {
> > > +		if (!get_file_active(&priv->dmabuf->file))
> > > +			continue;
> > > +
> > > +		dma_resv_lock(priv->dmabuf->resv, NULL);
> > > +		list_del_init(&priv->dmabufs_elm);
> > > +		priv->vdev = NULL;
> > > +		priv->revoked = true;
> > > +		dma_buf_move_notify(priv->dmabuf);
> > > +		dma_resv_unlock(priv->dmabuf->resv);
> > > +		vfio_device_put_registration(&vdev->vdev);
> > > +		fput(priv->dmabuf->file);
> >
> > dma_buf_put(priv->dmabuf), consistent with other places.
> 
> Someone else said this, I don't agree, the above got the get via
> 
> get_file_active() instead of a dma_buf version..
> 
> So we should pair with get_file_active() vs fput().
> 
> Christian rejected the idea of adding a dmabuf wrapper for
> get_file_active(), oh well.

Okay then vfio_pci_dma_buf_move() should be changed. It uses
get_file_active() to pair dma_buf_put().


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 10/11] vfio/pci: Add dma-buf export support for MMIO regions
  2025-11-18 23:56       ` Tian, Kevin
@ 2025-11-19 19:41         ` Jason Gunthorpe
  2025-11-19 20:50           ` Leon Romanovsky
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2025-11-19 19:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson,
	Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org,
	Kasireddy, Vivek

On Tue, Nov 18, 2025 at 11:56:14PM +0000, Tian, Kevin wrote:
> > > > +	down_write(&vdev->memory_lock);
> > > > +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm)
> > > > {
> > > > +		if (!get_file_active(&priv->dmabuf->file))
> > > > +			continue;
> > > > +
> > > > +		dma_resv_lock(priv->dmabuf->resv, NULL);
> > > > +		list_del_init(&priv->dmabufs_elm);
> > > > +		priv->vdev = NULL;
> > > > +		priv->revoked = true;
> > > > +		dma_buf_move_notify(priv->dmabuf);
> > > > +		dma_resv_unlock(priv->dmabuf->resv);
> > > > +		vfio_device_put_registration(&vdev->vdev);
> > > > +		fput(priv->dmabuf->file);
> > >
> > > dma_buf_put(priv->dmabuf), consistent with other places.
> > 
> > Someone else said this, I don't agree, the above got the get via
> > 
> > get_file_active() instead of a dma_buf version..
> > 
> > So we should pair with get_file_active() vs fput().
> > 
> > Christian rejected the idea of adding a dmabuf wrapper for
> > get_file_active(), oh well.
> 
> Okay then vfio_pci_dma_buf_move() should be changed. It uses
> get_file_active() to pair dma_buf_put().

Makes sense, Leon can you fix it?

Thanks,
Jason 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 10/11] vfio/pci: Add dma-buf export support for MMIO regions
  2025-11-19 19:41         ` Jason Gunthorpe
@ 2025-11-19 20:50           ` Leon Romanovsky
  0 siblings, 0 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-19 20:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson,
	Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org,
	Kasireddy, Vivek

On Wed, Nov 19, 2025 at 03:41:20PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 18, 2025 at 11:56:14PM +0000, Tian, Kevin wrote:
> > > > > +	down_write(&vdev->memory_lock);
> > > > > +	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm)
> > > > > {
> > > > > +		if (!get_file_active(&priv->dmabuf->file))
> > > > > +			continue;
> > > > > +
> > > > > +		dma_resv_lock(priv->dmabuf->resv, NULL);
> > > > > +		list_del_init(&priv->dmabufs_elm);
> > > > > +		priv->vdev = NULL;
> > > > > +		priv->revoked = true;
> > > > > +		dma_buf_move_notify(priv->dmabuf);
> > > > > +		dma_resv_unlock(priv->dmabuf->resv);
> > > > > +		vfio_device_put_registration(&vdev->vdev);
> > > > > +		fput(priv->dmabuf->file);
> > > >
> > > > dma_buf_put(priv->dmabuf), consistent with other places.
> > > 
> > > Someone else said this, I don't agree, the above got the get via
> > > 
> > > get_file_active() instead of a dma_buf version..
> > > 
> > > So we should pair with get_file_active() vs fput().
> > > 
> > > Christian rejected the idea of adding a dmabuf wrapper for
> > > get_file_active(), oh well.
> > 
> > Okay then vfio_pci_dma_buf_move() should be changed. It uses
> > get_file_active() to pair dma_buf_put().
> 
> Makes sense, Leon can you fix it?

Sure,

diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index e7511cad8e06..c67c1ca7e4bf 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -300,7 +300,7 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
                        dma_buf_move_notify(priv->dmabuf);
                        dma_resv_unlock(priv->dmabuf->resv);
                }
-               dma_buf_put(priv->dmabuf);
+               fput(priv->dmabuf->file);
        }
 }



> 
> Thanks,
> Jason 

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v8 11/11] vfio/nvgrace: Support get_dmabuf_phys
  2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
                   ` (9 preceding siblings ...)
  2025-11-11  9:57 ` [PATCH v8 10/11] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
@ 2025-11-11  9:57 ` Leon Romanovsky
  2025-11-18  7:34   ` Tian, Kevin
  2025-11-18  7:59   ` Ankit Agrawal
  10 siblings, 2 replies; 63+ messages in thread
From: Leon Romanovsky @ 2025-11-11  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas, Logan Gunthorpe, Jens Axboe, Robin Murphy,
	Joerg Roedel, Will Deacon, Marek Szyprowski, Jason Gunthorpe,
	Leon Romanovsky, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci, linux-kernel, linux-block,
	iommu, linux-mm, linux-doc, linux-media, dri-devel, linaro-mm-sig,
	kvm, linux-hardening, Alex Mastro, Nicolin Chen

From: Jason Gunthorpe <jgg@nvidia.com>

Call vfio_pci_core_fill_phys_vec() with the proper physical ranges for the
synthetic BAR 2 and BAR 4 regions. Otherwise use the normal flow based on
the PCI bar.

This demonstrates a DMABUF that follows the region info report to only
allow mapping parts of the region that are mmapable. Since the BAR is
power of two sized and the "CXL" region is just page aligned the there can
be a padding region at the end that is not mmaped or passed into the
DMABUF.

The "CXL" ranges that are remapped into BAR 2 and BAR 4 areas are not PCI
MMIO, they actually run over the CXL-like coherent interconnect and for
the purposes of DMA behave identically to DRAM. We don't try to model this
distinction between true PCI BAR memory that takes a real PCI path and the
"CXL" memory that takes a different path in the p2p framework for now.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Alex Mastro <amastro@fb.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/vfio/pci/nvgrace-gpu/main.c | 56 +++++++++++++++++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
index e346392b72f6..1a15c928f067 100644
--- a/drivers/vfio/pci/nvgrace-gpu/main.c
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -7,6 +7,7 @@
 #include <linux/vfio_pci_core.h>
 #include <linux/delay.h>
 #include <linux/jiffies.h>
+#include <linux/pci-p2pdma.h>
 
 /*
  * The device memory usable to the workloads running in the VM is cached
@@ -683,6 +684,54 @@ nvgrace_gpu_write(struct vfio_device *core_vdev,
 	return vfio_pci_core_write(core_vdev, buf, count, ppos);
 }
 
+static int nvgrace_get_dmabuf_phys(struct vfio_pci_core_device *core_vdev,
+				   struct p2pdma_provider **provider,
+				   unsigned int region_index,
+				   struct dma_buf_phys_vec *phys_vec,
+				   struct vfio_region_dma_range *dma_ranges,
+				   size_t nr_ranges)
+{
+	struct nvgrace_gpu_pci_core_device *nvdev = container_of(
+		core_vdev, struct nvgrace_gpu_pci_core_device, core_device);
+	struct pci_dev *pdev = core_vdev->pdev;
+	struct mem_region *mem_region = NULL;
+
+	if (nvdev->resmem.memlength && region_index == RESMEM_REGION_INDEX) {
+		/*
+		 * The P2P properties of the non-BAR memory is the same as the
+		 * BAR memory, so just use the provider for index 0. Someday
+		 * when CXL gets P2P support we could create CXLish providers
+		 * for the non-BAR memory.
+		 */
+		mem_region = &nvdev->resmem;
+	} else if (region_index == USEMEM_REGION_INDEX) {
+		/*
+		 * This is actually cachable memory and isn't treated as P2P in
+		 * the chip. For now we have no way to push cachable memory
+		 * through everything and the Grace HW doesn't care what caching
+		 * attribute is programmed into the SMMU. So use BAR 0.
+		 */
+		mem_region = &nvdev->usemem;
+	}
+
+	if (mem_region) {
+		*provider = pcim_p2pdma_provider(pdev, 0);
+		if (!*provider)
+			return -EINVAL;
+		return vfio_pci_core_fill_phys_vec(phys_vec, dma_ranges,
+						   nr_ranges,
+						   mem_region->memphys,
+						   mem_region->memlength);
+	}
+
+	return vfio_pci_core_get_dmabuf_phys(core_vdev, provider, region_index,
+					     phys_vec, dma_ranges, nr_ranges);
+}
+
+static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_ops = {
+	.get_dmabuf_phys = nvgrace_get_dmabuf_phys,
+};
+
 static const struct vfio_device_ops nvgrace_gpu_pci_ops = {
 	.name		= "nvgrace-gpu-vfio-pci",
 	.init		= vfio_pci_core_init_dev,
@@ -703,6 +752,10 @@ static const struct vfio_device_ops nvgrace_gpu_pci_ops = {
 	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
 };
 
+static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_core_ops = {
+	.get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys,
+};
+
 static const struct vfio_device_ops nvgrace_gpu_pci_core_ops = {
 	.name		= "nvgrace-gpu-vfio-pci-core",
 	.init		= vfio_pci_core_init_dev,
@@ -965,6 +1018,9 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
 						    memphys, memlength);
 		if (ret)
 			goto out_put_vdev;
+		nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_ops;
+	} else {
+		nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_core_ops;
 	}
 
 	ret = vfio_pci_core_register_device(&nvdev->core_device);

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* RE: [PATCH v8 11/11] vfio/nvgrace: Support get_dmabuf_phys
  2025-11-11  9:57 ` [PATCH v8 11/11] vfio/nvgrace: Support get_dmabuf_phys Leon Romanovsky
@ 2025-11-18  7:34   ` Tian, Kevin
  2025-11-18  7:59   ` Ankit Agrawal
  1 sibling, 0 replies; 63+ messages in thread
From: Tian, Kevin @ 2025-11-18  7:34 UTC (permalink / raw)
  To: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Ankit Agrawal, Yishai Hadas, Shameer Kolothum, Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, November 11, 2025 5:58 PM
> 
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> Call vfio_pci_core_fill_phys_vec() with the proper physical ranges for the
> synthetic BAR 2 and BAR 4 regions. Otherwise use the normal flow based on
> the PCI bar.
> 
> This demonstrates a DMABUF that follows the region info report to only
> allow mapping parts of the region that are mmapable. Since the BAR is
> power of two sized and the "CXL" region is just page aligned the there can
> be a padding region at the end that is not mmaped or passed into the
> DMABUF.
> 
> The "CXL" ranges that are remapped into BAR 2 and BAR 4 areas are not PCI
> MMIO, they actually run over the CXL-like coherent interconnect and for
> the purposes of DMA behave identically to DRAM. We don't try to model this
> distinction between true PCI BAR memory that takes a real PCI path and the
> "CXL" memory that takes a different path in the p2p framework for now.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Tested-by: Alex Mastro <amastro@fb.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 11/11] vfio/nvgrace: Support get_dmabuf_phys
  2025-11-11  9:57 ` [PATCH v8 11/11] vfio/nvgrace: Support get_dmabuf_phys Leon Romanovsky
  2025-11-18  7:34   ` Tian, Kevin
@ 2025-11-18  7:59   ` Ankit Agrawal
  2025-11-18 14:30     ` Jason Gunthorpe
  1 sibling, 1 reply; 63+ messages in thread
From: Ankit Agrawal @ 2025-11-18  7:59 UTC (permalink / raw)
  To: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson
  Cc: Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

+       if (nvdev->resmem.memlength && region_index == RESMEM_REGION_INDEX) {
+               /*
+                * The P2P properties of the non-BAR memory is the same as the
+                * BAR memory, so just use the provider for index 0. Someday
+                * when CXL gets P2P support we could create CXLish providers
+                * for the non-BAR memory.
+                */
+               mem_region = &nvdev->resmem;
+       } else if (region_index == USEMEM_REGION_INDEX) {
+               /*
+                * This is actually cachable memory and isn't treated as P2P in
+                * the chip. For now we have no way to push cachable memory
+                * through everything and the Grace HW doesn't care what caching
+                * attribute is programmed into the SMMU. So use BAR 0.
+                */
+               mem_region = &nvdev->usemem;
+       }
+

Can we replace this with nvgrace_gpu_memregion()?




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v8 11/11] vfio/nvgrace: Support get_dmabuf_phys
  2025-11-18  7:59   ` Ankit Agrawal
@ 2025-11-18 14:30     ` Jason Gunthorpe
  0 siblings, 0 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2025-11-18 14:30 UTC (permalink / raw)
  To: Ankit Agrawal
  Cc: Leon Romanovsky, Bjorn Helgaas, Logan Gunthorpe, Jens Axboe,
	Robin Murphy, Joerg Roedel, Will Deacon, Marek Szyprowski,
	Andrew Morton, Jonathan Corbet, Sumit Semwal,
	Christian König, Kees Cook, Gustavo A. R. Silva,
	Yishai Hadas, Shameer Kolothum, Kevin Tian, Alex Williamson,
	Krishnakant Jaju, Matt Ochs, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
	iommu@lists.linux.dev, linux-mm@kvack.org,
	linux-doc@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org, linux-hardening@vger.kernel.org, Alex Mastro,
	Nicolin Chen

On Tue, Nov 18, 2025 at 07:59:20AM +0000, Ankit Agrawal wrote:
> +       if (nvdev->resmem.memlength && region_index == RESMEM_REGION_INDEX) {
> +               /*
> +                * The P2P properties of the non-BAR memory is the same as the
> +                * BAR memory, so just use the provider for index 0. Someday
> +                * when CXL gets P2P support we could create CXLish providers
> +                * for the non-BAR memory.
> +                */
> +               mem_region = &nvdev->resmem;
> +       } else if (region_index == USEMEM_REGION_INDEX) {
> +               /*
> +                * This is actually cachable memory and isn't treated as P2P in
> +                * the chip. For now we have no way to push cachable memory
> +                * through everything and the Grace HW doesn't care what caching
> +                * attribute is programmed into the SMMU. So use BAR 0.
> +                */
> +               mem_region = &nvdev->usemem;
> +       }
> +
> 
> Can we replace this with nvgrace_gpu_memregion()?

Yes, looks like

But we need to preserve the comments above as well somehow.

Jason

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2025-11-20 13:20 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-11  9:57 [PATCH v8 00/11] vfio/pci: Allow MMIO regions to be exported through dma-buf Leon Romanovsky
2025-11-11  9:57 ` [PATCH v8 01/11] PCI/P2PDMA: Separate the mmap() support from the core logic Leon Romanovsky
2025-11-11  9:57 ` [PATCH v8 02/11] PCI/P2PDMA: Simplify bus address mapping API Leon Romanovsky
2025-11-11  9:57 ` [PATCH v8 03/11] PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation Leon Romanovsky
2025-11-11  9:57 ` [PATCH v8 04/11] PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function Leon Romanovsky
2025-11-11  9:57 ` [PATCH v8 05/11] PCI/P2PDMA: Document DMABUF model Leon Romanovsky
2025-11-19  9:18   ` Christian König
2025-11-19 13:13     ` Leon Romanovsky
2025-11-19 13:35     ` Jason Gunthorpe
2025-11-19 14:06       ` Christian König
2025-11-19 19:45         ` Jason Gunthorpe
2025-11-19 20:45           ` Leon Romanovsky
2025-11-11  9:57 ` [PATCH v8 06/11] dma-buf: provide phys_vec to scatter-gather mapping routine Leon Romanovsky
2025-11-18 23:02   ` Jason Gunthorpe
2025-11-19  0:06   ` Nicolin Chen
2025-11-19 13:32     ` Leon Romanovsky
2025-11-19  5:54   ` Tian, Kevin
2025-11-19 13:30     ` Leon Romanovsky
2025-11-19 13:37       ` Jason Gunthorpe
2025-11-19 13:45         ` Leon Romanovsky
2025-11-19 13:16   ` [Linaro-mm-sig] " Christian König
2025-11-19 13:25     ` Jason Gunthorpe
2025-11-19 13:42       ` Christian König
2025-11-19 13:48         ` Leon Romanovsky
2025-11-19 19:31         ` Jason Gunthorpe
2025-11-19 20:54           ` Leon Romanovsky
2025-11-20  7:08           ` Christian König
2025-11-20  7:41             ` Leon Romanovsky
2025-11-20  7:54               ` Christian König
2025-11-20  8:06                 ` Leon Romanovsky
2025-11-20  8:32                   ` Christian König
2025-11-20  8:42                     ` Leon Romanovsky
2025-11-20 13:20             ` Jason Gunthorpe
2025-11-19 13:42     ` Leon Romanovsky
2025-11-19 14:11       ` Christian König
2025-11-19 14:50         ` Leon Romanovsky
2025-11-19 14:53           ` Christian König
2025-11-19 15:41             ` Leon Romanovsky
2025-11-19 16:33             ` Leon Romanovsky
2025-11-20  7:03               ` Christian König
2025-11-20  7:38                 ` Leon Romanovsky
2025-11-19 19:36         ` Jason Gunthorpe
2025-11-11  9:57 ` [PATCH v8 07/11] vfio: Export vfio device get and put registration helpers Leon Romanovsky
2025-11-18  7:10   ` Tian, Kevin
2025-11-11  9:57 ` [PATCH v8 08/11] vfio/pci: Share the core device pointer while invoking feature functions Leon Romanovsky
2025-11-18  7:11   ` Tian, Kevin
2025-11-11  9:57 ` [PATCH v8 09/11] vfio/pci: Enable peer-to-peer DMA transactions by default Leon Romanovsky
2025-11-18  7:18   ` Tian, Kevin
2025-11-18 20:10     ` Alex Williamson
2025-11-19  0:01       ` Tian, Kevin
2025-11-18 20:18     ` Keith Busch
2025-11-19  0:02       ` Tian, Kevin
2025-11-19 13:54         ` Leon Romanovsky
2025-11-11  9:57 ` [PATCH v8 10/11] vfio/pci: Add dma-buf export support for MMIO regions Leon Romanovsky
2025-11-18  7:33   ` Tian, Kevin
2025-11-18 14:28     ` Jason Gunthorpe
2025-11-18 23:56       ` Tian, Kevin
2025-11-19 19:41         ` Jason Gunthorpe
2025-11-19 20:50           ` Leon Romanovsky
2025-11-11  9:57 ` [PATCH v8 11/11] vfio/nvgrace: Support get_dmabuf_phys Leon Romanovsky
2025-11-18  7:34   ` Tian, Kevin
2025-11-18  7:59   ` Ankit Agrawal
2025-11-18 14:30     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).