public inbox for linux-media@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs
@ 2026-03-12 18:45 Matt Evans
  2026-03-12 18:45 ` [RFC v2 PATCH 01/10] vfio/pci: Set up VFIO barmap before creating a DMABUF Matt Evans
                   ` (10 more replies)
  0 siblings, 11 replies; 17+ messages in thread
From: Matt Evans @ 2026-03-12 18:45 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Christian König,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

Hi all,


There were various suggestions in the September 2025 thread "[TECH
TOPIC] vfio, iommufd: Enabling user space drivers to vend more
granular access to client processes" [0], and LPC discussions, around
improving the situation for multi-process userspace driver designs.
This RFC series implements some of these ideas.

(Thanks for feedback on v1!  Revised series, with changes noted
inline.)

Background: Multi-process USDs
==============================

The userspace driver scenario discussed in that thread involves a
primary process driving a PCIe function through VFIO/iommufd, which
manages the function-wide ownership/lifecycle.  The function is
designed to provide multiple distinct programming interfaces (for
example, several independent MMIO register frames in one function),
and the primary process delegates control of these interfaces to
multiple independent client processes (which do the actual work).
This scenario clearly relies on a HW design that provides appropriate
isolation between the programming interfaces.

The two key needs are:

 1.  Mechanisms to safely delegate a subset of the device MMIO
     resources to a client process without over-sharing wider access
     (or influence over whole-device activities, such as reset).

 2.  Mechanisms to allow a client process to do its own iommufd
     management w.r.t. its address space, in a way that's isolated
     from DMA relating to other clients.


mmap() of VFIO DMABUFs
======================

This RFC addresses #1 in "vfio/pci: Support mmap() of a VFIO DMABUF",
implementing the proposals in [0] to add mmap() support to the
existing VFIO DMABUF exporter.

This enables a userspace driver to define DMABUF ranges corresponding
to sub-ranges of a BAR, and grant a given client (via a shared fd)
the capability to access (only) those sub-ranges.  The VFIO device fds
would be kept private to the primary process.  All the client can do
with that fd is map (or iomap via iommufd) that specific subset of
resources, and the impact of bugs/malice is contained.

 (We'll follow up on #2 separately, as a related-but-distinct problem.
  PASIDs are one way to achieve per-client isolation of DMA; another
  could be sharing of a single IOVA space via 'constrained' iommufds.)


New in v2: To achieve this, the existing VFIO BAR mmap() path is
converted to use DMABUFs behind the scenes, in "vfio/pci: Convert BAR
mmap() to use a DMABUF" plus new helper functions, as Jason/Christian
suggested in the v1 discussion [3].

This means:

 - Both regular and new DMABUF BAR mappings share the same vm_ops,
   i.e.  mmap()ing DMABUFs is a smaller change on top of the existing
   mmap().

 - The zapping of mappings occurs via vfio_pci_dma_buf_move(), and the
   vfio_pci_zap_bars() originally paired with the _move()s can go
   away.  Each DMABUF has a unique address_space.

 - It's a step towards future iommufd VFIO Type1 emulation
   implementing P2P, since iommufd can now get a DMABUF from a VA that
   it's mapping for IO; the VMAs' vm_file is that of the backing
   DMABUF.


Revocation/reclaim
==================

Mapping a BAR subset is useful, but the lifetime of access granted to
a client needs to be managed well.  For example, a protocol between
the primary process and the client can indicate when the client is
done, and when it's safe to reuse the resources elsewhere, but cleanup
can't practically be cooperative.

For robustness, we enable the driver to make the resources
guaranteed-inaccessible when it chooses, so that it can re-assign them
to other uses in future.

"vfio/pci: Permanently revoke a DMABUF on request" adds a new VFIO
device fd ioctl, VFIO_DEVICE_PCI_DMABUF_REVOKE.  This takes a DMABUF
fd parameter previously exported (from that device!) and permanently
revokes the DMABUF.  This notifies/detaches importers, zaps PTEs for
any mappings, and guarantees no future attachment/import/map/access is
possible by any means.

A primary driver process would use this operation when the client's
tenure ends to reclaim "loaned-out" MMIO interfaces, at which point
the interfaces could be safely re-used.

New in v2: ioctl() on VFIO driver fd, rather than DMABUF fd.  A DMABUF
is revoked using code common to vfio_pci_dma_buf_move(), selectively
zapping mappings (after waiting for completion on the
dma_buf_invalidate_mappings() request).


BAR mapping access attributes
=============================

Inspired by Alex [Mastro] and Jason's comments in [0] and Mahmoud's
work in [1] with the goal of controlling CPU access attributes for
VFIO BAR mappings (e.g. WC), we can decorate DMABUFs with access
attributes that are then used by a mapping's PTEs.

I've proposed reserving a field in struct
vfio_device_feature_dma_buf's flags to specify an attribute for its
ranges.  Although that keeps the (UAPI) struct unchanged, it means all
ranges in a DMABUF share the same attribute.  I feel a single
attribute-to-mmap() relation is logical/reasonable.  An application
can also create multiple DMABUFs to describe any BAR layout and mix of
attributes.


Tests
=====

(Still sharing the [RFC ONLY] userspace test/demo program for context,
not for merge.)

It illustrates & tests various map/revoke cases, but doesn't use the
existing VFIO selftests and relies on a (tweaked) QEMU EDU function.
I'm (still) working on integrating the scenarios into the existing
VFIO selftests.

This code has been tested in mapping DMABUFs of single/multiple
ranges, aliasing mmap()s, aliasing ranges across DMABUFs, vm_pgoff >
0, revocation, shutdown/cleanup scenarios, and hugepage mappings seem
to work correctly.  I've lightly tested WC mappings also (by observing
resulting PTEs as having the correct attributes...).


Fin
===

v2 is based on next-20260310 (to build on Leon's recent series
"vfio: Wait for dma-buf invalidation to complete" [2]).


Please share your thoughts!  I'd like to de-RFC if we feel this
approach is now fair.


Many thanks,


Matt



References:

[0]: https://lore.kernel.org/linux-iommu/20250918214425.2677057-1-amastro@fb.com/
[1]: https://lore.kernel.org/all/20250804104012.87915-1-mngyadam@amazon.de/
[2]: https://lore.kernel.org/linux-iommu/20260205-nocturnal-poetic-chamois-f566ad@houat/T/#m310cd07011e3a1461b6fda45e3f9b886ba76571a
[3]: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/

--------------------------------------------------------------------------------
Changelog:

v2:  Respin based on the feedback/suggestions:

- Transform the existing VFIO BAR mmap path to also use DMABUFs behind
  the scenes, and then simply share that code for explicitly-mapped
  DMABUFs.

- Refactors the export itself out of vfio_pci_core_feature_dma_buf,
  and shared by a new vfio_pci_core_mmap_prep_dmabuf helper used by
  the regular VFIO mmap to create a DMABUF.

- Revoke buffers using a VFIO device fd ioctl

v1: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/


Matt Evans (10):
  vfio/pci: Set up VFIO barmap before creating a DMABUF
  vfio/pci: Clean up DMABUFs before disabling function
  vfio/pci: Add helper to look up PFNs for DMABUFs
  vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
  vfio/pci: Convert BAR mmap() to use a DMABUF
  vfio/pci: Remove vfio_pci_zap_bars()
  vfio/pci: Support mmap() of a VFIO DMABUF
  vfio/pci: Permanently revoke a DMABUF on request
  vfio/pci: Add mmap() attributes to DMABUF feature
  [RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test

 drivers/vfio/pci/Kconfig                      |   3 +-
 drivers/vfio/pci/Makefile                     |   3 +-
 drivers/vfio/pci/vfio_pci_config.c            |  18 +-
 drivers/vfio/pci/vfio_pci_core.c              | 123 +--
 drivers/vfio/pci/vfio_pci_dmabuf.c            | 425 +++++++--
 drivers/vfio/pci/vfio_pci_priv.h              |  46 +-
 include/uapi/linux/vfio.h                     |  42 +-
 tools/testing/selftests/vfio/Makefile         |   1 +
 .../vfio/standalone/vfio_dmabuf_mmap_test.c   | 837 ++++++++++++++++++
 9 files changed, 1339 insertions(+), 159 deletions(-)
 create mode 100644 tools/testing/selftests/vfio/standalone/vfio_dmabuf_mmap_test.c

-- 
2.47.3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC v2 PATCH 01/10] vfio/pci: Set up VFIO barmap before creating a DMABUF
  2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
@ 2026-03-12 18:45 ` Matt Evans
  2026-03-12 18:46 ` [RFC v2 PATCH 02/10] vfio/pci: Clean up DMABUFs before disabling function Matt Evans
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Matt Evans @ 2026-03-12 18:45 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Christian König,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

A DMABUF exports access to BAR resources which need to be requested
before the DMABUF is handed out.  Usually the resources are requested
when setting up the barmap when the VFIO device fd is mmap()ed, but
there's no guarantee that's done before a DMABUF is created.

Set up the barmap (and so request resources) in the DMABUF-creation
path.

Fixes: 5d74781ebc86c ("vfio/pci: Add dma-buf export support for MMIO regions")
Signed-off-by: Matt Evans <mattev@meta.com>
---
 drivers/vfio/pci/vfio_pci_dmabuf.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 3a803923141b..44558cc2948e 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -269,6 +269,17 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 		goto err_free_priv;
 	}
 
+	/*
+	 * Just like the vfio_pci_core_mmap() path, we need to ensure
+	 * PCI regions have been requested before returning DMABUFs
+	 * that reference them.  It's possible to create a DMABUF for
+	 * a BAR without the BAR having already been mmap()ed.  The
+	 * barmap setup requests the regions for us:
+	 */
+	ret = vfio_pci_core_setup_barmap(vdev, get_dma_buf.region_index);
+	if (ret)
+		goto err_free_phys;
+
 	priv->vdev = vdev;
 	priv->nr_ranges = get_dma_buf.nr_ranges;
 	priv->size = length;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC v2 PATCH 02/10] vfio/pci: Clean up DMABUFs before disabling function
  2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
  2026-03-12 18:45 ` [RFC v2 PATCH 01/10] vfio/pci: Set up VFIO barmap before creating a DMABUF Matt Evans
@ 2026-03-12 18:46 ` Matt Evans
  2026-03-12 18:46 ` [RFC v2 PATCH 03/10] vfio/pci: Add helper to look up PFNs for DMABUFs Matt Evans
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Matt Evans @ 2026-03-12 18:46 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Christian König,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

On device shutdown, make vfio_pci_core_close_device() call
vfio_pci_dma_buf_cleanup() before the function is disabled via
vfio_pci_core_disable().  This ensures that all access via DMABUFs is
revoked before the function's BARs become inaccessible.

This fixes an issue where, if the function is disabled first, a tiny
window exists in which the function's MSE is cleared and yet BARs
could still be accessed via the DMABUF.  The resources would also be
freed and up for grabs by a different driver.

Fixes: 5d74781ebc86c ("vfio/pci: Add dma-buf export support for MMIO regions")
Signed-off-by: Matt Evans <mattev@meta.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d43745fe4c84..f9ed3374d268 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -734,10 +734,10 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
 #if IS_ENABLED(CONFIG_EEH)
 	eeh_dev_release(vdev->pdev);
 #endif
-	vfio_pci_core_disable(vdev);
-
 	vfio_pci_dma_buf_cleanup(vdev);
 
+	vfio_pci_core_disable(vdev);
+
 	mutex_lock(&vdev->igate);
 	vfio_pci_eventfd_replace_locked(vdev, &vdev->err_trigger, NULL);
 	vfio_pci_eventfd_replace_locked(vdev, &vdev->req_trigger, NULL);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC v2 PATCH 03/10] vfio/pci: Add helper to look up PFNs for DMABUFs
  2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
  2026-03-12 18:45 ` [RFC v2 PATCH 01/10] vfio/pci: Set up VFIO barmap before creating a DMABUF Matt Evans
  2026-03-12 18:46 ` [RFC v2 PATCH 02/10] vfio/pci: Clean up DMABUFs before disabling function Matt Evans
@ 2026-03-12 18:46 ` Matt Evans
  2026-03-12 18:46 ` [RFC v2 PATCH 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA Matt Evans
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Matt Evans @ 2026-03-12 18:46 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Christian König,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

Add a helper, vfio_pci_dma_buf_find_pfn(), which a VMA fault handler
can use to find a PFN.

This supports multi-range DMABUFs, which typically would be used to
represent scattered spans but might even represent overlapping or
aliasing spans of PFNs.

Because this is intended to be used in vfio_pci_core.c, we also need
to expose the struct vfio_pci_dma_buf in the vfio_pci_priv.h header.

Signed-off-by: Matt Evans <mattev@meta.com>
---
 drivers/vfio/pci/vfio_pci_dmabuf.c | 102 +++++++++++++++++++++++++----
 drivers/vfio/pci/vfio_pci_priv.h   |  19 ++++++
 2 files changed, 108 insertions(+), 13 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 44558cc2948e..63140528dbea 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -9,19 +9,6 @@
 
 MODULE_IMPORT_NS("DMA_BUF");
 
-struct vfio_pci_dma_buf {
-	struct dma_buf *dmabuf;
-	struct vfio_pci_core_device *vdev;
-	struct list_head dmabufs_elm;
-	size_t size;
-	struct phys_vec *phys_vec;
-	struct p2pdma_provider *provider;
-	u32 nr_ranges;
-	struct kref kref;
-	struct completion comp;
-	u8 revoked : 1;
-};
-
 static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
 				   struct dma_buf_attachment *attachment)
 {
@@ -106,6 +93,95 @@ static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
 	.release = vfio_pci_dma_buf_release,
 };
 
+int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
+			      struct vm_area_struct *vma,
+			      unsigned long address,
+			      unsigned int order,
+			      unsigned long *out_pfn)
+{
+	/*
+	 * Given a VMA (start, end, pgoffs) and a fault address,
+	 * search the corresponding DMABUF's phys_vec[] to find the
+	 * range representing the address's offset into the VMA, and
+	 * its PFN.
+	 *
+	 * The phys_vec[] ranges represent contiguous spans of VAs
+	 * upwards from the buffer offset 0; the actual PFNs might be
+	 * in any order, overlap/alias, etc.  Calculate an offset of
+	 * the desired page given VMA start/pgoff and address, then
+	 * search upwards from 0 to find which span contains it.
+	 *
+	 * On success, a valid PFN for a page sized by 'order' is
+	 * returned into out_pfn.
+	 *
+	 * Failure occurs if:
+	 * - The page would cross the edge of the VMA
+	 * - The page isn't entirely contained within a range
+	 * - We find a range, but the final PFN isn't aligned to the
+	 *   requested order.
+	 *
+	 * (Upon failure, the caller is expected to try again with a
+	 * smaller order; the tests above will always succeed for
+	 * order=0 as the limit case.)
+	 *
+	 * It's suboptimal if DMABUFs are created with neigbouring
+	 * ranges that are physically contiguous, since hugepages
+	 * can't straddle range boundaries.  (The construction of the
+	 * ranges vector should merge such ranges.)
+	 */
+
+	const unsigned long pagesize = PAGE_SIZE << order;
+	unsigned long rounded_page_addr = address & ~(pagesize - 1);
+	unsigned long rounded_page_end = rounded_page_addr + pagesize;
+	unsigned long buf_page_offset;
+	unsigned long buf_offset = 0;
+	unsigned int i;
+
+	if (rounded_page_addr < vma->vm_start || rounded_page_end > vma->vm_end)
+		return -EAGAIN;
+
+	if (unlikely(check_add_overflow(rounded_page_addr - vma->vm_start,
+					vma->vm_pgoff << PAGE_SHIFT, &buf_page_offset)))
+		return -EFAULT;
+
+	for (i = 0; i < vpdmabuf->nr_ranges; i++) {
+		unsigned long range_len = vpdmabuf->phys_vec[i].len;
+		unsigned long range_start = vpdmabuf->phys_vec[i].paddr;
+
+		if (buf_page_offset >= buf_offset &&
+		    buf_page_offset + pagesize <= buf_offset + range_len) {
+			/*
+			 * The faulting page is wholly contained
+			 * within the span represented by the range.
+			 * Validate PFN alignment for the order:
+			 */
+			unsigned long pfn = (range_start >> PAGE_SHIFT) +
+				((buf_page_offset - buf_offset) >> PAGE_SHIFT);
+
+			if (IS_ALIGNED(pfn, 1 << order)) {
+				*out_pfn = pfn;
+				return 0;
+			}
+			/* Retry with smaller order */
+			return -EAGAIN;
+		}
+		buf_offset += range_len;
+	}
+
+	/*
+	 * If we get here, the address fell outside of the span
+	 * represented by the (concatenated) ranges.  Setup of a
+	 * mapping must ensure that the VMA is <= the total size of
+	 * the ranges, so this should never happen.  But, if it does,
+	 * force SIGBUS for the access and warn.
+	 */
+	WARN_ONCE(1, "No range for addr 0x%lx, order %d: VMA 0x%lx-0x%lx pgoff 0x%lx, %d ranges, size 0x%lx\n",
+		  address, order, vma->vm_start, vma->vm_end, vma->vm_pgoff,
+		  vpdmabuf->nr_ranges, vpdmabuf->size);
+
+	return -EFAULT;
+}
+
 /*
  * This is a temporary "private interconnect" between VFIO DMABUF and iommufd.
  * It allows the two co-operating drivers to exchange the physical address of
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 27ac280f00b9..5cc8c85a2153 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -23,6 +23,19 @@ struct vfio_pci_ioeventfd {
 	bool			test_mem;
 };
 
+struct vfio_pci_dma_buf {
+	struct dma_buf *dmabuf;
+	struct vfio_pci_core_device *vdev;
+	struct list_head dmabufs_elm;
+	size_t size;
+	struct phys_vec *phys_vec;
+	struct p2pdma_provider *provider;
+	u32 nr_ranges;
+	struct kref kref;
+	struct completion comp;
+	u8 revoked : 1;
+};
+
 bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
 void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
 
@@ -110,6 +123,12 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
+			      struct vm_area_struct *vma,
+			      unsigned long address,
+			      unsigned int order,
+			      unsigned long *out_pfn);
+
 #ifdef CONFIG_VFIO_PCI_DMABUF
 int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 				  struct vfio_device_feature_dma_buf __user *arg,
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC v2 PATCH 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
  2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
                   ` (2 preceding siblings ...)
  2026-03-12 18:46 ` [RFC v2 PATCH 03/10] vfio/pci: Add helper to look up PFNs for DMABUFs Matt Evans
@ 2026-03-12 18:46 ` Matt Evans
  2026-03-18 20:04   ` Alex Williamson
  2026-03-12 18:46 ` [RFC v2 PATCH 05/10] vfio/pci: Convert BAR mmap() to use a DMABUF Matt Evans
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 17+ messages in thread
From: Matt Evans @ 2026-03-12 18:46 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Christian König,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

This helper, vfio_pci_core_mmap_prep_dmabuf(), creates a single-range
DMABUF for the purpose of mapping a PCI BAR.  This is used in a future
commit by VFIO's ordinary mmap() path.

This function transfers ownership of the VFIO device fd to the
DMABUF, which fput()s when it's released.

Refactor the existing vfio_pci_core_feature_dma_buf() to split out
export code common to the two paths, VFIO_DEVICE_FEATURE_DMA_BUF and
this new VFIO_BAR mmap().

Signed-off-by: Matt Evans <mattev@meta.com>
---
 drivers/vfio/pci/vfio_pci_dmabuf.c | 131 +++++++++++++++++++++--------
 drivers/vfio/pci/vfio_pci_priv.h   |   4 +
 2 files changed, 102 insertions(+), 33 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 63140528dbea..76db340ba592 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -82,6 +82,8 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
 		up_write(&priv->vdev->memory_lock);
 		vfio_device_put_registration(&priv->vdev->vdev);
 	}
+	if (priv->vfile)
+		fput(priv->vfile);
 	kfree(priv->phys_vec);
 	kfree(priv);
 }
@@ -182,6 +184,41 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
 	return -EFAULT;
 }
 
+static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev,
+				  struct vfio_pci_dma_buf *priv, uint32_t flags,
+				  size_t size, bool status_ok)
+{
+	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+
+	if (!vfio_device_try_get_registration(&vdev->vdev))
+		return -ENODEV;
+
+	exp_info.ops = &vfio_pci_dmabuf_ops;
+	exp_info.size = size;
+	exp_info.flags = flags;
+	exp_info.priv = priv;
+
+	priv->dmabuf = dma_buf_export(&exp_info);
+	if (IS_ERR(priv->dmabuf)) {
+		vfio_device_put_registration(&vdev->vdev);
+		return PTR_ERR(priv->dmabuf);
+	}
+
+	kref_init(&priv->kref);
+	init_completion(&priv->comp);
+
+	/* dma_buf_put() now frees priv */
+	INIT_LIST_HEAD(&priv->dmabufs_elm);
+	down_write(&vdev->memory_lock);
+	dma_resv_lock(priv->dmabuf->resv, NULL);
+	priv->revoked = !status_ok;
+	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
+	dma_resv_unlock(priv->dmabuf->resv);
+	up_write(&vdev->memory_lock);
+
+	return 0;
+}
+
 /*
  * This is a temporary "private interconnect" between VFIO DMABUF and iommufd.
  * It allows the two co-operating drivers to exchange the physical address of
@@ -300,7 +337,6 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 {
 	struct vfio_device_feature_dma_buf get_dma_buf = {};
 	struct vfio_region_dma_range *dma_ranges;
-	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
 	struct vfio_pci_dma_buf *priv;
 	size_t length;
 	int ret;
@@ -369,46 +405,20 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 	kfree(dma_ranges);
 	dma_ranges = NULL;
 
-	if (!vfio_device_try_get_registration(&vdev->vdev)) {
-		ret = -ENODEV;
+	ret = vfio_pci_dmabuf_export(vdev, priv, get_dma_buf.open_flags,
+				     priv->size,
+				     __vfio_pci_memory_enabled(vdev));
+	if (ret)
 		goto err_free_phys;
-	}
-
-	exp_info.ops = &vfio_pci_dmabuf_ops;
-	exp_info.size = priv->size;
-	exp_info.flags = get_dma_buf.open_flags;
-	exp_info.priv = priv;
-
-	priv->dmabuf = dma_buf_export(&exp_info);
-	if (IS_ERR(priv->dmabuf)) {
-		ret = PTR_ERR(priv->dmabuf);
-		goto err_dev_put;
-	}
-
-	kref_init(&priv->kref);
-	init_completion(&priv->comp);
-
-	/* dma_buf_put() now frees priv */
-	INIT_LIST_HEAD(&priv->dmabufs_elm);
-	down_write(&vdev->memory_lock);
-	dma_resv_lock(priv->dmabuf->resv, NULL);
-	priv->revoked = !__vfio_pci_memory_enabled(vdev);
-	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
-	dma_resv_unlock(priv->dmabuf->resv);
-	up_write(&vdev->memory_lock);
-
 	/*
 	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
 	 * will be released.
 	 */
 	ret = dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
-	if (ret < 0)
-		goto err_dma_buf;
-	return ret;
+	if (ret >= 0)
+		return ret;
 
-err_dma_buf:
 	dma_buf_put(priv->dmabuf);
-err_dev_put:
 	vfio_device_put_registration(&vdev->vdev);
 err_free_phys:
 	kfree(priv->phys_vec);
@@ -419,6 +429,61 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 	return ret;
 }
 
+int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
+				   struct vm_area_struct *vma,
+				   u64 phys_start,
+				   u64 pgoff,
+				   u64 req_len)
+{
+	struct vfio_pci_dma_buf *priv;
+	const unsigned int nr_ranges = 1;
+	int ret;
+
+	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+	if (!priv)
+		return -ENOMEM;
+
+	priv->phys_vec = kcalloc(nr_ranges, sizeof(*priv->phys_vec),
+				 GFP_KERNEL);
+	if (!priv->phys_vec) {
+		ret = -ENOMEM;
+		goto err_free_priv;
+	}
+
+	priv->vdev = vdev;
+	priv->nr_ranges = nr_ranges;
+	priv->size = req_len;
+	priv->phys_vec[0].paddr = phys_start + (pgoff << PAGE_SHIFT);
+	priv->phys_vec[0].len = req_len;
+
+	/*
+	 * Creates a DMABUF, adds it to vdev->dmabufs list for
+	 * tracking (meaning cleanup or revocation will zap them), and
+	 * registers with vfio_device:
+	 */
+	ret = vfio_pci_dmabuf_export(vdev, priv, O_CLOEXEC, priv->size, true);
+	if (ret)
+		goto err_free_phys;
+
+	/*
+	 * The VMA gets the DMABUF file so that other users can locate
+	 * the DMABUF via a VA.  Ownership of the original VFIO device
+	 * file being mmap()ed transfers to priv, and is put when the
+	 * DMABUF is released.
+	 */
+	priv->vfile = vma->vm_file;
+	vma->vm_file = priv->dmabuf->file;
+	vma->vm_private_data = priv;
+
+	return 0;
+
+err_free_phys:
+	kfree(priv->phys_vec);
+err_free_priv:
+	kfree(priv);
+	return ret;
+}
+
 void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
 {
 	struct vfio_pci_dma_buf *priv;
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 5cc8c85a2153..5fd3a6e00a0e 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -30,6 +30,7 @@ struct vfio_pci_dma_buf {
 	size_t size;
 	struct phys_vec *phys_vec;
 	struct p2pdma_provider *provider;
+	struct file *vfile;
 	u32 nr_ranges;
 	struct kref kref;
 	struct completion comp;
@@ -128,6 +129,9 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
 			      unsigned long address,
 			      unsigned int order,
 			      unsigned long *out_pfn);
+int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
+				   struct vm_area_struct *vma,
+				   u64 phys_start, u64 pgoff, u64 req_len);
 
 #ifdef CONFIG_VFIO_PCI_DMABUF
 int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC v2 PATCH 05/10] vfio/pci: Convert BAR mmap() to use a DMABUF
  2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
                   ` (3 preceding siblings ...)
  2026-03-12 18:46 ` [RFC v2 PATCH 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA Matt Evans
@ 2026-03-12 18:46 ` Matt Evans
  2026-03-12 18:46 ` [RFC v2 PATCH 06/10] vfio/pci: Remove vfio_pci_zap_bars() Matt Evans
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Matt Evans @ 2026-03-12 18:46 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Christian König,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

Convert the VFIO device fd fops->mmap to create a DMABUF representing
the BAR mapping, and make the VMA fault handler look up PFNs from the
corresponding DMABUF.  This supports future code mmap()ing BAR
DMABUFs, and iommufd work to support Type1 P2P.

First, vfio_pci_core_mmap() uses the new
vfio_pci_core_mmap_prep_dmabuf() helper to export a DMABUF
representing a single BAR range.  Then, the vfio_pci_mmap_huge_fault()
callback is updated to understand revoked buffers, and uses the new
vfio_pci_dma_buf_find_pfn() helper to determine the PFN for a given
fault address.

Now that the VFIO DMABUFs can be mmap()ed, vfio_pci_dma_buf_move() and
vfio_pci_dma_buf_cleanup() need to zap PTEs on revocation and cleanup
paths.

CONFIG_VFIO_PCI_CORE now unconditionally depends on
CONFIG_DMA_SHARED_BUFFER.  CONFIG_VFIO_PCI_DMABUF remains, to
conditionally include support for VFIO_DEVICE_FEATURE_DMA_BUF, and
depends on CONFIG_PCI_P2PDMA.

Signed-off-by: Matt Evans <mattev@meta.com>
---
 drivers/vfio/pci/Kconfig           |  3 +-
 drivers/vfio/pci/Makefile          |  3 +-
 drivers/vfio/pci/vfio_pci_core.c   | 73 ++++++++++++++++++++----------
 drivers/vfio/pci/vfio_pci_dmabuf.c | 14 ++++++
 drivers/vfio/pci/vfio_pci_priv.h   | 11 +----
 5 files changed, 67 insertions(+), 37 deletions(-)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 1e82b44bda1a..bf5c64d1fe22 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -6,6 +6,7 @@ config VFIO_PCI_CORE
 	tristate
 	select VFIO_VIRQFD
 	select IRQ_BYPASS_MANAGER
+	select DMA_SHARED_BUFFER
 
 config VFIO_PCI_INTX
 	def_bool y if !S390
@@ -56,7 +57,7 @@ config VFIO_PCI_ZDEV_KVM
 	  To enable s390x KVM vfio-pci extensions, say Y.
 
 config VFIO_PCI_DMABUF
-	def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
+	def_bool y if PCI_P2PDMA
 
 source "drivers/vfio/pci/mlx5/Kconfig"
 
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index e0a0757dd1d2..bab7a33a2b31 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,8 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
-vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio_pci_dmabuf.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
 
 vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f9ed3374d268..41224efa58d8 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1648,18 +1648,6 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 c
 	up_write(&vdev->memory_lock);
 }
 
-static unsigned long vma_to_pfn(struct vm_area_struct *vma)
-{
-	struct vfio_pci_core_device *vdev = vma->vm_private_data;
-	int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
-	u64 pgoff;
-
-	pgoff = vma->vm_pgoff &
-		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
-
-	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
-}
-
 vm_fault_t vfio_pci_vmf_insert_pfn(struct vfio_pci_core_device *vdev,
 				   struct vm_fault *vmf,
 				   unsigned long pfn,
@@ -1692,23 +1680,45 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
 					   unsigned int order)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	struct vfio_pci_core_device *vdev = vma->vm_private_data;
-	unsigned long addr = vmf->address & ~((PAGE_SIZE << order) - 1);
-	unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
-	unsigned long pfn = vma_to_pfn(vma) + pgoff;
+	struct vfio_pci_dma_buf *priv = vma->vm_private_data;
+	struct vfio_pci_core_device *vdev;
+	unsigned long pfn;
 	vm_fault_t ret = VM_FAULT_FALLBACK;
+	int pres;
+
+	vdev = READ_ONCE(priv->vdev);
 
-	if (is_aligned_for_order(vma, addr, pfn, order)) {
-		scoped_guard(rwsem_read, &vdev->memory_lock)
-			ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
+	/*
+	 * A fault might occur after vfio_pci_dma_buf_cleanup() has
+	 * revoked and destroyed the vdev's DMABUFs, and annulled
+	 * vdev.  After creation, vdev is only ever written in
+	 * cleanup.
+	 */
+	if (!vdev)
+		return VM_FAULT_SIGBUS;
+
+	pres = vfio_pci_dma_buf_find_pfn(priv, vma, vmf->address, order, &pfn);
+
+	if (pres == 0) {
+		scoped_guard(rwsem_read, &vdev->memory_lock) {
+			/*
+			 * A buffer's revocation/unmap and status
+			 * change occurs whilst holding memory_lock,
+			 * so protects against racing faults.
+			 */
+			if (priv->revoked)
+				ret = VM_FAULT_SIGBUS;
+			else
+				ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
+		}
+	} else if (pres != -EAGAIN) {
+		ret = VM_FAULT_SIGBUS;
 	}
 
 	dev_dbg_ratelimited(&vdev->pdev->dev,
-			   "%s(,order = %d) BAR %ld page offset 0x%lx: 0x%x\n",
-			    __func__, order,
-			    vma->vm_pgoff >>
-				(VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT),
-			    pgoff, (unsigned int)ret);
+			    "%s(order = %d) PFN 0x%lx, VA 0x%lx, pgoff 0x%lx: 0x%x\n",
+			    __func__, order, pfn, vmf->address, vma->vm_pgoff,
+			    (unsigned int)ret);
 
 	return ret;
 }
@@ -1773,7 +1783,20 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
 	if (ret)
 		return ret;
 
-	vma->vm_private_data = vdev;
+	/*
+	 * Create a DMABUF with a single range corresponding to this
+	 * mapping, and wire it into vma->vm_private_data.  The VMA's
+	 * vm_file becomes that of the DMABUF, and the DMABUF takes
+	 * ownership of the VFIO device file (put upon DMABUF
+	 * release).  This maintains the behaviour of a live VMA
+	 * mapping holding the VFIO device file open.
+	 */
+	ret = vfio_pci_core_mmap_prep_dmabuf(vdev, vma,
+					     pci_resource_start(pdev, index),
+					     pgoff, req_len);
+	if (ret)
+		return ret;
+
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
 
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 76db340ba592..197f50365ee1 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -9,6 +9,7 @@
 
 MODULE_IMPORT_NS("DMA_BUF");
 
+#ifdef CONFIG_VFIO_PCI_DMABUF
 static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
 				   struct dma_buf_attachment *attachment)
 {
@@ -25,6 +26,7 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
 
 	return 0;
 }
+#endif /* CONFIG_VFIO_PCI_DMABUF */
 
 static void vfio_pci_dma_buf_done(struct kref *kref)
 {
@@ -89,7 +91,9 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
 }
 
 static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
+#ifdef CONFIG_VFIO_PCI_DMABUF
 	.attach = vfio_pci_dma_buf_attach,
+#endif
 	.map_dma_buf = vfio_pci_dma_buf_map,
 	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
 	.release = vfio_pci_dma_buf_release,
@@ -219,6 +223,7 @@ static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev,
 	return 0;
 }
 
+#ifdef CONFIG_VFIO_PCI_DMABUF
 /*
  * This is a temporary "private interconnect" between VFIO DMABUF and iommufd.
  * It allows the two co-operating drivers to exchange the physical address of
@@ -428,6 +433,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 	kfree(dma_ranges);
 	return ret;
 }
+#endif /* CONFIG_VFIO_PCI_DMABUF */
 
 int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
 				   struct vm_area_struct *vma,
@@ -490,6 +496,10 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
 	struct vfio_pci_dma_buf *tmp;
 
 	lockdep_assert_held_write(&vdev->memory_lock);
+	/*
+	 * Holding memory_lock ensures a racing VMA fault observes
+	 * priv->revoked properly.
+	 */
 
 	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
 		if (!get_file_active(&priv->dmabuf->file))
@@ -507,6 +517,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
 			if (revoked) {
 				kref_put(&priv->kref, vfio_pci_dma_buf_done);
 				wait_for_completion(&priv->comp);
+				unmap_mapping_range(priv->dmabuf->file->f_mapping,
+						    0, priv->size, 1);
 			} else {
 				/*
 				 * Kref is initialize again, because when revoke
@@ -550,6 +562,8 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
 		dma_resv_unlock(priv->dmabuf->resv);
 		kref_put(&priv->kref, vfio_pci_dma_buf_done);
 		wait_for_completion(&priv->comp);
+		unmap_mapping_range(priv->dmabuf->file->f_mapping,
+				    0, priv->size, 1);
 		vfio_device_put_registration(&vdev->vdev);
 		fput(priv->dmabuf->file);
 	}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 5fd3a6e00a0e..37ece9b4b5bd 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -132,13 +132,13 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
 int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
 				   struct vm_area_struct *vma,
 				   u64 phys_start, u64 pgoff, u64 req_len);
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
 
 #ifdef CONFIG_VFIO_PCI_DMABUF
 int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 				  struct vfio_device_feature_dma_buf __user *arg,
 				  size_t argsz);
-void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
-void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
 #else
 static inline int
 vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
@@ -147,13 +147,6 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 {
 	return -ENOTTY;
 }
-static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
-{
-}
-static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
-					 bool revoked)
-{
-}
 #endif
 
 #endif
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC v2 PATCH 06/10] vfio/pci: Remove vfio_pci_zap_bars()
  2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
                   ` (4 preceding siblings ...)
  2026-03-12 18:46 ` [RFC v2 PATCH 05/10] vfio/pci: Convert BAR mmap() to use a DMABUF Matt Evans
@ 2026-03-12 18:46 ` Matt Evans
  2026-03-13  9:12   ` Christian König
  2026-03-12 18:46 ` [RFC v2 PATCH 07/10] vfio/pci: Support mmap() of a VFIO DMABUF Matt Evans
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 17+ messages in thread
From: Matt Evans @ 2026-03-12 18:46 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Christian König,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

vfio_pci_zap_bars() and the wrapper
vfio_pci_zap_and_down_write_memory_lock() are redundant as of
"vfio/pci: Convert BAR mmap() to use a DMABUF".  The DMABUFs used for
BAR mappings already zap PTEs via the existing
vfio_pci_dma_buf_move(), which notifies changes to the BAR space
(e.g. around reset).

Remove the old functions, and the various points needing to zap BARs
become slightly cleaner.

Signed-off-by: Matt Evans <mattev@meta.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 18 ++++++------------
 drivers/vfio/pci/vfio_pci_core.c   | 30 +++++++-----------------------
 drivers/vfio/pci/vfio_pci_priv.h   |  1 -
 3 files changed, 13 insertions(+), 36 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index b4e39253f98d..c7ed28be1104 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -590,12 +590,9 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
 		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
 		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
 
-		if (!new_mem) {
-			vfio_pci_zap_and_down_write_memory_lock(vdev);
+		down_write(&vdev->memory_lock);
+		if (!new_mem)
 			vfio_pci_dma_buf_move(vdev, true);
-		} else {
-			down_write(&vdev->memory_lock);
-		}
 
 		/*
 		 * If the user is writing mem/io enable (new_mem/io) and we
@@ -712,12 +709,9 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
 static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
 					  pci_power_t state)
 {
-	if (state >= PCI_D3hot) {
-		vfio_pci_zap_and_down_write_memory_lock(vdev);
+	down_write(&vdev->memory_lock);
+	if (state >= PCI_D3hot)
 		vfio_pci_dma_buf_move(vdev, true);
-	} else {
-		down_write(&vdev->memory_lock);
-	}
 
 	vfio_pci_set_power_state(vdev, state);
 	if (__vfio_pci_memory_enabled(vdev))
@@ -908,7 +902,7 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
 						 &cap);
 
 		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
-			vfio_pci_zap_and_down_write_memory_lock(vdev);
+			down_write(&vdev->memory_lock);
 			vfio_pci_dma_buf_move(vdev, true);
 			pci_try_reset_function(vdev->pdev);
 			if (__vfio_pci_memory_enabled(vdev))
@@ -993,7 +987,7 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
 						&cap);
 
 		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
-			vfio_pci_zap_and_down_write_memory_lock(vdev);
+			down_write(&vdev->memory_lock);
 			vfio_pci_dma_buf_move(vdev, true);
 			pci_try_reset_function(vdev->pdev);
 			if (__vfio_pci_memory_enabled(vdev))
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 41224efa58d8..9e9ad97c2f7f 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -319,7 +319,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
 	 * The vdev power related flags are protected with 'memory_lock'
 	 * semaphore.
 	 */
-	vfio_pci_zap_and_down_write_memory_lock(vdev);
+	down_write(&vdev->memory_lock);
 	vfio_pci_dma_buf_move(vdev, true);
 
 	if (vdev->pm_runtime_engaged) {
@@ -1229,7 +1229,7 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
 	if (!vdev->reset_works)
 		return -EINVAL;
 
-	vfio_pci_zap_and_down_write_memory_lock(vdev);
+	down_write(&vdev->memory_lock);
 
 	/*
 	 * This function can be invoked while the power state is non-D0. If
@@ -1613,22 +1613,6 @@ ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *bu
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_write);
 
-static void vfio_pci_zap_bars(struct vfio_pci_core_device *vdev)
-{
-	struct vfio_device *core_vdev = &vdev->vdev;
-	loff_t start = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX);
-	loff_t end = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX);
-	loff_t len = end - start;
-
-	unmap_mapping_range(core_vdev->inode->i_mapping, start, len, true);
-}
-
-void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev)
-{
-	down_write(&vdev->memory_lock);
-	vfio_pci_zap_bars(vdev);
-}
-
 u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev)
 {
 	u16 cmd;
@@ -2487,10 +2471,11 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		}
 
 		/*
-		 * Take the memory write lock for each device and zap BAR
-		 * mappings to prevent the user accessing the device while in
-		 * reset.  Locking multiple devices is prone to deadlock,
-		 * runaway and unwind if we hit contention.
+		 * Take the memory write lock for each device and
+		 * revoke all DMABUFs, which will prevent any access
+		 * to the device while in reset.  Locking multiple
+		 * devices is prone to deadlock, runaway and unwind if
+		 * we hit contention.
 		 */
 		if (!down_write_trylock(&vdev->memory_lock)) {
 			ret = -EBUSY;
@@ -2498,7 +2483,6 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		}
 
 		vfio_pci_dma_buf_move(vdev, true);
-		vfio_pci_zap_bars(vdev);
 	}
 
 	if (!list_entry_is_head(vdev,
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 37ece9b4b5bd..e201c96bbb14 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -78,7 +78,6 @@ void vfio_config_free(struct vfio_pci_core_device *vdev);
 int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev,
 			     pci_power_t state);
 
-void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev);
 u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev);
 void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev,
 					u16 cmd);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC v2 PATCH 07/10] vfio/pci: Support mmap() of a VFIO DMABUF
  2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
                   ` (5 preceding siblings ...)
  2026-03-12 18:46 ` [RFC v2 PATCH 06/10] vfio/pci: Remove vfio_pci_zap_bars() Matt Evans
@ 2026-03-12 18:46 ` Matt Evans
  2026-03-12 18:46 ` [RFC v2 PATCH 08/10] vfio/pci: Permanently revoke a DMABUF on request Matt Evans
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Matt Evans @ 2026-03-12 18:46 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Christian König,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

A VFIO DMABUF can export a subset of a BAR to userspace by fd; add
support for mmap() of this fd.  This provides another route for a
process to map BARs, except one where the process can only map a specific
subset of a BAR represented by the exported DMABUF.

mmap() support enables userspace driver designs that safely delegate
access to BAR sub-ranges to other client processes by sharing a DMABUF
fd, without having to share the (omnipotent) VFIO device fd with them.

Since the main VFIO BAR mmap() path is now DMABUF-aware, this path
reuses the existing vm_ops.

Signed-off-by: Matt Evans <mattev@meta.com>
---
 drivers/vfio/pci/vfio_pci_core.c   |  2 +-
 drivers/vfio/pci/vfio_pci_dmabuf.c | 28 ++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_priv.h   |  2 ++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 9e9ad97c2f7f..4f411a0b980c 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1712,7 +1712,7 @@ static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf)
 	return vfio_pci_mmap_huge_fault(vmf, 0);
 }
 
-static const struct vm_operations_struct vfio_pci_mmap_ops = {
+const struct vm_operations_struct vfio_pci_mmap_ops = {
 	.fault = vfio_pci_mmap_page_fault,
 #ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
 	.huge_fault = vfio_pci_mmap_huge_fault,
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 197f50365ee1..ab665db66904 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -26,6 +26,33 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
 
 	return 0;
 }
+
+static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+	u64 req_len, req_start;
+
+	if (priv->revoked)
+		return -ENODEV;
+	if ((vma->vm_flags & VM_SHARED) == 0)
+		return -EINVAL;
+
+	req_len = vma->vm_end - vma->vm_start;
+	req_start = vma->vm_pgoff << PAGE_SHIFT;
+	if (req_start + req_len > priv->size)
+		return -EINVAL;
+
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
+
+	/* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */
+	vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
+				  VM_DONTEXPAND | VM_DONTDUMP);
+	vma->vm_private_data = priv;
+	vma->vm_ops = &vfio_pci_mmap_ops;
+
+	return 0;
+}
 #endif /* CONFIG_VFIO_PCI_DMABUF */
 
 static void vfio_pci_dma_buf_done(struct kref *kref)
@@ -93,6 +120,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
 static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
 #ifdef CONFIG_VFIO_PCI_DMABUF
 	.attach = vfio_pci_dma_buf_attach,
+	.mmap = vfio_pci_dma_buf_mmap,
 #endif
 	.map_dma_buf = vfio_pci_dma_buf_map,
 	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index e201c96bbb14..b16a8d22563c 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -37,6 +37,8 @@ struct vfio_pci_dma_buf {
 	u8 revoked : 1;
 };
 
+extern const struct vm_operations_struct vfio_pci_mmap_ops;
+
 bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
 void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC v2 PATCH 08/10] vfio/pci: Permanently revoke a DMABUF on request
  2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
                   ` (6 preceding siblings ...)
  2026-03-12 18:46 ` [RFC v2 PATCH 07/10] vfio/pci: Support mmap() of a VFIO DMABUF Matt Evans
@ 2026-03-12 18:46 ` Matt Evans
  2026-03-12 18:46 ` [RFC v2 PATCH 09/10] vfio/pci: Add mmap() attributes to DMABUF feature Matt Evans
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 17+ messages in thread
From: Matt Evans @ 2026-03-12 18:46 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Christian König,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

Expand the VFIO DMABUF revocation state to three states:
Not revoked, temporarily revoked, and permanently revoked.

The first two are for existing transient revocation, e.g. across a
function reset, and the DMABUF is put into the last in response to a
new ioctl(VFIO_DEVICE_PCI_DMABUF_REVOKE) request.

This VFIO device fd ioctl passes a DMABUF by fd and requests that the
DMABUF is permanently revoked.  On success, it's guaranteed that the
buffer can never be imported/attached/mmap()ed in future, that dynamic
imports have been cleanly detached, and all mappings made
inaccessible/PTEs zapped.

This is useful for lifecycle management, to reclaim VFIO PCI BAR
ranges previously delegated to a subordinate client process: The
driver process can ensure that the loaned resources are revoked when
the client is deemed "done", and exported ranges can be safely re-used
elsewhere.

Signed-off-by: Matt Evans <mattev@meta.com>
---
 drivers/vfio/pci/vfio_pci_core.c   |  16 +++-
 drivers/vfio/pci/vfio_pci_dmabuf.c | 136 ++++++++++++++++++++---------
 drivers/vfio/pci/vfio_pci_priv.h   |  14 ++-
 include/uapi/linux/vfio.h          |  30 +++++++
 4 files changed, 154 insertions(+), 42 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 4f411a0b980c..c7760dd3a5f0 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1461,6 +1461,18 @@ static int vfio_pci_ioctl_ioeventfd(struct vfio_pci_core_device *vdev,
 				  ioeventfd.fd);
 }
 
+static int vfio_pci_ioctl_dmabuf_revoke(struct vfio_pci_core_device *vdev,
+					struct vfio_device_ioeventfd __user *arg)
+{
+	unsigned long minsz = offsetofend(struct vfio_pci_dmabuf_revoke, dmabuf_fd);
+	struct vfio_pci_dmabuf_revoke revoke;
+
+	if (copy_from_user(&revoke, arg, minsz))
+		return -EFAULT;
+
+	return vfio_pci_dma_buf_revoke(vdev, revoke.dmabuf_fd);
+}
+
 long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 			 unsigned long arg)
 {
@@ -1483,6 +1495,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		return vfio_pci_ioctl_reset(vdev, uarg);
 	case VFIO_DEVICE_SET_IRQS:
 		return vfio_pci_ioctl_set_irqs(vdev, uarg);
+	case VFIO_DEVICE_PCI_DMABUF_REVOKE:
+		return vfio_pci_ioctl_dmabuf_revoke(vdev, uarg);
 	default:
 		return -ENOTTY;
 	}
@@ -1690,7 +1704,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
 			 * change occurs whilst holding memory_lock,
 			 * so protects against racing faults.
 			 */
-			if (priv->revoked)
+			if (priv->status != VFIO_PCI_DMABUF_OK)
 				ret = VM_FAULT_SIGBUS;
 			else
 				ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index ab665db66904..362207cf7e71 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -18,7 +18,7 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
 	if (!attachment->peer2peer)
 		return -EOPNOTSUPP;
 
-	if (priv->revoked)
+	if (priv->status != VFIO_PCI_DMABUF_OK)
 		return -ENODEV;
 
 	if (!dma_buf_attach_revocable(attachment))
@@ -32,7 +32,7 @@ static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *
 	struct vfio_pci_dma_buf *priv = dmabuf->priv;
 	u64 req_len, req_start;
 
-	if (priv->revoked)
+	if (priv->status != VFIO_PCI_DMABUF_OK)
 		return -ENODEV;
 	if ((vma->vm_flags & VM_SHARED) == 0)
 		return -EINVAL;
@@ -72,7 +72,7 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
 
 	dma_resv_assert_held(priv->dmabuf->resv);
 
-	if (priv->revoked)
+	if (priv->status != VFIO_PCI_DMABUF_OK)
 		return ERR_PTR(-ENODEV);
 
 	ret = dma_buf_phys_vec_to_sgt(attachment, priv->provider,
@@ -243,7 +243,8 @@ static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev,
 	INIT_LIST_HEAD(&priv->dmabufs_elm);
 	down_write(&vdev->memory_lock);
 	dma_resv_lock(priv->dmabuf->resv, NULL);
-	priv->revoked = !status_ok;
+	priv->status = status_ok ? VFIO_PCI_DMABUF_OK :
+		VFIO_PCI_DMABUF_TEMP_REVOKED;
 	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
 	dma_resv_unlock(priv->dmabuf->resv);
 	up_write(&vdev->memory_lock);
@@ -274,7 +275,7 @@ int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment,
 		return -EOPNOTSUPP;
 
 	priv = attachment->dmabuf->priv;
-	if (priv->revoked)
+	if (priv->status != VFIO_PCI_DMABUF_OK)
 		return -ENODEV;
 
 	/* More than one range to iommufd will require proper DMABUF support */
@@ -518,6 +519,48 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+static void __vfio_pci_dma_buf_revoke(struct vfio_pci_dma_buf *priv, bool revoked,
+				      bool permanently)
+{
+	if ((priv->status == VFIO_PCI_DMABUF_PERM_REVOKED) ||
+	    (priv->status == VFIO_PCI_DMABUF_OK && !revoked) ||
+	    (priv->status == VFIO_PCI_DMABUF_TEMP_REVOKED && revoked && !permanently)) {
+		return;
+	}
+
+	dma_resv_lock(priv->dmabuf->resv, NULL);
+	if (revoked)
+		priv->status = permanently ?
+			VFIO_PCI_DMABUF_PERM_REVOKED : VFIO_PCI_DMABUF_TEMP_REVOKED;
+	dma_buf_invalidate_mappings(priv->dmabuf);
+	dma_resv_wait_timeout(priv->dmabuf->resv,
+			      DMA_RESV_USAGE_BOOKKEEP, false,
+			      MAX_SCHEDULE_TIMEOUT);
+	dma_resv_unlock(priv->dmabuf->resv);
+	if (revoked) {
+		kref_put(&priv->kref, vfio_pci_dma_buf_done);
+		wait_for_completion(&priv->comp);
+		unmap_mapping_range(priv->dmabuf->file->f_mapping,
+				    0, priv->size, 1);
+	} else {
+		/*
+		 * Kref is initialize again, because when revoke
+		 * was performed the reference counter was decreased
+		 * to zero to trigger completion.
+		 */
+		kref_init(&priv->kref);
+		/*
+		 * There is no need to wait as no mapping was
+		 * performed when the previous status was
+		 * priv->status == *REVOKED.
+		 */
+		reinit_completion(&priv->comp);
+		dma_resv_lock(priv->dmabuf->resv, NULL);
+		priv->status = VFIO_PCI_DMABUF_OK;
+		dma_resv_unlock(priv->dmabuf->resv);
+	}
+}
+
 void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
 {
 	struct vfio_pci_dma_buf *priv;
@@ -526,45 +569,13 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
 	lockdep_assert_held_write(&vdev->memory_lock);
 	/*
 	 * Holding memory_lock ensures a racing VMA fault observes
-	 * priv->revoked properly.
+	 * priv->status properly.
 	 */
 
 	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
 		if (!get_file_active(&priv->dmabuf->file))
 			continue;
-
-		if (priv->revoked != revoked) {
-			dma_resv_lock(priv->dmabuf->resv, NULL);
-			if (revoked)
-				priv->revoked = true;
-			dma_buf_invalidate_mappings(priv->dmabuf);
-			dma_resv_wait_timeout(priv->dmabuf->resv,
-					      DMA_RESV_USAGE_BOOKKEEP, false,
-					      MAX_SCHEDULE_TIMEOUT);
-			dma_resv_unlock(priv->dmabuf->resv);
-			if (revoked) {
-				kref_put(&priv->kref, vfio_pci_dma_buf_done);
-				wait_for_completion(&priv->comp);
-				unmap_mapping_range(priv->dmabuf->file->f_mapping,
-						    0, priv->size, 1);
-			} else {
-				/*
-				 * Kref is initialize again, because when revoke
-				 * was performed the reference counter was decreased
-				 * to zero to trigger completion.
-				 */
-				kref_init(&priv->kref);
-				/*
-				 * There is no need to wait as no mapping was
-				 * performed when the previous status was
-				 * priv->revoked == true.
-				 */
-				reinit_completion(&priv->comp);
-				dma_resv_lock(priv->dmabuf->resv, NULL);
-				priv->revoked = false;
-				dma_resv_unlock(priv->dmabuf->resv);
-			}
-		}
+		__vfio_pci_dma_buf_revoke(priv, revoked, false);
 		fput(priv->dmabuf->file);
 	}
 }
@@ -582,7 +593,7 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
 		dma_resv_lock(priv->dmabuf->resv, NULL);
 		list_del_init(&priv->dmabufs_elm);
 		priv->vdev = NULL;
-		priv->revoked = true;
+		priv->status = VFIO_PCI_DMABUF_PERM_REVOKED;
 		dma_buf_invalidate_mappings(priv->dmabuf);
 		dma_resv_wait_timeout(priv->dmabuf->resv,
 				      DMA_RESV_USAGE_BOOKKEEP, false,
@@ -597,3 +608,48 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
 	}
 	up_write(&vdev->memory_lock);
 }
+
+#ifdef CONFIG_VFIO_PCI_DMABUF
+int vfio_pci_dma_buf_revoke(struct vfio_pci_core_device *vdev, int dmabuf_fd)
+{
+	struct vfio_pci_core_device *db_vdev;
+	struct dma_buf *dmabuf;
+	struct vfio_pci_dma_buf *priv;
+	int ret = 0;
+
+	dmabuf = dma_buf_get(dmabuf_fd);
+	if (IS_ERR(dmabuf))
+		return PTR_ERR(dmabuf);
+
+	/*
+	 * The DMABUF is a user-provided fd, so sanity-check it's
+	 * really a vfio_pci_dma_buf _and_ relates to the VFIO device
+	 * that it was provided for:
+	 */
+	if (dmabuf->ops != &vfio_pci_dmabuf_ops) {
+		ret = -ENODEV;
+		goto out_put_buf;
+	}
+
+	priv = dmabuf->priv;
+	db_vdev = READ_ONCE(priv->vdev);
+
+	if (!db_vdev || db_vdev != vdev) {
+		ret = -ENODEV;
+		goto out_put_buf;
+	}
+
+	scoped_guard(rwsem_read, &vdev->memory_lock) {
+		if (priv->status == VFIO_PCI_DMABUF_PERM_REVOKED) {
+			ret = -EBADFD;
+			break;
+		}
+		__vfio_pci_dma_buf_revoke(priv, true, true);
+	}
+
+ out_put_buf:
+	dma_buf_put(dmabuf);
+
+	return ret;
+}
+#endif /* CONFIG_VFIO_PCI_DMABUF */
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index b16a8d22563c..c5a9e06bf81a 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -23,6 +23,12 @@ struct vfio_pci_ioeventfd {
 	bool			test_mem;
 };
 
+enum vfio_pci_dma_buf_status {
+	VFIO_PCI_DMABUF_OK = 0,
+	VFIO_PCI_DMABUF_TEMP_REVOKED = 1,
+	VFIO_PCI_DMABUF_PERM_REVOKED = 2,
+};
+
 struct vfio_pci_dma_buf {
 	struct dma_buf *dmabuf;
 	struct vfio_pci_core_device *vdev;
@@ -34,7 +40,7 @@ struct vfio_pci_dma_buf {
 	u32 nr_ranges;
 	struct kref kref;
 	struct completion comp;
-	u8 revoked : 1;
+	enum vfio_pci_dma_buf_status status;
 };
 
 extern const struct vm_operations_struct vfio_pci_mmap_ops;
@@ -140,6 +146,7 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
 int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 				  struct vfio_device_feature_dma_buf __user *arg,
 				  size_t argsz);
+int vfio_pci_dma_buf_revoke(struct vfio_pci_core_device *vdev, int dmabuf_fd);
 #else
 static inline int
 vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
@@ -148,6 +155,11 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 {
 	return -ENOTTY;
 }
+static inline int vfio_pci_dma_buf_revoke(struct vfio_pci_core_device *vdev,
+					  int dmabuf_fd)
+{
+	return -ENODEV;
+}
 #endif
 
 #endif
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index bb7b89330d35..c1b3fa880aa1 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1307,6 +1307,36 @@ struct vfio_precopy_info {
 
 #define VFIO_MIG_GET_PRECOPY_INFO _IO(VFIO_TYPE, VFIO_BASE + 21)
 
+/**
+ * VFIO_DEVICE_PCI_DMABUF_REVOKE - _IO(VFIO_TYPE, VFIO_BASE + 22)
+ *
+ * This ioctl is used on the device FD, and requests that access to
+ * the buffer corresponding to the DMABUF FD parameter is immediately
+ * and permanently revoked.  On successful return, the buffer is not
+ * accessible through any mmap() or dma-buf import.  The request fails
+ * if the buffer is pinned; otherwise, the exporter marks the buffer
+ * as inaccessible and uses the move_notify callback to inform
+ * importers of the change.  The buffer is permanently disabled, and
+ * VFIO refuses all map, mmap, attach, etc. requests.
+ *
+ * Returns:
+ *
+ * Return: 0 on success, -1 and errno set on failure:
+ *
+ *  ENODEV if the associated dmabuf FD no longer exists/is closed,
+ *         or is not a DMABUF created for this device.
+ *  EINVAL if the dmabuf_fd parameter isn't a DMABUF.
+ *  EBADF if the dmabuf_fd parameter isn't a valid file number.
+ *  EBADFD if the buffer has already been revoked.
+ *
+ */
+struct vfio_pci_dmabuf_revoke {
+	__u32 argsz;
+	__u32 dmabuf_fd;
+};
+
+#define VFIO_DEVICE_PCI_DMABUF_REVOKE _IO(VFIO_TYPE, VFIO_BASE + 22)
+
 /*
  * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
  * state with the platform-based power management.  Device use of lower power
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC v2 PATCH 09/10] vfio/pci: Add mmap() attributes to DMABUF feature
  2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
                   ` (7 preceding siblings ...)
  2026-03-12 18:46 ` [RFC v2 PATCH 08/10] vfio/pci: Permanently revoke a DMABUF on request Matt Evans
@ 2026-03-12 18:46 ` Matt Evans
  2026-03-12 18:46 ` [RFC v2 PATCH 10/10] [RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test Matt Evans
  2026-03-13  9:21 ` [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Christian König
  10 siblings, 0 replies; 17+ messages in thread
From: Matt Evans @ 2026-03-12 18:46 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Christian König,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

A new field is reserved in vfio_device_feature_dma_buf.flags to
request CPU-facing memory type attributes for mmap()s of the buffer.
Add a flag VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_WC, which results in WC
PTEs for the DMABUF's BAR region.

Signed-off-by: Matt Evans <mattev@meta.com>
---
 drivers/vfio/pci/vfio_pci_dmabuf.c | 15 +++++++++++++--
 drivers/vfio/pci/vfio_pci_priv.h   |  1 +
 include/uapi/linux/vfio.h          | 12 +++++++++---
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 362207cf7e71..ed5b80f6911e 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -42,7 +42,10 @@ static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *
 	if (req_start + req_len > priv->size)
 		return -EINVAL;
 
-	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	if (priv->attrs == VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_WC)
+		vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
+	else
+		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
 
 	/* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */
@@ -343,6 +346,12 @@ static int validate_dmabuf_input(struct vfio_device_feature_dma_buf *dma_buf,
 	size_t length = 0;
 	u32 i;
 
+	if ((dma_buf->flags != 0) &&
+	    ((dma_buf->flags & ~VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_MASK) ||
+	     ((dma_buf->flags & VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_MASK) !=
+	      VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_WC)))
+		return -EINVAL;
+
 	for (i = 0; i < dma_buf->nr_ranges; i++) {
 		u64 offset = dma_ranges[i].offset;
 		u64 len = dma_ranges[i].length;
@@ -386,7 +395,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 	if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
 		return -EFAULT;
 
-	if (!get_dma_buf.nr_ranges || get_dma_buf.flags)
+	if (!get_dma_buf.nr_ranges)
 		return -EINVAL;
 
 	/*
@@ -429,6 +438,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
 	priv->vdev = vdev;
 	priv->nr_ranges = get_dma_buf.nr_ranges;
 	priv->size = length;
+	priv->attrs = get_dma_buf.flags & VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_MASK;
 	ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider,
 					     get_dma_buf.region_index,
 					     priv->phys_vec, dma_ranges,
@@ -488,6 +498,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
 	priv->vdev = vdev;
 	priv->nr_ranges = nr_ranges;
 	priv->size = req_len;
+	priv->attrs = 0;
 	priv->phys_vec[0].paddr = phys_start + (pgoff << PAGE_SHIFT);
 	priv->phys_vec[0].len = req_len;
 
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index c5a9e06bf81a..562de3cc88f4 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -40,6 +40,7 @@ struct vfio_pci_dma_buf {
 	u32 nr_ranges;
 	struct kref kref;
 	struct completion comp;
+	u32 attrs;
 	enum vfio_pci_dma_buf_status status;
 };
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index c1b3fa880aa1..fbbe1adea533 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1521,7 +1521,9 @@ struct vfio_device_feature_bus_master {
  * etc. offset/length specify a slice of the region to create the dmabuf from.
  * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf.
  *
- * flags should be 0.
+ * flags contains:
+ * - A field for userspace mapping attribute: by default, suitable for regular
+ *   MMIO. Alternate attributes (such as WC) can be selected.
  *
  * Return: The fd number on success, -1 and errno is set on failure.
  */
@@ -1535,8 +1537,12 @@ struct vfio_region_dma_range {
 struct vfio_device_feature_dma_buf {
 	__u32	region_index;
 	__u32	open_flags;
-	__u32   flags;
-	__u32   nr_ranges;
+	__u32	flags;
+	/* Flags sub-field reserved for attribute enum */
+#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_MASK		(0xf << 28)
+#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_UC		(0 << 28)
+#define VFIO_DEVICE_FEATURE_DMA_BUF_ATTR_WC		(1 << 28)
+	__u32	nr_ranges;
 	struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
 };
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC v2 PATCH 10/10] [RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test
  2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
                   ` (8 preceding siblings ...)
  2026-03-12 18:46 ` [RFC v2 PATCH 09/10] vfio/pci: Add mmap() attributes to DMABUF feature Matt Evans
@ 2026-03-12 18:46 ` Matt Evans
  2026-03-13  9:21 ` [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Christian König
  10 siblings, 0 replies; 17+ messages in thread
From: Matt Evans @ 2026-03-12 18:46 UTC (permalink / raw)
  To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
	Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Christian König,
	Kevin Tian, Ankit Agrawal, Pranjal Shrivastava, Alistair Popple,
	Vivek Kasireddy, linux-kernel, linux-media, dri-devel,
	linaro-mm-sig, kvm

This test exercises VFIO DMABUF mmap() to userspace, including various
revocation/shutdown cases (which make the VMA inacessible).

This is a TEMPORARY test, just to illustrate a new UAPI and
DMABUF/mmap() usage.  Since it originates from out-of-tree code, it
duplicates some of the VFIO device setup code in
.../selftests/vfio/lib.  Instead, the tests should be folded into the
existing VFIO tests.

Signed-off-by: Matt Evans <mattev@meta.com>
---
 tools/testing/selftests/vfio/Makefile         |   1 +
 .../vfio/standalone/vfio_dmabuf_mmap_test.c   | 837 ++++++++++++++++++
 2 files changed, 838 insertions(+)
 create mode 100644 tools/testing/selftests/vfio/standalone/vfio_dmabuf_mmap_test.c

diff --git a/tools/testing/selftests/vfio/Makefile b/tools/testing/selftests/vfio/Makefile
index 8e90e409e91d..8679d96e5b92 100644
--- a/tools/testing/selftests/vfio/Makefile
+++ b/tools/testing/selftests/vfio/Makefile
@@ -12,6 +12,7 @@ TEST_GEN_PROGS += vfio_iommufd_setup_test
 TEST_GEN_PROGS += vfio_pci_device_test
 TEST_GEN_PROGS += vfio_pci_device_init_perf_test
 TEST_GEN_PROGS += vfio_pci_driver_test
+TEST_GEN_PROGS += standalone/vfio_dmabuf_mmap_test
 
 TEST_FILES += scripts/cleanup.sh
 TEST_FILES += scripts/lib.sh
diff --git a/tools/testing/selftests/vfio/standalone/vfio_dmabuf_mmap_test.c b/tools/testing/selftests/vfio/standalone/vfio_dmabuf_mmap_test.c
new file mode 100644
index 000000000000..0c087497b777
--- /dev/null
+++ b/tools/testing/selftests/vfio/standalone/vfio_dmabuf_mmap_test.c
@@ -0,0 +1,837 @@
+/*
+ * Tests for VFIO DMABUF userspace mmap()
+ *
+ * As well as the basics (mmap() a BAR resource to userspace), test
+ * shutdown/unmapping, aliasing, and DMABUF revocation scenarios.
+ *
+ * This test relies on being attached to a QEMU EDU device (for a
+ * simple known MMIO layout).  Example invocation, assuming function
+ * 0000:00:03.0 is the target:
+ *
+ *  # lspci -n -s 00:03.0
+ *  00:03.0 00ff: 1234:11e8 (rev 10)
+ *
+ *  # readlink /sys/bus/pci/devices/0000\:00\:03.0/iommu_group
+ *  ../../../../../kernel/iommu_groups/3
+ *
+ *  (if there's a driver already attached)
+ *  # echo 0000:00:03.0 > /sys/bus/pci/devices/0000:00:03.0/driver/unbind
+ *
+ *  (and, might need)
+ *  # echo 1 > /sys/module/vfio_iommu_type1/parameters/allow_unsafe_interrupts
+ *
+ *  Attach to VFIO:
+ *  # echo 1234 11e8 > /sys/bus/pci/drivers/vfio-pci/new_id
+ *
+ *  There should be only one thing in the group:
+ *  # ls /sys/bus/pci/devices/0000:00:03.0/iommu_group/devices
+ *
+ *  Then given above an invocation would be:
+ *  # this_test -r 0000:00:03.0 -g 3
+ *
+ * However, note the QEMU EDU device has a very small address span of
+ * useful things in BAR0, which makes testing a non-zero BAR offset
+ * impossible.  An "extended EDU" device is supported, which just
+ * presents a large chunk of memory as a second BAR resource: this
+ * allows non-zero BAR offsets to be tested.  See below for a QEMU
+ * diff...
+ *
+ * Copyright (c) Meta Platforms, Inc. and affiliates.
+ *
+ * This software may be used and distributed according to the terms of the
+ * GNU General Public License version 2.
+ */
+
+/*
+diff --git a/hw/misc/edu.c b/hw/misc/edu.c
+index cece633e11..5f119e0642 100644
+--- a/hw/misc/edu.c
++++ b/hw/misc/edu.c
+@@ -47,6 +47,7 @@ DECLARE_INSTANCE_CHECKER(EduState, EDU,
+ struct EduState {
+     PCIDevice pdev;
+     MemoryRegion mmio;
++    MemoryRegion ram;
+ 
+     QemuThread thread;
+     QemuMutex thr_mutex;
+@@ -386,7 +387,12 @@ static void pci_edu_realize(PCIDevice *pdev, Error **errp)
+ 
+     memory_region_init_io(&edu->mmio, OBJECT(edu), &edu_mmio_ops, edu,
+                     "edu-mmio", 1 * MiB);
++    memory_region_init_ram(&edu->ram, OBJECT(edu), "edu-ram", 64 * MiB, &error_fatal);
+     pci_register_bar(pdev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &edu->mmio);
++    pci_register_bar(pdev, 1,
++                     PCI_BASE_ADDRESS_SPACE_MEMORY |
++                    PCI_BASE_ADDRESS_MEM_PREFETCH |
++                    PCI_BASE_ADDRESS_MEM_TYPE_64, &edu->ram);
+ }
+ 
+ static void pci_edu_uninit(PCIDevice *pdev)
+*/
+
+#include <errno.h>
+#include <inttypes.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <linux/dma-buf.h>
+#include <linux/vfio.h>
+#include <setjmp.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#define ROUND_UP(x, to) (((x) + (to) - 1) & ~((to) - 1))
+#define MiB(x)		((x) * 1024ULL * 1024)
+
+#define EDU_REG_MAGIC	0x00
+#define EDU_MAGIC_VAL	0x010000edu
+#define EDU_REG_INVERT	0x04
+
+#define FAIL_IF(cond, msg...)                  \
+	do {                                   \
+		if (cond) {                    \
+			printf("\n\nFAIL:\t"); \
+			printf(msg);           \
+			exit(1);               \
+		}                              \
+	} while (0)
+
+static int vfio_setup(int groupnr, char *rid_str,
+		      struct vfio_region_info *out_mappable_regions,
+		      int nr_regions, int *out_nr_regions, int *out_vfio_cfd,
+		      int *out_vfio_devfd)
+{
+	/* Create a new container, add group to it, open device, read
+	 * resource, reset, etc.  Based on the example code in
+	 * Documentation/driver-api/vfio.rst
+	 */
+
+	int container = open("/dev/vfio/vfio", O_RDWR);
+
+	int r = ioctl(container, VFIO_GET_API_VERSION);
+
+	if (r != VFIO_API_VERSION) {
+		/* Unknown API version */
+		printf("-E- Unknown API ver %d\n", r);
+		return 1;
+	}
+
+	if (ioctl(container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU) != 1) {
+		printf("-E- Doesn't support type 1\n");
+		return 1;
+	}
+
+	char devpath[PATH_MAX];
+
+	snprintf(devpath, PATH_MAX - 1, "/dev/vfio/%d", groupnr);
+	/* Open the group */
+	int group = open(devpath, O_RDWR);
+
+	if (group < 0) {
+		printf("-E- Can't open VFIO device (group %d)\n", groupnr);
+		return 1;
+	}
+
+	/* Test the group is viable and available */
+	struct vfio_group_status group_status = { .argsz = sizeof(
+							  group_status) };
+
+	if (ioctl(group, VFIO_GROUP_GET_STATUS, &group_status)) {
+		perror("-E- Can't get group status");
+		return 1;
+	}
+
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+		/* Group is not viable (ie, not all devices bound for vfio) */
+		printf("-E- Group %d is not viable!\n", groupnr);
+		return 1;
+	}
+
+	/* Add the group to the container */
+	if (ioctl(group, VFIO_GROUP_SET_CONTAINER, &container)) {
+		perror("-E- Can't add group to container");
+		return 1;
+	}
+
+	/* Enable the IOMMU model we want */
+	if (ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU)) {
+		perror("-E- Can't select T1");
+		return 1;
+	}
+
+	/* Get addition IOMMU info */
+	struct vfio_iommu_type1_info iommu_info = { .argsz = sizeof(
+							    iommu_info) };
+
+	if (ioctl(container, VFIO_IOMMU_GET_INFO, &iommu_info)) {
+		perror("-E- Can't get VFIO info");
+		return 1;
+	}
+
+	/* Get a file descriptor for the device */
+	int device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, rid_str);
+
+	if (device < 0) {
+		perror("-E- Can't get device fd");
+		return 1;
+	}
+	close(group);
+
+	/* Test and setup the device */
+	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+
+	if (ioctl(device, VFIO_DEVICE_GET_INFO, &device_info)) {
+		perror("-E- Can't get device info");
+		return 1;
+	}
+	printf("-i- %d device regions, flags 0x%x\n", device_info.num_regions,
+	       device_info.flags);
+
+	/* Regions are BAR0-5 then ROM, config, VGA */
+	int out_region = 0;
+
+	for (int i = 0; i < device_info.num_regions; i++) {
+		struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+		reg.index = i;
+
+		if (ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg)) {
+			/* We expect EINVAL if there's no VGA region */
+			printf("-W- Region %d: ERROR %d\n", i, errno);
+		} else {
+			printf("-i- Region %d: flags 0x%08x (%c%c%c), cap_offs %d, size 0x%llx, offs 0x%llx\n",
+			       i, reg.flags,
+			       (reg.flags & VFIO_REGION_INFO_FLAG_READ) ? 'R' :
+									  '-',
+			       (reg.flags & VFIO_REGION_INFO_FLAG_WRITE) ? 'W' :
+									   '-',
+			       (reg.flags & VFIO_REGION_INFO_FLAG_MMAP) ? 'M' :
+									  '-',
+			       reg.cap_offset, reg.size, reg.offset);
+
+			if ((reg.flags & VFIO_REGION_INFO_FLAG_MMAP) &&
+			    (out_region < nr_regions))
+				out_mappable_regions[out_region++] = reg;
+		}
+	}
+	*out_nr_regions = out_region;
+
+#ifdef THERE_ARE_NO_IRQS_YET
+	for (i = 0; i < device_info.num_irqs; i++) {
+		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+
+		irq.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &irq);
+
+		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQS */
+	}
+#endif
+	/* Gratuitous device reset and go... */
+	if (ioctl(device, VFIO_DEVICE_RESET))
+		perror("-W- Can't reset device (continuing)");
+
+	*out_vfio_cfd = container;
+	*out_vfio_devfd = device;
+
+	return 0;
+}
+
+static int vfio_feature_present(int dev_fd, uint32_t feature)
+{
+	struct vfio_device_feature probeftr = {
+		.argsz = sizeof(probeftr),
+		.flags = VFIO_DEVICE_FEATURE_PROBE | VFIO_DEVICE_FEATURE_GET |
+			 feature,
+	};
+	return ioctl(dev_fd, VFIO_DEVICE_FEATURE, &probeftr) == 0;
+}
+
+static int vfio_create_dmabuf(int dev_fd, uint32_t region, uint64_t offset,
+			      uint64_t length)
+{
+	uint64_t ftrbuf
+		[ROUND_UP(sizeof(struct vfio_device_feature) +
+				  sizeof(struct vfio_device_feature_dma_buf) +
+				  sizeof(struct vfio_region_dma_range),
+			  8) /
+		 8];
+
+	struct vfio_device_feature *f = (struct vfio_device_feature *)ftrbuf;
+	struct vfio_device_feature_dma_buf *db =
+		(struct vfio_device_feature_dma_buf *)f->data;
+	struct vfio_region_dma_range *range =
+		(struct vfio_region_dma_range *)db->dma_ranges;
+
+	f->argsz = sizeof(ftrbuf);
+	f->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_DMA_BUF;
+	db->region_index = region;
+	db->open_flags = O_RDWR | O_CLOEXEC;
+	db->flags = 0;
+	db->nr_ranges = 1;
+	range->offset = offset;
+	range->length = length;
+
+	return ioctl(dev_fd, VFIO_DEVICE_FEATURE, &ftrbuf);
+}
+
+/* As above, but try multiple ranges in one dmabuf */
+static int vfio_create_dmabuf_dual(int dev_fd, uint32_t region,
+				   uint64_t offset0, uint64_t length0,
+				   uint64_t offset1, uint64_t length1)
+{
+	uint64_t ftrbuf
+		[ROUND_UP(sizeof(struct vfio_device_feature) +
+				  sizeof(struct vfio_device_feature_dma_buf) +
+				  (sizeof(struct vfio_region_dma_range) * 2),
+			  8) /
+		 8];
+
+	struct vfio_device_feature *f = (struct vfio_device_feature *)ftrbuf;
+	struct vfio_device_feature_dma_buf *db =
+		(struct vfio_device_feature_dma_buf *)f->data;
+	struct vfio_region_dma_range *range =
+		(struct vfio_region_dma_range *)db->dma_ranges;
+
+	f->argsz = sizeof(ftrbuf);
+	f->flags = VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_DMA_BUF;
+	db->region_index = region;
+	db->open_flags = O_RDWR | O_CLOEXEC;
+	db->flags = 0;
+	db->nr_ranges = 2;
+	range[0].offset = offset0;
+	range[0].length = length0;
+	range[1].offset = offset1;
+	range[1].length = length1;
+
+	return ioctl(dev_fd, VFIO_DEVICE_FEATURE, &ftrbuf);
+}
+
+static volatile uint32_t *mmap_resource_aligned(size_t size,
+						unsigned long align, int fd,
+						unsigned long offset)
+{
+	void *v;
+
+	if (align <= getpagesize()) {
+		v = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
+			 offset);
+		FAIL_IF(v == MAP_FAILED,
+			"Can't mmap fd %d (size 0x%lx, offset 0x%lx), %d\n", fd,
+			size, offset, errno);
+	} else {
+		size_t resv_size = size + align;
+		void *resv =
+			mmap(0, resv_size, 0, MAP_PRIVATE | MAP_ANON, -1, 0);
+		FAIL_IF(resv == MAP_FAILED,
+			"Can't mmap reservation, size 0x%lx, %d\n", resv_size,
+			errno);
+
+		uintptr_t pos = ((uintptr_t)resv + (align - 1)) & ~(align - 1);
+
+		v = mmap((void *)pos, size, PROT_READ | PROT_WRITE,
+			 MAP_SHARED | MAP_FIXED, fd, offset);
+		FAIL_IF(v == MAP_FAILED,
+			"Can't mmap-fixed fd %d (size 0x%lx, offset 0x%lx), %d\n",
+			fd, size, offset, errno);
+		madvise((void *)v, size, MADV_HUGEPAGE);
+
+		/* Tidy */
+		if (pos > (uintptr_t)resv)
+			munmap(resv, pos - (uintptr_t)resv);
+		if (pos + size < (uintptr_t)resv + resv_size)
+			munmap((void *)pos + size,
+			       (uintptr_t)resv + resv_size - (pos + size));
+	}
+
+	return (volatile uint32_t *)v;
+}
+
+static volatile uint32_t *mmap_resource(size_t size, int fd,
+					unsigned long offset)
+{
+	return mmap_resource_aligned(size, getpagesize(), fd, offset);
+}
+
+static void check_mmio(volatile uint32_t *base)
+{
+	static uint32_t magic = 0xdeadbeef;
+	uint32_t v;
+
+	printf("-i- MMIO check: ");
+
+	/* Trivial MMIO */
+	v = base[EDU_REG_MAGIC / 4];
+	FAIL_IF(v != EDU_MAGIC_VAL,
+		"Magic value %08x incorrect, BAR map bad?\n", v);
+
+	base[EDU_REG_INVERT / 4] = magic;
+	v = base[EDU_REG_INVERT / 4];
+	FAIL_IF(v != ~magic, "Inverterizer value %08x bad (should be %08x)\n",
+		v, ~magic);
+	printf("OK\n");
+
+	magic = (magic << 1) ^ (magic >> 1) ^ (magic << 7);
+}
+
+static int revoke_dmabuf(int dev_fd, int dmabuf_fd)
+{
+	struct vfio_pci_dmabuf_revoke dmabuf_rev = {
+		.argsz = sizeof(dmabuf_rev),
+		.dmabuf_fd = dmabuf_fd,
+	};
+	return ioctl(dev_fd, VFIO_DEVICE_PCI_DMABUF_REVOKE, &dmabuf_rev);
+}
+
+static jmp_buf jmpbuf;
+
+static void sighandler(int sig)
+{
+	printf("*** Signal %d ***\n", sig);
+	siglongjmp(jmpbuf, sig);
+}
+
+static void setup_signals(void)
+{
+	struct sigaction sa = {
+		.sa_handler = sighandler,
+		.sa_flags = 0,
+	};
+
+	sigaction(SIGBUS, &sa, NULL);
+}
+
+static int vfio_dmabuf_test(int groupnr, char *rid_str)
+{
+	/* Only expecting one or two regions */
+	struct vfio_region_info bar_region[2];
+	int num_regions = 0;
+	int container_fd, dev_fd;
+	int r = vfio_setup(groupnr, rid_str, &bar_region[0], 2, &num_regions,
+			   &container_fd, &dev_fd);
+
+	FAIL_IF(r, "VFIO setup failed\n");
+	FAIL_IF(!vfio_feature_present(dev_fd, VFIO_DEVICE_FEATURE_DMA_BUF),
+		"VFIO DMABUF support not available\n");
+
+	printf("-i- Container fd %d, device fd %d, and got DMA_BUF\n",
+	       container_fd, dev_fd);
+
+	setup_signals();
+
+	////////////////////////////////////////////////////////////////////////////////
+
+	/* Real basics:	 create DMABUF, and mmap it, and access MMIO through it.
+	 * Do this for 2nd BAR if present, too (just plain memory).
+	 */
+	printf("\nTEST: Create DMABUF, map it\n");
+	int bar_db_fd = vfio_create_dmabuf(dev_fd, /* region */ 0,
+					   /* offset */ 0, bar_region[0].size);
+	FAIL_IF(bar_db_fd < 0, "Can't create DMABUF, %d\n", errno);
+
+	volatile uint32_t *dbbar0 =
+		mmap_resource(bar_region[0].size, bar_db_fd, 0);
+
+	printf("-i- Mapped DMABUF BAR0 at %p+0x%llx\n", dbbar0,
+	       bar_region[0].size);
+	check_mmio(dbbar0);
+
+	/* TEST: Map the traditional VFIO one _second_; it should still work. */
+	printf("\nTEST: Map the regular VFIO BAR\n");
+	volatile uint32_t *vfiobar =
+		mmap_resource(bar_region[0].size, dev_fd, bar_region[0].offset);
+
+	printf("-i- Mapped VIRTIO BAR0 at %p+0x%llx\n", vfiobar,
+	       bar_region[0].size);
+	check_mmio(vfiobar);
+
+	/* Test plan:
+	 *
+	 * - Revoke the first DMABUF, check for fault
+	 * - Check VFIO BAR access still works
+	 * - Revoke first DMABUF fd again: -EBADFD
+	 * - create new DMABUF for same (previously-revoked) region: accessible
+	 *
+	 * - Create overlapping DMABUFs: map success, maps alias OK
+	 * - Create a second mapping of the second DMABUF, maps alias OK
+	 * - Destroy one by revoking through a dup()ed fd: check mapping revoked
+	 * - Check original is still accessible
+	 *
+	 * If we have a larger (>4K of accessible stuff!) second BAR resource:
+	 * - Map it, create an overlapping alias with offset != 0
+	 * - Check alias/offset is sane
+	 *
+	 * Last:
+	 * - close container_fd and dev_fd: check DMABUF mapping revoked
+	 * - try revoking a non-DMABUF fd: -EINVAL
+	 */
+
+	printf("\nTEST: Revocation of first DMABUF\n");
+	r = revoke_dmabuf(dev_fd, bar_db_fd);
+	FAIL_IF(r != 0, "Can't revoke: %d\n", errno);
+
+	if (sigsetjmp(jmpbuf, 1) == 0) {
+		// Try an access: expect BOOM
+		check_mmio(dbbar0);
+		FAIL_IF(true, "Expecting fault after revoke!\n");
+	}
+	printf("-i- Revoked OK\n");
+
+	printf("\nTEST: Access through VFIO-mapped region still works\n");
+	if (sigsetjmp(jmpbuf, 1) == 0)
+		check_mmio(vfiobar);
+	else
+		FAIL_IF(true, "Expecting VFIO-mapped BAR to still work!\n");
+
+	printf("\nTEST: Double-revoke\n");
+	r = revoke_dmabuf(dev_fd, bar_db_fd);
+	FAIL_IF(r != -1 || errno != EBADFD,
+		"Expecting 2nd revoke to give EBADFD, got %d errno %d\n", r,
+		errno);
+	printf("-i- Correctly failed second revoke\n");
+
+	printf("\nTEST: Can't mmap() revoked DMABUF\n");
+	void *dbfail = mmap(0, bar_region[1].size, PROT_READ | PROT_WRITE,
+			    MAP_SHARED, bar_db_fd, 0);
+	FAIL_IF(dbfail != MAP_FAILED, "mmap() should fail\n");
+	printf("-i- OK\n");
+
+	printf("\nTEST: Recreate new DMABUF for previously-revoked region\n");
+	int bar_db_fd_2 = vfio_create_dmabuf(
+		dev_fd, /* region */ 0, /* offset */ 0, bar_region[0].size);
+	FAIL_IF(bar_db_fd_2 < 0, "Can't create DMABUF, %d\n", errno);
+
+	volatile uint32_t *dbbar0_2 =
+		mmap_resource(bar_region[0].size, bar_db_fd_2, 0);
+
+	printf("-i- Mapped 2nd DMABUF BAR0 at %p+0x%llx\n", dbbar0_2,
+	       bar_region[0].size);
+	check_mmio(dbbar0_2);
+
+	munmap((void *)dbbar0, bar_region[0].size);
+	close(bar_db_fd);
+
+	printf("\nTEST: Create aliasing/overlapping DMABUF\n");
+	int bar_db_fd_3 = vfio_create_dmabuf(
+		dev_fd, /* region */ 0, /* offset */ 0, bar_region[0].size);
+	FAIL_IF(bar_db_fd_3 < 0, "Can't create DMABUF, %d\n", errno);
+
+	volatile uint32_t *dbbar0_3 =
+		mmap_resource(bar_region[0].size, bar_db_fd_3, 0);
+
+	printf("-i- Mapped 3rd DMABUF BAR0 at %p+0x%llx\n", dbbar0_3,
+	       bar_region[0].size);
+	check_mmio(dbbar0_3);
+
+	/* Basic aliasing check: Write value through 2nd, read back through 3rd */
+	uint32_t v;
+
+	dbbar0_2[EDU_REG_INVERT / 4] = 0xfacecace;
+	v = dbbar0_3[EDU_REG_INVERT / 4];
+	FAIL_IF(v != ~0xfacecace,
+		"Alias inverted MMIO value %08x bad (should be %08x)\n", v,
+		~0xfacecace);
+	printf("-i- Aliasing DMABUF OK\n");
+
+	printf("\nTEST: Create a double-mapping of DMABUF\n");
+	/* Create another mmap of the existing aliasing DMABUF fd */
+	volatile uint32_t *dbbar0_3_2 =
+		mmap_resource(bar_region[0].size, bar_db_fd_3, 0);
+
+	printf("-i- Mapped 3rd DMABUF BAR0 _again_ at %p+0x%llx\n", dbbar0_3_2,
+	       bar_region[0].size);
+	/* Can we see the value we wrote before? */
+	v = dbbar0_3_2[EDU_REG_INVERT / 4];
+	FAIL_IF(v != ~0xfacecace,
+		"Alias alias inverted MMIO value %08x bad (should be %08x)\n",
+		v, ~0xfacecace);
+	check_mmio(dbbar0_3_2);
+
+	printf("\nTEST: revoke aliasing DMABUF through dup()ed fd\n");
+	int dup_dbfd3 = dup(bar_db_fd_3);
+
+	r = revoke_dmabuf(dev_fd, dup_dbfd3);
+	FAIL_IF(r != 0, "Can't revoke: %d\n", errno);
+
+	/* Both of the mmap()s made should now be gone */
+	if (sigsetjmp(jmpbuf, 1) == 0) {
+		check_mmio(dbbar0_3);
+		FAIL_IF(true, "Expecting fault on 1st mmap after revoke!\n");
+	}
+
+	if (sigsetjmp(jmpbuf, 1) == 0) {
+		check_mmio(dbbar0_3_2);
+		FAIL_IF(true, "Expecting fault on 2nd mmap after revoke!\n");
+	}
+	printf("-i- Both aliasing DMABUF mappings revoked OK\n");
+
+	close(dup_dbfd3);
+	close(bar_db_fd_3);
+	munmap((void *)dbbar0_3, bar_region[0].size);
+	munmap((void *)dbbar0_3_2, bar_region[0].size);
+
+	/* And finally, although the aliasing DMABUF is gone, access
+	 * through the original one should still work:
+	 */
+	if (sigsetjmp(jmpbuf, 1) == 0)
+		check_mmio(dbbar0_2);
+	else
+		FAIL_IF(true,
+			"Expecting original DMABUF mapping to still work!\n");
+	printf("-i- Aliasing DMABUF removal OK, original still accessible\n");
+
+	/* If we're attached to a hacked/extended QEMU EDU device with
+	 * a large memory region 1 then we can test things like
+	 * offsets/aliasing.
+	 */
+	if (num_regions >= 2) {
+		printf("\nTEST: Second BAR: test overlapping+offset DMABUF\n");
+
+		printf("-i- Region 1 DMABUF: offset %llx, size %llx\n",
+		       bar_region[1].offset, bar_region[1].size);
+		int bar1_db_fd =
+			vfio_create_dmabuf(dev_fd, 1, 0, bar_region[1].size);
+
+		FAIL_IF(bar1_db_fd < 0, "Can't create DMABUF, %d\n", errno);
+
+		volatile uint32_t *dbbar1 = mmap_resource_aligned(
+			bar_region[1].size, MiB(32), bar1_db_fd, 0);
+		printf("-i- Mapped DMABUF Region 1 at %p+0x%llx\n", dbbar1,
+		       bar_region[1].size);
+
+		/* Init with known values */
+		for (unsigned long i = 0; i < (bar_region[1].size);
+		     i += getpagesize())
+			dbbar1[i / 4] = 0xca77face ^ i;
+
+		v = dbbar1[0];
+		FAIL_IF(v != 0xca77face,
+			"DB Region 1 read: Magic value %08x incorrect\n", v);
+		printf("-i- DB Region 1 read: Magic: 0x%08x\n", v);
+
+		/* TEST: Overlap/aliasing; map same BAR with a range
+		 * offset > 0.  Also test disjoint/multi-range DMABUFs
+		 * by creating a second range.  This appears as one
+		 * contiguous VA range mapped to a first BAR range
+		 * (starting from range0_offset), then skipping a few
+		 * physical pages, then a second range (starting at
+		 * range1_offset).
+		 */
+		unsigned long range0_offset = getpagesize() * 3;
+		unsigned long range1_skip_pages = 5;
+		unsigned long range1_skip = getpagesize() * range1_skip_pages;
+		unsigned long range_size =
+			(bar_region[1].size - range0_offset - range1_skip) / 2;
+		unsigned long range1_offset =
+			range0_offset + range_size + range1_skip;
+		unsigned long map_size = range_size * 2;
+
+		printf("\nTEST: Second BAR aliasing mapping, two ranges size 0x%lx:\n\t\t0x%lx-0x%lx, 0x%lx-0x%lx\n",
+		       range_size, range0_offset, range0_offset + range_size,
+		       range1_offset, range1_offset + range_size);
+
+		int bar1_2_db_fd = vfio_create_dmabuf_dual(
+			dev_fd, 1, range0_offset, range_size, range1_offset,
+			range_size);
+		FAIL_IF(bar1_2_db_fd < 0, "Can't create DMABUF, %d\n", errno);
+
+		volatile uint32_t *dbbar1_2 =
+			mmap_resource(map_size, bar1_2_db_fd, 0);
+
+		printf("-i- Mapped DMABUF Region 1 alias at %p+0x%lx\n",
+		       dbbar1_2, map_size);
+		FAIL_IF(dbbar1_2[0] != dbbar1[range0_offset / 4],
+			"slice2 value mismatch\n");
+
+		dbbar1[(range0_offset + 4) / 4] = 0xfacef00d;
+		/* Check we can see the value written above at +offset
+		 * from offset 0 of this mapping (since the DMABUF
+		 * itself is offsetted):
+		 */
+		v = dbbar1_2[4 / 4];
+		FAIL_IF(v != 0xfacef00d,
+			"DB Region 1 alias read: Magic value %08x incorrect\n",
+			v);
+		printf("-i- DB Region 1 alias read: Magic 0x%08x, OK\n", v);
+
+		/* Read back the known values across the two
+		 * sub-ranges of the dbbar1_2 mapping, accounting for
+		 * the physical pages skipped between them
+		 */
+		for (unsigned long i = 0; i < range_size; i += getpagesize()) {
+			unsigned long t = i + range0_offset;
+			uint32_t want = (0xca77face ^ t);
+
+			v = dbbar1_2[i / 4];
+			FAIL_IF(v != want,
+				"Expected %08x (got %08x) from range0 +%08lx (real %08lx)\n",
+				want, v, i, t);
+		}
+		for (unsigned long i = range_size; i < (range_size * 2);
+		     i += getpagesize()) {
+			unsigned long t = i + range1_offset - range_size;
+			uint32_t want = (0xca77face ^ t);
+
+			v = dbbar1_2[i / 4];
+			FAIL_IF(v != want,
+				"Expected %08x (got %08x) from range1 +%08lx (real %08lx)\n",
+				want, v, i, t);
+		}
+
+		printf("\nTEST: Third BAR aliasing mapping, testing mmap() non-zero offset:\n");
+
+		unsigned long smaller = range_size - 0x1000;
+		volatile uint32_t *dbbar1_3 = mmap_resource_aligned(
+			smaller, MiB(32), bar1_2_db_fd, range_size);
+		printf("-i- Mapped DMABUF Region 1 range 1 alias at %p+0x%lx\n",
+		       dbbar1_3, smaller);
+
+		for (unsigned long i = 0; i < smaller; i += getpagesize()) {
+			unsigned long t = i + range1_offset;
+			uint32_t want = (0xca77face ^ t);
+
+			v = dbbar1_3[i / 4];
+			FAIL_IF(v != want,
+				"Expected %08x (got %08x) from 3rd range1 +%08lx (real %08lx)\n",
+				want, v, i, t);
+		}
+		printf("-i- mmap offset OK\n");
+
+		/* TODO: If we can observe hugepages (mechanically,
+		 * rather than human reading debug), we can test
+		 * interesting alignment cases for the PFN search:
+		 *
+		 * - Deny hugepages at start/end of an mmap() that
+		 *   starts/ends at non-HP-aligned addresses
+		 *   (e.g. first pages are small, middle is fully
+		 *   aligned in VA and PFN so 2M, and buffer finishes
+		 *   before 2M boundary, so last pages are small).
+		 *
+		 * - Everything aligned nicely except the mmap() size
+		 *   is <2MB, so hugepage denied due to straddling
+		 *   end.
+		 *
+		 * - Buffer offsets into BAR not aligned, so no huge
+		 *   mappings even if mmap() is perfectly aligned.
+		 */
+
+		/* Check that access after DMABUF fd close still works
+		 * (VMA still holds refcount, obvs!)
+		 */
+		close(bar1_2_db_fd);
+		if (sigsetjmp(jmpbuf, 1) == 0)
+			v = dbbar1_2[0x4 / 4];
+		else
+			FAIL_IF(true,
+				"Expecting original DMABUF mapping to still work!\n");
+		printf("-i- DB Region 1 alias read 2: Magic 0x%08x, OK\n", v);
+		printf("-i- Offset check OK\n");
+	}
+
+	printf("\nTEST: Shutdown: close VFIO container/device fds, check DMABUF gone\n");
+
+	/* Final use of dev_fd: use it to try to revoke a non-DMABUF fd: */
+	r = revoke_dmabuf(dev_fd, 1);
+	FAIL_IF(r != -1 || errno != EINVAL,
+		"Expecting revoke of stdout to give EINVAL, got %d errno %d\n",
+		r, errno);
+	printf("-i- Correctly failed final revoke\n");
+
+	/* Closing all uses of dev_fd (including the VFIO BAR mmap()!)
+	 * will revoke the DMABUF; even though the DMABUF fd might
+	 * remain open, the mapping itself is zapped. Start with a
+	 * plain close (before unmapping the VFIO BAR mapping):
+	 */
+	close(dev_fd);
+	close(container_fd);
+	printf("-i- VFIO fds closed\n");
+
+	if (sigsetjmp(jmpbuf, 1) == 0)
+		check_mmio(dbbar0_2);
+	else
+		FAIL_IF(true,
+			"Expecting DMABUF mapping to still work if VFIO mapping still live!\n");
+
+	if (sigsetjmp(jmpbuf, 1) == 0)
+		check_mmio(vfiobar);
+	else
+		FAIL_IF(true,
+			"Expecting VFIO BAR mapping to still work after fd close!\n");
+
+	munmap((void *)vfiobar, bar_region[0].size);
+	printf("-i- VFIO BAR unmapped\n");
+
+	/* The final reference via VFIO should now be gone, and the
+	 * DMABUF should now be destroyed.  The mapping of it should
+	 * be inaccessible:
+	 */
+	if (sigsetjmp(jmpbuf, 1) == 0) {
+		check_mmio(dbbar0_2);
+		FAIL_IF(true,
+			"Expecting DMABUF mapping to fault after VFIO fd shutdown!\n");
+	}
+	printf("-i- DMABUF mappings inaccessible\n");
+
+	/* Ensure we can't mmap() DMABUF for closed device */
+	void *dbfail2 = mmap(0, bar_region[1].size, PROT_READ | PROT_WRITE,
+			     MAP_SHARED, bar_db_fd_2, 0);
+	FAIL_IF(dbfail2 != MAP_FAILED, "mmap() should fail\n");
+	printf("-i- Can't mmap DMABUF for closed device, OK\n");
+
+	munmap((void *)dbbar0_2, bar_region[0].size);
+	close(bar_db_fd_2);
+
+	printf("\nPASS\n");
+
+	return 0;
+}
+
+static void usage(char *me)
+{
+	printf("Usage:\t%s -g <group_number> -r <RID/BDF>\n"
+	       "\n"
+	       "\t\tGroup is found via device path, e.g. cat /sys/bus/pci/devices/0000:03:1d.0/iommu_group\n"
+	       "\t\tRID is of the form 0000:03:1d.0\n"
+	       "\n",
+	       me);
+}
+
+int main(int argc, char *argv[])
+{
+	/* Get args: IOMMU group and BDF/path */
+	int groupnr = -1;
+	char *rid_str = NULL;
+	int arg;
+
+	while ((arg = getopt(argc, argv, "g:r:h")) != -1) {
+		switch (arg) {
+		case 'g':
+			groupnr = atoi(optarg);
+			break;
+
+		case 'r':
+			rid_str = strdup(optarg);
+			break;
+		case 'h':
+		default:
+			usage(argv[0]);
+			return 1;
+		}
+	}
+
+	if (rid_str == NULL || groupnr == -1) {
+		usage(argv[0]);
+		return 1;
+	}
+
+	printf("-i- Using group number %d, RID '%s'\n", groupnr, rid_str);
+
+	return vfio_dmabuf_test(groupnr, rid_str);
+}
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC v2 PATCH 06/10] vfio/pci: Remove vfio_pci_zap_bars()
  2026-03-12 18:46 ` [RFC v2 PATCH 06/10] vfio/pci: Remove vfio_pci_zap_bars() Matt Evans
@ 2026-03-13  9:12   ` Christian König
  0 siblings, 0 replies; 17+ messages in thread
From: Christian König @ 2026-03-13  9:12 UTC (permalink / raw)
  To: Matt Evans, Alex Williamson, Leon Romanovsky, Jason Gunthorpe,
	Alex Mastro, Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

On 3/12/26 19:46, Matt Evans wrote:
> vfio_pci_zap_bars() and the wrapper
> vfio_pci_zap_and_down_write_memory_lock() are redundant as of
> "vfio/pci: Convert BAR mmap() to use a DMABUF".  The DMABUFs used for
> BAR mappings already zap PTEs via the existing
> vfio_pci_dma_buf_move(), which notifies changes to the BAR space
> (e.g. around reset).
> 
> Remove the old functions, and the various points needing to zap BARs
> become slightly cleaner.

No a full review, but it looks like you now take the DMA buf reservation lock while holding vdev->memory_lock.

I strongly recommend enabling lockdep while testing that, just to be on the sure side that all locks are taken in a consistend order.

Regards,
Christian.

> 
> Signed-off-by: Matt Evans <mattev@meta.com>
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 18 ++++++------------
>  drivers/vfio/pci/vfio_pci_core.c   | 30 +++++++-----------------------
>  drivers/vfio/pci/vfio_pci_priv.h   |  1 -
>  3 files changed, 13 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index b4e39253f98d..c7ed28be1104 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -590,12 +590,9 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
>  		virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
>  		new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
>  
> -		if (!new_mem) {
> -			vfio_pci_zap_and_down_write_memory_lock(vdev);
> +		down_write(&vdev->memory_lock);
> +		if (!new_mem)
>  			vfio_pci_dma_buf_move(vdev, true);
> -		} else {
> -			down_write(&vdev->memory_lock);
> -		}
>  
>  		/*
>  		 * If the user is writing mem/io enable (new_mem/io) and we
> @@ -712,12 +709,9 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
>  static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
>  					  pci_power_t state)
>  {
> -	if (state >= PCI_D3hot) {
> -		vfio_pci_zap_and_down_write_memory_lock(vdev);
> +	down_write(&vdev->memory_lock);
> +	if (state >= PCI_D3hot)
>  		vfio_pci_dma_buf_move(vdev, true);
> -	} else {
> -		down_write(&vdev->memory_lock);
> -	}
>  
>  	vfio_pci_set_power_state(vdev, state);
>  	if (__vfio_pci_memory_enabled(vdev))
> @@ -908,7 +902,7 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
>  						 &cap);
>  
>  		if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
> -			vfio_pci_zap_and_down_write_memory_lock(vdev);
> +			down_write(&vdev->memory_lock);
>  			vfio_pci_dma_buf_move(vdev, true);
>  			pci_try_reset_function(vdev->pdev);
>  			if (__vfio_pci_memory_enabled(vdev))
> @@ -993,7 +987,7 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
>  						&cap);
>  
>  		if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
> -			vfio_pci_zap_and_down_write_memory_lock(vdev);
> +			down_write(&vdev->memory_lock);
>  			vfio_pci_dma_buf_move(vdev, true);
>  			pci_try_reset_function(vdev->pdev);
>  			if (__vfio_pci_memory_enabled(vdev))
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 41224efa58d8..9e9ad97c2f7f 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -319,7 +319,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
>  	 * The vdev power related flags are protected with 'memory_lock'
>  	 * semaphore.
>  	 */
> -	vfio_pci_zap_and_down_write_memory_lock(vdev);
> +	down_write(&vdev->memory_lock);
>  	vfio_pci_dma_buf_move(vdev, true);
>  
>  	if (vdev->pm_runtime_engaged) {
> @@ -1229,7 +1229,7 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
>  	if (!vdev->reset_works)
>  		return -EINVAL;
>  
> -	vfio_pci_zap_and_down_write_memory_lock(vdev);
> +	down_write(&vdev->memory_lock);
>  
>  	/*
>  	 * This function can be invoked while the power state is non-D0. If
> @@ -1613,22 +1613,6 @@ ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *bu
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_write);
>  
> -static void vfio_pci_zap_bars(struct vfio_pci_core_device *vdev)
> -{
> -	struct vfio_device *core_vdev = &vdev->vdev;
> -	loff_t start = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX);
> -	loff_t end = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX);
> -	loff_t len = end - start;
> -
> -	unmap_mapping_range(core_vdev->inode->i_mapping, start, len, true);
> -}
> -
> -void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev)
> -{
> -	down_write(&vdev->memory_lock);
> -	vfio_pci_zap_bars(vdev);
> -}
> -
>  u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev)
>  {
>  	u16 cmd;
> @@ -2487,10 +2471,11 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		}
>  
>  		/*
> -		 * Take the memory write lock for each device and zap BAR
> -		 * mappings to prevent the user accessing the device while in
> -		 * reset.  Locking multiple devices is prone to deadlock,
> -		 * runaway and unwind if we hit contention.
> +		 * Take the memory write lock for each device and
> +		 * revoke all DMABUFs, which will prevent any access
> +		 * to the device while in reset.  Locking multiple
> +		 * devices is prone to deadlock, runaway and unwind if
> +		 * we hit contention.
>  		 */
>  		if (!down_write_trylock(&vdev->memory_lock)) {
>  			ret = -EBUSY;
> @@ -2498,7 +2483,6 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		}
>  
>  		vfio_pci_dma_buf_move(vdev, true);
> -		vfio_pci_zap_bars(vdev);
>  	}
>  
>  	if (!list_entry_is_head(vdev,
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index 37ece9b4b5bd..e201c96bbb14 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -78,7 +78,6 @@ void vfio_config_free(struct vfio_pci_core_device *vdev);
>  int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev,
>  			     pci_power_t state);
>  
> -void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev);
>  u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev);
>  void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev,
>  					u16 cmd);


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs
  2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
                   ` (9 preceding siblings ...)
  2026-03-12 18:46 ` [RFC v2 PATCH 10/10] [RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test Matt Evans
@ 2026-03-13  9:21 ` Christian König
  2026-03-13 13:28   ` Matt Evans
  10 siblings, 1 reply; 17+ messages in thread
From: Christian König @ 2026-03-13  9:21 UTC (permalink / raw)
  To: Matt Evans, Alex Williamson, Leon Romanovsky, Jason Gunthorpe,
	Alex Mastro, Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

On 3/12/26 19:45, Matt Evans wrote:
> Hi all,
> 
> 
> There were various suggestions in the September 2025 thread "[TECH
> TOPIC] vfio, iommufd: Enabling user space drivers to vend more
> granular access to client processes" [0], and LPC discussions, around
> improving the situation for multi-process userspace driver designs.
> This RFC series implements some of these ideas.
> 
> (Thanks for feedback on v1!  Revised series, with changes noted
> inline.)
> 
> Background: Multi-process USDs
> ==============================
> 
> The userspace driver scenario discussed in that thread involves a
> primary process driving a PCIe function through VFIO/iommufd, which
> manages the function-wide ownership/lifecycle.  The function is
> designed to provide multiple distinct programming interfaces (for
> example, several independent MMIO register frames in one function),
> and the primary process delegates control of these interfaces to
> multiple independent client processes (which do the actual work).
> This scenario clearly relies on a HW design that provides appropriate
> isolation between the programming interfaces.
> 
> The two key needs are:
> 
>  1.  Mechanisms to safely delegate a subset of the device MMIO
>      resources to a client process without over-sharing wider access
>      (or influence over whole-device activities, such as reset).
> 
>  2.  Mechanisms to allow a client process to do its own iommufd
>      management w.r.t. its address space, in a way that's isolated
>      from DMA relating to other clients.
> 
> 
> mmap() of VFIO DMABUFs
> ======================
> 
> This RFC addresses #1 in "vfio/pci: Support mmap() of a VFIO DMABUF",
> implementing the proposals in [0] to add mmap() support to the
> existing VFIO DMABUF exporter.
> 
> This enables a userspace driver to define DMABUF ranges corresponding
> to sub-ranges of a BAR, and grant a given client (via a shared fd)
> the capability to access (only) those sub-ranges.  The VFIO device fds
> would be kept private to the primary process.  All the client can do
> with that fd is map (or iomap via iommufd) that specific subset of
> resources, and the impact of bugs/malice is contained.
> 
>  (We'll follow up on #2 separately, as a related-but-distinct problem.
>   PASIDs are one way to achieve per-client isolation of DMA; another
>   could be sharing of a single IOVA space via 'constrained' iommufds.)
> 
> 
> New in v2: To achieve this, the existing VFIO BAR mmap() path is
> converted to use DMABUFs behind the scenes, in "vfio/pci: Convert BAR
> mmap() to use a DMABUF" plus new helper functions, as Jason/Christian
> suggested in the v1 discussion [3].
> 
> This means:
> 
>  - Both regular and new DMABUF BAR mappings share the same vm_ops,
>    i.e.  mmap()ing DMABUFs is a smaller change on top of the existing
>    mmap().
> 
>  - The zapping of mappings occurs via vfio_pci_dma_buf_move(), and the
>    vfio_pci_zap_bars() originally paired with the _move()s can go
>    away.  Each DMABUF has a unique address_space.
> 
>  - It's a step towards future iommufd VFIO Type1 emulation
>    implementing P2P, since iommufd can now get a DMABUF from a VA that
>    it's mapping for IO; the VMAs' vm_file is that of the backing
>    DMABUF.
> 
> 
> Revocation/reclaim
> ==================
> 
> Mapping a BAR subset is useful, but the lifetime of access granted to
> a client needs to be managed well.  For example, a protocol between
> the primary process and the client can indicate when the client is
> done, and when it's safe to reuse the resources elsewhere, but cleanup
> can't practically be cooperative.
> 
> For robustness, we enable the driver to make the resources
> guaranteed-inaccessible when it chooses, so that it can re-assign them
> to other uses in future.
> 
> "vfio/pci: Permanently revoke a DMABUF on request" adds a new VFIO
> device fd ioctl, VFIO_DEVICE_PCI_DMABUF_REVOKE.  This takes a DMABUF
> fd parameter previously exported (from that device!) and permanently
> revokes the DMABUF.  This notifies/detaches importers, zaps PTEs for
> any mappings, and guarantees no future attachment/import/map/access is
> possible by any means.
> 
> A primary driver process would use this operation when the client's
> tenure ends to reclaim "loaned-out" MMIO interfaces, at which point
> the interfaces could be safely re-used.
> 
> New in v2: ioctl() on VFIO driver fd, rather than DMABUF fd.  A DMABUF
> is revoked using code common to vfio_pci_dma_buf_move(), selectively
> zapping mappings (after waiting for completion on the
> dma_buf_invalidate_mappings() request).
> 
> 
> BAR mapping access attributes
> =============================
> 
> Inspired by Alex [Mastro] and Jason's comments in [0] and Mahmoud's
> work in [1] with the goal of controlling CPU access attributes for
> VFIO BAR mappings (e.g. WC), we can decorate DMABUFs with access
> attributes that are then used by a mapping's PTEs.
> 
> I've proposed reserving a field in struct
> vfio_device_feature_dma_buf's flags to specify an attribute for its
> ranges.  Although that keeps the (UAPI) struct unchanged, it means all
> ranges in a DMABUF share the same attribute.  I feel a single
> attribute-to-mmap() relation is logical/reasonable.  An application
> can also create multiple DMABUFs to describe any BAR layout and mix of
> attributes.
> 
> 
> Tests
> =====
> 
> (Still sharing the [RFC ONLY] userspace test/demo program for context,
> not for merge.)
> 
> It illustrates & tests various map/revoke cases, but doesn't use the
> existing VFIO selftests and relies on a (tweaked) QEMU EDU function.
> I'm (still) working on integrating the scenarios into the existing
> VFIO selftests.
> 
> This code has been tested in mapping DMABUFs of single/multiple
> ranges, aliasing mmap()s, aliasing ranges across DMABUFs, vm_pgoff >
> 0, revocation, shutdown/cleanup scenarios, and hugepage mappings seem
> to work correctly.  I've lightly tested WC mappings also (by observing
> resulting PTEs as having the correct attributes...).
> 
> 
> Fin
> ===
> 
> v2 is based on next-20260310 (to build on Leon's recent series
> "vfio: Wait for dma-buf invalidation to complete" [2]).
> 
> 
> Please share your thoughts!  I'd like to de-RFC if we feel this
> approach is now fair.

I only skimmed over it, but at least of hand I couldn't find anything fundamentally wrong.

The locking order seems to change in patch #6. In general I strongly recommend to enable lockdep while testing anyway but explicitly when I see such changes.

Additional to that it might also be a good idea to have a lockdep initcall function which defines the locking order in the way all the VFIO code should follow.

See function dma_resv_lockdep() for an example on how to do that. Especially with mmap support and all the locks involved with that it has proven to be a good practice to have something like that.

Regards,
Christian.

> 
> 
> Many thanks,
> 
> 
> Matt
> 
> 
> 
> References:
> 
> [0]: https://lore.kernel.org/linux-iommu/20250918214425.2677057-1-amastro@fb.com/
> [1]: https://lore.kernel.org/all/20250804104012.87915-1-mngyadam@amazon.de/
> [2]: https://lore.kernel.org/linux-iommu/20260205-nocturnal-poetic-chamois-f566ad@houat/T/#m310cd07011e3a1461b6fda45e3f9b886ba76571a
> [3]: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
> 
> --------------------------------------------------------------------------------
> Changelog:
> 
> v2:  Respin based on the feedback/suggestions:
> 
> - Transform the existing VFIO BAR mmap path to also use DMABUFs behind
>   the scenes, and then simply share that code for explicitly-mapped
>   DMABUFs.
> 
> - Refactors the export itself out of vfio_pci_core_feature_dma_buf,
>   and shared by a new vfio_pci_core_mmap_prep_dmabuf helper used by
>   the regular VFIO mmap to create a DMABUF.
> 
> - Revoke buffers using a VFIO device fd ioctl
> 
> v1: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
> 
> 
> Matt Evans (10):
>   vfio/pci: Set up VFIO barmap before creating a DMABUF
>   vfio/pci: Clean up DMABUFs before disabling function
>   vfio/pci: Add helper to look up PFNs for DMABUFs
>   vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
>   vfio/pci: Convert BAR mmap() to use a DMABUF
>   vfio/pci: Remove vfio_pci_zap_bars()
>   vfio/pci: Support mmap() of a VFIO DMABUF
>   vfio/pci: Permanently revoke a DMABUF on request
>   vfio/pci: Add mmap() attributes to DMABUF feature
>   [RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test
> 
>  drivers/vfio/pci/Kconfig                      |   3 +-
>  drivers/vfio/pci/Makefile                     |   3 +-
>  drivers/vfio/pci/vfio_pci_config.c            |  18 +-
>  drivers/vfio/pci/vfio_pci_core.c              | 123 +--
>  drivers/vfio/pci/vfio_pci_dmabuf.c            | 425 +++++++--
>  drivers/vfio/pci/vfio_pci_priv.h              |  46 +-
>  include/uapi/linux/vfio.h                     |  42 +-
>  tools/testing/selftests/vfio/Makefile         |   1 +
>  .../vfio/standalone/vfio_dmabuf_mmap_test.c   | 837 ++++++++++++++++++
>  9 files changed, 1339 insertions(+), 159 deletions(-)
>  create mode 100644 tools/testing/selftests/vfio/standalone/vfio_dmabuf_mmap_test.c
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs
  2026-03-13  9:21 ` [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Christian König
@ 2026-03-13 13:28   ` Matt Evans
  0 siblings, 0 replies; 17+ messages in thread
From: Matt Evans @ 2026-03-13 13:28 UTC (permalink / raw)
  To: Christian König, Alex Williamson, Leon Romanovsky,
	Jason Gunthorpe, Alex Mastro, Mahmoud Adam, David Matlack
  Cc: Björn Töpel, Sumit Semwal, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

Hi Christian,

On 13/03/2026 09:21, Christian König wrote:
> On 3/12/26 19:45, Matt Evans wrote:
>> Hi all,
>>
>>
>> There were various suggestions in the September 2025 thread "[TECH
>> TOPIC] vfio, iommufd: Enabling user space drivers to vend more
>> granular access to client processes" [0], and LPC discussions, around
>> improving the situation for multi-process userspace driver designs.
>> This RFC series implements some of these ideas.
>>
>> (Thanks for feedback on v1!  Revised series, with changes noted
>> inline.)
>>
>> Background: Multi-process USDs
>> ==============================
>>
>> The userspace driver scenario discussed in that thread involves a
>> primary process driving a PCIe function through VFIO/iommufd, which
>> manages the function-wide ownership/lifecycle.  The function is
>> designed to provide multiple distinct programming interfaces (for
>> example, several independent MMIO register frames in one function),
>> and the primary process delegates control of these interfaces to
>> multiple independent client processes (which do the actual work).
>> This scenario clearly relies on a HW design that provides appropriate
>> isolation between the programming interfaces.
>>
>> The two key needs are:
>>
>>  1.  Mechanisms to safely delegate a subset of the device MMIO
>>      resources to a client process without over-sharing wider access
>>      (or influence over whole-device activities, such as reset).
>>
>>  2.  Mechanisms to allow a client process to do its own iommufd
>>      management w.r.t. its address space, in a way that's isolated
>>      from DMA relating to other clients.
>>
>>
>> mmap() of VFIO DMABUFs
>> ======================
>>
>> This RFC addresses #1 in "vfio/pci: Support mmap() of a VFIO DMABUF",
>> implementing the proposals in [0] to add mmap() support to the
>> existing VFIO DMABUF exporter.
>>
>> This enables a userspace driver to define DMABUF ranges corresponding
>> to sub-ranges of a BAR, and grant a given client (via a shared fd)
>> the capability to access (only) those sub-ranges.  The VFIO device fds
>> would be kept private to the primary process.  All the client can do
>> with that fd is map (or iomap via iommufd) that specific subset of
>> resources, and the impact of bugs/malice is contained.
>>
>>  (We'll follow up on #2 separately, as a related-but-distinct problem.
>>   PASIDs are one way to achieve per-client isolation of DMA; another
>>   could be sharing of a single IOVA space via 'constrained' iommufds.)
>>
>>
>> New in v2: To achieve this, the existing VFIO BAR mmap() path is
>> converted to use DMABUFs behind the scenes, in "vfio/pci: Convert BAR
>> mmap() to use a DMABUF" plus new helper functions, as Jason/Christian
>> suggested in the v1 discussion [3].
>>
>> This means:
>>
>>  - Both regular and new DMABUF BAR mappings share the same vm_ops,
>>    i.e.  mmap()ing DMABUFs is a smaller change on top of the existing
>>    mmap().
>>
>>  - The zapping of mappings occurs via vfio_pci_dma_buf_move(), and the
>>    vfio_pci_zap_bars() originally paired with the _move()s can go
>>    away.  Each DMABUF has a unique address_space.
>>
>>  - It's a step towards future iommufd VFIO Type1 emulation
>>    implementing P2P, since iommufd can now get a DMABUF from a VA that
>>    it's mapping for IO; the VMAs' vm_file is that of the backing
>>    DMABUF.
>>
>>
>> Revocation/reclaim
>> ==================
>>
>> Mapping a BAR subset is useful, but the lifetime of access granted to
>> a client needs to be managed well.  For example, a protocol between
>> the primary process and the client can indicate when the client is
>> done, and when it's safe to reuse the resources elsewhere, but cleanup
>> can't practically be cooperative.
>>
>> For robustness, we enable the driver to make the resources
>> guaranteed-inaccessible when it chooses, so that it can re-assign them
>> to other uses in future.
>>
>> "vfio/pci: Permanently revoke a DMABUF on request" adds a new VFIO
>> device fd ioctl, VFIO_DEVICE_PCI_DMABUF_REVOKE.  This takes a DMABUF
>> fd parameter previously exported (from that device!) and permanently
>> revokes the DMABUF.  This notifies/detaches importers, zaps PTEs for
>> any mappings, and guarantees no future attachment/import/map/access is
>> possible by any means.
>>
>> A primary driver process would use this operation when the client's
>> tenure ends to reclaim "loaned-out" MMIO interfaces, at which point
>> the interfaces could be safely re-used.
>>
>> New in v2: ioctl() on VFIO driver fd, rather than DMABUF fd.  A DMABUF
>> is revoked using code common to vfio_pci_dma_buf_move(), selectively
>> zapping mappings (after waiting for completion on the
>> dma_buf_invalidate_mappings() request).
>>
>>
>> BAR mapping access attributes
>> =============================
>>
>> Inspired by Alex [Mastro] and Jason's comments in [0] and Mahmoud's
>> work in [1] with the goal of controlling CPU access attributes for
>> VFIO BAR mappings (e.g. WC), we can decorate DMABUFs with access
>> attributes that are then used by a mapping's PTEs.
>>
>> I've proposed reserving a field in struct
>> vfio_device_feature_dma_buf's flags to specify an attribute for its
>> ranges.  Although that keeps the (UAPI) struct unchanged, it means all
>> ranges in a DMABUF share the same attribute.  I feel a single
>> attribute-to-mmap() relation is logical/reasonable.  An application
>> can also create multiple DMABUFs to describe any BAR layout and mix of
>> attributes.
>>
>>
>> Tests
>> =====
>>
>> (Still sharing the [RFC ONLY] userspace test/demo program for context,
>> not for merge.)
>>
>> It illustrates & tests various map/revoke cases, but doesn't use the
>> existing VFIO selftests and relies on a (tweaked) QEMU EDU function.
>> I'm (still) working on integrating the scenarios into the existing
>> VFIO selftests.
>>
>> This code has been tested in mapping DMABUFs of single/multiple
>> ranges, aliasing mmap()s, aliasing ranges across DMABUFs, vm_pgoff >
>> 0, revocation, shutdown/cleanup scenarios, and hugepage mappings seem
>> to work correctly.  I've lightly tested WC mappings also (by observing
>> resulting PTEs as having the correct attributes...).
>>
>>
>> Fin
>> ===
>>
>> v2 is based on next-20260310 (to build on Leon's recent series
>> "vfio: Wait for dma-buf invalidation to complete" [2]).
>>
>>
>> Please share your thoughts!  I'd like to de-RFC if we feel this
>> approach is now fair.
> 
> I only skimmed over it, but at least of hand I couldn't find anything fundamentally wrong.

Thank you!

> The locking order seems to change in patch #6. In general I strongly recommend to enable lockdep while testing anyway but explicitly when I see such changes.

I'll definitely +1 on testing with lockdep.

Note that patch #6 doesn't [intend to] change the locking; the naming of
the existing vfio_pci_zap_and_down_write_memory_lock() is potentially
confusing because _really_ it's
vfio_pci_down_write_memory_lock_and_zap().  Patch #6 is replacing that
with _just_ the existing down_write(&memory_lock) part.

(FWIW, lockdep's happy when running the test scenarios on this series.)

> Additional to that it might also be a good idea to have a lockdep initcall function which defines the locking order in the way all the VFIO code should follow.
> 
> See function dma_resv_lockdep() for an example on how to do that. Especially with mmap support and all the locks involved with that it has proven to be a good practice to have something like that.

That's a good suggestion; I'll investigate, and thanks for the pointer.
I spent time stepping through the locking particularly in the revoke
path, and automation here would be pretty useful if possible.


Thanks and regards,


Matt


> 
> Regards,
> Christian.
> 
>>
>>
>> Many thanks,
>>
>>
>> Matt
>>
>>
>>
>> References:
>>
>> [0]: https://lore.kernel.org/linux-iommu/20250918214425.2677057-1-amastro@fb.com/ 
>> [1]: https://lore.kernel.org/all/20250804104012.87915-1-mngyadam@amazon.de/ 
>> [2]: https://lore.kernel.org/linux-iommu/20260205-nocturnal-poetic-chamois-f566ad@houat/T/#m310cd07011e3a1461b6fda45e3f9b886ba76571a 
>> [3]: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/ 
>>
>> --------------------------------------------------------------------------------
>> Changelog:
>>
>> v2:  Respin based on the feedback/suggestions:
>>
>> - Transform the existing VFIO BAR mmap path to also use DMABUFs behind
>>   the scenes, and then simply share that code for explicitly-mapped
>>   DMABUFs.
>>
>> - Refactors the export itself out of vfio_pci_core_feature_dma_buf,
>>   and shared by a new vfio_pci_core_mmap_prep_dmabuf helper used by
>>   the regular VFIO mmap to create a DMABUF.
>>
>> - Revoke buffers using a VFIO device fd ioctl
>>
>> v1: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/ 
>>
>>
>> Matt Evans (10):
>>   vfio/pci: Set up VFIO barmap before creating a DMABUF
>>   vfio/pci: Clean up DMABUFs before disabling function
>>   vfio/pci: Add helper to look up PFNs for DMABUFs
>>   vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
>>   vfio/pci: Convert BAR mmap() to use a DMABUF
>>   vfio/pci: Remove vfio_pci_zap_bars()
>>   vfio/pci: Support mmap() of a VFIO DMABUF
>>   vfio/pci: Permanently revoke a DMABUF on request
>>   vfio/pci: Add mmap() attributes to DMABUF feature
>>   [RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test
>>
>>  drivers/vfio/pci/Kconfig                      |   3 +-
>>  drivers/vfio/pci/Makefile                     |   3 +-
>>  drivers/vfio/pci/vfio_pci_config.c            |  18 +-
>>  drivers/vfio/pci/vfio_pci_core.c              | 123 +--
>>  drivers/vfio/pci/vfio_pci_dmabuf.c            | 425 +++++++--
>>  drivers/vfio/pci/vfio_pci_priv.h              |  46 +-
>>  include/uapi/linux/vfio.h                     |  42 +-
>>  tools/testing/selftests/vfio/Makefile         |   1 +
>>  .../vfio/standalone/vfio_dmabuf_mmap_test.c   | 837 ++++++++++++++++++
>>  9 files changed, 1339 insertions(+), 159 deletions(-)
>>  create mode 100644 tools/testing/selftests/vfio/standalone/vfio_dmabuf_mmap_test.c
>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v2 PATCH 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
  2026-03-12 18:46 ` [RFC v2 PATCH 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA Matt Evans
@ 2026-03-18 20:04   ` Alex Williamson
  2026-03-23 13:25     ` Jason Gunthorpe
  2026-03-23 14:55     ` Matt Evans
  0 siblings, 2 replies; 17+ messages in thread
From: Alex Williamson @ 2026-03-18 20:04 UTC (permalink / raw)
  To: Matt Evans
  Cc: Leon Romanovsky, Jason Gunthorpe, Alex Mastro, Mahmoud Adam,
	David Matlack, Björn Töpel, Sumit Semwal,
	Christian König, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm, alex

On Thu, 12 Mar 2026 11:46:02 -0700
Matt Evans <mattev@meta.com> wrote:

> This helper, vfio_pci_core_mmap_prep_dmabuf(), creates a single-range
> DMABUF for the purpose of mapping a PCI BAR.  This is used in a future
> commit by VFIO's ordinary mmap() path.
> 
> This function transfers ownership of the VFIO device fd to the
> DMABUF, which fput()s when it's released.
> 
> Refactor the existing vfio_pci_core_feature_dma_buf() to split out
> export code common to the two paths, VFIO_DEVICE_FEATURE_DMA_BUF and
> this new VFIO_BAR mmap().
> 
> Signed-off-by: Matt Evans <mattev@meta.com>
> ---
>  drivers/vfio/pci/vfio_pci_dmabuf.c | 131 +++++++++++++++++++++--------
>  drivers/vfio/pci/vfio_pci_priv.h   |   4 +
>  2 files changed, 102 insertions(+), 33 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index 63140528dbea..76db340ba592 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -82,6 +82,8 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
>  		up_write(&priv->vdev->memory_lock);
>  		vfio_device_put_registration(&priv->vdev->vdev);
>  	}
> +	if (priv->vfile)
> +		fput(priv->vfile);
>  	kfree(priv->phys_vec);
>  	kfree(priv);
>  }
> @@ -182,6 +184,41 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
>  	return -EFAULT;
>  }
>  
> +static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev,
> +				  struct vfio_pci_dma_buf *priv, uint32_t flags,
> +				  size_t size, bool status_ok)
> +{
> +	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> +
> +	if (!vfio_device_try_get_registration(&vdev->vdev))
> +		return -ENODEV;
> +
> +	exp_info.ops = &vfio_pci_dmabuf_ops;
> +	exp_info.size = size;
> +	exp_info.flags = flags;
> +	exp_info.priv = priv;
> +
> +	priv->dmabuf = dma_buf_export(&exp_info);
> +	if (IS_ERR(priv->dmabuf)) {
> +		vfio_device_put_registration(&vdev->vdev);
> +		return PTR_ERR(priv->dmabuf);
> +	}
> +
> +	kref_init(&priv->kref);
> +	init_completion(&priv->comp);
> +
> +	/* dma_buf_put() now frees priv */
> +	INIT_LIST_HEAD(&priv->dmabufs_elm);
> +	down_write(&vdev->memory_lock);
> +	dma_resv_lock(priv->dmabuf->resv, NULL);
> +	priv->revoked = !status_ok;

Testing __vfio_pci_memory_enabled() outside of memory_lock() is
invalid, so passing it as a parameter outside of the semaphore is
invalid.  @status_ok is stale here.

> +	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
> +	dma_resv_unlock(priv->dmabuf->resv);
> +	up_write(&vdev->memory_lock);
> +
> +	return 0;
> +}
> +
>  /*
>   * This is a temporary "private interconnect" between VFIO DMABUF and iommufd.
>   * It allows the two co-operating drivers to exchange the physical address of
> @@ -300,7 +337,6 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  {
>  	struct vfio_device_feature_dma_buf get_dma_buf = {};
>  	struct vfio_region_dma_range *dma_ranges;
> -	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
>  	struct vfio_pci_dma_buf *priv;
>  	size_t length;
>  	int ret;
> @@ -369,46 +405,20 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  	kfree(dma_ranges);
>  	dma_ranges = NULL;
>  
> -	if (!vfio_device_try_get_registration(&vdev->vdev)) {
> -		ret = -ENODEV;
> +	ret = vfio_pci_dmabuf_export(vdev, priv, get_dma_buf.open_flags,
> +				     priv->size,
> +				     __vfio_pci_memory_enabled(vdev));
> +	if (ret)
>  		goto err_free_phys;
> -	}
> -
> -	exp_info.ops = &vfio_pci_dmabuf_ops;
> -	exp_info.size = priv->size;
> -	exp_info.flags = get_dma_buf.open_flags;
> -	exp_info.priv = priv;
> -
> -	priv->dmabuf = dma_buf_export(&exp_info);
> -	if (IS_ERR(priv->dmabuf)) {
> -		ret = PTR_ERR(priv->dmabuf);
> -		goto err_dev_put;
> -	}
> -
> -	kref_init(&priv->kref);
> -	init_completion(&priv->comp);
> -
> -	/* dma_buf_put() now frees priv */
> -	INIT_LIST_HEAD(&priv->dmabufs_elm);
> -	down_write(&vdev->memory_lock);
> -	dma_resv_lock(priv->dmabuf->resv, NULL);
> -	priv->revoked = !__vfio_pci_memory_enabled(vdev);

Tested under memory_lock.  It was correct previously.

> -	list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
> -	dma_resv_unlock(priv->dmabuf->resv);
> -	up_write(&vdev->memory_lock);
> -
>  	/*
>  	 * dma_buf_fd() consumes the reference, when the file closes the dmabuf
>  	 * will be released.
>  	 */
>  	ret = dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
> -	if (ret < 0)
> -		goto err_dma_buf;
> -	return ret;
> +	if (ret >= 0)
> +		return ret;
>  
> -err_dma_buf:
>  	dma_buf_put(priv->dmabuf);
> -err_dev_put:
>  	vfio_device_put_registration(&vdev->vdev);
>  err_free_phys:
>  	kfree(priv->phys_vec);
> @@ -419,6 +429,61 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>  	return ret;
>  }
>  
> +int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
> +				   struct vm_area_struct *vma,
> +				   u64 phys_start,
> +				   u64 pgoff,
> +				   u64 req_len)
> +{
> +	struct vfio_pci_dma_buf *priv;
> +	const unsigned int nr_ranges = 1;
> +	int ret;
> +
> +	priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> +	if (!priv)
> +		return -ENOMEM;
> +
> +	priv->phys_vec = kcalloc(nr_ranges, sizeof(*priv->phys_vec),
> +				 GFP_KERNEL);
> +	if (!priv->phys_vec) {
> +		ret = -ENOMEM;
> +		goto err_free_priv;
> +	}
> +
> +	priv->vdev = vdev;
> +	priv->nr_ranges = nr_ranges;
> +	priv->size = req_len;
> +	priv->phys_vec[0].paddr = phys_start + (pgoff << PAGE_SHIFT);
> +	priv->phys_vec[0].len = req_len;
> +
> +	/*
> +	 * Creates a DMABUF, adds it to vdev->dmabufs list for
> +	 * tracking (meaning cleanup or revocation will zap them), and
> +	 * registers with vfio_device:
> +	 */
> +	ret = vfio_pci_dmabuf_export(vdev, priv, O_CLOEXEC, priv->size, true);
> +	if (ret)
> +		goto err_free_phys;
> +
> +	/*
> +	 * The VMA gets the DMABUF file so that other users can locate
> +	 * the DMABUF via a VA.  Ownership of the original VFIO device
> +	 * file being mmap()ed transfers to priv, and is put when the
> +	 * DMABUF is released.
> +	 */
> +	priv->vfile = vma->vm_file;
> +	vma->vm_file = priv->dmabuf->file;

AIUI, this affects what the user sees in /proc/<pid>/maps, right?
Previously a memory range could be clearly associated with a specific
vfio device, now, only for vfio-pci devices, I think the range is
associated to a nondescript dmabuf.  If so, is that an acceptable, user
visible, debugging friendly change (ex. lsof)?  Thanks,

Alex

> +	vma->vm_private_data = priv;
> +
> +	return 0;
> +
> +err_free_phys:
> +	kfree(priv->phys_vec);
> +err_free_priv:
> +	kfree(priv);
> +	return ret;
> +}
> +
>  void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
>  {
>  	struct vfio_pci_dma_buf *priv;
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index 5cc8c85a2153..5fd3a6e00a0e 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -30,6 +30,7 @@ struct vfio_pci_dma_buf {
>  	size_t size;
>  	struct phys_vec *phys_vec;
>  	struct p2pdma_provider *provider;
> +	struct file *vfile;
>  	u32 nr_ranges;
>  	struct kref kref;
>  	struct completion comp;
> @@ -128,6 +129,9 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
>  			      unsigned long address,
>  			      unsigned int order,
>  			      unsigned long *out_pfn);
> +int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
> +				   struct vm_area_struct *vma,
> +				   u64 phys_start, u64 pgoff, u64 req_len);
>  
>  #ifdef CONFIG_VFIO_PCI_DMABUF
>  int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v2 PATCH 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
  2026-03-18 20:04   ` Alex Williamson
@ 2026-03-23 13:25     ` Jason Gunthorpe
  2026-03-23 14:55     ` Matt Evans
  1 sibling, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2026-03-23 13:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Matt Evans, Leon Romanovsky, Alex Mastro, Mahmoud Adam,
	David Matlack, Björn Töpel, Sumit Semwal,
	Christian König, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

On Wed, Mar 18, 2026 at 02:04:08PM -0600, Alex Williamson wrote:

> AIUI, this affects what the user sees in /proc/<pid>/maps, right?
> Previously a memory range could be clearly associated with a specific
> vfio device, now, only for vfio-pci devices, I think the range is
> associated to a nondescript dmabuf.  

Probably

> If so, is that an acceptable, user
> visible, debugging friendly change (ex. lsof)?  Thanks,

IIRC there is a way to attach a string to the inode that lsof can
display, if we do that right it should be good enough.

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC v2 PATCH 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
  2026-03-18 20:04   ` Alex Williamson
  2026-03-23 13:25     ` Jason Gunthorpe
@ 2026-03-23 14:55     ` Matt Evans
  1 sibling, 0 replies; 17+ messages in thread
From: Matt Evans @ 2026-03-23 14:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Leon Romanovsky, Jason Gunthorpe, Alex Mastro, Mahmoud Adam,
	David Matlack, Björn Töpel, Sumit Semwal,
	Christian König, Kevin Tian, Ankit Agrawal,
	Pranjal Shrivastava, Alistair Popple, Vivek Kasireddy,
	linux-kernel, linux-media, dri-devel, linaro-mm-sig, kvm

Hi Alex,

On Wed, Mar 18, 2026 at 8:04 PM Alex Williamson <alex@shazbot.org> wrote:
> On Thu, 12 Mar 2026 11:46:02 -0700
> Matt Evans <mattev@meta.com> wrote:
>
> > This helper, vfio_pci_core_mmap_prep_dmabuf(), creates a single-range
> > DMABUF for the purpose of mapping a PCI BAR.  This is used in a future
> > commit by VFIO's ordinary mmap() path.
> >
> > This function transfers ownership of the VFIO device fd to the
> > DMABUF, which fput()s when it's released.
> >
> > Refactor the existing vfio_pci_core_feature_dma_buf() to split out
> > export code common to the two paths, VFIO_DEVICE_FEATURE_DMA_BUF and
> > this new VFIO_BAR mmap().
> >
> > Signed-off-by: Matt Evans <mattev@meta.com>
> > ---
> >  drivers/vfio/pci/vfio_pci_dmabuf.c | 131 +++++++++++++++++++++--------
> >  drivers/vfio/pci/vfio_pci_priv.h   |   4 +
> >  2 files changed, 102 insertions(+), 33 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > index 63140528dbea..76db340ba592 100644
> > --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> > +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > @@ -82,6 +82,8 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> >               up_write(&priv->vdev->memory_lock);
> >               vfio_device_put_registration(&priv->vdev->vdev);
> >       }
> > +     if (priv->vfile)
> > +             fput(priv->vfile);
> >       kfree(priv->phys_vec);
> >       kfree(priv);
> >  }
> > @@ -182,6 +184,41 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
> >       return -EFAULT;
> >  }
> >
> > +static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev,
> > +                               struct vfio_pci_dma_buf *priv, uint32_t flags,
> > +                               size_t size, bool status_ok)
> > +{
> > +     DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> > +
> > +     if (!vfio_device_try_get_registration(&vdev->vdev))
> > +             return -ENODEV;
> > +
> > +     exp_info.ops = &vfio_pci_dmabuf_ops;
> > +     exp_info.size = size;
> > +     exp_info.flags = flags;
> > +     exp_info.priv = priv;
> > +
> > +     priv->dmabuf = dma_buf_export(&exp_info);
> > +     if (IS_ERR(priv->dmabuf)) {
> > +             vfio_device_put_registration(&vdev->vdev);
> > +             return PTR_ERR(priv->dmabuf);
> > +     }
> > +
> > +     kref_init(&priv->kref);
> > +     init_completion(&priv->comp);
> > +
> > +     /* dma_buf_put() now frees priv */
> > +     INIT_LIST_HEAD(&priv->dmabufs_elm);
> > +     down_write(&vdev->memory_lock);
> > +     dma_resv_lock(priv->dmabuf->resv, NULL);
> > +     priv->revoked = !status_ok;
>
> Testing __vfio_pci_memory_enabled() outside of memory_lock() is
> invalid, so passing it as a parameter outside of the semaphore is
> invalid.  @status_ok is stale here.

So it is, arrrrrgh.  Thank you for that; I've found a couple of other
choice bugs in that RFC, and will resolve all of this in a repost
soon.

[snip]
> > +
> > +     /*
> > +      * The VMA gets the DMABUF file so that other users can locate
> > +      * the DMABUF via a VA.  Ownership of the original VFIO device
> > +      * file being mmap()ed transfers to priv, and is put when the
> > +      * DMABUF is released.
> > +      */
> > +     priv->vfile = vma->vm_file;
> > +     vma->vm_file = priv->dmabuf->file;
>
> AIUI, this affects what the user sees in /proc/<pid>/maps, right?
> Previously a memory range could be clearly associated with a specific
> vfio device, now, only for vfio-pci devices, I think the range is
> associated to a nondescript dmabuf.  If so, is that an acceptable, user
> visible, debugging friendly change (ex. lsof)?  Thanks,

(Jason, your comment noted with thanks, replying to you both here to
save electrons.)

Great question; a formatting change there is inherent to moving to a
DMABUF (which generates a "/dmabuf:" prefix to a user-defined name).
If we can accept that it changes at all, then I agree this then should
output nice debug: at least the cdev name and resource index, and
we've the opportunity to include the BDF too.  I've added this; an
example line of /proc/<pid>/maps:

    ffffb8070000-ffffbc040000 rw-s 00030000 00:0b 5
      /dmabuf:vfio0:0000:00:03.0/1

Note the file offset used to include the resource index up at
VFIO_PCI_OFFSET_SHIFT but this DMABUF version doesn't do that, so I'm
proposing appending a "/%u" for the index.  Above is a map of BAR1,
offset 0x30000.  If people feel strongly about the existing aesthetic
then we could keep the index encoded in vm_pgoff to retain the same
offset field in /proc/<pid>/maps, but it'd be less neat masking it
back out in a few places.

The default name of a DMABUF acquired through
VFIO_DEVICE_FEATURE_DMA_BUF would still be "/dmabuf:" and I think it
should stay this way since a better name should be supplied by
userspace.  The default at least differentiates them from VFIO device
fd mappings.

Many thanks,


Matt


>
> Alex
>
> > +     vma->vm_private_data = priv;
> > +
> > +     return 0;
> > +
> > +err_free_phys:
> > +     kfree(priv->phys_vec);
> > +err_free_priv:
> > +     kfree(priv);
> > +     return ret;
> > +}
> > +
> >  void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> >  {
> >       struct vfio_pci_dma_buf *priv;
> > diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> > index 5cc8c85a2153..5fd3a6e00a0e 100644
> > --- a/drivers/vfio/pci/vfio_pci_priv.h
> > +++ b/drivers/vfio/pci/vfio_pci_priv.h
> > @@ -30,6 +30,7 @@ struct vfio_pci_dma_buf {
> >       size_t size;
> >       struct phys_vec *phys_vec;
> >       struct p2pdma_provider *provider;
> > +     struct file *vfile;
> >       u32 nr_ranges;
> >       struct kref kref;
> >       struct completion comp;
> > @@ -128,6 +129,9 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
> >                             unsigned long address,
> >                             unsigned int order,
> >                             unsigned long *out_pfn);
> > +int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
> > +                                struct vm_area_struct *vma,
> > +                                u64 phys_start, u64 pgoff, u64 req_len);
> >
> >  #ifdef CONFIG_VFIO_PCI_DMABUF
> >  int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2026-03-23 14:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-12 18:45 [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
2026-03-12 18:45 ` [RFC v2 PATCH 01/10] vfio/pci: Set up VFIO barmap before creating a DMABUF Matt Evans
2026-03-12 18:46 ` [RFC v2 PATCH 02/10] vfio/pci: Clean up DMABUFs before disabling function Matt Evans
2026-03-12 18:46 ` [RFC v2 PATCH 03/10] vfio/pci: Add helper to look up PFNs for DMABUFs Matt Evans
2026-03-12 18:46 ` [RFC v2 PATCH 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA Matt Evans
2026-03-18 20:04   ` Alex Williamson
2026-03-23 13:25     ` Jason Gunthorpe
2026-03-23 14:55     ` Matt Evans
2026-03-12 18:46 ` [RFC v2 PATCH 05/10] vfio/pci: Convert BAR mmap() to use a DMABUF Matt Evans
2026-03-12 18:46 ` [RFC v2 PATCH 06/10] vfio/pci: Remove vfio_pci_zap_bars() Matt Evans
2026-03-13  9:12   ` Christian König
2026-03-12 18:46 ` [RFC v2 PATCH 07/10] vfio/pci: Support mmap() of a VFIO DMABUF Matt Evans
2026-03-12 18:46 ` [RFC v2 PATCH 08/10] vfio/pci: Permanently revoke a DMABUF on request Matt Evans
2026-03-12 18:46 ` [RFC v2 PATCH 09/10] vfio/pci: Add mmap() attributes to DMABUF feature Matt Evans
2026-03-12 18:46 ` [RFC v2 PATCH 10/10] [RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test Matt Evans
2026-03-13  9:21 ` [RFC v2 PATCH 00/10] vfio/pci: Add mmap() for DMABUFs Christian König
2026-03-13 13:28   ` Matt Evans

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox