[RFC PATCH 3/7] vfio/pci: Support mmap() of a DMABUF

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Matt Evans <mattev@meta.com>
To: Alex Williamson <alex@shazbot.org>,
	Leon Romanovsky <leon@kernel.org>,
	Jason Gunthorpe <jgg@nvidia.com>, Alex Mastro <amastro@fb.com>,
	Mahmoud Adam <mngyadam@amazon.de>,
	David Matlack <dmatlack@google.com>
Cc: "Björn Töpel" <bjorn@kernel.org>,
	"Sumit Semwal" <sumit.semwal@linaro.org>,
	"Christian König" <christian.koenig@amd.com>,
	"Kevin Tian" <kevin.tian@intel.com>,
	"Ankit Agrawal" <ankita@nvidia.com>,
	"Pranjal Shrivastava" <praan@google.com>,
	"Alistair Popple" <apopple@nvidia.com>,
	"Vivek Kasireddy" <vivek.kasireddy@intel.com>,
	linux-kernel@vger.kernel.org, linux-media@vger.kernel.org,
	dri-devel@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org,
	kvm@vger.kernel.org
Subject: [RFC PATCH 3/7] vfio/pci: Support mmap() of a DMABUF
Date: Thu, 26 Feb 2026 12:21:59 -0800	[thread overview]
Message-ID: <20260226202211.929005-4-mattev@meta.com> (raw)
In-Reply-To: <20260226202211.929005-1-mattev@meta.com>

A VFIO DMABUF can export a subset of a BAR to userspace by fd; add
support for mmap() of this fd.  This provides another route for a
process to map BARs, except one where the process can only map a specific
subset of a BAR represented by the exported DMABUF.

mmap() support enables userspace driver designs that safely delegate
access to BAR sub-ranges to other client processes by sharing a DMABUF
fd, without having to share the (omnipotent) VFIO device fd with them.

The mmap callback installs vm_ops callbacks for .fault and .huge_fault;
they find a PFN by searching the DMABUF's physical ranges.  That is,
DMABUFs with multiple ranges are supported for mmap().

Signed-off-by: Matt Evans <mattev@meta.com>
---
 drivers/vfio/pci/vfio_pci_dmabuf.c | 219 +++++++++++++++++++++++++++++
 1 file changed, 219 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 46ab64fbeb19..bebb496bd0f2 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -85,6 +85,209 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
 	kfree(priv);
 }
 
+static int vfio_pci_dma_buf_find_pfn(struct device *dev,
+				     struct vfio_pci_dma_buf *vpdmabuf,
+				     struct vm_area_struct *vma,
+				     unsigned long address,
+				     unsigned int order,
+				     unsigned long *out_pfn)
+{
+	/*
+	 * Given a VMA (start, end, pgoffs) and a fault address,
+	 * search phys_vec[] to find the range representing the
+	 * address's offset into the VMA (and so a PFN).
+	 *
+	 * The phys_vec ranges represent contiguous spans of VAs
+	 * upwards from the buffer offset 0; the actual PFNs might be
+	 * in any order, overlap/alias, etc.  Calculate an offset of
+	 * the desired page given VMA start/pgoff and address, then
+	 * search upwards from 0 to find which span contains it.
+	 *
+	 * On success, a valid PFN for a page sized by 'order' is
+	 * returned into out_pfn.
+	 *
+	 * Failure occurs if:
+	 * - The page would cross the edge of the VMA
+	 * - The page isn't entirely contained within a range
+	 * - We find a range, but the final PFN isn't aligned to the
+	 *   requested order.
+	 *
+	 * (Upon failure, the caller is expected to try again with a
+	 * smaller order; the tests above will always succeed for
+	 * order=0 as the limit case.)
+	 *
+	 * It's suboptimal if DMABUFs are created with neigbouring
+	 * ranges that are physically contiguous, since hugepages
+	 * can't straddle range boundaries.  (The construction of the
+	 * ranges vector should merge such ranges.)
+	 */
+
+	unsigned long rounded_page_addr = address & ~((PAGE_SIZE << order) - 1);
+	unsigned long rounded_page_end = rounded_page_addr + (PAGE_SIZE << order);
+	unsigned long buf_page_offset;
+	unsigned long buf_offset = 0;
+	unsigned int i;
+
+	if (rounded_page_addr < vma->vm_start || rounded_page_end > vma->vm_end)
+		return -EAGAIN;
+
+	if (unlikely(check_add_overflow(rounded_page_addr - vma->vm_start,
+					vma->vm_pgoff << PAGE_SHIFT, &buf_page_offset)))
+		return -EFAULT;
+
+	for (i = 0; i < vpdmabuf->nr_ranges; i++) {
+		unsigned long range_len = vpdmabuf->phys_vec[i].len;
+		unsigned long range_start = vpdmabuf->phys_vec[i].paddr;
+
+		if (buf_page_offset >= buf_offset &&
+		    buf_page_offset + (PAGE_SIZE << order) <= buf_offset + range_len) {
+			/*
+			 * The faulting page is wholly contained
+			 * within the span represented by the range.
+			 * Validate PFN alignment for the order:
+			 */
+			unsigned long pfn = (range_start >> PAGE_SHIFT) +
+				((buf_page_offset - buf_offset) >> PAGE_SHIFT);
+
+			if (IS_ALIGNED(pfn, 1 << order)) {
+				*out_pfn = pfn;
+				return 0;
+			}
+			/* Retry with smaller order */
+			return -EAGAIN;
+		}
+		buf_offset += range_len;
+	}
+
+	/*
+	 * If we get here, the address fell outside of the span
+	 * represented by the (concatenated) ranges.  This can
+	 * never happen because vfio_pci_dma_buf_mmap() checks that
+	 * the VMA is <= the total size of the ranges.
+	 *
+	 * But if it does, force SIGBUS for the access, and warn.
+	 */
+	WARN_ONCE(1, "No range for addr 0x%lx, order %d: VMA 0x%lx-0x%lx pgoff 0x%lx, %d ranges, size 0x%lx\n",
+		  address, order, vma->vm_start, vma->vm_end, vma->vm_pgoff,
+		  vpdmabuf->nr_ranges, vpdmabuf->size);
+
+	return -EFAULT;
+}
+
+static vm_fault_t vfio_pci_dma_buf_mmap_huge_fault(struct vm_fault *vmf,
+						   unsigned int order)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_dma_buf *priv = vma->vm_private_data;
+	struct vfio_pci_core_device *vdev;
+	unsigned long pfn;
+	vm_fault_t ret = VM_FAULT_FALLBACK;
+
+	vdev = READ_ONCE(priv->vdev);
+
+	/*
+	 * A fault for an existing mmap might occur after
+	 * vfio_pci_dma_buf_cleanup() has revoked and destroyed the
+	 * vdev's DMABUFs, and annulled vdev.  After creation, vdev is
+	 * only ever written in cleanup.
+	 */
+	if (!vdev)
+		return VM_FAULT_SIGBUS;
+
+	int r = vfio_pci_dma_buf_find_pfn(&vdev->pdev->dev, priv, vma,
+					  vmf->address, order, &pfn);
+
+	if (r == 0) {
+		scoped_guard(rwsem_read, &vdev->memory_lock) {
+			/* Deal with the possibility of a fault racing
+			 * with vfio_pci_dma_buf_move() revoking and
+			 * then unmapping the buffer.  The
+			 * revocation/unmap and status change occurs
+			 * whilst holding memory_lock.
+			 */
+			if (priv->revoked)
+				ret = VM_FAULT_SIGBUS;
+			else
+				ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
+		}
+	} else if (r != -EAGAIN) {
+		ret = VM_FAULT_SIGBUS;
+	}
+
+	dev_dbg_ratelimited(&vdev->pdev->dev,
+			    "%s(order = %d) PFN 0x%lx, VA 0x%lx, pgoff 0x%lx: 0x%x\n",
+			    __func__, order, pfn, vmf->address, vma->vm_pgoff, (unsigned int)ret);
+
+	return ret;
+}
+
+static vm_fault_t vfio_pci_dma_buf_mmap_page_fault(struct vm_fault *vmf)
+{
+	return vfio_pci_dma_buf_mmap_huge_fault(vmf, 0);
+}
+
+static const struct vm_operations_struct vfio_pci_dma_buf_mmap_ops = {
+	.fault = vfio_pci_dma_buf_mmap_page_fault,
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+	.huge_fault = vfio_pci_dma_buf_mmap_huge_fault,
+#endif
+};
+
+static bool vfio_pci_dma_buf_is_mappable(struct dma_buf *dmabuf)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+	/*
+	 * Sanity checks at mmap() time; alignment has already been
+	 * asserted by validate_dmabuf_input().
+	 *
+	 * Although the revoked state is transient, refuse to map a
+	 * revoked buffer to flag early that something odd is going
+	 * on: for example, users should not be mmap()ing a buffer
+	 * that's being moved [by a user-triggered activity].
+	 */
+	if (priv->revoked)
+		return false;
+
+	return true;
+}
+
+/*
+ * Similar to vfio_pci_core_mmap() for a regular VFIO device fd, but
+ * differs by pre-checks performed and ultimately the vm_ops installed.
+ */
+static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma)
+{
+	struct vfio_pci_dma_buf *priv = dmabuf->priv;
+	u64 req_len, req_start;
+
+	if (!vfio_pci_dma_buf_is_mappable(dmabuf))
+		return -ENODEV;
+	if ((vma->vm_flags & VM_SHARED) == 0)
+		return -EINVAL;
+
+	req_len = vma->vm_end - vma->vm_start;
+	req_start = vma->vm_pgoff << PAGE_SHIFT;
+
+	if (req_start + req_len > priv->size)
+		return -EINVAL;
+
+	vma->vm_private_data = priv;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
+
+	/*
+	 * See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED.
+	 *
+	 * FIXME: get mapping attributes from dmabuf?
+	 */
+	vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
+		     VM_DONTEXPAND | VM_DONTDUMP);
+	vma->vm_ops = &vfio_pci_dma_buf_mmap_ops;
+
+	return 0;
+}
+
 static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
 	.pin = vfio_pci_dma_buf_pin,
 	.unpin = vfio_pci_dma_buf_unpin,
@@ -92,6 +295,7 @@ static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
 	.map_dma_buf = vfio_pci_dma_buf_map,
 	.unmap_dma_buf = vfio_pci_dma_buf_unmap,
 	.release = vfio_pci_dma_buf_release,
+	.mmap = vfio_pci_dma_buf_mmap,
 };
 
 /*
@@ -335,6 +539,11 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
 	struct vfio_pci_dma_buf *tmp;
 
 	lockdep_assert_held_write(&vdev->memory_lock);
+	/*
+	 * Holding memory_lock ensures a racing
+	 * vfio_pci_dma_buf_mmap_*_fault() observes priv->revoked
+	 * properly.
+	 */
 
 	list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
 		if (!get_file_active(&priv->dmabuf->file))
@@ -345,6 +554,14 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
 			priv->revoked = revoked;
 			dma_buf_move_notify(priv->dmabuf);
 			dma_resv_unlock(priv->dmabuf->resv);
+
+			/*
+			 * Unmap any possible userspace mappings for a
+			 * now-revoked DMABUF:
+			 */
+			if (revoked)
+				unmap_mapping_range(priv->dmabuf->file->f_mapping,
+						    0, priv->size, 1);
 		}
 		fput(priv->dmabuf->file);
 	}
@@ -366,6 +583,8 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
 		priv->revoked = true;
 		dma_buf_move_notify(priv->dmabuf);
 		dma_resv_unlock(priv->dmabuf->resv);
+		unmap_mapping_range(priv->dmabuf->file->f_mapping,
+				    0, priv->size, 1);
 		vfio_device_put_registration(&vdev->vdev);
 		fput(priv->dmabuf->file);
 	}
-- 
2.47.3

next prev parent reply	other threads:[~2026-02-26 20:23 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26 20:21 [RFC PATCH 0/7] vfio/pci: Add mmap() for DMABUFs Matt Evans
2026-02-26 20:21 ` [RFC PATCH 1/7] vfio/pci: Ensure VFIO barmap is set up before creating a DMABUF Matt Evans
2026-02-26 20:21 ` [RFC PATCH 2/7] vfio/pci: Clean up DMABUFs before disabling function Matt Evans
2026-02-26 20:21 ` Matt Evans [this message]
2026-02-27 10:09   ` [RFC PATCH 3/7] vfio/pci: Support mmap() of a DMABUF Christian König
2026-02-27 12:51     ` Jason Gunthorpe
2026-02-27 19:42       ` Matt Evans
2026-02-27 19:48         ` Jason Gunthorpe
2026-02-27 21:52           ` Alex Mastro
2026-02-27 22:00             ` Alex Mastro
2026-02-27 22:04             ` Jason Gunthorpe
2026-03-02 10:07               ` Christian König
2026-03-02 12:54                 ` Jason Gunthorpe
2026-03-02 13:20                   ` Christian König
2026-02-26 20:22 ` [RFC PATCH 4/7] dma-buf: uapi: Mechanism to revoke DMABUFs via ioctl() Matt Evans
2026-02-27 10:05   ` Christian König
2026-02-27 12:56     ` Jason Gunthorpe
2026-02-27 13:02     ` Matt Evans
2026-02-27 15:20       ` Christian König
2026-02-27 16:19         ` Matt Evans
2026-02-26 20:22 ` [RFC PATCH 5/7] vfio/pci: Permanently revoke a DMABUF on request Matt Evans
2026-02-26 20:22 ` [RFC PATCH 6/7] vfio/pci: Add mmap() attributes to DMABUF feature Matt Evans
2026-02-26 20:22 ` [RFC PATCH 7/7] [RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test Matt Evans

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:46ab64fbeb1 dfblob:bebb496bd0f )
 OR (
bs:"[RFC PATCH 3/7] vfio/pci: Support mmap() of a DMABUF" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260226202211.929005-4-mattev@meta.com \
    --to=mattev@meta.com \
    --cc=alex@shazbot.org \
    --cc=amastro@fb.com \
    --cc=ankita@nvidia.com \
    --cc=apopple@nvidia.com \
    --cc=bjorn@kernel.org \
    --cc=christian.koenig@amd.com \
    --cc=dmatlack@google.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jgg@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=leon@kernel.org \
    --cc=linaro-mm-sig@lists.linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-media@vger.kernel.org \
    --cc=mngyadam@amazon.de \
    --cc=praan@google.com \
    --cc=sumit.semwal@linaro.org \
    --cc=vivek.kasireddy@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.