* [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs
@ 2026-07-01 17:12 Matt Evans
2026-07-01 17:12 ` [PATCH v4 01/10] PCI/P2PDMA: Split pool-related cleanup out of pci_p2pdma_release() Matt Evans
` (9 more replies)
0 siblings, 10 replies; 13+ messages in thread
From: Matt Evans @ 2026-07-01 17:12 UTC (permalink / raw)
To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava
Cc: Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
Ankit Agrawal, Alistair Popple, Vivek Kasireddy, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, kvm, linux-pci
Hi all,
This series is based on previous RFCs/discussions:
Tech topic: https://lore.kernel.org/linux-iommu/20250918214425.2677057-1-amastro@fb.com/
RFCv1: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
RFCv2: https://lore.kernel.org/kvm/20260312184613.3710705-1-mattev@meta.com/
The background/rationale is covered in more detail in the RFC cover
letters. The TL;DR is:
The goal is to enable userspace driver designs that use VFIO to export
DMABUFs representing subsets of PCI device BARs, and "vend" those
buffers from a primary process to other subordinate processes by fd.
These processes then mmap() the buffers and their access to the device
is isolated to the exported ranges. This is an improvement on sharing
the VFIO device fd to subordinate processes, which would allow
unfettered access.
This is achieved by enabling mmap() of vfio-pci DMABUFs, passed by fd
to subordinate processes. Second, a new revocation mechanism is added
to allow the primary process to forcibly revoke access to
previously-shared BAR spans, even if the subordinate processes haven't
cleanly exited.
(The related topic of safe delegation of iommufd control to the
subordinate processes is not addressed here, and is follow-up work.)
As well as isolation and revocation, another advantage to accessing a
BAR through a VMA backed by a DMABUF is that it's straightforward to
mmap() the buffer with access attributes, such as write-combining.
Feedback from the RFCs requested that, instead of creating
DMABUF-specific vm_ops and .fault paths, to go the whole way and
migrate the existing VFIO PCI BAR mmap() to be backed by a DMABUF too,
resulting in a common vm_ops and fault handler for mmap()s of both the
VFIO device and explicitly-exported DMABUFs. This will help future
iommufd emulation of VFIO Type1 peer-to-peer, making it easier to get
a DMABUF for a VFIO BAR as a DMA target.
mmap() conversion to use DMABUF underneath has been done for vfio-pci,
but not sub-drivers:
nvgrace-gpu's mmap() override path is unchanged; I kept this out of
scope for now not least because I don't have a thorough test setup
for this system. I would prefer to help the nvgrace-gpu maintainers
enable BAR mmap() DMABUFs themselves.
Notes on patches
================
PCI/P2PDMA: Split pool-related cleanup out of pci_p2pdma_release()
PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE
Later in the series, vfio-pci's mmap() is going to depend on
pcim_p2pdma_provider() which depended on CONFIG_PCI_P2PDMA, which
in turn depended on ZONE_DEVICE. That isn't available on 32-bit
and some archs, because they lack MEMORY_HOTPLUG and friends.
VFIO does _not_ require actual P2P to be present for basic mmap()
functionality, only for the optional CONFIG_DMA_SHARED_BUFFER
feature.
These split out p2pdma_core.c under CONFIG_PCI_P2PDMA_CORE (which
currently contains pcim_p2pdma_provider()), and an optional
CONFIG_PCI_P2PDMA which depends on ZONE_DEVICE etc. providing
P2P functionality in the existing p2pdma.c. The first splits
out pool cleanup from the release path, and the second does the
refactor/code move to the new file.
vfio/pci: Add a helper to look up PFNs for DMABUFs
vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
The first adds a DMABUF VMA fault handler helper to determine
arbitrary-sized PFNs from ranges in DMABUF. Secondly, refactor
DMABUF export for use by the existing export feature and add a
helper that creates a DMABUF corresponding to a VFIO BAR mmap()
request.
vfio/pci: Convert BAR mmap() to use a DMABUF
The vfio-pci core mmap() creates a DMABUF with the helper, and the
vm_ops fault handler uses the other helper to resolve the fault.
Because this depends on DMABUF structs/code, CONFIG_VFIO_PCI_CORE
needs to depend on CONFIG_DMA_SHARED_BUFFER. The
CONFIG_VFIO_PCI_DMABUF still conditionally enables the export
support code.
NOTE: The user mmap()s a device fd, but the resulting VMA's vm_file
becomes that of the DMABUF. The DMABUF takes ownership of the
device file and put()s it on release, which maintains the existing
behaviour of a VMA keeping the VFIO device open.
BAR zapping then happens via the existing vfio_pci_dma_buf_move()
path, which now needs to unmap PTEs in the DMABUF's address_space.
vfio/pci: Provide a user-facing name for BAR mappings
There was a request for decent debug naming in /proc/<pid>/maps
etc. comparable to the existing VFIO names: since the VMAs are
DMABUFs, they have a "dmabuf:" prefix and can't be 100% identical
to before. This is a user-visible change, but this patch at least
now gives us extra info on the BDF & BAR being mapped.
vfio/pci: Clean up BAR zap and revocation
In general (see NOTE!) the vfio_pci_zap_bars() is now obsolete,
since it unmaps PTEs in the VFIO device address_space which is now
unused. This consolidates all calls (e.g. around reset) with the
neighbouring vfio_pci_dma_buf_move()s into new functions, to
revoke/unrevoke (making the steps clearer).
NOTE: Because drivers can use their own vm_ops and override .mmap,
the core must conservatively assume an overridden .mmap might still
add PTEs to the VFIO device address_space and therefore still does
the zap. A new flag, zap_bars_on_revoke, enables the zap when
.mmap is overridden. A driver that does not need the zap can clear
this to opt-out, e.g. if the driver calls down to the common mmap
(and so uses DMABUFs).
vfio/pci: Support mmap() of a VFIO DMABUF
Adds mmap() for a DMABUF fd exported from vfio-pci.
It was a goal to keep the VFIO device fd lifetime behaviour
unchanged with respect to the DMABUFs. An application can close
all device fds, and this will revoke/clean up all DMABUFs; no
mappings or other access can be performed now. When enabling
mmap() of the DMABUFs, this means access through the VMA is also
revoked. This complicates the fault handler because whilst the
DMABUF exists, it has no guarantee that the corresponding VFIO
device is still alive. Adds synchronisation ensuring the vdev is
available before vdev->memory_lock is touched; this holds the
device registration so that even if the buffer has been cleaned up,
vdev hasn't been freed and so the lock can be safely taken.
vfio/pci: Permanently revoke a DMABUF on request
By weight, this is mostly a rename of revoked to an enum, status.
There are now 3 states for a buffer, usable and revoked
temporary/permanent. A new VFIO feature is added,
VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE, which takes a DMABUF (exported
from the same device) and permanently revokes it. Thus a userspace
driver can guarantee any downstream consumers of a shared fd are
prevented from accessing a BAR range, and that range can be reused.
NOTE: This might block userspace, waiting on importers to detach.
The code doing revocation in vfio_pci_dma_buf_move() is moved, to a
common function for use by ..._move() and this new feature.
NOTE: See changelog, by request v4 added a condition to the
existing code to elide the unnecessary invalidation/sync on the
un-revoke path.)
vfio/pci: Add mmap() attributes to DMABUF feature
Adds a new VFIO feature, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR.
After a DMABUF is exported, this feature is used to set a memory
attribute that will be used by future mmap()s of the DMABUF fd. It
doesn't affect existing maps.
The default is UC, and via the feature one can specify CPU access
as WC. The attribute is an enum/scalar rather than
bitmap/cumulative. The attributes follow a "try-fail" model where
a client can request an attribute and either succeed or fail with
ENOENT if it's unknown; if future attributes are platform-specific
then their support can be probed.
Since it's just UC/WC for now, there is no reservation or numeric
structure to the namespace yet, but we could support
system/arch-specific values in future by carving out base +
arch-specific + IMPDEF ranges.
Testing
=======
(The [RFC ONLY] userspace test program, for QEMU edu-plus, can be
found in the GitHub branch below. It at least illustrates how the
export, map, revoke, attribute, and close semantics interoperate.)
This code has been tested in mapping DMABUFs of single/multiple ranges
from multiple BARs, aliasing mmap()s, aliasing ranges across DMABUFs,
vm_pgoff > 0, revocation, shutdown/cleanup scenarios, and hugepage
mappings. I've lightly tested WC mappings also (by observing
resulting PTEs as having the correct attributes...). No regressions
observed on the VFIO selftests, or on our internal vfio-pci
applications. VFIO on i386 has been build-tested.
Dear Reviewers,
===============
I was grateful for the reviews and Reviewed-Bys on previous versions
from several of you; I've added some Reviewed-Bys.
But, various changes were also requested and I'm erring on the
conservative side: I have NOT included your Reviewed-Bys where the
patch has changed after your review (or where requested changes ended
up more than super-trivial). I hope that's okay.
End
===
This is based on v7.1.
These commits are on GitHub for easier browsing, along with
"[RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test":
https://github.com/metamev/linux/compare/8cd9520d35a6...dev/mev/vfio-dmabuf-mmap-v4
Thanks for reading,
Matt
================================================================================
Changelog:
v4:
- Rebased on v7.1
- 1/9: Split the p2pdma.c pool release code into a new patch, making
the second patch a pure code-move exercise. Reworded the commit
messages, comment cleanups.
- 2/9 Look up PFNs helper: renamed DMABUF range search loop variables
for clarity, and simplified search loop and fallthrough/exit logic.
Moved WARN to ratelimited warnings. Rearranged pfn arithmetic to
avoid potential overflow. Clarified comments, better explanation
of vma_pgoff_adjust, spelling.
- 4/9 convert BAR mmap(): Trivial comment change, move. Used 'true'
instead of '1' for unmap_mapping_range() arg, consistent with
elsewhere in vfio-pci.
- 5/9 User-facing name for mappings: Uses kasprintf() instead.
- 6/9 Clean up BAR zap: Renamed functions to simplify/shorten names,
emphasising "revoke/unrevoke" actions. (Then internally this will
do a DMABUF revoke and possibly a PTE zap.)
NOTE: We debated ordering of a previous zap before a transition
to D0 before reset. The conclusion was the current patch is OK.
- 7/9 mmap() of a DMABUF: Added helper vfio_pci_set_vma_ops() to keep
ops struct static/local. Squashed vfio_pci_dma_buf_mmap() comments
for space/clarity.
- 8/9 DMABUF revoke: Fix typos in commit message. Implement request
to add a condition to revocation path to only invalidate/wait when
a buffer is being revoked (avoiding it on un-revoke).
NOTE: This means a (small) change to the code moved from
vfio_pci_dma_buf_move.
NOTE: Also, we discussed adding warnings for setting a state
matching the existing state; I didn't add them after all, because
these situations can occur in normal usage (e.g. a revoke of a
buffer from a device in D3, or a cleanup of a lingering
user-revoked buffer).
v3:
https://lore.kernel.org/all/20260610154327.37758-1-matt@ozlabs.org/
- Refactor p2pdma.c: split out pcim_p2pdma_provider() into a new
p2pdma_core.c under CONFIG_PCI_P2PDMA_CORE.
- vfio_pci_dma_buf_find_pfn() cleanups: Rename parameter to priv,
remove bad WARN, move unnecessary addition out of inner loop.
- vfio_pci_core_mmap_prep_dmabuf() cleanups: Remove uint32_t, remove
unnecessary const variable.
- Conversion of BAR mmap() to DMABUF: VFIO_PCI_DMABUF depends on
VFIO_PCI_CORE. vfio_pci_mmap_huge_fault(): move dev_dbg() outside
of lock (argh), remove READ_ONCE(vdev)/move priv->vdev read and
improve comment explanation.
- On revoke, BAR zap defaults to on if .mmap is overridden by a
driver (and implements an opt-out for the hisi_acc_vfio_pci driver,
which overrides mmap() with a simple wrapper that ends up using the
common DMABUF mmap() rather than custom mappings).
- Reworded commit "vfio/pci: Support mmap() of a VFIO DMABUF" message
for clarity. Reworded vfio_pci_mmap_huge_fault() comment for
accuracy (vdev validity depends on not being revoked).
Added comment in mmap() explaining belt-and-braces approach for
early detecting a map of a revoked buffer.
- Revoke now uses VFIO_DEVICE_FEATURE_DMA_BUF rather than a new
ioctl(); instead of the revoke helper taking 'revoked/permanently'
bools, it's become vfio_pci_dma_buf_set_status() taking a single
status enum. Added a READ_ONCE() for the lockless test of
priv->vdev (flags it as intentional, even if it's in practice going
to be a single-copy atomic read).
- Removed GET on vfio_pci_core_feature_dma_buf_memattr(), removed
unnecessary taking of memory_lock, fixed error return values. In
particular, removes ENOTSUPP, and uses ENOENT to indicate an
unknown attribute enum value was passed to SET. In the discussion
here,
https://lore.kernel.org/all/20260602131417.41366391@shazbot.org/
we'd agreed on EOPNOTSUPP before I realised that's already used
elsewhere. ENOENT uniquely indicates an unknown attribute.
v2:
https://lore.kernel.org/all/20260527102319.100128-1-mattev@meta.com/
- Rebase on VFIO next, picking up Alex's
vfio_pci_dma_buf_move()/vfio_pci_dma_buf_cleanup() fixes, and
dropping "vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put"
- Added "PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE" so that the
newly-added vfio-pci hard dependency on the P2PDMA provider instead
pulls in the _CORE variant and not the full-fat CONFIG_PCI_P2PDMA.
This means that the core of vfio-pci does not need ZONE_DEVICE, but
if it's available then enabling P2PDMA in turn enables DMABUF
export. Fixes basic VFIO operation on 32b or other platforms without
ZONE_DEVICE.
- Fixed comment inaccuracy in vfio_pci_dma_buf_revoke() and cleaned
up vdev validity test.
- vfio_pci_dma_buf_find_pfn(): use PAGE_ALIGN(), better span variable
naming, OVF check
- Made vm_pgoffs use consistent (keeping the resource index at the
top and masking where offset is used). For BAR mmap, use new
vma_pgoff_adjust to create the DMABUF with the exact mmap()ed span
instead of from the start of the BAR with an invisible portion
before the mapping.
- Added VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR to set memory attributes,
instead of using the export `flags` field.
- vfio_pci_ioctl_reset: Moved vfio_pci_zap_revoke_bars()
(effectively, vfio_pci_dma_buf_move()) back after D0 transition.
Note, if a BAR zap is needed, it's done in this function so now
happens after this D0 transition with the _move; it was done before
it at the time of the memory_lock taking.
- Minimised vfio_pci_dma_buf_mmap() (removed redundant span check),
added READ_ONCE for memattr
- Misc fixes: comment in DMABUF name generation, removed superfluous
READ_ONCE from faulthandler
v1:
https://lore.kernel.org/kvm/20260416131815.2729131-1-mattev@meta.com/
- Cleanup of the common DMABUF-aware VMA vm_ops fault handler and
export code.
- Fixed a lot of races, particularly faults racing with DMABUF
cleanup (if the VFIO device fds close, for example).
- Added nicer human-readable names for VFIO mmap() VMAs
RFCv2: Respin based on the feedback/suggestions:
https://lore.kernel.org/kvm/20260312184613.3710705-1-mattev@meta.com/
- Transform the existing VFIO BAR mmap path to also use DMABUFs
behind the scenes, and then simply share that code for
explicitly-mapped DMABUFs. Jason wanted to go that direction to
enable iommufd VFIO type 1 emulation to pick up a DMABUF for an IO
mapping.
- Revoke buffers using a VFIO device fd ioctl
RFCv1:
https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
Matt Evans (10):
PCI/P2PDMA: Split pool-related cleanup out of pci_p2pdma_release()
PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE
vfio/pci: Add a helper to look up PFNs for DMABUFs
vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
vfio/pci: Convert BAR mmap() to use a DMABUF
vfio/pci: Provide a user-facing name for BAR mappings
vfio/pci: Clean up BAR zap and revocation
vfio/pci: Support mmap() of a VFIO DMABUF
vfio/pci: Permanently revoke a DMABUF on request
vfio/pci: Add mmap() attributes to DMABUF feature
MAINTAINERS | 2 +-
drivers/pci/Kconfig | 10 +-
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 109 +---
drivers/pci/p2pdma.h | 29 +
drivers/pci/p2pdma_core.c | 118 ++++
drivers/vfio/pci/Kconfig | 5 +-
drivers/vfio/pci/Makefile | 3 +-
.../vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 8 +
drivers/vfio/pci/vfio_pci_config.c | 30 +-
drivers/vfio/pci/vfio_pci_core.c | 211 +++++--
drivers/vfio/pci/vfio_pci_dmabuf.c | 568 +++++++++++++++---
drivers/vfio/pci/vfio_pci_priv.h | 63 +-
include/linux/pci-p2pdma.h | 24 +-
include/linux/pci.h | 2 +-
include/linux/vfio_pci_core.h | 1 +
include/uapi/linux/vfio.h | 47 ++
17 files changed, 960 insertions(+), 271 deletions(-)
create mode 100644 drivers/pci/p2pdma.h
create mode 100644 drivers/pci/p2pdma_core.c
--
2.50.1 (Apple Git-155)
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v4 01/10] PCI/P2PDMA: Split pool-related cleanup out of pci_p2pdma_release()
2026-07-01 17:12 [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
@ 2026-07-01 17:12 ` Matt Evans
2026-07-01 21:15 ` Bjorn Helgaas
2026-07-01 17:12 ` [PATCH v4 02/10] PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE Matt Evans
` (8 subsequent siblings)
9 siblings, 1 reply; 13+ messages in thread
From: Matt Evans @ 2026-07-01 17:12 UTC (permalink / raw)
To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava
Cc: Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
Ankit Agrawal, Alistair Popple, Vivek Kasireddy, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, kvm, linux-pci
Preparing for a refactor in a subsequent patch, split the pool-related
release code into a new pci_p2pdma_release_pool() function.
This allows future compile-time selection of a null implementation for
pci_p2pdma_release_pool(), when p2pdma.c is refactored into core- and
P2P-related files.
Signed-off-by: Matt Evans <matt@ozlabs.org>
---
drivers/pci/p2pdma.c | 24 ++++++++++++++----------
1 file changed, 14 insertions(+), 10 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 7c898542af8d..a5a1baebc34e 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -226,6 +226,17 @@ static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
.folio_free = p2pdma_folio_free,
};
+static void pci_p2pdma_release_pool(struct pci_dev *pdev,
+ struct pci_p2pdma *p2pdma)
+{
+ if (!p2pdma->pool)
+ return;
+
+ synchronize_rcu();
+ gen_pool_destroy(p2pdma->pool);
+ sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
+}
+
static void pci_p2pdma_release(void *data)
{
struct pci_dev *pdev = data;
@@ -237,15 +248,8 @@ static void pci_p2pdma_release(void *data)
/* Flush and disable pci_alloc_p2p_mem() */
pdev->p2pdma = NULL;
- if (p2pdma->pool)
- synchronize_rcu();
+ pci_p2pdma_release_pool(pdev, p2pdma);
xa_destroy(&p2pdma->map_types);
-
- if (!p2pdma->pool)
- return;
-
- gen_pool_destroy(p2pdma->pool);
- sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
}
/**
@@ -932,8 +936,8 @@ void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
struct pci_p2pdma *p2pdma;
/*
- * Pairs with synchronize_rcu() in pci_p2pdma_release() to
- * ensure pdev->p2pdma is non-NULL for the duration of the
+ * Pairs with synchronize_rcu() in pci_p2pdma_release_pool()
+ * to ensure pdev->p2pdma is non-NULL for the duration of the
* read-lock.
*/
rcu_read_lock();
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v4 02/10] PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE
2026-07-01 17:12 [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
2026-07-01 17:12 ` [PATCH v4 01/10] PCI/P2PDMA: Split pool-related cleanup out of pci_p2pdma_release() Matt Evans
@ 2026-07-01 17:12 ` Matt Evans
2026-07-01 21:16 ` Bjorn Helgaas
2026-07-01 17:12 ` [PATCH v4 03/10] vfio/pci: Add a helper to look up PFNs for DMABUFs Matt Evans
` (7 subsequent siblings)
9 siblings, 1 reply; 13+ messages in thread
From: Matt Evans @ 2026-07-01 17:12 UTC (permalink / raw)
To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava
Cc: Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
Ankit Agrawal, Alistair Popple, Vivek Kasireddy, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, kvm, linux-pci
The P2PDMA code currently provides two features under the same
CONFIG_PCI_P2PDMA option:
1. Locate providers via pcim_p2pdma_provider()
2. Manage actual P2P DMA
Some drivers (such as vfio-pci) depend on (1), without having a hard
dependency on (2).
A future vfio-pci commit will rely on pcim_p2pdma_provider() always
being present. If that depended on CONFIG_PCI_P2PDMA, it would make
vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit
systems), even when P2P is not needed.
To resolve this, introduce CONFIG_PCI_P2PDMA_CORE and refactor the
basic provider functionality into a new p2pdma_core.c file. This is
available even if the CONFIG_PCI_P2PDMA feature is disabled (or
unavailable due to !CONFIG_ZONE_DEVICE), satisfying (1).
Then, when the original CONFIG_PCI_P2PDMA is set, drivers have access
to the additional P2P features of (2). This still depends on
CONFIG_ZONE_DEVICE.
Signed-off-by: Matt Evans <matt@ozlabs.org>
---
MAINTAINERS | 2 +-
drivers/pci/Kconfig | 10 ++--
drivers/pci/Makefile | 1 +
drivers/pci/p2pdma.c | 107 +--------------------------------
drivers/pci/p2pdma.h | 29 +++++++++
drivers/pci/p2pdma_core.c | 118 +++++++++++++++++++++++++++++++++++++
include/linux/pci-p2pdma.h | 24 ++++----
include/linux/pci.h | 2 +-
8 files changed, 171 insertions(+), 122 deletions(-)
create mode 100644 drivers/pci/p2pdma.h
create mode 100644 drivers/pci/p2pdma_core.c
diff --git a/MAINTAINERS b/MAINTAINERS
index c8d4b913f26c..713861af4484 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20626,7 +20626,7 @@ B: https://bugzilla.kernel.org
C: irc://irc.oftc.net/linux-pci
T: git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git
F: Documentation/driver-api/pci/p2pdma.rst
-F: drivers/pci/p2pdma.c
+F: drivers/pci/p2pdma*
F: include/linux/pci-p2pdma.h
PCI POWER CONTROL
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 33c88432b728..59d70bc84cc9 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -206,11 +206,7 @@ config PCIE_TPH
config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
depends on ZONE_DEVICE
- #
- # The need for the scatterlist DMA bus address flag means PCI P2PDMA
- # requires 64bit
- #
- depends on 64BIT
+ select PCI_P2PDMA_CORE
select GENERIC_ALLOCATOR
select NEED_SG_DMA_FLAGS
help
@@ -226,6 +222,10 @@ config PCI_P2PDMA
If unsure, say N.
+config PCI_P2PDMA_CORE
+ default n
+ bool
+
config PCI_LABEL
def_bool y if (DMI || ACPI)
select NLS
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 41ebc3b9a518..0b32572d57a1 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -30,6 +30,7 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o
obj-$(CONFIG_PCI_STUB) += pci-stub.o
obj-$(CONFIG_PCI_PF_STUB) += pci-pf-stub.o
obj-$(CONFIG_PCI_ECAM) += ecam.o
+obj-$(CONFIG_PCI_P2PDMA_CORE) += p2pdma_core.o
obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
obj-$(CONFIG_VGA_ARB) += vgaarb.o
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index a5a1baebc34e..50b1a7daf55c 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -21,12 +21,7 @@
#include <linux/seq_buf.h>
#include <linux/xarray.h>
-struct pci_p2pdma {
- struct gen_pool *pool;
- bool p2pmem_published;
- struct xarray map_types;
- struct p2pdma_provider mem[PCI_STD_NUM_BARS];
-};
+#include "p2pdma.h"
struct pci_p2pdma_pagemap {
struct dev_pagemap pgmap;
@@ -226,8 +221,7 @@ static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
.folio_free = p2pdma_folio_free,
};
-static void pci_p2pdma_release_pool(struct pci_dev *pdev,
- struct pci_p2pdma *p2pdma)
+void pci_p2pdma_release_pool(struct pci_dev *pdev, struct pci_p2pdma *p2pdma)
{
if (!p2pdma->pool)
return;
@@ -237,103 +231,6 @@ static void pci_p2pdma_release_pool(struct pci_dev *pdev,
sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
}
-static void pci_p2pdma_release(void *data)
-{
- struct pci_dev *pdev = data;
- struct pci_p2pdma *p2pdma;
-
- p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
- if (!p2pdma)
- return;
-
- /* Flush and disable pci_alloc_p2p_mem() */
- pdev->p2pdma = NULL;
- pci_p2pdma_release_pool(pdev, p2pdma);
- xa_destroy(&p2pdma->map_types);
-}
-
-/**
- * pcim_p2pdma_init - Initialise peer-to-peer DMA providers
- * @pdev: The PCI device to enable P2PDMA for
- *
- * This function initializes the peer-to-peer DMA infrastructure
- * for a PCI device. It allocates and sets up the necessary data
- * structures to support P2PDMA operations, including mapping type
- * tracking.
- */
-int pcim_p2pdma_init(struct pci_dev *pdev)
-{
- struct pci_p2pdma *p2p;
- int i, ret;
-
- p2p = rcu_dereference_protected(pdev->p2pdma, 1);
- if (p2p)
- return 0;
-
- p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
- if (!p2p)
- return -ENOMEM;
-
- xa_init(&p2p->map_types);
- /*
- * Iterate over all standard PCI BARs and record only those that
- * correspond to MMIO regions. Skip non-memory resources (e.g. I/O
- * port BARs) since they cannot be used for peer-to-peer (P2P)
- * transactions.
- */
- for (i = 0; i < PCI_STD_NUM_BARS; i++) {
- if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM))
- continue;
-
- p2p->mem[i].owner = &pdev->dev;
- p2p->mem[i].bus_offset =
- pci_bus_address(pdev, i) - pci_resource_start(pdev, i);
- }
-
- ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
- if (ret)
- goto out_p2p;
-
- rcu_assign_pointer(pdev->p2pdma, p2p);
- return 0;
-
-out_p2p:
- devm_kfree(&pdev->dev, p2p);
- return ret;
-}
-EXPORT_SYMBOL_GPL(pcim_p2pdma_init);
-
-/**
- * pcim_p2pdma_provider - Get peer-to-peer DMA provider
- * @pdev: The PCI device to enable P2PDMA for
- * @bar: BAR index to get provider
- *
- * This function gets peer-to-peer DMA provider for a PCI device. The lifetime
- * of the provider (and of course the MMIO) is bound to the lifetime of the
- * driver. A driver calling this function must ensure that all references to the
- * provider, and any DMA mappings created for any MMIO, are all cleaned up
- * before the driver remove() completes.
- *
- * Since P2P is almost always shared with a second driver this means some system
- * to notify, invalidate and revoke the MMIO's DMA must be in place to use this
- * function. For example a revoke can be built using DMABUF.
- */
-struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar)
-{
- struct pci_p2pdma *p2p;
-
- if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
- return NULL;
-
- p2p = rcu_dereference_protected(pdev->p2pdma, 1);
- if (WARN_ON(!p2p))
- /* Someone forgot to call to pcim_p2pdma_init() before */
- return NULL;
-
- return &p2p->mem[bar];
-}
-EXPORT_SYMBOL_GPL(pcim_p2pdma_provider);
-
static int pci_p2pdma_setup_pool(struct pci_dev *pdev)
{
struct pci_p2pdma *p2pdma;
diff --git a/drivers/pci/p2pdma.h b/drivers/pci/p2pdma.h
new file mode 100644
index 000000000000..946383809981
--- /dev/null
+++ b/drivers/pci/p2pdma.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PCI peer-to-peer DMA support.
+ */
+
+#ifndef _PCI_P2PDMA_H
+#define _PCI_P2PDMA_H
+
+#include <linux/genalloc.h>
+#include <linux/pci-p2pdma.h>
+#include <linux/xarray.h>
+
+struct pci_p2pdma {
+ struct gen_pool *pool;
+ bool p2pmem_published;
+ struct xarray map_types;
+ struct p2pdma_provider mem[PCI_STD_NUM_BARS];
+};
+
+#ifdef CONFIG_PCI_P2PDMA
+void pci_p2pdma_release_pool(struct pci_dev *pdev, struct pci_p2pdma *p2pdma);
+#else
+static inline void pci_p2pdma_release_pool(struct pci_dev *pdev,
+ struct pci_p2pdma *p2pdma)
+{
+}
+#endif
+
+#endif
diff --git a/drivers/pci/p2pdma_core.c b/drivers/pci/p2pdma_core.c
new file mode 100644
index 000000000000..bb2138bf2bc7
--- /dev/null
+++ b/drivers/pci/p2pdma_core.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI peer-to-peer DMA support core, providing a bare-bones
+ * pcim_p2pdma_provider() interface to drivers even if full P2PDMA
+ * isn't present. The full P2PDMA feature is in p2pdma.c (see
+ * CONFIG_PCI_P2PDMA).
+ *
+ * Copyright (c) 2016-2018, Logan Gunthorpe
+ * Copyright (c) 2016-2017, Microsemi Corporation
+ * Copyright (c) 2017, Christoph Hellwig
+ * Copyright (c) 2018, Eideticom Inc.
+ */
+
+#define pr_fmt(fmt) "pci-p2pdma: " fmt
+#include <linux/ctype.h>
+#include <linux/genalloc.h>
+#include <linux/memremap.h>
+#include <linux/pci-p2pdma.h>
+#include <linux/xarray.h>
+
+#include "p2pdma.h"
+
+static void pci_p2pdma_release(void *data)
+{
+ struct pci_dev *pdev = data;
+ struct pci_p2pdma *p2pdma;
+
+ p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
+ if (!p2pdma)
+ return;
+
+ /* Flush and disable pci_alloc_p2p_mem() */
+ pdev->p2pdma = NULL;
+ pci_p2pdma_release_pool(pdev, p2pdma);
+ xa_destroy(&p2pdma->map_types);
+}
+
+/**
+ * pcim_p2pdma_init - Initialise peer-to-peer DMA providers
+ * @pdev: The PCI device to enable P2PDMA for
+ *
+ * This function initializes the peer-to-peer DMA infrastructure
+ * for a PCI device. It allocates and sets up the necessary data
+ * structures to support P2PDMA operations, including mapping type
+ * tracking.
+ */
+int pcim_p2pdma_init(struct pci_dev *pdev)
+{
+ struct pci_p2pdma *p2p;
+ int i, ret;
+
+ p2p = rcu_dereference_protected(pdev->p2pdma, 1);
+ if (p2p)
+ return 0;
+
+ p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
+ if (!p2p)
+ return -ENOMEM;
+
+ xa_init(&p2p->map_types);
+ /*
+ * Iterate over all standard PCI BARs and record only those that
+ * correspond to MMIO regions. Skip non-memory resources (e.g. I/O
+ * port BARs) since they cannot be used for peer-to-peer (P2P)
+ * transactions.
+ */
+ for (i = 0; i < PCI_STD_NUM_BARS; i++) {
+ if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM))
+ continue;
+
+ p2p->mem[i].owner = &pdev->dev;
+ p2p->mem[i].bus_offset =
+ pci_bus_address(pdev, i) - pci_resource_start(pdev, i);
+ }
+
+ ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+ if (ret)
+ goto out_p2p;
+
+ rcu_assign_pointer(pdev->p2pdma, p2p);
+ return 0;
+
+out_p2p:
+ devm_kfree(&pdev->dev, p2p);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pcim_p2pdma_init);
+
+/**
+ * pcim_p2pdma_provider - Get peer-to-peer DMA provider
+ * @pdev: The PCI device to enable P2PDMA for
+ * @bar: BAR index to get provider
+ *
+ * This function gets peer-to-peer DMA provider for a PCI device. The lifetime
+ * of the provider (and of course the MMIO) is bound to the lifetime of the
+ * driver. A driver calling this function must ensure that all references to the
+ * provider, and any DMA mappings created for any MMIO, are all cleaned up
+ * before the driver remove() completes.
+ *
+ * Since P2P is almost always shared with a second driver this means some system
+ * to notify, invalidate and revoke the MMIO's DMA must be in place to use this
+ * function. For example a revoke can be built using DMABUF.
+ */
+struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar)
+{
+ struct pci_p2pdma *p2p;
+
+ if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
+ return NULL;
+
+ p2p = rcu_dereference_protected(pdev->p2pdma, 1);
+ if (WARN_ON(!p2p))
+ /* Someone forgot to call to pcim_p2pdma_init() before */
+ return NULL;
+
+ return &p2p->mem[bar];
+}
+EXPORT_SYMBOL_GPL(pcim_p2pdma_provider);
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 873de20a2247..4c42a7b2ee85 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -67,9 +67,22 @@ enum pci_p2pdma_map_type {
PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
};
-#ifdef CONFIG_PCI_P2PDMA
+#ifdef CONFIG_PCI_P2PDMA_CORE
int pcim_p2pdma_init(struct pci_dev *pdev);
struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar);
+#else
+static inline int pcim_p2pdma_init(struct pci_dev *pdev)
+{
+ return -EOPNOTSUPP;
+}
+static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev,
+ int bar)
+{
+ return NULL;
+}
+#endif
+
+#ifdef CONFIG_PCI_P2PDMA
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
@@ -89,15 +102,6 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
struct device *dev);
#else /* CONFIG_PCI_P2PDMA */
-static inline int pcim_p2pdma_init(struct pci_dev *pdev)
-{
- return -EOPNOTSUPP;
-}
-static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev,
- int bar)
-{
- return NULL;
-}
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
{
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..531aec355686 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -557,7 +557,7 @@ struct pci_dev {
u16 pasid_cap; /* PASID Capability offset */
u16 pasid_features;
#endif
-#ifdef CONFIG_PCI_P2PDMA
+#ifdef CONFIG_PCI_P2PDMA_CORE
struct pci_p2pdma __rcu *p2pdma;
#endif
#ifdef CONFIG_PCI_DOE
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v4 03/10] vfio/pci: Add a helper to look up PFNs for DMABUFs
2026-07-01 17:12 [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
2026-07-01 17:12 ` [PATCH v4 01/10] PCI/P2PDMA: Split pool-related cleanup out of pci_p2pdma_release() Matt Evans
2026-07-01 17:12 ` [PATCH v4 02/10] PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE Matt Evans
@ 2026-07-01 17:12 ` Matt Evans
2026-07-01 17:12 ` [PATCH v4 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA Matt Evans
` (6 subsequent siblings)
9 siblings, 0 replies; 13+ messages in thread
From: Matt Evans @ 2026-07-01 17:12 UTC (permalink / raw)
To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava
Cc: Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
Ankit Agrawal, Alistair Popple, Vivek Kasireddy, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, kvm, linux-pci
Add vfio_pci_dma_buf_find_pfn(), which a VMA fault handler can use to
find a PFN.
This supports multi-range DMABUFs, which typically would be used to
represent scattered spans but might even represent overlapping or
aliasing spans of PFNs.
Because this is intended to be used in vfio_pci_core.c, we also need
to expose the struct vfio_pci_dma_buf in the vfio_pci_priv.h header.
Signed-off-by: Matt Evans <matt@ozlabs.org>
---
drivers/vfio/pci/vfio_pci_dmabuf.c | 153 ++++++++++++++++++++++++++---
drivers/vfio/pci/vfio_pci_priv.h | 20 ++++
2 files changed, 160 insertions(+), 13 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 1a177ce7de54..2e9adc3fe0f7 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -9,19 +9,6 @@
MODULE_IMPORT_NS("DMA_BUF");
-struct vfio_pci_dma_buf {
- struct dma_buf *dmabuf;
- struct vfio_pci_core_device *vdev;
- struct list_head dmabufs_elm;
- size_t size;
- struct phys_vec *phys_vec;
- struct p2pdma_provider *provider;
- u32 nr_ranges;
- struct kref kref;
- struct completion comp;
- u8 revoked : 1;
-};
-
static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
{
@@ -106,6 +93,146 @@ static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
.release = vfio_pci_dma_buf_release,
};
+int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *priv,
+ struct vm_area_struct *vma,
+ unsigned long fault_addr,
+ unsigned int order,
+ unsigned long *out_pfn)
+{
+ /*
+ * Given a VMA (start, end, pgoffs) and a fault address,
+ * search the corresponding DMABUF's phys_vec[] to find the
+ * range representing the address's offset into the VMA, and
+ * its PFN.
+ *
+ * The phys_vec[] ranges represent contiguous spans of VAs
+ * upwards from the buffer offset 0; the actual PFNs might be
+ * in any order, overlap/alias, etc. Calculate an offset of
+ * the desired page given VMA start/pgoff and address, then
+ * search upwards from 0 to find which span contains it.
+ *
+ * On success, a valid PFN for a page sized by 'order' is
+ * returned into out_pfn.
+ *
+ * Failure occurs if:
+ * - A hugepage would cross the edge of the VMA,
+ * - A hugepage isn't entirely contained within a range
+ * (including where it straddles the boundary between
+ * ranges),
+ * - We find a range, but the final PFN isn't aligned to the
+ * requested order.
+ *
+ * Upon failure, -EAGAIN is returned and the caller is
+ * expected to try again with a smaller order, which will
+ * eventually succeed (order=0 will always work).
+ *
+ * It's suboptimal if DMABUFs are created with neighbouring
+ * ranges that are physically contiguous, since hugepages
+ * can't straddle range boundaries. (The construction of the
+ * ranges should merge them in this case.)
+ *
+ * Finally, vma_pgoff_adjust is used with a DMABUF created for
+ * a VFIO BAR mmap: a BAR mapped with vm_pgoff > 0 creates a
+ * DMABUF such that byte 0 of the VMA corresponds to byte 0 of
+ * the DMABUF and byte 'vm_pgoff << PAGE_SHIFT' into the BAR.
+ * To avoid double-offsetting in this scenario, subtracting
+ * vma_pgoff_adjust from this (non-zero) vm_pgoff generates
+ * the effective offset.
+ */
+
+ const unsigned long pagesize = PAGE_SIZE << order;
+ unsigned long vma_off = ((vma->vm_pgoff - priv->vma_pgoff_adjust) <<
+ PAGE_SHIFT) & VFIO_PCI_OFFSET_MASK;
+ unsigned long rounded_page_addr = ALIGN_DOWN(fault_addr, pagesize);
+ unsigned long rounded_page_end = rounded_page_addr + pagesize;
+ unsigned long fault_offset;
+ unsigned long fault_offset_end;
+ unsigned long range_start_offset = 0;
+ unsigned int i;
+ int ret;
+
+ if (rounded_page_addr < vma->vm_start || rounded_page_end > vma->vm_end) {
+ if (order > 0)
+ return -EAGAIN;
+
+ /* A fault address outside of the VMA is absurd. */
+ dev_warn_ratelimited(
+ &priv->vdev->pdev->dev,
+ "Fault addr 0x%lx outside VMA 0x%lx-0x%lx\n",
+ fault_addr, vma->vm_start, vma->vm_end);
+ return -EFAULT;
+ }
+
+ /*
+ * fault_offset[_end] is the span within the DMABUF
+ * corresponding to the faulting page:
+ */
+ if (unlikely(check_add_overflow(rounded_page_addr - vma->vm_start,
+ vma_off, &fault_offset) ||
+ check_add_overflow(fault_offset, pagesize,
+ &fault_offset_end)))
+ return -EFAULT;
+
+ /*
+ * Iterate over ranges in the buffer, summing their lengths:
+ * range_start_offset represents the current range's starting
+ * offset in the buffer (from 0 upwards).
+ *
+ * A failure for order == 0 is unexpected, and triggers a
+ * fault/warn.
+ */
+ ret = (order == 0) ? -EFAULT : -EAGAIN;
+
+ for (i = 0; i < priv->nr_ranges; i++) {
+ size_t range_len = priv->phys_vec[i].len;
+
+ /* Early exit if range starts after the page end */
+ if (fault_offset_end <= range_start_offset)
+ break;
+
+ if (fault_offset >= range_start_offset &&
+ fault_offset_end <= range_start_offset + range_len) {
+ /*
+ * The faulting page is wholly contained
+ * within the span represented by this range,
+ * so validate PFN alignment for the order.
+ * The if() condition ensures the pfn
+ * arithmetic won't overflow.
+ */
+ unsigned long pfn =
+ ((fault_offset - range_start_offset) +
+ priv->phys_vec[i].paddr) >> PAGE_SHIFT;
+
+ if (IS_ALIGNED(pfn, 1 << order)) {
+ *out_pfn = pfn;
+ ret = 0;
+ }
+ /*
+ * Else order > 0; EAGAIN retries with smaller
+ * order
+ */
+ break;
+ }
+ range_start_offset += range_len;
+ }
+
+ if (order == 0 && ret != 0)
+ /*
+ * The address fell outside of the span represented by
+ * the (concatenated) ranges. As setup of a mapping
+ * ensures that the VMA is <= the total size of the
+ * ranges this should never happen. If it does, warn
+ * and SIGBUS.
+ */
+ dev_warn_ratelimited(
+ &priv->vdev->pdev->dev,
+ "No range for addr 0x%lx, order %d: VMA 0x%lx-0x%lx pgoff 0x%lx, %u ranges, size 0x%zx\n",
+ fault_addr, order, vma->vm_start, vma->vm_end,
+ vma->vm_pgoff, priv->nr_ranges, priv->size);
+
+ return ret;
+}
+
/*
* This is a temporary "private interconnect" between VFIO DMABUF and iommufd.
* It allows the two co-operating drivers to exchange the physical address of
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index fca9d0dfac90..c8f6f959056a 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -23,6 +23,20 @@ struct vfio_pci_ioeventfd {
bool test_mem;
};
+struct vfio_pci_dma_buf {
+ struct dma_buf *dmabuf;
+ struct vfio_pci_core_device *vdev;
+ struct list_head dmabufs_elm;
+ size_t size;
+ struct phys_vec *phys_vec;
+ struct p2pdma_provider *provider;
+ u32 nr_ranges;
+ struct kref kref;
+ struct completion comp;
+ unsigned long vma_pgoff_adjust;
+ u8 revoked : 1;
+};
+
bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
@@ -114,6 +128,12 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
}
+int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ unsigned int order,
+ unsigned long *out_pfn);
+
#ifdef CONFIG_VFIO_PCI_DMABUF
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v4 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
2026-07-01 17:12 [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
` (2 preceding siblings ...)
2026-07-01 17:12 ` [PATCH v4 03/10] vfio/pci: Add a helper to look up PFNs for DMABUFs Matt Evans
@ 2026-07-01 17:12 ` Matt Evans
2026-07-01 17:12 ` [PATCH v4 05/10] vfio/pci: Convert BAR mmap() to use a DMABUF Matt Evans
` (5 subsequent siblings)
9 siblings, 0 replies; 13+ messages in thread
From: Matt Evans @ 2026-07-01 17:12 UTC (permalink / raw)
To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava
Cc: Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
Ankit Agrawal, Alistair Popple, Vivek Kasireddy, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, kvm, linux-pci
This helper, vfio_pci_core_mmap_prep_dmabuf(), creates a single-range
DMABUF for the purpose of mapping a PCI BAR. This is used in a future
commit by VFIO's ordinary mmap() path.
This function transfers ownership of the VFIO device fd to the
DMABUF, which fput()s when it's released.
Refactor the existing vfio_pci_core_feature_dma_buf() to split out
export code common to the two paths, VFIO_DEVICE_FEATURE_DMA_BUF and
this new VFIO_BAR mmap().
Signed-off-by: Matt Evans <matt@ozlabs.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
---
drivers/vfio/pci/vfio_pci_dmabuf.c | 142 +++++++++++++++++++++++------
drivers/vfio/pci/vfio_pci_priv.h | 5 +
2 files changed, 117 insertions(+), 30 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 2e9adc3fe0f7..9a80acc1ab19 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -82,6 +82,8 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
up_write(&priv->vdev->memory_lock);
vfio_device_put_registration(&priv->vdev->vdev);
}
+ if (priv->vfile)
+ fput(priv->vfile);
kfree(priv->phys_vec);
kfree(priv);
}
@@ -233,6 +235,45 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *priv,
return ret;
}
+/*
+ * Create a DMABUF corresponding to priv, add it to vdev->dmabufs list
+ * for tracking (meaning cleanup or revocation will zap it), and take
+ * a vfio_device registration.
+ */
+static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_dma_buf *priv, u32 flags)
+{
+ DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+
+ if (!vfio_device_try_get_registration(&vdev->vdev))
+ return -ENODEV;
+
+ exp_info.ops = &vfio_pci_dmabuf_ops;
+ exp_info.size = priv->size;
+ exp_info.flags = flags;
+ exp_info.priv = priv;
+
+ priv->dmabuf = dma_buf_export(&exp_info);
+ if (IS_ERR(priv->dmabuf)) {
+ vfio_device_put_registration(&vdev->vdev);
+ return PTR_ERR(priv->dmabuf);
+ }
+
+ kref_init(&priv->kref);
+ init_completion(&priv->comp);
+
+ /* dma_buf_put() now frees priv */
+ INIT_LIST_HEAD(&priv->dmabufs_elm);
+ down_write(&vdev->memory_lock);
+ dma_resv_lock(priv->dmabuf->resv, NULL);
+ priv->revoked = !__vfio_pci_memory_enabled(vdev);
+ list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
+ dma_resv_unlock(priv->dmabuf->resv);
+ up_write(&vdev->memory_lock);
+
+ return 0;
+}
+
/*
* This is a temporary "private interconnect" between VFIO DMABUF and iommufd.
* It allows the two co-operating drivers to exchange the physical address of
@@ -351,7 +392,6 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
{
struct vfio_device_feature_dma_buf get_dma_buf = {};
struct vfio_region_dma_range *dma_ranges;
- DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
struct vfio_pci_dma_buf *priv;
size_t length;
int ret;
@@ -411,34 +451,9 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
kfree(dma_ranges);
dma_ranges = NULL;
- if (!vfio_device_try_get_registration(&vdev->vdev)) {
- ret = -ENODEV;
+ ret = vfio_pci_dmabuf_export(vdev, priv, get_dma_buf.open_flags);
+ if (ret)
goto err_free_phys;
- }
-
- exp_info.ops = &vfio_pci_dmabuf_ops;
- exp_info.size = priv->size;
- exp_info.flags = get_dma_buf.open_flags;
- exp_info.priv = priv;
-
- priv->dmabuf = dma_buf_export(&exp_info);
- if (IS_ERR(priv->dmabuf)) {
- ret = PTR_ERR(priv->dmabuf);
- goto err_dev_put;
- }
-
- kref_init(&priv->kref);
- init_completion(&priv->comp);
-
- /* dma_buf_put() now frees priv */
- INIT_LIST_HEAD(&priv->dmabufs_elm);
- down_write(&vdev->memory_lock);
- dma_resv_lock(priv->dmabuf->resv, NULL);
- priv->revoked = !__vfio_pci_memory_enabled(vdev);
- list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
- dma_resv_unlock(priv->dmabuf->resv);
- up_write(&vdev->memory_lock);
-
/*
* dma_buf_fd() consumes the reference, when the file closes the dmabuf
* will be released.
@@ -449,8 +464,6 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
return ret;
-err_dev_put:
- vfio_device_put_registration(&vdev->vdev);
err_free_phys:
kfree(priv->phys_vec);
err_free_priv:
@@ -460,6 +473,75 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
return ret;
}
+int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
+ struct vm_area_struct *vma,
+ u64 phys_start, u64 req_len,
+ unsigned int res_index)
+{
+ struct vfio_pci_dma_buf *priv;
+ unsigned long vma_pgoff = vma->vm_pgoff & (VFIO_PCI_OFFSET_MASK >> PAGE_SHIFT);
+ int ret;
+
+ priv = kzalloc_obj(*priv);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->phys_vec = kzalloc_obj(*priv->phys_vec);
+ if (!priv->phys_vec) {
+ ret = -ENOMEM;
+ goto err_free_priv;
+ }
+
+ /*
+ * The DMABUF begins from the mmap()'s BAR offset, i.e. the
+ * start of the VMA corresponds to byte 0 of the DMABUF and
+ * byte (vma_pgoff << PAGE_SHIFT) of the BAR.
+ *
+ * vfio_pci_dma_buf_find_pfn() reverses this offset using
+ * vma_pgoff_adjust, so that ultimately a fault's offset from
+ * the start of the _VMA_ has a consistent usage whether the
+ * VMA originates from an mmap() of the VFIO device here or a
+ * direct DMABUF mmap().
+ */
+ priv->vdev = vdev;
+ priv->size = req_len;
+ priv->nr_ranges = 1;
+ priv->vma_pgoff_adjust = vma_pgoff;
+ priv->provider = pcim_p2pdma_provider(vdev->pdev, res_index);
+ if (!priv->provider) {
+ ret = -EINVAL;
+ goto err_free_phys;
+ }
+
+ priv->phys_vec[0].paddr = phys_start + ((u64)vma_pgoff << PAGE_SHIFT);
+ priv->phys_vec[0].len = priv->size;
+
+ ret = vfio_pci_dmabuf_export(vdev, priv, O_CLOEXEC | O_RDWR);
+ if (ret)
+ goto err_free_phys;
+
+ /*
+ * Ownership of the DMABUF file transfers to the VMA so that
+ * other users can locate the DMABUF via a VA. Ownership of
+ * the original VFIO device file being mmap()ed transfers to
+ * priv, and is put when the DMABUF is released. This
+ * intentionally does not use get_file()/vma_set_file()
+ * because the references are already held, and ownership
+ * moves.
+ */
+ priv->vfile = vma->vm_file;
+ vma->vm_file = priv->dmabuf->file;
+ vma->vm_private_data = priv;
+
+ return 0;
+
+err_free_phys:
+ kfree(priv->phys_vec);
+err_free_priv:
+ kfree(priv);
+ return ret;
+}
+
void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
{
struct vfio_pci_dma_buf *priv;
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index c8f6f959056a..06dc0fd3e230 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -30,6 +30,7 @@ struct vfio_pci_dma_buf {
size_t size;
struct phys_vec *phys_vec;
struct p2pdma_provider *provider;
+ struct file *vfile;
u32 nr_ranges;
struct kref kref;
struct completion comp;
@@ -133,6 +134,10 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf,
unsigned long address,
unsigned int order,
unsigned long *out_pfn);
+int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
+ struct vm_area_struct *vma,
+ u64 phys_start, u64 req_len,
+ unsigned int res_index);
#ifdef CONFIG_VFIO_PCI_DMABUF
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v4 05/10] vfio/pci: Convert BAR mmap() to use a DMABUF
2026-07-01 17:12 [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
` (3 preceding siblings ...)
2026-07-01 17:12 ` [PATCH v4 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA Matt Evans
@ 2026-07-01 17:12 ` Matt Evans
2026-07-01 17:12 ` [PATCH v4 06/10] vfio/pci: Provide a user-facing name for BAR mappings Matt Evans
` (4 subsequent siblings)
9 siblings, 0 replies; 13+ messages in thread
From: Matt Evans @ 2026-07-01 17:12 UTC (permalink / raw)
To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava
Cc: Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
Ankit Agrawal, Alistair Popple, Vivek Kasireddy, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, kvm, linux-pci
Convert the VFIO device fd fops->mmap to create a DMABUF representing
the BAR mapping, and make the VMA fault handler look up PFNs from the
corresponding DMABUF. This supports future code mmap()ing BAR
DMABUFs, and iommufd work to support Type1 P2P.
First, vfio_pci_core_mmap() uses the new
vfio_pci_core_mmap_prep_dmabuf() helper to export a DMABUF
representing a single BAR range. Then, the vfio_pci_mmap_huge_fault()
callback is updated to understand revoked buffers, and uses the new
vfio_pci_dma_buf_find_pfn() helper to determine the PFN for a given
fault address.
Now that the VFIO DMABUFs can be mmap()ed, vfio_pci_dma_buf_move()
zaps PTEs (used on the revocation and cleanup paths).
CONFIG_VFIO_PCI_CORE now unconditionally depends on
CONFIG_DMA_SHARED_BUFFER and CONFIG_PCI_P2PDMA_CORE. The
CONFIG_VFIO_PCI_DMABUF feature conditionally includes support for
VFIO_DEVICE_FEATURE_DMA_BUF, depending on the availability of
CONFIG_PCI_P2PDMA.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
---
drivers/vfio/pci/Kconfig | 5 ++-
drivers/vfio/pci/Makefile | 3 +-
drivers/vfio/pci/vfio_pci_core.c | 66 ++++++++++++++++++------------
drivers/vfio/pci/vfio_pci_dmabuf.c | 12 ++++++
drivers/vfio/pci/vfio_pci_priv.h | 11 +----
5 files changed, 58 insertions(+), 39 deletions(-)
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 296bf01e185e..67a2ae1fbc04 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -6,6 +6,8 @@ config VFIO_PCI_CORE
tristate
select VFIO_VIRQFD
select IRQ_BYPASS_MANAGER
+ select PCI_P2PDMA_CORE
+ select DMA_SHARED_BUFFER
config VFIO_PCI_INTX
def_bool y if !S390
@@ -56,7 +58,8 @@ config VFIO_PCI_ZDEV_KVM
To enable s390x KVM vfio-pci extensions, say Y.
config VFIO_PCI_DMABUF
- def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
+ def_bool y if PCI_P2PDMA
+ depends on VFIO_PCI_CORE
source "drivers/vfio/pci/mlx5/Kconfig"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 6138f1bf241d..881452ea89be 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,8 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
-vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio_pci_dmabuf.o
vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 050e7542952e..42ea8481935f 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1683,18 +1683,6 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 c
up_write(&vdev->memory_lock);
}
-static unsigned long vma_to_pfn(struct vm_area_struct *vma)
-{
- struct vfio_pci_core_device *vdev = vma->vm_private_data;
- int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
- u64 pgoff;
-
- pgoff = vma->vm_pgoff &
- ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
-
- return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
-}
-
vm_fault_t vfio_pci_vmf_insert_pfn(struct vfio_pci_core_device *vdev,
struct vm_fault *vmf,
unsigned long pfn,
@@ -1722,23 +1710,42 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
unsigned int order)
{
struct vm_area_struct *vma = vmf->vma;
- struct vfio_pci_core_device *vdev = vma->vm_private_data;
- unsigned long addr = vmf->address & ~((PAGE_SIZE << order) - 1);
- unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
- unsigned long pfn = vma_to_pfn(vma) + pgoff;
- vm_fault_t ret = VM_FAULT_FALLBACK;
-
- if (is_aligned_for_order(vma, addr, pfn, order)) {
- scoped_guard(rwsem_read, &vdev->memory_lock)
- ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order);
+ struct vfio_pci_dma_buf *priv = vma->vm_private_data;
+ struct vfio_pci_core_device *vdev;
+ unsigned long pfn = 0;
+ vm_fault_t ret = VM_FAULT_SIGBUS;
+
+ /*
+ * We can rely on the existence of both a DMABUF (priv) and
+ * the VFIO device it was exported from (vdev). This fault's
+ * VMA was established using vfio_pci_core_mmap_prep_dmabuf()
+ * which transfers ownership of the VFIO device fd to the
+ * DMABUF, and so the VFIO device is held open because the
+ * VMA's vm_file (DMABUF) is open.
+ *
+ * Since vfio_pci_dma_buf_cleanup() cannot have happened,
+ * vdev must be valid; we can take memory_lock.
+ */
+ vdev = priv->vdev;
+
+ scoped_guard(rwsem_read, &vdev->memory_lock) {
+ if (!priv->revoked) {
+ int pres = vfio_pci_dma_buf_find_pfn(priv, vma,
+ vmf->address,
+ order, &pfn);
+
+ if (pres == 0)
+ ret = vfio_pci_vmf_insert_pfn(vdev, vmf,
+ pfn, order);
+ else if (pres == -EAGAIN)
+ ret = VM_FAULT_FALLBACK;
+ }
}
dev_dbg_ratelimited(&vdev->pdev->dev,
- "%s(,order = %d) BAR %ld page offset 0x%lx: 0x%x\n",
- __func__, order,
- vma->vm_pgoff >>
- (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT),
- pgoff, (unsigned int)ret);
+ "%s(order = %d) PFN 0x%lx, VA 0x%lx, pgoff 0x%lx: 0x%x\n",
+ __func__, order, pfn, vmf->address,
+ vma->vm_pgoff, (unsigned int)ret);
return ret;
}
@@ -1803,7 +1810,12 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
if (ret)
return ret;
- vma->vm_private_data = vdev;
+ ret = vfio_pci_core_mmap_prep_dmabuf(vdev, vma,
+ pci_resource_start(pdev, index),
+ req_len, index);
+ if (ret)
+ return ret;
+
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 9a80acc1ab19..ca8a04d4629a 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -9,6 +9,7 @@
MODULE_IMPORT_NS("DMA_BUF");
+#ifdef CONFIG_VFIO_PCI_DMABUF
static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
{
@@ -25,6 +26,7 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
return 0;
}
+#endif /* CONFIG_VFIO_PCI_DMABUF */
static void vfio_pci_dma_buf_done(struct kref *kref)
{
@@ -89,7 +91,9 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
}
static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
+#ifdef CONFIG_VFIO_PCI_DMABUF
.attach = vfio_pci_dma_buf_attach,
+#endif
.map_dma_buf = vfio_pci_dma_buf_map,
.unmap_dma_buf = vfio_pci_dma_buf_unmap,
.release = vfio_pci_dma_buf_release,
@@ -274,6 +278,7 @@ static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev,
return 0;
}
+#ifdef CONFIG_VFIO_PCI_DMABUF
/*
* This is a temporary "private interconnect" between VFIO DMABUF and iommufd.
* It allows the two co-operating drivers to exchange the physical address of
@@ -472,6 +477,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
kfree(dma_ranges);
return ret;
}
+#endif /* CONFIG_VFIO_PCI_DMABUF */
int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
struct vm_area_struct *vma,
@@ -547,6 +553,10 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
struct vfio_pci_dma_buf *priv;
struct vfio_pci_dma_buf *tmp;
+ /*
+ * Holding memory_lock ensures a racing VMA fault observes
+ * priv->revoked properly.
+ */
lockdep_assert_held_write(&vdev->memory_lock);
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
@@ -565,6 +575,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
if (revoked) {
kref_put(&priv->kref, vfio_pci_dma_buf_done);
wait_for_completion(&priv->comp);
+ unmap_mapping_range(priv->dmabuf->file->f_mapping,
+ 0, priv->size, true);
/*
* Re-arm the registered kref reference and the
* completion so the post-revoke state matches the
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 06dc0fd3e230..d38e1b98b2e9 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -138,13 +138,13 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
struct vm_area_struct *vma,
u64 phys_start, u64 req_len,
unsigned int res_index);
+void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
#ifdef CONFIG_VFIO_PCI_DMABUF
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz);
-void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
-void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
#else
static inline int
vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
@@ -153,13 +153,6 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
{
return -ENOTTY;
}
-static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
-{
-}
-static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
- bool revoked)
-{
-}
#endif
#endif
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v4 06/10] vfio/pci: Provide a user-facing name for BAR mappings
2026-07-01 17:12 [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
` (4 preceding siblings ...)
2026-07-01 17:12 ` [PATCH v4 05/10] vfio/pci: Convert BAR mmap() to use a DMABUF Matt Evans
@ 2026-07-01 17:12 ` Matt Evans
2026-07-01 17:12 ` [PATCH v4 07/10] vfio/pci: Clean up BAR zap and revocation Matt Evans
` (3 subsequent siblings)
9 siblings, 0 replies; 13+ messages in thread
From: Matt Evans @ 2026-07-01 17:12 UTC (permalink / raw)
To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava
Cc: Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
Ankit Agrawal, Alistair Popple, Vivek Kasireddy, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, kvm, linux-pci
Since converting BAR mmap()s to using DMABUFs, we lose the original
device path in /proc/<pid>/maps, lsof, etc. Generate a debug-oriented
synthetic 'filename' based on the cdev, plus BDF, plus resource index.
This applies only to BAR mappings via the VFIO device fd, as
explicitly-exported DMABUFs are named by userspace via the
DMA_BUF_SET_NAME ioctl.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
---
drivers/vfio/pci/vfio_pci_dmabuf.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index ca8a04d4629a..08c56e2803ad 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -4,6 +4,7 @@
#include <linux/dma-buf-mapping.h>
#include <linux/pci-p2pdma.h>
#include <linux/dma-resv.h>
+#include <uapi/linux/dma-buf.h>
#include "vfio_pci_priv.h"
@@ -486,6 +487,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
{
struct vfio_pci_dma_buf *priv;
unsigned long vma_pgoff = vma->vm_pgoff & (VFIO_PCI_OFFSET_MASK >> PAGE_SHIFT);
+ char *bufname;
int ret;
priv = kzalloc_obj(*priv);
@@ -498,6 +500,15 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
goto err_free_priv;
}
+ bufname = kasprintf(GFP_KERNEL, "%s:%s/%x",
+ dev_name(&vdev->vdev.device), pci_name(vdev->pdev),
+ res_index);
+
+ if (!bufname) {
+ ret = -ENOMEM;
+ goto err_free_phys;
+ }
+
/*
* The DMABUF begins from the mmap()'s BAR offset, i.e. the
* start of the VMA corresponds to byte 0 of the DMABUF and
@@ -516,7 +527,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
priv->provider = pcim_p2pdma_provider(vdev->pdev, res_index);
if (!priv->provider) {
ret = -EINVAL;
- goto err_free_phys;
+ goto err_free_name;
}
priv->phys_vec[0].paddr = phys_start + ((u64)vma_pgoff << PAGE_SHIFT);
@@ -524,7 +535,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
ret = vfio_pci_dmabuf_export(vdev, priv, O_CLOEXEC | O_RDWR);
if (ret)
- goto err_free_phys;
+ goto err_free_name;
/*
* Ownership of the DMABUF file transfers to the VMA so that
@@ -539,8 +550,15 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
vma->vm_file = priv->dmabuf->file;
vma->vm_private_data = priv;
+ spin_lock(&priv->dmabuf->name_lock);
+ kfree(priv->dmabuf->name);
+ priv->dmabuf->name = bufname;
+ spin_unlock(&priv->dmabuf->name_lock);
+
return 0;
+err_free_name:
+ kfree(bufname);
err_free_phys:
kfree(priv->phys_vec);
err_free_priv:
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v4 07/10] vfio/pci: Clean up BAR zap and revocation
2026-07-01 17:12 [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
` (5 preceding siblings ...)
2026-07-01 17:12 ` [PATCH v4 06/10] vfio/pci: Provide a user-facing name for BAR mappings Matt Evans
@ 2026-07-01 17:12 ` Matt Evans
2026-07-01 17:12 ` [PATCH v4 08/10] vfio/pci: Support mmap() of a VFIO DMABUF Matt Evans
` (2 subsequent siblings)
9 siblings, 0 replies; 13+ messages in thread
From: Matt Evans @ 2026-07-01 17:12 UTC (permalink / raw)
To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava
Cc: Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
Ankit Agrawal, Alistair Popple, Vivek Kasireddy, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, kvm, linux-pci
Previously, vfio_pci_zap_bars() (and the wrapper
vfio_pci_zap_and_down_write_memory_lock()) calls were paired with
calls to vfio_pci_dma_buf_move().
This commit replaces them with a unified new function,
vfio_pci_revoke_bars() containing both the vfio_pci_dma_buf_move() and
the unmap_mapping_range(), making it harder for callers to omit one.
It adds a wrapper, vfio_pci_lock_revoke_bars(), which takes the write
memory_lock before zapping, and adds a new vfio_pci_unrevoke_bars()
for the re-enable path.
As of "vfio/pci: Convert BAR mmap() to use a DMABUF", the zap via
unmap_mapping_range() is no longer performed for vfio-pci since the
DMABUFs used for BAR mappings already zap PTEs when the
vfio_pci_dma_buf_move() occurs.
However, it must be assumed that VFIO drivers which override the .mmap
op could create mappings _not_ backed by DMABUFs. So, the zap is
still performed on revoke if .mmap is overridden, using a new
zap_bars_on_revoke flag. A driver can explicitly opt out; the flag is
cleared by the hisi_acc_vfio_pci driver, since its .mmap just wraps
vfio_pci_core_mmap() and so still uses DMABUFs.
Signed-off-by: Matt Evans <matt@ozlabs.org>
---
.../vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 8 ++
drivers/vfio/pci/vfio_pci_config.c | 30 +++-----
drivers/vfio/pci/vfio_pci_core.c | 74 ++++++++++++++-----
drivers/vfio/pci/vfio_pci_priv.h | 3 +-
include/linux/vfio_pci_core.h | 1 +
5 files changed, 77 insertions(+), 39 deletions(-)
diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
index bb121f635b9f..d9f26b0ce991 100644
--- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
+++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
@@ -1725,6 +1725,14 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device
if (ret)
goto out_put_vdev;
+ /*
+ * hisi_acc_vfio_pci_mmap() calls down to
+ * vfio_pci_core_mmap(), so BAR mappings are still
+ * DMABUF-backed. They don't require a zap on revoke, so opt
+ * out:
+ */
+ hisi_acc_vdev->core_device.zap_bars_on_revoke = false;
+
hisi_acc_vfio_debug_init(hisi_acc_vdev);
return 0;
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index a10ed733f0e3..bc6ccb2e135c 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -590,12 +590,10 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
- if (!new_mem) {
- vfio_pci_zap_and_down_write_memory_lock(vdev);
- vfio_pci_dma_buf_move(vdev, true);
- } else {
+ if (!new_mem)
+ vfio_pci_lock_revoke_bars(vdev);
+ else
down_write(&vdev->memory_lock);
- }
/*
* If the user is writing mem/io enable (new_mem/io) and we
@@ -631,7 +629,7 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
*virt_cmd |= cpu_to_le16(new_cmd & mask);
if (__vfio_pci_memory_enabled(vdev))
- vfio_pci_dma_buf_move(vdev, false);
+ vfio_pci_unrevoke_bars(vdev);
up_write(&vdev->memory_lock);
}
@@ -712,16 +710,14 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
pci_power_t state)
{
- if (state >= PCI_D3hot) {
- vfio_pci_zap_and_down_write_memory_lock(vdev);
- vfio_pci_dma_buf_move(vdev, true);
- } else {
+ if (state >= PCI_D3hot)
+ vfio_pci_lock_revoke_bars(vdev);
+ else
down_write(&vdev->memory_lock);
- }
vfio_pci_set_power_state(vdev, state);
if (__vfio_pci_memory_enabled(vdev))
- vfio_pci_dma_buf_move(vdev, false);
+ vfio_pci_unrevoke_bars(vdev);
up_write(&vdev->memory_lock);
}
@@ -908,11 +904,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
&cap);
if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
- vfio_pci_zap_and_down_write_memory_lock(vdev);
- vfio_pci_dma_buf_move(vdev, true);
+ vfio_pci_lock_revoke_bars(vdev);
pci_try_reset_function(vdev->pdev);
if (__vfio_pci_memory_enabled(vdev))
- vfio_pci_dma_buf_move(vdev, false);
+ vfio_pci_unrevoke_bars(vdev);
up_write(&vdev->memory_lock);
}
}
@@ -993,11 +988,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
&cap);
if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
- vfio_pci_zap_and_down_write_memory_lock(vdev);
- vfio_pci_dma_buf_move(vdev, true);
+ vfio_pci_lock_revoke_bars(vdev);
pci_try_reset_function(vdev->pdev);
if (__vfio_pci_memory_enabled(vdev))
- vfio_pci_dma_buf_move(vdev, false);
+ vfio_pci_unrevoke_bars(vdev);
up_write(&vdev->memory_lock);
}
}
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 42ea8481935f..d8cd7d752781 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -319,8 +319,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
* The vdev power related flags are protected with 'memory_lock'
* semaphore.
*/
- vfio_pci_zap_and_down_write_memory_lock(vdev);
- vfio_pci_dma_buf_move(vdev, true);
+ vfio_pci_lock_revoke_bars(vdev);
if (vdev->pm_runtime_engaged) {
up_write(&vdev->memory_lock);
@@ -406,7 +405,7 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
down_write(&vdev->memory_lock);
__vfio_pci_runtime_pm_exit(vdev);
if (__vfio_pci_memory_enabled(vdev))
- vfio_pci_dma_buf_move(vdev, false);
+ vfio_pci_unrevoke_bars(vdev);
up_write(&vdev->memory_lock);
}
@@ -1256,6 +1255,8 @@ static int vfio_pci_ioctl_set_irqs(struct vfio_pci_core_device *vdev,
return ret;
}
+static void vfio_pci_revoke_bars(struct vfio_pci_core_device *vdev);
+
static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
void __user *arg)
{
@@ -1264,7 +1265,7 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
if (!vdev->reset_works)
return -EINVAL;
- vfio_pci_zap_and_down_write_memory_lock(vdev);
+ down_write(&vdev->memory_lock);
/*
* This function can be invoked while the power state is non-D0. If
@@ -1274,13 +1275,18 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
* have NoSoftRst-, the reset function can cause the PCI config space
* reset without restoring the original state (saved locally in
* 'vdev->pm_save').
+ *
+ * The zap is done after making the device accessible in D0,
+ * because a DMABUF importer could access the device as part
+ * of its revocation cleanup.
*/
vfio_pci_set_power_state(vdev, PCI_D0);
- vfio_pci_dma_buf_move(vdev, true);
+ vfio_pci_revoke_bars(vdev);
+
ret = pci_try_reset_function(vdev->pdev);
if (__vfio_pci_memory_enabled(vdev))
- vfio_pci_dma_buf_move(vdev, false);
+ vfio_pci_unrevoke_bars(vdev);
up_write(&vdev->memory_lock);
return ret;
@@ -1648,20 +1654,37 @@ ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *bu
}
EXPORT_SYMBOL_GPL(vfio_pci_core_write);
-static void vfio_pci_zap_bars(struct vfio_pci_core_device *vdev)
+static void vfio_pci_revoke_bars(struct vfio_pci_core_device *vdev)
{
- struct vfio_device *core_vdev = &vdev->vdev;
- loff_t start = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX);
- loff_t end = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX);
- loff_t len = end - start;
+ lockdep_assert_held_write(&vdev->memory_lock);
+ vfio_pci_dma_buf_move(vdev, true);
- unmap_mapping_range(core_vdev->inode->i_mapping, start, len, true);
+ /*
+ * If a driver could possibly create BAR mappings in the
+ * vdev's address_space, do an additional zap on revoke. See
+ * vfio_pci_core_register_device().
+ */
+ if (vdev->zap_bars_on_revoke) {
+ struct vfio_device *core_vdev = &vdev->vdev;
+ loff_t start = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX);
+ loff_t end = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX);
+ loff_t len = end - start;
+
+ unmap_mapping_range(core_vdev->inode->i_mapping,
+ start, len, true);
+ }
}
-void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev)
+void vfio_pci_lock_revoke_bars(struct vfio_pci_core_device *vdev)
{
down_write(&vdev->memory_lock);
- vfio_pci_zap_bars(vdev);
+ vfio_pci_revoke_bars(vdev);
+}
+
+void vfio_pci_unrevoke_bars(struct vfio_pci_core_device *vdev)
+{
+ lockdep_assert_held_write(&vdev->memory_lock);
+ vfio_pci_dma_buf_move(vdev, false);
}
u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev)
@@ -2234,6 +2257,17 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
if (ret)
goto out_vf;
+ /*
+ * If a driver overrides .mmap, it has to be assumed that it
+ * might not use the DMABUF-backed core mmap; this flag
+ * enables a zap at revoke time. Drivers can opt out by
+ * clearing this flag after registration, e.g. if their .mmap
+ * override calls down to vfio_pci_core_mmap() so that maps
+ * are DMABUF-backed.
+ */
+ if (vdev->vdev.ops->mmap != vfio_pci_core_mmap)
+ vdev->zap_bars_on_revoke = true;
+
vfio_pci_probe_power_state(vdev);
/*
@@ -2509,9 +2543,10 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
}
/*
- * Take the memory write lock for each device and zap BAR
- * mappings to prevent the user accessing the device while in
- * reset. Locking multiple devices is prone to deadlock,
+ * Take the memory write lock for each device and
+ * zap/revoke BAR mappings to prevent the user (or
+ * peers) accessing the device while in reset.
+ * Locking multiple devices is prone to deadlock,
* runaway and unwind if we hit contention.
*/
if (!down_write_trylock(&vdev->memory_lock)) {
@@ -2519,8 +2554,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
break;
}
- vfio_pci_dma_buf_move(vdev, true);
- vfio_pci_zap_bars(vdev);
+ vfio_pci_revoke_bars(vdev);
}
if (!list_entry_is_head(vdev,
@@ -2550,7 +2584,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
vdev.dev_set_list) {
if (vdev->vdev.open_count && __vfio_pci_memory_enabled(vdev))
- vfio_pci_dma_buf_move(vdev, false);
+ vfio_pci_unrevoke_bars(vdev);
up_write(&vdev->memory_lock);
}
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index d38e1b98b2e9..a8671d63b9ca 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -83,7 +83,8 @@ void vfio_config_free(struct vfio_pci_core_device *vdev);
int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev,
pci_power_t state);
-void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev);
+void vfio_pci_lock_revoke_bars(struct vfio_pci_core_device *vdev);
+void vfio_pci_unrevoke_bars(struct vfio_pci_core_device *vdev);
u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev);
void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev,
u16 cmd);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 89165b769e5c..a2f586e4fcd8 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -127,6 +127,7 @@ struct vfio_pci_core_device {
bool needs_pm_restore:1;
bool pm_intx_masked:1;
bool pm_runtime_engaged:1;
+ bool zap_bars_on_revoke:1;
struct pci_saved_state *pci_saved_state;
struct pci_saved_state *pm_save;
int ioeventfds_nr;
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v4 08/10] vfio/pci: Support mmap() of a VFIO DMABUF
2026-07-01 17:12 [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
` (6 preceding siblings ...)
2026-07-01 17:12 ` [PATCH v4 07/10] vfio/pci: Clean up BAR zap and revocation Matt Evans
@ 2026-07-01 17:12 ` Matt Evans
2026-07-01 17:12 ` [PATCH v4 09/10] vfio/pci: Permanently revoke a DMABUF on request Matt Evans
2026-07-01 17:12 ` [PATCH v4 10/10] vfio/pci: Add mmap() attributes to DMABUF feature Matt Evans
9 siblings, 0 replies; 13+ messages in thread
From: Matt Evans @ 2026-07-01 17:12 UTC (permalink / raw)
To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava
Cc: Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
Ankit Agrawal, Alistair Popple, Vivek Kasireddy, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, kvm, linux-pci
A VFIO DMABUF can export a subset of a BAR to userspace by fd; add
support for mmap() of this fd. This provides another route for a
process to map BARs, in which the process can only map a specific
subset of a BAR represented by the exported DMABUF.
mmap() support enables userspace driver designs that safely delegate
access to BAR sub-ranges to other client processes by sharing a DMABUF
fd, without having to share the (omnipotent) VFIO device fd with them.
Since the main VFIO BAR mmap() is now DMABUF-aware, the new mmap() reuses
the existing vm_ops.
The lifecycle of an exported DMABUF remains decoupled from that of the
device fd it came from, i.e. the device fd could be closed with DMABUF
VMAs present, meaning a fault on a VMA could happen concurrently with
vfio_pci_dma_buf_cleanup().
To deal with this scenario, the fault handler now temporarily takes a
VFIO device registration to ensure the vdev remains valid, and then
vdev->memory_lock can be taken on it.
Signed-off-by: Matt Evans <matt@ozlabs.org>
---
drivers/vfio/pci/vfio_pci_core.c | 83 +++++++++++++++++++++++++++---
drivers/vfio/pci/vfio_pci_dmabuf.c | 34 ++++++++++++
drivers/vfio/pci/vfio_pci_priv.h | 1 +
3 files changed, 110 insertions(+), 8 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d8cd7d752781..cd15b934912b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -12,6 +12,8 @@
#include <linux/aperture.h>
#include <linux/device.h>
+#include <linux/dma-buf.h>
+#include <linux/dma-resv.h>
#include <linux/eventfd.h>
#include <linux/file.h>
#include <linux/interrupt.h>
@@ -1739,19 +1741,78 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
vm_fault_t ret = VM_FAULT_SIGBUS;
/*
- * We can rely on the existence of both a DMABUF (priv) and
- * the VFIO device it was exported from (vdev). This fault's
- * VMA was established using vfio_pci_core_mmap_prep_dmabuf()
- * which transfers ownership of the VFIO device fd to the
- * DMABUF, and so the VFIO device is held open because the
- * VMA's vm_file (DMABUF) is open.
+ * The only thing this can rely on is that the DMABUF relating
+ * to the VMA's vm_file exists (priv).
*
- * Since vfio_pci_dma_buf_cleanup() cannot have happened,
- * vdev must be valid; we can take memory_lock.
+ * A DMABUF for a VFIO device fd mmap() holds a reference to
+ * the original VFIO device fd, but an explicitly-exported
+ * DMABUF does not. The original fd might have closed,
+ * meaning this fault can race with
+ * vfio_pci_dma_buf_cleanup(), meaning the buffer could have
+ * been revoked (in which case priv->vdev might be NULL), and
+ * the VFIO device registration might have been dropped.
+ *
+ * With the goal of taking vdev->memory_lock in a world where
+ * vdev might not still exist:
+ *
+ * 1. Take the resv lock on the DMABUF:
+ * - If racing cleanup got in first, the buffer is revoked;
+ * stop/exit if so.
+ * - If we got in first, the buffer is not revoked so vdev is
+ * non-NULL, accessible, and cleanup _has not yet put the
+ * VFIO device registration_. So, the device refcount must
+ * be >0.
+ *
+ * 2. Take vfio_device registration (refcount guaranteed >0
+ * hereafter).
+ *
+ * 3. Unlock the DMABUF's resv lock:
+ * - A racing cleanup can now complete.
+ * - But, the device refcount >0, meaning the vfio_device
+ * (and vfio_pcie_core device vdev) have not yet been
+ * freed. vdev is accessible, even if the DMABUF has been
+ * revoked or cleanup has happened, because
+ * vfio_unregister_group_dev() can't complete.
+ *
+ * 4. Take the vdev->memory_lock
+ * - Either the DMABUF is usable, or has been cleaned up.
+ * Whichever, it can no longer change under us.
+ * - Test the DMABUF revocation status again: if it was
+ * revoked between 1 and 4 return a SIGBUS. Otherwise,
+ * return a PFN.
+ * - It's not necessary to also take the resv lock, because
+ * the status/vdev can't change while memory_lock is held.
+ *
+ * 5. Unlock, done.
*/
+
+ dma_resv_lock(priv->dmabuf->resv, NULL);
+
+ if (priv->revoked) {
+ pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF revoked/cleaned up\n",
+ __func__, vmf->address, vma->vm_pgoff);
+ dma_resv_unlock(priv->dmabuf->resv);
+ return VM_FAULT_SIGBUS;
+ }
+
+ /* If the buffer isn't revoked, vdev is valid */
vdev = priv->vdev;
+ if (!vfio_device_try_get_registration(&vdev->vdev)) {
+ /*
+ * If vdev != NULL (above), the registration should
+ * already be >0 and so this try_get should never
+ * fail.
+ */
+ dev_warn(&vdev->pdev->dev, "%s: Unexpected registration failure\n",
+ __func__);
+ dma_resv_unlock(priv->dmabuf->resv);
+ return VM_FAULT_SIGBUS;
+ }
+ dma_resv_unlock(priv->dmabuf->resv);
+
scoped_guard(rwsem_read, &vdev->memory_lock) {
+ /* Revocation status must be re-read, under memory_lock */
if (!priv->revoked) {
int pres = vfio_pci_dma_buf_find_pfn(priv, vma,
vmf->address,
@@ -1770,6 +1831,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
__func__, order, pfn, vmf->address,
vma->vm_pgoff, (unsigned int)ret);
+ vfio_device_put_registration(&vdev->vdev);
return ret;
}
@@ -1785,6 +1847,11 @@ static const struct vm_operations_struct vfio_pci_mmap_ops = {
#endif
};
+void vfio_pci_set_vma_ops(struct vm_area_struct *vma)
+{
+ vma->vm_ops = &vfio_pci_mmap_ops;
+}
+
int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma)
{
struct vfio_pci_core_device *vdev =
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 08c56e2803ad..50b713249341 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -27,6 +27,39 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
return 0;
}
+
+static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma)
+{
+ struct vfio_pci_dma_buf *priv = dmabuf->priv;
+
+ /*
+ * dma_buf_mmap_internal() has asserted that the VMA is
+ * contained within the DMABUF size before calling this.
+ *
+ * Also, if we observe that the buffer is revoked now then
+ * refuse the mmap(). This is a belt-and-braces early failure
+ * to ease debugging a revoked buffer being used. Userspace
+ * might also race an mmap() against an explicit revocation,
+ * or an action doing a temporary revoke; race scenarios are
+ * still safe because the fault handler ultimately prevents
+ * access to a revoked buffer if it isn't caught here.
+ */
+ if (READ_ONCE(priv->revoked))
+ return -ENODEV;
+ if ((vma->vm_flags & VM_SHARED) == 0)
+ return -EINVAL;
+
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
+
+ /* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */
+ vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
+ VM_DONTEXPAND | VM_DONTDUMP);
+ vma->vm_private_data = priv;
+ vfio_pci_set_vma_ops(vma);
+
+ return 0;
+}
#endif /* CONFIG_VFIO_PCI_DMABUF */
static void vfio_pci_dma_buf_done(struct kref *kref)
@@ -94,6 +127,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
#ifdef CONFIG_VFIO_PCI_DMABUF
.attach = vfio_pci_dma_buf_attach,
+ .mmap = vfio_pci_dma_buf_mmap,
#endif
.map_dma_buf = vfio_pci_dma_buf_map,
.unmap_dma_buf = vfio_pci_dma_buf_unmap,
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index a8671d63b9ca..c3fa35381679 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -141,6 +141,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
unsigned int res_index);
void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
+void vfio_pci_set_vma_ops(struct vm_area_struct *vma);
#ifdef CONFIG_VFIO_PCI_DMABUF
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v4 09/10] vfio/pci: Permanently revoke a DMABUF on request
2026-07-01 17:12 [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
` (7 preceding siblings ...)
2026-07-01 17:12 ` [PATCH v4 08/10] vfio/pci: Support mmap() of a VFIO DMABUF Matt Evans
@ 2026-07-01 17:12 ` Matt Evans
2026-07-01 17:12 ` [PATCH v4 10/10] vfio/pci: Add mmap() attributes to DMABUF feature Matt Evans
9 siblings, 0 replies; 13+ messages in thread
From: Matt Evans @ 2026-07-01 17:12 UTC (permalink / raw)
To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava
Cc: Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
Ankit Agrawal, Alistair Popple, Vivek Kasireddy, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, kvm, linux-pci
Expand the VFIO DMABUF revocation state to three states:
Not revoked, temporarily revoked, and permanently revoked.
The first two are for existing transient revocation, e.g. across a
function reset, and the DMABUF is put into the last in response to a
new VFIO feature VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE.
VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE passes a DMABUF by fd and requests
that the DMABUF is permanently revoked. On success, it's guaranteed
that the buffer can never be imported/attached/mmap()ed in future,
that dynamic imports have been cleanly detached, and that all mappings
have been made inaccessible/PTEs zapped.
This is useful for lifecycle management, to reclaim VFIO PCI BAR
ranges previously delegated to a subordinate client process: by
revoking, the driver process can ensure that the loaned resources are
made inaccessible when the client is deemed "done". The original
DMABUF is defunct, and BAR resources can then be safely re-exported
for use by new clients.
Refactor the revocation code out of vfio_pci_dma_buf_move() to a
function common to move and the new feature request path. Note: this
now only calls dma_buf_invalidate_mappings()/dma_resv_wait_timeout()
on the revoke path, whereas vfio_pci_dma_buf_move() originally called
them for both revoke and (unnecessarily) un-revoke.
Signed-off-by: Matt Evans <matt@ozlabs.org>
---
drivers/vfio/pci/vfio_pci_core.c | 6 +-
drivers/vfio/pci/vfio_pci_dmabuf.c | 164 ++++++++++++++++++++++-------
drivers/vfio/pci/vfio_pci_priv.h | 19 +++-
include/uapi/linux/vfio.h | 20 ++++
4 files changed, 168 insertions(+), 41 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index cd15b934912b..ebdf9ec30517 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1577,6 +1577,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_DMA_BUF:
return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
+ case VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE:
+ return vfio_pci_core_feature_dma_buf_revoke(vdev, flags, arg, argsz);
default:
return -ENOTTY;
}
@@ -1788,7 +1790,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
dma_resv_lock(priv->dmabuf->resv, NULL);
- if (priv->revoked) {
+ if (priv->status != VFIO_PCI_DMABUF_OK) {
pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF revoked/cleaned up\n",
__func__, vmf->address, vma->vm_pgoff);
dma_resv_unlock(priv->dmabuf->resv);
@@ -1813,7 +1815,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
scoped_guard(rwsem_read, &vdev->memory_lock) {
/* Revocation status must be re-read, under memory_lock */
- if (!priv->revoked) {
+ if (priv->status == VFIO_PCI_DMABUF_OK) {
int pres = vfio_pci_dma_buf_find_pfn(priv, vma,
vmf->address,
order, &pfn);
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 50b713249341..cfca820b767a 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -19,7 +19,7 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
if (!attachment->peer2peer)
return -EOPNOTSUPP;
- if (priv->revoked)
+ if (priv->status != VFIO_PCI_DMABUF_OK)
return -ENODEV;
if (!dma_buf_attach_revocable(attachment))
@@ -44,7 +44,7 @@ static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *
* still safe because the fault handler ultimately prevents
* access to a revoked buffer if it isn't caught here.
*/
- if (READ_ONCE(priv->revoked))
+ if (READ_ONCE(priv->status) != VFIO_PCI_DMABUF_OK)
return -ENODEV;
if ((vma->vm_flags & VM_SHARED) == 0)
return -EINVAL;
@@ -79,7 +79,7 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
dma_resv_assert_held(priv->dmabuf->resv);
- if (priv->revoked)
+ if (priv->status != VFIO_PCI_DMABUF_OK)
return ERR_PTR(-ENODEV);
ret = dma_buf_phys_vec_to_sgt(attachment, priv->provider,
@@ -305,7 +305,8 @@ static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev,
INIT_LIST_HEAD(&priv->dmabufs_elm);
down_write(&vdev->memory_lock);
dma_resv_lock(priv->dmabuf->resv, NULL);
- priv->revoked = !__vfio_pci_memory_enabled(vdev);
+ priv->status = __vfio_pci_memory_enabled(vdev) ? VFIO_PCI_DMABUF_OK :
+ VFIO_PCI_DMABUF_TEMP_REVOKED;
list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
dma_resv_unlock(priv->dmabuf->resv);
up_write(&vdev->memory_lock);
@@ -336,7 +337,7 @@ int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment,
return -EOPNOTSUPP;
priv = attachment->dmabuf->priv;
- if (priv->revoked)
+ if (priv->status != VFIO_PCI_DMABUF_OK)
return -ENODEV;
/* More than one range to iommufd will require proper DMABUF support */
@@ -600,6 +601,59 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
return ret;
}
+/* Set the DMABUF's revocation status (OK or temporarily/permanently revoked) */
+static void vfio_pci_dma_buf_set_status(struct vfio_pci_dma_buf *priv,
+ enum vfio_pci_dma_buf_status new_status)
+{
+ bool was_revoked;
+
+ lockdep_assert_held_write(&priv->vdev->memory_lock);
+
+ if (priv->status == VFIO_PCI_DMABUF_PERM_REVOKED ||
+ priv->status == new_status)
+ return;
+
+ dma_resv_lock(priv->dmabuf->resv, NULL);
+ was_revoked = (priv->status == VFIO_PCI_DMABUF_TEMP_REVOKED);
+
+ if (new_status != VFIO_PCI_DMABUF_OK) {
+ priv->status = new_status; /* Temp or permanently revoked */
+
+ if (was_revoked) {
+ /*
+ * TEMP_REVOKED is being upgraded to
+ * PERM_REVOKED. The buffer is already gone,
+ * don't wait on it again.
+ */
+ dma_resv_unlock(priv->dmabuf->resv);
+ return;
+ }
+ dma_buf_invalidate_mappings(priv->dmabuf);
+ dma_resv_wait_timeout(priv->dmabuf->resv,
+ DMA_RESV_USAGE_BOOKKEEP, false,
+ MAX_SCHEDULE_TIMEOUT);
+ dma_resv_unlock(priv->dmabuf->resv);
+ kref_put(&priv->kref, vfio_pci_dma_buf_done);
+ wait_for_completion(&priv->comp);
+ unmap_mapping_range(priv->dmabuf->file->f_mapping,
+ 0, priv->size, true);
+ /*
+ * Re-arm the registered kref reference and the
+ * completion so the post-revoke state matches the
+ * post-creation state. An un-revoke followed by a
+ * new mapping needs the kref to be non-zero before
+ * kref_get(), and vfio_pci_dma_buf_cleanup()
+ * delegates its drain back through this revoke
+ * path on a possibly-already-revoked dma-buf.
+ */
+ kref_init(&priv->kref);
+ reinit_completion(&priv->comp);
+ } else {
+ priv->status = VFIO_PCI_DMABUF_OK;
+ dma_resv_unlock(priv->dmabuf->resv);
+ }
+}
+
void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
{
struct vfio_pci_dma_buf *priv;
@@ -607,45 +661,16 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
/*
* Holding memory_lock ensures a racing VMA fault observes
- * priv->revoked properly.
+ * priv->status properly.
*/
lockdep_assert_held_write(&vdev->memory_lock);
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
if (!get_file_active(&priv->dmabuf->file))
continue;
-
- if (priv->revoked != revoked) {
- dma_resv_lock(priv->dmabuf->resv, NULL);
- if (revoked)
- priv->revoked = true;
- dma_buf_invalidate_mappings(priv->dmabuf);
- dma_resv_wait_timeout(priv->dmabuf->resv,
- DMA_RESV_USAGE_BOOKKEEP, false,
- MAX_SCHEDULE_TIMEOUT);
- dma_resv_unlock(priv->dmabuf->resv);
- if (revoked) {
- kref_put(&priv->kref, vfio_pci_dma_buf_done);
- wait_for_completion(&priv->comp);
- unmap_mapping_range(priv->dmabuf->file->f_mapping,
- 0, priv->size, true);
- /*
- * Re-arm the registered kref reference and the
- * completion so the post-revoke state matches the
- * post-creation state. An un-revoke followed by a
- * new mapping needs the kref to be non-zero before
- * kref_get(), and vfio_pci_dma_buf_cleanup()
- * delegates its drain back through this revoke
- * path on a possibly-already-revoked dma-buf.
- */
- kref_init(&priv->kref);
- reinit_completion(&priv->comp);
- } else {
- dma_resv_lock(priv->dmabuf->resv, NULL);
- priv->revoked = false;
- dma_resv_unlock(priv->dmabuf->resv);
- }
- }
+ vfio_pci_dma_buf_set_status(priv, revoked ?
+ VFIO_PCI_DMABUF_TEMP_REVOKED :
+ VFIO_PCI_DMABUF_OK);
fput(priv->dmabuf->file);
}
}
@@ -677,3 +702,66 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
}
up_write(&vdev->memory_lock);
}
+
+#ifdef CONFIG_VFIO_PCI_DMABUF
+int vfio_pci_core_feature_dma_buf_revoke(
+ struct vfio_pci_core_device *vdev, u32 flags,
+ struct vfio_device_feature_dma_buf_revoke __user *arg,
+ size_t argsz)
+{
+ struct vfio_device_feature_dma_buf_revoke db_revoke;
+ struct vfio_pci_dma_buf *priv;
+ struct dma_buf *dmabuf;
+ int ret;
+
+ if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys)
+ return -EOPNOTSUPP;
+
+ ret = vfio_check_feature(flags, argsz,
+ VFIO_DEVICE_FEATURE_SET,
+ sizeof(db_revoke));
+ if (ret != 1)
+ return ret;
+
+ if (copy_from_user(&db_revoke, arg, sizeof(db_revoke)))
+ return -EFAULT;
+
+ dmabuf = dma_buf_get(db_revoke.dmabuf_fd);
+ if (IS_ERR(dmabuf))
+ return PTR_ERR(dmabuf);
+
+ priv = dmabuf->priv;
+ /*
+ * Sanity-check the DMABUF is really a vfio_pci_dma_buf _and_
+ * relates to the VFIO device it was provided with.
+ *
+ * If the DMABUF relates to this vdev then priv->vdev is
+ * stable because this open fd prevents cleanup.
+ *
+ * If it relates to a different vdev, reading priv->vdev might
+ * race with a concurrent cleanup on that device. But if so,
+ * it points to a non-matching vdev or NULL and is unusable
+ * either way.
+ */
+ if (dmabuf->ops != &vfio_pci_dmabuf_ops ||
+ READ_ONCE(priv->vdev) != vdev) {
+ ret = -ENODEV;
+ goto out_put_buf;
+ }
+
+ scoped_guard(rwsem_write, &vdev->memory_lock) {
+ if (priv->status == VFIO_PCI_DMABUF_PERM_REVOKED) {
+ ret = -EBADFD;
+ } else {
+ vfio_pci_dma_buf_set_status(priv,
+ VFIO_PCI_DMABUF_PERM_REVOKED);
+ ret = 0;
+ }
+ }
+
+out_put_buf:
+ dma_buf_put(dmabuf);
+
+ return ret;
+}
+#endif /* CONFIG_VFIO_PCI_DMABUF */
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index c3fa35381679..8741abd04461 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -23,6 +23,12 @@ struct vfio_pci_ioeventfd {
bool test_mem;
};
+enum vfio_pci_dma_buf_status {
+ VFIO_PCI_DMABUF_OK = 0,
+ VFIO_PCI_DMABUF_TEMP_REVOKED = 1,
+ VFIO_PCI_DMABUF_PERM_REVOKED = 2,
+};
+
struct vfio_pci_dma_buf {
struct dma_buf *dmabuf;
struct vfio_pci_core_device *vdev;
@@ -35,7 +41,7 @@ struct vfio_pci_dma_buf {
struct kref kref;
struct completion comp;
unsigned long vma_pgoff_adjust;
- u8 revoked : 1;
+ enum vfio_pci_dma_buf_status status;
};
bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
@@ -147,6 +153,10 @@ void vfio_pci_set_vma_ops(struct vm_area_struct *vma);
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz);
+int vfio_pci_core_feature_dma_buf_revoke(
+ struct vfio_pci_core_device *vdev, u32 flags,
+ struct vfio_device_feature_dma_buf_revoke __user *arg,
+ size_t argsz);
#else
static inline int
vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
@@ -155,6 +165,13 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
{
return -ENOTTY;
}
+static inline int vfio_pci_core_feature_dma_buf_revoke(
+ struct vfio_pci_core_device *vdev, u32 flags,
+ struct vfio_device_feature_dma_buf_revoke __user *arg,
+ size_t argsz)
+{
+ return -ENOTTY;
+}
#endif
#endif
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..697c0bb4b9bc 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1534,6 +1534,26 @@ struct vfio_device_feature_dma_buf {
*/
#define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 12
+/**
+ * Given a dma_buf fd previously created by
+ * VFIO_DEVICE_FEATURE_DMA_BUF, a SET of this feature requests that
+ * access to the corresponding DMABUF is immediately and permanently
+ * revoked. On successful return, the buffer is not accessible
+ * through any mmap() or dma-buf import. The buffer is permanently
+ * disabled, and VFIO refuses all map, mmap, attach, etc. requests.
+ *
+ * Return: 0 on success, -1 and errno is set on failure:
+ *
+ * EBADF, EINVAL: dmabuf_fd is not a DMABUF fd.
+ * ENODEV: The dmabuf_fd does not match this VFIO device.
+ * EBADFD: The DMABUF is already revoked.
+ */
+#define VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE 13
+
+struct vfio_device_feature_dma_buf_revoke {
+ __s32 dmabuf_fd;
+};
+
/* -------- API for Type1 VFIO IOMMU -------- */
/**
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v4 10/10] vfio/pci: Add mmap() attributes to DMABUF feature
2026-07-01 17:12 [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
` (8 preceding siblings ...)
2026-07-01 17:12 ` [PATCH v4 09/10] vfio/pci: Permanently revoke a DMABUF on request Matt Evans
@ 2026-07-01 17:12 ` Matt Evans
9 siblings, 0 replies; 13+ messages in thread
From: Matt Evans @ 2026-07-01 17:12 UTC (permalink / raw)
To: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava
Cc: Mahmoud Adam, David Matlack, Björn Töpel, Sumit Semwal,
Ankit Agrawal, Alistair Popple, Vivek Kasireddy, linux-kernel,
linux-media, dri-devel, linaro-mm-sig, kvm, linux-pci
A new VFIO feature, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR, is added to
set CPU-facing memory type attributes for a DMABUF exported from
vfio-pci. These are used for subsequent mmap()s of the buffer.
There are two attributes supported:
- The default, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC
- VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC, which results in WC
PTEs for the DMABUF's BAR region.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
---
drivers/vfio/pci/vfio_pci_core.c | 2 ++
drivers/vfio/pci/vfio_pci_dmabuf.c | 57 +++++++++++++++++++++++++++++-
drivers/vfio/pci/vfio_pci_priv.h | 14 ++++++++
include/uapi/linux/vfio.h | 27 ++++++++++++++
4 files changed, 99 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ebdf9ec30517..0059c176d627 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1579,6 +1579,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE:
return vfio_pci_core_feature_dma_buf_revoke(vdev, flags, arg, argsz);
+ case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR:
+ return vfio_pci_core_feature_dma_buf_memattr(vdev, flags, arg, argsz);
default:
return -ENOTTY;
}
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index cfca820b767a..31fbad93f89b 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -49,7 +49,10 @@ static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *
if ((vma->vm_flags & VM_SHARED) == 0)
return -EINVAL;
- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ if (READ_ONCE(priv->memattr) == VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC)
+ vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
+ else
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
/* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */
@@ -482,6 +485,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
priv->vdev = vdev;
priv->nr_ranges = get_dma_buf.nr_ranges;
priv->size = length;
+ priv->memattr = VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC;
ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider,
get_dma_buf.region_index,
priv->phys_vec, dma_ranges,
@@ -759,6 +763,57 @@ int vfio_pci_core_feature_dma_buf_revoke(
}
}
+out_put_buf:
+ dma_buf_put(dmabuf);
+
+ return ret;
+}
+
+int vfio_pci_core_feature_dma_buf_memattr(
+ struct vfio_pci_core_device *vdev, u32 flags,
+ struct vfio_device_feature_dma_buf_memattr __user *arg,
+ size_t argsz)
+{
+ struct vfio_device_feature_dma_buf_memattr db_attr;
+ struct vfio_pci_dma_buf *priv;
+ struct dma_buf *dmabuf;
+ int ret;
+
+ if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys)
+ return -EOPNOTSUPP;
+
+ ret = vfio_check_feature(flags, argsz,
+ VFIO_DEVICE_FEATURE_SET,
+ sizeof(db_attr));
+ if (ret != 1)
+ return ret;
+
+ if (copy_from_user(&db_attr, arg, sizeof(db_attr)))
+ return -EFAULT;
+
+ dmabuf = dma_buf_get(db_attr.dmabuf_fd);
+ if (IS_ERR(dmabuf))
+ return PTR_ERR(dmabuf);
+
+ /* Verify DMABUF: see comments in vfio_pci_dma_buf_revoke() */
+ priv = dmabuf->priv;
+ if (dmabuf->ops != &vfio_pci_dmabuf_ops ||
+ READ_ONCE(priv->vdev) != vdev) {
+ ret = -ENODEV;
+ goto out_put_buf;
+ }
+
+ switch (db_attr.memattr) {
+ case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC:
+ case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC:
+ WRITE_ONCE(priv->memattr, db_attr.memattr);
+ ret = 0;
+ break;
+
+ default:
+ ret = -ENOENT;
+ }
+
out_put_buf:
dma_buf_put(dmabuf);
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 8741abd04461..b3c72df002c4 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -41,6 +41,7 @@ struct vfio_pci_dma_buf {
struct kref kref;
struct completion comp;
unsigned long vma_pgoff_adjust;
+ u32 memattr;
enum vfio_pci_dma_buf_status status;
};
@@ -157,6 +158,10 @@ int vfio_pci_core_feature_dma_buf_revoke(
struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf_revoke __user *arg,
size_t argsz);
+int vfio_pci_core_feature_dma_buf_memattr(
+ struct vfio_pci_core_device *vdev, u32 flags,
+ struct vfio_device_feature_dma_buf_memattr __user *arg,
+ size_t argsz);
#else
static inline int
vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
@@ -165,6 +170,7 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
{
return -ENOTTY;
}
+
static inline int vfio_pci_core_feature_dma_buf_revoke(
struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf_revoke __user *arg,
@@ -172,6 +178,14 @@ static inline int vfio_pci_core_feature_dma_buf_revoke(
{
return -ENOTTY;
}
+
+static inline int vfio_pci_core_feature_dma_buf_memattr(
+ struct vfio_pci_core_device *vdev, u32 flags,
+ struct vfio_device_feature_dma_buf_memattr __user *arg,
+ size_t argsz)
+{
+ return -ENOTTY;
+}
#endif
#endif
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 697c0bb4b9bc..ab30b89399d0 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1554,6 +1554,33 @@ struct vfio_device_feature_dma_buf_revoke {
__s32 dmabuf_fd;
};
+/**
+ * Given a dma_buf fd previously created by
+ * VFIO_DEVICE_FEATURE_DMA_BUF, SET the memory attribute that will be
+ * used by future mmap()s of that fd. SETting a new attribute does
+ * not affect existing VMAs.
+ *
+ * The default, if no previous SET has been performed, is NC.
+ *
+ * Return: 0 on success, -1 and errno is set on failure:
+ *
+ * EBADF, EINVAL: dmabuf_fd is not a DMABUF fd.
+ * ENODEV: The dmabuf_fd does not match this VFIO device.
+ * ENOENT: The given memattr is not supported.
+ */
+#define VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR 14
+
+/* Valid memory attributes for the memattr field */
+enum vfio_device_dma_buf_memattr {
+ VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC = 0, /* pgprot_noncached */
+ VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC = 1, /* pgprot_writecombine */
+};
+
+struct vfio_device_feature_dma_buf_memattr {
+ __s32 dmabuf_fd;
+ __u32 memattr;
+};
+
/* -------- API for Type1 VFIO IOMMU -------- */
/**
--
2.50.1 (Apple Git-155)
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v4 01/10] PCI/P2PDMA: Split pool-related cleanup out of pci_p2pdma_release()
2026-07-01 17:12 ` [PATCH v4 01/10] PCI/P2PDMA: Split pool-related cleanup out of pci_p2pdma_release() Matt Evans
@ 2026-07-01 21:15 ` Bjorn Helgaas
0 siblings, 0 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2026-07-01 21:15 UTC (permalink / raw)
To: Matt Evans
Cc: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava, Mahmoud Adam, David Matlack,
Björn Töpel, Sumit Semwal, Ankit Agrawal,
Alistair Popple, Vivek Kasireddy, linux-kernel, linux-media,
dri-devel, linaro-mm-sig, kvm, linux-pci
On Wed, Jul 01, 2026 at 06:12:13PM +0100, Matt Evans wrote:
> Preparing for a refactor in a subsequent patch, split the pool-related
> release code into a new pci_p2pdma_release_pool() function.
>
> This allows future compile-time selection of a null implementation for
> pci_p2pdma_release_pool(), when p2pdma.c is refactored into core- and
> P2P-related files.
>
> Signed-off-by: Matt Evans <matt@ozlabs.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
> drivers/pci/p2pdma.c | 24 ++++++++++++++----------
> 1 file changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 7c898542af8d..a5a1baebc34e 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -226,6 +226,17 @@ static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
> .folio_free = p2pdma_folio_free,
> };
>
> +static void pci_p2pdma_release_pool(struct pci_dev *pdev,
> + struct pci_p2pdma *p2pdma)
> +{
> + if (!p2pdma->pool)
> + return;
> +
> + synchronize_rcu();
> + gen_pool_destroy(p2pdma->pool);
> + sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
> +}
> +
> static void pci_p2pdma_release(void *data)
> {
> struct pci_dev *pdev = data;
> @@ -237,15 +248,8 @@ static void pci_p2pdma_release(void *data)
>
> /* Flush and disable pci_alloc_p2p_mem() */
> pdev->p2pdma = NULL;
> - if (p2pdma->pool)
> - synchronize_rcu();
> + pci_p2pdma_release_pool(pdev, p2pdma);
> xa_destroy(&p2pdma->map_types);
> -
> - if (!p2pdma->pool)
> - return;
> -
> - gen_pool_destroy(p2pdma->pool);
> - sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
> }
>
> /**
> @@ -932,8 +936,8 @@ void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size)
> struct pci_p2pdma *p2pdma;
>
> /*
> - * Pairs with synchronize_rcu() in pci_p2pdma_release() to
> - * ensure pdev->p2pdma is non-NULL for the duration of the
> + * Pairs with synchronize_rcu() in pci_p2pdma_release_pool()
> + * to ensure pdev->p2pdma is non-NULL for the duration of the
> * read-lock.
> */
> rcu_read_lock();
> --
> 2.50.1 (Apple Git-155)
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v4 02/10] PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE
2026-07-01 17:12 ` [PATCH v4 02/10] PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE Matt Evans
@ 2026-07-01 21:16 ` Bjorn Helgaas
0 siblings, 0 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2026-07-01 21:16 UTC (permalink / raw)
To: Matt Evans
Cc: Alex Williamson, Leon Romanovsky, Jason Gunthorpe, Alex Mastro,
Christian König, Bjorn Helgaas, Logan Gunthorpe, Kevin Tian,
Pranjal Shrivastava, Mahmoud Adam, David Matlack,
Björn Töpel, Sumit Semwal, Ankit Agrawal,
Alistair Popple, Vivek Kasireddy, linux-kernel, linux-media,
dri-devel, linaro-mm-sig, kvm, linux-pci
On Wed, Jul 01, 2026 at 06:12:14PM +0100, Matt Evans wrote:
> The P2PDMA code currently provides two features under the same
> CONFIG_PCI_P2PDMA option:
>
> 1. Locate providers via pcim_p2pdma_provider()
> 2. Manage actual P2P DMA
>
> Some drivers (such as vfio-pci) depend on (1), without having a hard
> dependency on (2).
>
> A future vfio-pci commit will rely on pcim_p2pdma_provider() always
> being present. If that depended on CONFIG_PCI_P2PDMA, it would make
> vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit
> systems), even when P2P is not needed.
>
> To resolve this, introduce CONFIG_PCI_P2PDMA_CORE and refactor the
> basic provider functionality into a new p2pdma_core.c file. This is
> available even if the CONFIG_PCI_P2PDMA feature is disabled (or
> unavailable due to !CONFIG_ZONE_DEVICE), satisfying (1).
>
> Then, when the original CONFIG_PCI_P2PDMA is set, drivers have access
> to the additional P2P features of (2). This still depends on
> CONFIG_ZONE_DEVICE.
>
> Signed-off-by: Matt Evans <matt@ozlabs.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
> MAINTAINERS | 2 +-
> drivers/pci/Kconfig | 10 ++--
> drivers/pci/Makefile | 1 +
> drivers/pci/p2pdma.c | 107 +--------------------------------
> drivers/pci/p2pdma.h | 29 +++++++++
> drivers/pci/p2pdma_core.c | 118 +++++++++++++++++++++++++++++++++++++
> include/linux/pci-p2pdma.h | 24 ++++----
> include/linux/pci.h | 2 +-
> 8 files changed, 171 insertions(+), 122 deletions(-)
> create mode 100644 drivers/pci/p2pdma.h
> create mode 100644 drivers/pci/p2pdma_core.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c8d4b913f26c..713861af4484 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -20626,7 +20626,7 @@ B: https://bugzilla.kernel.org
> C: irc://irc.oftc.net/linux-pci
> T: git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git
> F: Documentation/driver-api/pci/p2pdma.rst
> -F: drivers/pci/p2pdma.c
> +F: drivers/pci/p2pdma*
> F: include/linux/pci-p2pdma.h
>
> PCI POWER CONTROL
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 33c88432b728..59d70bc84cc9 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -206,11 +206,7 @@ config PCIE_TPH
> config PCI_P2PDMA
> bool "PCI peer-to-peer transfer support"
> depends on ZONE_DEVICE
> - #
> - # The need for the scatterlist DMA bus address flag means PCI P2PDMA
> - # requires 64bit
> - #
> - depends on 64BIT
> + select PCI_P2PDMA_CORE
> select GENERIC_ALLOCATOR
> select NEED_SG_DMA_FLAGS
> help
> @@ -226,6 +222,10 @@ config PCI_P2PDMA
>
> If unsure, say N.
>
> +config PCI_P2PDMA_CORE
> + default n
> + bool
> +
> config PCI_LABEL
> def_bool y if (DMI || ACPI)
> select NLS
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 41ebc3b9a518..0b32572d57a1 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -30,6 +30,7 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o
> obj-$(CONFIG_PCI_STUB) += pci-stub.o
> obj-$(CONFIG_PCI_PF_STUB) += pci-pf-stub.o
> obj-$(CONFIG_PCI_ECAM) += ecam.o
> +obj-$(CONFIG_PCI_P2PDMA_CORE) += p2pdma_core.o
> obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o
> obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
> obj-$(CONFIG_VGA_ARB) += vgaarb.o
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index a5a1baebc34e..50b1a7daf55c 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -21,12 +21,7 @@
> #include <linux/seq_buf.h>
> #include <linux/xarray.h>
>
> -struct pci_p2pdma {
> - struct gen_pool *pool;
> - bool p2pmem_published;
> - struct xarray map_types;
> - struct p2pdma_provider mem[PCI_STD_NUM_BARS];
> -};
> +#include "p2pdma.h"
>
> struct pci_p2pdma_pagemap {
> struct dev_pagemap pgmap;
> @@ -226,8 +221,7 @@ static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
> .folio_free = p2pdma_folio_free,
> };
>
> -static void pci_p2pdma_release_pool(struct pci_dev *pdev,
> - struct pci_p2pdma *p2pdma)
> +void pci_p2pdma_release_pool(struct pci_dev *pdev, struct pci_p2pdma *p2pdma)
> {
> if (!p2pdma->pool)
> return;
> @@ -237,103 +231,6 @@ static void pci_p2pdma_release_pool(struct pci_dev *pdev,
> sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
> }
>
> -static void pci_p2pdma_release(void *data)
> -{
> - struct pci_dev *pdev = data;
> - struct pci_p2pdma *p2pdma;
> -
> - p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
> - if (!p2pdma)
> - return;
> -
> - /* Flush and disable pci_alloc_p2p_mem() */
> - pdev->p2pdma = NULL;
> - pci_p2pdma_release_pool(pdev, p2pdma);
> - xa_destroy(&p2pdma->map_types);
> -}
> -
> -/**
> - * pcim_p2pdma_init - Initialise peer-to-peer DMA providers
> - * @pdev: The PCI device to enable P2PDMA for
> - *
> - * This function initializes the peer-to-peer DMA infrastructure
> - * for a PCI device. It allocates and sets up the necessary data
> - * structures to support P2PDMA operations, including mapping type
> - * tracking.
> - */
> -int pcim_p2pdma_init(struct pci_dev *pdev)
> -{
> - struct pci_p2pdma *p2p;
> - int i, ret;
> -
> - p2p = rcu_dereference_protected(pdev->p2pdma, 1);
> - if (p2p)
> - return 0;
> -
> - p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
> - if (!p2p)
> - return -ENOMEM;
> -
> - xa_init(&p2p->map_types);
> - /*
> - * Iterate over all standard PCI BARs and record only those that
> - * correspond to MMIO regions. Skip non-memory resources (e.g. I/O
> - * port BARs) since they cannot be used for peer-to-peer (P2P)
> - * transactions.
> - */
> - for (i = 0; i < PCI_STD_NUM_BARS; i++) {
> - if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM))
> - continue;
> -
> - p2p->mem[i].owner = &pdev->dev;
> - p2p->mem[i].bus_offset =
> - pci_bus_address(pdev, i) - pci_resource_start(pdev, i);
> - }
> -
> - ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
> - if (ret)
> - goto out_p2p;
> -
> - rcu_assign_pointer(pdev->p2pdma, p2p);
> - return 0;
> -
> -out_p2p:
> - devm_kfree(&pdev->dev, p2p);
> - return ret;
> -}
> -EXPORT_SYMBOL_GPL(pcim_p2pdma_init);
> -
> -/**
> - * pcim_p2pdma_provider - Get peer-to-peer DMA provider
> - * @pdev: The PCI device to enable P2PDMA for
> - * @bar: BAR index to get provider
> - *
> - * This function gets peer-to-peer DMA provider for a PCI device. The lifetime
> - * of the provider (and of course the MMIO) is bound to the lifetime of the
> - * driver. A driver calling this function must ensure that all references to the
> - * provider, and any DMA mappings created for any MMIO, are all cleaned up
> - * before the driver remove() completes.
> - *
> - * Since P2P is almost always shared with a second driver this means some system
> - * to notify, invalidate and revoke the MMIO's DMA must be in place to use this
> - * function. For example a revoke can be built using DMABUF.
> - */
> -struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar)
> -{
> - struct pci_p2pdma *p2p;
> -
> - if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
> - return NULL;
> -
> - p2p = rcu_dereference_protected(pdev->p2pdma, 1);
> - if (WARN_ON(!p2p))
> - /* Someone forgot to call to pcim_p2pdma_init() before */
> - return NULL;
> -
> - return &p2p->mem[bar];
> -}
> -EXPORT_SYMBOL_GPL(pcim_p2pdma_provider);
> -
> static int pci_p2pdma_setup_pool(struct pci_dev *pdev)
> {
> struct pci_p2pdma *p2pdma;
> diff --git a/drivers/pci/p2pdma.h b/drivers/pci/p2pdma.h
> new file mode 100644
> index 000000000000..946383809981
> --- /dev/null
> +++ b/drivers/pci/p2pdma.h
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * PCI peer-to-peer DMA support.
> + */
> +
> +#ifndef _PCI_P2PDMA_H
> +#define _PCI_P2PDMA_H
> +
> +#include <linux/genalloc.h>
> +#include <linux/pci-p2pdma.h>
> +#include <linux/xarray.h>
> +
> +struct pci_p2pdma {
> + struct gen_pool *pool;
> + bool p2pmem_published;
> + struct xarray map_types;
> + struct p2pdma_provider mem[PCI_STD_NUM_BARS];
> +};
> +
> +#ifdef CONFIG_PCI_P2PDMA
> +void pci_p2pdma_release_pool(struct pci_dev *pdev, struct pci_p2pdma *p2pdma);
> +#else
> +static inline void pci_p2pdma_release_pool(struct pci_dev *pdev,
> + struct pci_p2pdma *p2pdma)
> +{
> +}
> +#endif
> +
> +#endif
> diff --git a/drivers/pci/p2pdma_core.c b/drivers/pci/p2pdma_core.c
> new file mode 100644
> index 000000000000..bb2138bf2bc7
> --- /dev/null
> +++ b/drivers/pci/p2pdma_core.c
> @@ -0,0 +1,118 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCI peer-to-peer DMA support core, providing a bare-bones
> + * pcim_p2pdma_provider() interface to drivers even if full P2PDMA
> + * isn't present. The full P2PDMA feature is in p2pdma.c (see
> + * CONFIG_PCI_P2PDMA).
> + *
> + * Copyright (c) 2016-2018, Logan Gunthorpe
> + * Copyright (c) 2016-2017, Microsemi Corporation
> + * Copyright (c) 2017, Christoph Hellwig
> + * Copyright (c) 2018, Eideticom Inc.
> + */
> +
> +#define pr_fmt(fmt) "pci-p2pdma: " fmt
> +#include <linux/ctype.h>
> +#include <linux/genalloc.h>
> +#include <linux/memremap.h>
> +#include <linux/pci-p2pdma.h>
> +#include <linux/xarray.h>
> +
> +#include "p2pdma.h"
> +
> +static void pci_p2pdma_release(void *data)
> +{
> + struct pci_dev *pdev = data;
> + struct pci_p2pdma *p2pdma;
> +
> + p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
> + if (!p2pdma)
> + return;
> +
> + /* Flush and disable pci_alloc_p2p_mem() */
> + pdev->p2pdma = NULL;
> + pci_p2pdma_release_pool(pdev, p2pdma);
> + xa_destroy(&p2pdma->map_types);
> +}
> +
> +/**
> + * pcim_p2pdma_init - Initialise peer-to-peer DMA providers
> + * @pdev: The PCI device to enable P2PDMA for
> + *
> + * This function initializes the peer-to-peer DMA infrastructure
> + * for a PCI device. It allocates and sets up the necessary data
> + * structures to support P2PDMA operations, including mapping type
> + * tracking.
> + */
> +int pcim_p2pdma_init(struct pci_dev *pdev)
> +{
> + struct pci_p2pdma *p2p;
> + int i, ret;
> +
> + p2p = rcu_dereference_protected(pdev->p2pdma, 1);
> + if (p2p)
> + return 0;
> +
> + p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
> + if (!p2p)
> + return -ENOMEM;
> +
> + xa_init(&p2p->map_types);
> + /*
> + * Iterate over all standard PCI BARs and record only those that
> + * correspond to MMIO regions. Skip non-memory resources (e.g. I/O
> + * port BARs) since they cannot be used for peer-to-peer (P2P)
> + * transactions.
> + */
> + for (i = 0; i < PCI_STD_NUM_BARS; i++) {
> + if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM))
> + continue;
> +
> + p2p->mem[i].owner = &pdev->dev;
> + p2p->mem[i].bus_offset =
> + pci_bus_address(pdev, i) - pci_resource_start(pdev, i);
> + }
> +
> + ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
> + if (ret)
> + goto out_p2p;
> +
> + rcu_assign_pointer(pdev->p2pdma, p2p);
> + return 0;
> +
> +out_p2p:
> + devm_kfree(&pdev->dev, p2p);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(pcim_p2pdma_init);
> +
> +/**
> + * pcim_p2pdma_provider - Get peer-to-peer DMA provider
> + * @pdev: The PCI device to enable P2PDMA for
> + * @bar: BAR index to get provider
> + *
> + * This function gets peer-to-peer DMA provider for a PCI device. The lifetime
> + * of the provider (and of course the MMIO) is bound to the lifetime of the
> + * driver. A driver calling this function must ensure that all references to the
> + * provider, and any DMA mappings created for any MMIO, are all cleaned up
> + * before the driver remove() completes.
> + *
> + * Since P2P is almost always shared with a second driver this means some system
> + * to notify, invalidate and revoke the MMIO's DMA must be in place to use this
> + * function. For example a revoke can be built using DMABUF.
> + */
> +struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar)
> +{
> + struct pci_p2pdma *p2p;
> +
> + if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
> + return NULL;
> +
> + p2p = rcu_dereference_protected(pdev->p2pdma, 1);
> + if (WARN_ON(!p2p))
> + /* Someone forgot to call to pcim_p2pdma_init() before */
> + return NULL;
> +
> + return &p2p->mem[bar];
> +}
> +EXPORT_SYMBOL_GPL(pcim_p2pdma_provider);
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 873de20a2247..4c42a7b2ee85 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -67,9 +67,22 @@ enum pci_p2pdma_map_type {
> PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
> };
>
> -#ifdef CONFIG_PCI_P2PDMA
> +#ifdef CONFIG_PCI_P2PDMA_CORE
> int pcim_p2pdma_init(struct pci_dev *pdev);
> struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar);
> +#else
> +static inline int pcim_p2pdma_init(struct pci_dev *pdev)
> +{
> + return -EOPNOTSUPP;
> +}
> +static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev,
> + int bar)
> +{
> + return NULL;
> +}
> +#endif
> +
> +#ifdef CONFIG_PCI_P2PDMA
> int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
> u64 offset);
> int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
> @@ -89,15 +102,6 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
> struct device *dev);
> #else /* CONFIG_PCI_P2PDMA */
> -static inline int pcim_p2pdma_init(struct pci_dev *pdev)
> -{
> - return -EOPNOTSUPP;
> -}
> -static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev,
> - int bar)
> -{
> - return NULL;
> -}
> static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
> size_t size, u64 offset)
> {
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 2c4454583c11..531aec355686 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -557,7 +557,7 @@ struct pci_dev {
> u16 pasid_cap; /* PASID Capability offset */
> u16 pasid_features;
> #endif
> -#ifdef CONFIG_PCI_P2PDMA
> +#ifdef CONFIG_PCI_P2PDMA_CORE
> struct pci_p2pdma __rcu *p2pdma;
> #endif
> #ifdef CONFIG_PCI_DOE
> --
> 2.50.1 (Apple Git-155)
>
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-07-01 21:16 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 17:12 [PATCH v4 00/10] vfio/pci: Add mmap() for DMABUFs Matt Evans
2026-07-01 17:12 ` [PATCH v4 01/10] PCI/P2PDMA: Split pool-related cleanup out of pci_p2pdma_release() Matt Evans
2026-07-01 21:15 ` Bjorn Helgaas
2026-07-01 17:12 ` [PATCH v4 02/10] PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE Matt Evans
2026-07-01 21:16 ` Bjorn Helgaas
2026-07-01 17:12 ` [PATCH v4 03/10] vfio/pci: Add a helper to look up PFNs for DMABUFs Matt Evans
2026-07-01 17:12 ` [PATCH v4 04/10] vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA Matt Evans
2026-07-01 17:12 ` [PATCH v4 05/10] vfio/pci: Convert BAR mmap() to use a DMABUF Matt Evans
2026-07-01 17:12 ` [PATCH v4 06/10] vfio/pci: Provide a user-facing name for BAR mappings Matt Evans
2026-07-01 17:12 ` [PATCH v4 07/10] vfio/pci: Clean up BAR zap and revocation Matt Evans
2026-07-01 17:12 ` [PATCH v4 08/10] vfio/pci: Support mmap() of a VFIO DMABUF Matt Evans
2026-07-01 17:12 ` [PATCH v4 09/10] vfio/pci: Permanently revoke a DMABUF on request Matt Evans
2026-07-01 17:12 ` [PATCH v4 10/10] vfio/pci: Add mmap() attributes to DMABUF feature Matt Evans
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.