From: <mhonap@nvidia.com>
To: <alwilliamson@nvidia.com>, <dan.j.williams@intel.com>,
<jonathan.cameron@huawei.com>, <dave.jiang@intel.com>,
<alejandro.lucero-palau@amd.com>, <dave@stgolabs.net>,
<alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
<ira.weiny@intel.com>, <dmatlack@google.com>, <shuah@kernel.org>,
<jgg@ziepe.ca>, <yishaih@nvidia.com>, <skolothumtho@nvidia.com>,
<kevin.tian@intel.com>, <ankita@nvidia.com>
Cc: <vsethi@nvidia.com>, <cjia@nvidia.com>, <targupta@nvidia.com>,
<zhiw@nvidia.com>, <kjaju@nvidia.com>,
<linux-kselftest@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<linux-cxl@vger.kernel.org>, <kvm@vger.kernel.org>,
<mhonap@nvidia.com>, Alex Williamson <alex@shazbot.org>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>
Subject: [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
Date: Wed, 1 Apr 2026 20:08:57 +0530 [thread overview]
Message-ID: <20260401143917.108413-1-mhonap@nvidia.com> (raw)
From: Manish Honap <mhonap@nvidia.com>
CXL Type-2 accelerators (e.g. CXL.mem-capable GPUs) cannot be passed
through to virtual machines with stock vfio-pci because the driver has
no concept of HDM decoder management, DPA region exposure, or component
register emulation. This series wires all of that into vfio-pci-core
behind a new CONFIG_VFIO_CXL_CORE optional module, without requiring a
variant driver.
When a CXL Device DVSEC (Vendor ID 0x1E98, ID 0x0000) is detected at
device open time, the driver:
- Probes the HDM Decoder Capability block in the component registers
and allocates a DPA region through the CXL subsystem. On devices
where firmware has already committed a decoder, the kernel skips
allocation and re-uses the committed range.
- Builds a kernel-owned shadow of the HDM register block. The VMM
reads and writes this shadow through a dedicated COMP_REGS VFIO
region rather than touching the hardware directly. The kernel
enforces CXL 3.1 bit-field rules: reserved bits, read-only bits,
the COMMIT/COMMITTED latch, and the LOCK→0 reprogram path for
firmware-committed decoders.
- Exposes the DPA range as a second VFIO region (VFIO_REGION_SUBTYPE_CXL)
backed by the kernel-assigned HPA. PTEs are inserted lazily on first
page fault and torn down atomically under memory_lock during FLR.
- Intercepts writes to the CXL DVSEC configuration-space registers
(Control, Status, Control2, Status2, Lock, Range Base) and replays
them through a per-device vconfig shadow, enforcing RWL/RW1CS/RWO
access semantics and the CONFIG_LOCK one-shot latch.
- Returns a VFIO_DEVICE_INFO_CAP_CXL capability (id=6) carrying the
HDM register BAR index and offset, commit flags, and the indices of
the DPA and COMP_REGS regions. HDM decoder count and the HDM block
offset within COMP_REGS are derivable by the VMM from the CXL
Capability Array in the COMP_REGS region itself, so they are not
duplicated in the capability struct.
- Builds a sparse-mmap capability for the component register BAR so
VMMs can map GPU/accelerator register windows while the kernel
protects the CXL component register block. Three physical layouts
are handled: component block at the BAR end, at the start, and in
the middle.
- Provides a module parameter (disable_cxl=1) and a per-device flag
(vdev->disable_cxl) for suppressing the feature without recompiling.
- Includes selftests covering device detection, capability parsing,
region enumeration, HDM register emulation, DPA mmap with page-fault
insertion, FLR invalidation, and DVSEC register emulation.
The series is applied on top of the cxl/next branch using the base
specified at the end of this cover letter plus Alejandro's v23 Type-2
device support patches [1].
Series structure
================
Patches 1-5 extend the CXL subsystem with the APIs vfio-pci needs.
Patches 6-8 add the vfio-pci-core plumbing (UAPI, device state,
Kconfig/build).
Patches 9-15 implement the core device lifecycle: detection, HDM
emulation, media readiness, region management, DPA region, and DVSEC
emulation.
Patches 16-18 wire everything together at open/close time and
populate the VFIO ioctl paths.
Patches 19-20 add documentation and selftests.
Changes since v1
================
UAPI struct minimization (patch 6)
v1 carried hdm_count, hdm_regs_size, hdm_decoder_offset, dpa_size,
and a pad byte in vfio_device_info_cap_cxl. All four fields are
derivable from data the VMM already has: hdm_count and the HDM block
offset come from the CXL Capability Array in the COMP_REGS region,
hdm_regs_size is implicit in the COMP_REGS region size, and dpa_size
is the DPA region size. v2 drops them and replaces pad with
reserved[3]. The VFIO_CXL_CAP_PRECOMMITTED flag is gone; the single
VFIO_CXL_CAP_FIRMWARE_COMMITTED flag covers both the committed and
precommitted cases. VFIO_CXL_CAP_CACHE_CAPABLE is added to expose
the HDM-DB (CXL.cache) capability bit.
Component BAR access: sparse mmap instead of blanket rejection (patch 17)
v1 returned size=0 for the component BAR and rejected all mmap and
r/w access to it. That broke GPU passthrough scenarios where the
device puts accelerator register windows in the same BAR as the CXL
component registers. v2 replaces the blanket rejection with a
sparse-mmap capability that advertises only the GPU register windows,
carving out the component register block. vfio_cxl_mmap_overlaps_comp_regs()
rejects only the sub-range covering [comp_reg_offset, comp_reg_offset
+ comp_reg_size); everything else in the BAR remains mappable.
CXL register defines moved to uapi/cxl/cxl_regs.h (patch 3)
v1 placed the component register defines in a private header
(include/cxl/cxl_regs.h). v2 moves them to include/uapi/cxl/cxl_regs.h
so VMMs can include them directly without duplicating definitions.
HDM API simplification (patch 1)
v1 exported cxl_get_hdm_reg_info() which returned a raw struct with
offset and size fields. v2 replaces it with cxl_get_hdm_info() which
uses the cached count already populated by cxl_probe_component_regs()
and returns a single struct with all HDM metadata, removing the need
for callers to re-read the hardware.
cxl_await_range_active() split (patch 4)
cxl_await_media_ready() requires a CXLMDEV mailbox register, which
Type-2 accelerators may not have. v2 splits out cxl_await_range_active()
so the HDM range-active poll can be used independently of the media
ready path.
LOCK→0 transition in HDM ctrl write emulation (patch 11)
v1 did not handle the case where a guest tries to clear the LOCK bit
to reprogram a firmware-committed decoder. v2 allows this transition
and re-programs the hardware accordingly.
Component register buffer allocation (patch 11)
v1 allocated only the HDM register sub-range in the COMP_REGS buffer.
v2 allocates the full CXL_COMPONENT_REG_BLOCK_SIZE so future patches
can expose other capability blocks (e.g. RAS, CXL.cache) without a
structural change.
Register region setup split (patch 16)
v1 tied region registration to the detection/init path. v2 splits it
into explicit vfio_cxl_register_cxl_region() and
vfio_cxl_register_comp_regs_region() functions called from
vfio_pci_open_device(), which is the correct point since vconfig and
pci_config_map are valid there.
VLA fix merged into selftest (patch 20)
v1 had a separate patch 20 fixing a VLA initialisation in
vfio_pci_irq_set(). v2 folds that fix into the selftest patch to
keep the standalone CXL change count at 19 functional patches.
Reviewer feedback addressed
===========================
Dave Jiang:
- Replace open-coded bit shifts with FIELD_GET() / FIELD_PREP()
throughout the HDM emulation code.
- Rename flag from VFIO_CXL_CAP_COMMITTED / VFIO_CXL_CAP_PRECOMMITTED
to VFIO_CXL_CAP_FIRMWARE_COMMITTED; the old names were ambiguous.
- Use memremap(MEMREMAP_WB) for the DPA kernel mapping instead of
ioremap_cache(), which selects the wrong memory-type descriptor on
ARM64.
- Use __free() / DEFINE_FREE() scope helpers for CXL resource cleanup
in the region management path, replacing the open-coded error
unwind.
- Remove the unused abs_off parameter from the HDM accessor.
- Rename cxl_dvsec_control_write() to better reflect its role.
Jonathan Cameron:
- Move CXL register defines to uapi/cxl/cxl_regs.h so VMMs can
consume them without a kernel header dependency.
- Use local variables with __free() rather than struct members for
intermediate ERR_PTR returns in the region management code; avoids
ambiguity about ownership on error paths.
- The assumption that a pre-committed decoder always exists at probe
time is too restrictive for hotplug scenarios; v2 makes the
precommitted path a fast-track that falls back to dynamic allocation
when no committed decoder is found.
Alex Williamson:
- The blanket size=0 / mmap-reject approach for the component BAR
prevents VMMs from accessing GPU register windows in the same BAR.
v2 implements the sparse-mmap capability described above.
Limitations and future work
===========================
Switched topologies with more than one caching agent are not yet
supported; that is planned for a follow-on series.
RAS/ECC handling and CXL core reset integration (cxl_reset support
from Srirangan [2]) will be added in subsequent patches.
Dependencies
============
[1] CXL Type-2 device basic support (Alejandro Lucero-Palau, v23):
https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/
[2] CXL reset support for Type-2 devices (Srirangan Madhavan):
https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/
Cc: Alex Williamson <alex@shazbot.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alejandro Lucero <alejandro.lucero-palau@amd.com>
Cc: linux-cxl@vger.kernel.org
Cc: kvm@vger.kernel.org
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
base-commit: 3f7938b1aec7f06d5b23adca83e4542fcf027001
--
Manish Honap (20):
cxl: Add cxl_get_hdm_info() for HDM decoder metadata
cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public
header
cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
cxl: Split cxl_await_range_active() from media-ready wait
cxl: Record BIR and BAR offset in cxl_register_map
vfio: UAPI for CXL-capable PCI device assignment
vfio/pci: Add CXL state to vfio_pci_core_device
vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks
vfio/cxl: Detect CXL DVSEC and probe HDM block
vfio/pci: Export config access helpers
vfio/cxl: Introduce HDM decoder register emulation framework
vfio/cxl: Wait for HDM ranges and create memdev
vfio/cxl: CXL region management support
vfio/cxl: DPA VFIO region with demand fault mmap and reset zap
vfio/cxl: Virtualize CXL DVSEC config writes
vfio/cxl: Register regions with VFIO layer
vfio/pci: Advertise CXL cap and sparse component BAR to userspace
vfio/cxl: Provide opt-out for CXL feature
docs: vfio-pci: Document CXL Type-2 device passthrough
selftests/vfio: Add CXL Type-2 VFIO assignment test
Documentation/driver-api/index.rst | 1 +
Documentation/driver-api/vfio-pci-cxl.rst | 382 +++
drivers/cxl/core/pci.c | 64 +-
drivers/cxl/core/regs.c | 30 +
drivers/cxl/cxl.h | 46 -
drivers/vfio/pci/Kconfig | 2 +
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/cxl/Kconfig | 9 +
drivers/vfio/pci/cxl/vfio_cxl_config.c | 306 ++
drivers/vfio/pci/cxl/vfio_cxl_core.c | 880 ++++++
drivers/vfio/pci/cxl/vfio_cxl_emu.c | 509 ++++
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 133 +
drivers/vfio/pci/vfio_pci.c | 32 +
drivers/vfio/pci/vfio_pci_config.c | 58 +-
drivers/vfio/pci/vfio_pci_core.c | 46 +-
drivers/vfio/pci/vfio_pci_priv.h | 66 +
drivers/vfio/pci/vfio_pci_rdwr.c | 16 +-
include/cxl/cxl.h | 51 +
include/linux/vfio_pci_core.h | 10 +
include/uapi/cxl/cxl_regs.h | 160 +
include/uapi/linux/vfio.h | 86 +
tools/testing/selftests/vfio/Makefile | 1 +
.../selftests/vfio/lib/vfio_pci_device.c | 3 +-
.../selftests/vfio/vfio_cxl_type2_test.c | 2631 +++++++++++++++++
24 files changed, 5459 insertions(+), 64 deletions(-)
create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
create mode 100644 drivers/vfio/pci/cxl/Kconfig
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_config.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_emu.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
create mode 100644 include/uapi/cxl/cxl_regs.h
create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c
--
2.25.1
next reply other threads:[~2026-04-01 14:39 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-01 14:38 mhonap [this message]
2026-04-01 14:38 ` [PATCH v2 01/20] cxl: Add cxl_get_hdm_info() for HDM decoder metadata mhonap
2026-04-01 14:38 ` [PATCH v2 02/20] cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public header mhonap
2026-04-01 14:39 ` [PATCH v2 03/20] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
2026-04-01 14:39 ` [PATCH v2 04/20] cxl: Split cxl_await_range_active() from media-ready wait mhonap
2026-04-01 14:39 ` [PATCH v2 05/20] cxl: Record BIR and BAR offset in cxl_register_map mhonap
2026-04-01 14:39 ` [PATCH v2 06/20] vfio: UAPI for CXL-capable PCI device assignment mhonap
2026-04-01 14:39 ` [PATCH v2 07/20] vfio/pci: Add CXL state to vfio_pci_core_device mhonap
2026-04-01 14:39 ` [PATCH v2 08/20] vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks mhonap
2026-04-01 14:39 ` [PATCH v2 09/20] vfio/cxl: Detect CXL DVSEC and probe HDM block mhonap
2026-04-01 14:39 ` [PATCH v2 10/20] vfio/pci: Export config access helpers mhonap
2026-04-01 14:39 ` [PATCH v2 11/20] vfio/cxl: Introduce HDM decoder register emulation framework mhonap
2026-04-01 14:39 ` [PATCH v2 12/20] vfio/cxl: Wait for HDM ranges and create memdev mhonap
2026-04-01 14:39 ` [PATCH v2 13/20] vfio/cxl: CXL region management support mhonap
2026-04-01 14:39 ` [PATCH v2 14/20] vfio/cxl: DPA VFIO region with demand fault mmap and reset zap mhonap
2026-04-01 14:39 ` [PATCH v2 15/20] vfio/cxl: Virtualize CXL DVSEC config writes mhonap
2026-04-01 14:39 ` [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer mhonap
2026-04-03 19:35 ` Dan Williams
2026-04-04 18:53 ` Jason Gunthorpe
2026-04-04 19:36 ` Dan Williams
2026-04-06 21:22 ` Gregory Price
2026-04-06 22:05 ` Jason Gunthorpe
2026-04-07 14:15 ` Gregory Price
2026-04-06 22:10 ` Jason Gunthorpe
2026-04-01 14:39 ` [PATCH v2 17/20] vfio/pci: Advertise CXL cap and sparse component BAR to userspace mhonap
2026-04-01 14:39 ` [PATCH v2 18/20] vfio/cxl: Provide opt-out for CXL feature mhonap
2026-04-01 14:39 ` [PATCH v2 19/20] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
2026-04-01 14:39 ` [PATCH v2 20/20] selftests/vfio: Add CXL Type-2 VFIO assignment test mhonap
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260401143917.108413-1-mhonap@nvidia.com \
--to=mhonap@nvidia.com \
--cc=alejandro.lucero-palau@amd.com \
--cc=alex@shazbot.org \
--cc=alison.schofield@intel.com \
--cc=alwilliamson@nvidia.com \
--cc=ankita@nvidia.com \
--cc=cjia@nvidia.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=dmatlack@google.com \
--cc=ira.weiny@intel.com \
--cc=jgg@ziepe.ca \
--cc=jonathan.cameron@huawei.com \
--cc=kevin.tian@intel.com \
--cc=kjaju@nvidia.com \
--cc=kvm@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=shuah@kernel.org \
--cc=skolothumtho@nvidia.com \
--cc=targupta@nvidia.com \
--cc=vishal.l.verma@intel.com \
--cc=vsethi@nvidia.com \
--cc=yishaih@nvidia.com \
--cc=zhiw@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox