public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
@ 2026-04-01 14:38 mhonap
  2026-04-01 14:38 ` [PATCH v2 01/20] cxl: Add cxl_get_hdm_info() for HDM decoder metadata mhonap
                   ` (19 more replies)
  0 siblings, 20 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:38 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap, Alex Williamson,
	Jonathan Cameron

From: Manish Honap <mhonap@nvidia.com>

CXL Type-2 accelerators (e.g. CXL.mem-capable GPUs) cannot be passed
through to virtual machines with stock vfio-pci because the driver has
no concept of HDM decoder management, DPA region exposure, or component
register emulation.  This series wires all of that into vfio-pci-core
behind a new CONFIG_VFIO_CXL_CORE optional module, without requiring a
variant driver.

When a CXL Device DVSEC (Vendor ID 0x1E98, ID 0x0000) is detected at
device open time, the driver:

  - Probes the HDM Decoder Capability block in the component registers
    and allocates a DPA region through the CXL subsystem.  On devices
    where firmware has already committed a decoder, the kernel skips
    allocation and re-uses the committed range.

  - Builds a kernel-owned shadow of the HDM register block.  The VMM
    reads and writes this shadow through a dedicated COMP_REGS VFIO
    region rather than touching the hardware directly.  The kernel
    enforces CXL 3.1 bit-field rules: reserved bits, read-only bits,
    the COMMIT/COMMITTED latch, and the LOCK→0 reprogram path for
    firmware-committed decoders.

  - Exposes the DPA range as a second VFIO region (VFIO_REGION_SUBTYPE_CXL)
    backed by the kernel-assigned HPA.  PTEs are inserted lazily on first
    page fault and torn down atomically under memory_lock during FLR.

  - Intercepts writes to the CXL DVSEC configuration-space registers
    (Control, Status, Control2, Status2, Lock, Range Base) and replays
    them through a per-device vconfig shadow, enforcing RWL/RW1CS/RWO
    access semantics and the CONFIG_LOCK one-shot latch.

  - Returns a VFIO_DEVICE_INFO_CAP_CXL capability (id=6) carrying the
    HDM register BAR index and offset, commit flags, and the indices of
    the DPA and COMP_REGS regions.  HDM decoder count and the HDM block
    offset within COMP_REGS are derivable by the VMM from the CXL
    Capability Array in the COMP_REGS region itself, so they are not
    duplicated in the capability struct.

  - Builds a sparse-mmap capability for the component register BAR so
    VMMs can map GPU/accelerator register windows while the kernel
    protects the CXL component register block.  Three physical layouts
    are handled: component block at the BAR end, at the start, and in
    the middle.

  - Provides a module parameter (disable_cxl=1) and a per-device flag
    (vdev->disable_cxl) for suppressing the feature without recompiling.

  - Includes selftests covering device detection, capability parsing,
    region enumeration, HDM register emulation, DPA mmap with page-fault
    insertion, FLR invalidation, and DVSEC register emulation.

The series is applied on top of the cxl/next branch using the base
specified at the end of this cover letter plus Alejandro's v23 Type-2
device support patches [1].

Series structure
================

  Patches 1-5 extend the CXL subsystem with the APIs vfio-pci needs.

  Patches 6-8 add the vfio-pci-core plumbing (UAPI, device state,
  Kconfig/build).

  Patches 9-15 implement the core device lifecycle: detection, HDM
  emulation, media readiness, region management, DPA region, and DVSEC
  emulation.

  Patches 16-18 wire everything together at open/close time and
  populate the VFIO ioctl paths.

  Patches 19-20 add documentation and selftests.

Changes since v1
================

UAPI struct minimization (patch 6)

  v1 carried hdm_count, hdm_regs_size, hdm_decoder_offset, dpa_size,
  and a pad byte in vfio_device_info_cap_cxl. All four fields are
  derivable from data the VMM already has: hdm_count and the HDM block
  offset come from the CXL Capability Array in the COMP_REGS region,
  hdm_regs_size is implicit in the COMP_REGS region size, and dpa_size
  is the DPA region size.  v2 drops them and replaces pad with
  reserved[3].  The VFIO_CXL_CAP_PRECOMMITTED flag is gone; the single
  VFIO_CXL_CAP_FIRMWARE_COMMITTED flag covers both the committed and
  precommitted cases.  VFIO_CXL_CAP_CACHE_CAPABLE is added to expose
  the HDM-DB (CXL.cache) capability bit.

Component BAR access: sparse mmap instead of blanket rejection (patch 17)

  v1 returned size=0 for the component BAR and rejected all mmap and
  r/w access to it. That broke GPU passthrough scenarios where the
  device puts accelerator register windows in the same BAR as the CXL
  component registers. v2 replaces the blanket rejection with a
  sparse-mmap capability that advertises only the GPU register windows,
  carving out the component register block.  vfio_cxl_mmap_overlaps_comp_regs()
  rejects only the sub-range covering [comp_reg_offset, comp_reg_offset
  + comp_reg_size); everything else in the BAR remains mappable.

CXL register defines moved to uapi/cxl/cxl_regs.h (patch 3)

  v1 placed the component register defines in a private header
  (include/cxl/cxl_regs.h). v2 moves them to include/uapi/cxl/cxl_regs.h
  so VMMs can include them directly without duplicating definitions.

HDM API simplification (patch 1)

  v1 exported cxl_get_hdm_reg_info() which returned a raw struct with
  offset and size fields. v2 replaces it with cxl_get_hdm_info() which
  uses the cached count already populated by cxl_probe_component_regs()
  and returns a single struct with all HDM metadata, removing the need
  for callers to re-read the hardware.

cxl_await_range_active() split (patch 4)

  cxl_await_media_ready() requires a CXLMDEV mailbox register, which
  Type-2 accelerators may not have.  v2 splits out cxl_await_range_active()
  so the HDM range-active poll can be used independently of the media
  ready path.

LOCK→0 transition in HDM ctrl write emulation (patch 11)

  v1 did not handle the case where a guest tries to clear the LOCK bit
  to reprogram a firmware-committed decoder. v2 allows this transition
  and re-programs the hardware accordingly.

Component register buffer allocation (patch 11)

  v1 allocated only the HDM register sub-range in the COMP_REGS buffer.
  v2 allocates the full CXL_COMPONENT_REG_BLOCK_SIZE so future patches
  can expose other capability blocks (e.g. RAS, CXL.cache) without a
  structural change.

Register region setup split (patch 16)

  v1 tied region registration to the detection/init path.  v2 splits it
  into explicit vfio_cxl_register_cxl_region() and
  vfio_cxl_register_comp_regs_region() functions called from
  vfio_pci_open_device(), which is the correct point since vconfig and
  pci_config_map are valid there.

VLA fix merged into selftest (patch 20)

  v1 had a separate patch 20 fixing a VLA initialisation in
  vfio_pci_irq_set().  v2 folds that fix into the selftest patch to
  keep the standalone CXL change count at 19 functional patches.

Reviewer feedback addressed
===========================

Dave Jiang:
  - Replace open-coded bit shifts with FIELD_GET() / FIELD_PREP()
    throughout the HDM emulation code.
  - Rename flag from VFIO_CXL_CAP_COMMITTED / VFIO_CXL_CAP_PRECOMMITTED
    to VFIO_CXL_CAP_FIRMWARE_COMMITTED; the old names were ambiguous.
  - Use memremap(MEMREMAP_WB) for the DPA kernel mapping instead of
    ioremap_cache(), which selects the wrong memory-type descriptor on
    ARM64.
  - Use __free() / DEFINE_FREE() scope helpers for CXL resource cleanup
    in the region management path, replacing the open-coded error
    unwind.
  - Remove the unused abs_off parameter from the HDM accessor.
  - Rename cxl_dvsec_control_write() to better reflect its role.

Jonathan Cameron:
  - Move CXL register defines to uapi/cxl/cxl_regs.h so VMMs can
    consume them without a kernel header dependency.
  - Use local variables with __free() rather than struct members for
    intermediate ERR_PTR returns in the region management code; avoids
    ambiguity about ownership on error paths.
  - The assumption that a pre-committed decoder always exists at probe
    time is too restrictive for hotplug scenarios; v2 makes the
    precommitted path a fast-track that falls back to dynamic allocation
    when no committed decoder is found.

Alex Williamson:
  - The blanket size=0 / mmap-reject approach for the component BAR
    prevents VMMs from accessing GPU register windows in the same BAR.
    v2 implements the sparse-mmap capability described above.

Limitations and future work
===========================

  Switched topologies with more than one caching agent are not yet
  supported; that is planned for a follow-on series.

  RAS/ECC handling and CXL core reset integration (cxl_reset support
  from Srirangan [2]) will be added in subsequent patches.

Dependencies
============

[1] CXL Type-2 device basic support (Alejandro Lucero-Palau, v23):
    https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/

[2] CXL reset support for Type-2 devices (Srirangan Madhavan):
    https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/

Cc: Alex Williamson <alex@shazbot.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alejandro Lucero <alejandro.lucero-palau@amd.com>
Cc: linux-cxl@vger.kernel.org
Cc: kvm@vger.kernel.org

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>

base-commit: 3f7938b1aec7f06d5b23adca83e4542fcf027001
--

Manish Honap (20):
  cxl: Add cxl_get_hdm_info() for HDM decoder metadata
  cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public
    header
  cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
  cxl: Split cxl_await_range_active() from media-ready wait
  cxl: Record BIR and BAR offset in cxl_register_map
  vfio: UAPI for CXL-capable PCI device assignment
  vfio/pci: Add CXL state to vfio_pci_core_device
  vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks
  vfio/cxl: Detect CXL DVSEC and probe HDM block
  vfio/pci: Export config access helpers
  vfio/cxl: Introduce HDM decoder register emulation framework
  vfio/cxl: Wait for HDM ranges and create memdev
  vfio/cxl: CXL region management support
  vfio/cxl: DPA VFIO region with demand fault mmap and reset zap
  vfio/cxl: Virtualize CXL DVSEC config writes
  vfio/cxl: Register regions with VFIO layer
  vfio/pci: Advertise CXL cap and sparse component BAR to userspace
  vfio/cxl: Provide opt-out for CXL feature
  docs: vfio-pci: Document CXL Type-2 device passthrough
  selftests/vfio: Add CXL Type-2 VFIO assignment test

 Documentation/driver-api/index.rst            |    1 +
 Documentation/driver-api/vfio-pci-cxl.rst     |  382 +++
 drivers/cxl/core/pci.c                        |   64 +-
 drivers/cxl/core/regs.c                       |   30 +
 drivers/cxl/cxl.h                             |   46 -
 drivers/vfio/pci/Kconfig                      |    2 +
 drivers/vfio/pci/Makefile                     |    1 +
 drivers/vfio/pci/cxl/Kconfig                  |    9 +
 drivers/vfio/pci/cxl/vfio_cxl_config.c        |  306 ++
 drivers/vfio/pci/cxl/vfio_cxl_core.c          |  880 ++++++
 drivers/vfio/pci/cxl/vfio_cxl_emu.c           |  509 ++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h          |  133 +
 drivers/vfio/pci/vfio_pci.c                   |   32 +
 drivers/vfio/pci/vfio_pci_config.c            |   58 +-
 drivers/vfio/pci/vfio_pci_core.c              |   46 +-
 drivers/vfio/pci/vfio_pci_priv.h              |   66 +
 drivers/vfio/pci/vfio_pci_rdwr.c              |   16 +-
 include/cxl/cxl.h                             |   51 +
 include/linux/vfio_pci_core.h                 |   10 +
 include/uapi/cxl/cxl_regs.h                   |  160 +
 include/uapi/linux/vfio.h                     |   86 +
 tools/testing/selftests/vfio/Makefile         |    1 +
 .../selftests/vfio/lib/vfio_pci_device.c      |    3 +-
 .../selftests/vfio/vfio_cxl_type2_test.c      | 2631 +++++++++++++++++
 24 files changed, 5459 insertions(+), 64 deletions(-)
 create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
 create mode 100644 drivers/vfio/pci/cxl/Kconfig
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_config.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_emu.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
 create mode 100644 include/uapi/cxl/cxl_regs.h
 create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c

--
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-04-06 22:10 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-04-01 14:38 ` [PATCH v2 01/20] cxl: Add cxl_get_hdm_info() for HDM decoder metadata mhonap
2026-04-01 14:38 ` [PATCH v2 02/20] cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public header mhonap
2026-04-01 14:39 ` [PATCH v2 03/20] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
2026-04-01 14:39 ` [PATCH v2 04/20] cxl: Split cxl_await_range_active() from media-ready wait mhonap
2026-04-01 14:39 ` [PATCH v2 05/20] cxl: Record BIR and BAR offset in cxl_register_map mhonap
2026-04-01 14:39 ` [PATCH v2 06/20] vfio: UAPI for CXL-capable PCI device assignment mhonap
2026-04-01 14:39 ` [PATCH v2 07/20] vfio/pci: Add CXL state to vfio_pci_core_device mhonap
2026-04-01 14:39 ` [PATCH v2 08/20] vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks mhonap
2026-04-01 14:39 ` [PATCH v2 09/20] vfio/cxl: Detect CXL DVSEC and probe HDM block mhonap
2026-04-01 14:39 ` [PATCH v2 10/20] vfio/pci: Export config access helpers mhonap
2026-04-01 14:39 ` [PATCH v2 11/20] vfio/cxl: Introduce HDM decoder register emulation framework mhonap
2026-04-01 14:39 ` [PATCH v2 12/20] vfio/cxl: Wait for HDM ranges and create memdev mhonap
2026-04-01 14:39 ` [PATCH v2 13/20] vfio/cxl: CXL region management support mhonap
2026-04-01 14:39 ` [PATCH v2 14/20] vfio/cxl: DPA VFIO region with demand fault mmap and reset zap mhonap
2026-04-01 14:39 ` [PATCH v2 15/20] vfio/cxl: Virtualize CXL DVSEC config writes mhonap
2026-04-01 14:39 ` [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer mhonap
2026-04-03 19:35   ` Dan Williams
2026-04-04 18:53     ` Jason Gunthorpe
2026-04-04 19:36       ` Dan Williams
2026-04-06 21:22         ` Gregory Price
2026-04-06 22:05           ` Jason Gunthorpe
2026-04-06 22:10         ` Jason Gunthorpe
2026-04-01 14:39 ` [PATCH v2 17/20] vfio/pci: Advertise CXL cap and sparse component BAR to userspace mhonap
2026-04-01 14:39 ` [PATCH v2 18/20] vfio/cxl: Provide opt-out for CXL feature mhonap
2026-04-01 14:39 ` [PATCH v2 19/20] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
2026-04-01 14:39 ` [PATCH v2 20/20] selftests/vfio: Add CXL Type-2 VFIO assignment test mhonap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox