Linux Documentation
 help / color / mirror / Atom feed
From: <mhonap@nvidia.com>
To: <djbw@kernel.org>, <alex@shazbot.org>, <jgg@ziepe.ca>,
	<jic23@kernel.org>, <dave.jiang@intel.com>, <ankita@nvidia.com>,
	<alejandro.lucero-palau@amd.com>, <alison.schofield@intel.com>,
	<dave@stgolabs.net>, <dmatlack@google.com>, <gourry@gourry.net>,
	<ira.weiny@intel.com>
Cc: <cjia@nvidia.com>, <kjaju@nvidia.com>, <vsethi@nvidia.com>,
	<zhiw@nvidia.com>, <mhonap@nvidia.com>, <kvm@vger.kernel.org>,
	<linux-cxl@vger.kernel.org>, <linux-doc@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-kselftest@vger.kernel.org>
Subject: [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support
Date: Thu, 25 Jun 2026 22:23:56 +0530	[thread overview]
Message-ID: <20260625165407.1769572-1-mhonap@nvidia.com> (raw)

From: Manish Honap <mhonap@nvidia.com>

CXL Type-2 accelerators (CXL.mem-capable GPUs and similar) cannot be
passed through to virtual machines with stock vfio-pci because the
driver has no concept of HDM decoder management, HDM region exposure,
or component register virtualization.  This series adds those three
pieces, sufficient for a guest to use the device's firmware-committed
coherent memory under UVM / ATS.

v3 is a rewrite of the v2 framework form, responding to Dan's request
in the v2 review for "less emulation, narrower interfaces, and a
closer mapping to the spec language."
In this release, cxl-core exposes four EXPORT_SYMBOL_GPL helpers behind
an opaque handle.  vfio-pci becomes a thin transport on top of those.
Please see "Changes since v2" and "Reviewer feedback addressed" below for
the per-area summary.

Motivation
==========

A CXL Type-2 device exposes its HDM-mapped device memory through HDM
decoders that BIOS programs and commits at boot.  To pass such a
device to a guest, vfio-pci has to do three things at once:

  1. Surface the firmware-committed HDM-mapped HPA range as a guest-
     mmappable region.

  2. Surface a CXL-spec-compliant view of the CXL Device DVSEC body,
     the HDM Decoder Capability block, and the CXL.cache/mem cap-array
     prefix, so the guest's CXL driver enumerates the same topology
     the host saw.

  3. Keep the host's committed decoder configuration intact (the
     physical decoder is never reprogrammed) while letting the guest
     observe and manage a shadow that follows the per-field write
     semantics in the spec.

The series builds on Alejandro Lucero-Palau's v28 work
applied on for-7.3/cxl-type2-enabling [1] (sfc is the in-tree consumer
today). vfio-pci becomes the second consumer.

Architecture
============

cxl-core owns the CXL semantics.  A new file
drivers/cxl/core/passthrough.c (gated by hidden Kconfig
CXL_VFIO_PASSTHROUGH) provides four exported symbols:

    struct cxl_passthrough *
    devm_cxl_passthrough_create(struct device *dev,
                                struct cxl_dev_state *cxlds);

    int cxl_passthrough_dvsec_rw(p, off, val, sz, write);
    int cxl_passthrough_hdm_rw  (p, off, val,      write);
    int cxl_passthrough_cm_rw   (p, off, val,      write);

cxl_passthrough is an opaque handle; vfio-pci sees no cxl-internal
struct pointers.  The shadows are snapshotted at create time: the
DVSEC body from PCI config space dword by dword, the CM cap-array and
HDM block from the cxl-core MMIO mapping at cxlds->reg_map.base.
Per-field write semantics follow below:
CXL r4.0 8.1.3 DVSEC:
- LOCK is RWO,
- CONTROL/CONTROL2 are RWL gated on CONFIG_LOCK,
- STATUS/STATUS2 are RW1C,
- RANGE1 is HwInit, RANGE2 is RsvdZ
CXL r4.0 8.2.4.20 HDM:
- GLOBAL_CTRL RW,
- decoder CTRL implements COMMIT/COMMITTED,
- decoder BASE/SIZE RWL gated on COMMITTED or LOCK_ON_COMMIT,
- cap header HwInit).

vfio-pci becomes a thin transport.  The new module
drivers/vfio/pci/cxl/ exposes two VFIO regions.

  VFIO_REGION_SUBTYPE_CXL (HDM region): mmappable view of the
  HDM-mapped HPA. The mmap fault handler calls vmf_insert_pfn() from
  the physical HPA. pread/pwrite go through the memremap_wb() kva
  captured at bind time.

  VFIO_REGION_SUBTYPE_CXL_COMP_REGS (component register shadow):
  pread/pwrite only, dword-aligned (-EINVAL on misalignment).
  Each dword dispatches by offset to cxl_passthrough_cm_rw() or
  cxl_passthrough_hdm_rw(). No shadow state on the vfio side; cxl-core
  enforces the spec.

CXL DVSEC config-space accesses use a clipping shim in
vfio_pci_config_rw_single(). A config-space chunk that crosses the
DVSEC body boundary is split: header bytes go through the generic
perm-bits path, body bytes go through cxl_passthrough_dvsec_rw().
The shim replaces v2's approach of repointing ecap_perms[]

Sparse-mmap is exposed on the component BAR so userspace can mmap the
non-component portions directly; only the CXL component register
sub-range goes through pread/pwrite emulation. The CXL sub-range is
also skipped from vfio_pci-core's request_selected_regions() set
because cxl-core's devm_cxl_probe_mem() already holds a
request_mem_region() on it; the asymmetric skip is matched by an
asymmetric release on disable().

Scope and out-of-scope
======================

In scope (rejected at create time with -EOPNOTSUPP otherwise):

  - Firmware-committed devices (HOST_FIRMWARE_COMMITTED set).
  - Single HDM decoder (hdm_count == 1).
  - No interleave (IW == 0).

Out of scope, deferred for follow-on work:

  - Multi-decoder devices and interleave.
  - Guest-driven (non-firmware-committed) HDM commit.
  - Hotplug, FLR, and sibling-function reset of CXL Type-2 devices.

Changes since v2
================

This is a rewrite, not an incremental update.  The structure of the
series changed (20 patches in v2 to 11 in v3) because v3 collapses
v2 patches 9-15 (detection, HDM emulation, media readiness, region
management, HDM region, DVSEC emulation) into one cxl-core helper
file and one vfio-pci consumer.

Framework replaced by narrow opaque-handle helpers (patches 6, 8)

  v2 carried a generic register-emulation framework split across four
  state-machine files in cxl-core.
  v3 collapses it into one file: drivers/cxl/core/passthrough.c
  exposing the four EXPORT_SYMBOL_GPL helpers above behind a struct
  cxl_passthrough opaque handle.

Shadow ownership moved into cxl-core (patches 6, 8)

  vfio-pci no longer keeps any per-field state. It forwards
  (offset, value) into cxl-core, and cxl-core enforces the spec
  (RWO, RWL, RW1C, HwInit, RsvdZ) with explicit CXL r4.0 section
  references in the switch arms.

DVSEC config-space clipping shim (patch 8)

  v2 repointed ecap_perms[] to redirect CXL DVSEC reads and writes.
  v3 keeps ecap_perms[] untouched and clips per-config-access chunks
  at the DVSEC body boundary in vfio_pci_config_rw_single(); header bytes
  go through the generic perm-bits path, body bytes go through
  cxl_passthrough_dvsec_rw(). The shim is local to the per-device
  path.

CONFIG_VFIO_PCI_CXL gates the new module (patch 7)

  v2 had a CONFIG_VFIO_CXL_CORE Kconfig stub; v3 renames it to
  CONFIG_VFIO_PCI_CXL to match the vfio-pci naming convention.
  The hidden CXL_VFIO_PASSTHROUGH selects the cxl-core helper file
  on demand. With both disabled, the cxl-core size is unchanged.

UAPI rewritten with named fields (patch 5)

  vfio_device_info_cap_cxl in v3 carries:
    flags + HOST_FIRMWARE_COMMITTED bit
    hdm_region_idx
    comp_reg_region_idx
    comp_reg_bar
    comp_reg_offset
    comp_reg_size
  The DPA terminology is renamed to HDM region throughout.
  CACHE_CAPABLE (HDM-DB indicator) is dropped;
  it was informational only in v2 with no caller, and re-adding it
  for an active CXL.cache plumbing series later.

Selftests trimmed (patch 9)

  v2 carried selftests for device detection, capability parsing,
  region enumeration, HDM register emulation, HDM mmap with
  page-fault insertion, FLR invalidation, and DVSEC register
  emulation. v3 keeps a smoke-test set of six focused tests:

    device_is_cxl                  GET_INFO advertises FLAGS_CXL
                                   and a populated CAP_CXL.
    hdm_region_mmap_rw             mmap one page, write+read back.
    component_bar_sparse_mmap      SPARSE_MMAP cap excludes the
                                   CXL component register sub-range.
    comp_regs_cm_cap_array_read    pread of the CM cap-array
                                   header at CXL_CM_OFFSET succeeds
                                   (CAP_ID == 1).
    dvsec_lock_byte_read           pread of the DVSEC CONFIG_LOCK
                                   byte through the clipping shim
                                   succeeds.
    hdm_decoder_commit_fsm         COMMIT / COMMITTED state machine
                                   and LOCK_ON_COMMIT behaviour.

  FLR invalidation, page-fault insertion under load, and full
  DVSEC field-by-field write coverage are deferred to a follow-on
  selftest series. The current six are the minimal set that
  exercises the kernel-side contract end-to-end.

cxl-core prep patches split (patches 1-4)

  v3 keeps the cxl-side enablers from v2 patches 1-4 but each as
  a standalone change so the cxl maintainer can review the helper
  API independently of the vfio consumer:

    [1/11] cxl_get_hdm_info()
    [2/11] cxl_await_range_active() split from media-ready wait
    [3/11] cxl_register_map records BIR + BAR offset
    [4/11] component/HDM register defines moved to uapi/cxl/cxl_regs.h

Reviewer feedback addressed
===========================

Dan
---

- VFIO exposes HDM/host-visible region, not raw DPA; docs/UAPI say HDM
  region, DPA only inside cxl-core where appropriate.
- One vfio-pci device = one HDM region / one decoder, no interleave;
  hdm_count != 1 → -EOPNOTSUPP.
- Global HDM on DVSEC Range Base treated as legacy; RANGE1/RANGE2
  read-only snapshot, guest writes dropped.
- No guest/kernel lock games; DVSEC LOCK and HDM LOCK_ON_COMMIT RWO,
  fixed at create from firmware snapshot.
- Opaque cxl_passthrough handle only; vfio gets HPA via memdev probe +
  layout via cxl_get_hdm_info(), rw via helpers.
- No multi-region accelerator case in v3; single region enforced,
  multi-region deferred.
- cxl_await_range_active stays in cxl-core probe; not exported, vfio does
  not call it.
- No guest LOCK→0 reprogram; guest cannot clear LOCK to remap host HPA;
  kernel uncommit tied to COMMIT, not LOCK alone.

Jason / Gregory / Dan
---------------------

- memremap(WB) + request_mem_region on HPA; conflicting direct-map/EFI use
  fails probe with -EBUSY.

Jonathan
--------

- uapi/cxl/cxl_regs.h for register defines so VMMs need no private
  kernel headers.
- __free() locals on cxl-core/passthrough error paths instead of
  struct-owned temporaries.
- No "precommitted at probe" assumption; acquire checks COMMITTED in
  HDM shadow and refuses if missing.

Dave
----

- memremap(MEMREMAP_WB) for HDM host mapping (not ioremap_cache).
- Renamed cap flag to VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED for clarity.
- __free() / DEFINE_FREE() cleanup in new passthrough.c create path.

Patch series
============

 [1/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
 [2/11] cxl: Split cxl_await_range_active() from media-ready wait
 [3/11] cxl: Record BIR and BAR offset in cxl_register_map
 [4/11] cxl: Move component/HDM register defines to
        uapi/cxl/cxl_regs.h
 [5/11] vfio: UAPI for CXL Type-2 device passthrough
 [6/11] cxl: Add register-virtualization helpers for vfio Type-2
        passthrough
 [7/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
        acquisition
 [8/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping
        shim
 [9/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test
[10/11] docs: vfio-pci: Document CXL Type-2 device passthrough
[11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions

Dependencies
============

[1] [PATCH v28 0/5] Type2 device basic support
https://lore.kernel.org/linux-cxl/20260618181806.118745-1-alejandro.lucero-palau@amd.com/

[2] Previous version of this patch series
[PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/

[3] Companion QEMU series
[RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci
https://lore.kernel.org/linux-cxl/20260427181235.3003865-1-mhonap@nvidia.com/

Manish Honap (11):
  cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
  cxl: Split cxl_await_range_active() from media-ready wait
  cxl: Record BIR and BAR offset in cxl_register_map
  cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
  vfio: UAPI for CXL Type-2 device passthrough
  cxl: Add register-virtualization helpers for vfio Type-2 passthrough
  vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
    acquisition
  vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim
  selftests/vfio: Add CXL Type-2 device passthrough smoke test
  docs: vfio-pci: Document CXL Type-2 device passthrough
  vfio/pci: Provide opt-out for CXL Type-2 extensions

 Documentation/driver-api/index.rst            |   1 +
 Documentation/driver-api/vfio-pci-cxl.rst     | 282 ++++++
 drivers/cxl/Kconfig                           |   7 +
 drivers/cxl/core/Makefile                     |   1 +
 drivers/cxl/core/passthrough.c                | 590 ++++++++++++
 drivers/cxl/core/pci.c                        |  70 +-
 drivers/cxl/core/regs.c                       |  35 +
 drivers/cxl/cxl.h                             |  52 +-
 drivers/vfio/pci/Kconfig                      |   2 +
 drivers/vfio/pci/Makefile                     |   1 +
 drivers/vfio/pci/cxl/Kconfig                  |  34 +
 drivers/vfio/pci/cxl/Makefile                 |   2 +
 drivers/vfio/pci/cxl/vfio_cxl_core.c          | 889 ++++++++++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h          |  71 ++
 drivers/vfio/pci/vfio_pci.c                   |   9 +
 drivers/vfio/pci/vfio_pci_config.c            |  31 +
 drivers/vfio/pci/vfio_pci_core.c              |  68 +-
 drivers/vfio/pci/vfio_pci_priv.h              |  93 ++
 drivers/vfio/pci/vfio_pci_rdwr.c              |  17 +
 include/cxl/cxl.h                             |  18 +
 include/cxl/passthrough.h                     | 121 +++
 include/linux/vfio_pci_core.h                 |   8 +
 include/uapi/cxl/cxl_regs.h                   |  63 ++
 include/uapi/linux/vfio.h                     |  46 +
 tools/testing/selftests/vfio/Makefile         |   1 +
 .../selftests/vfio/lib/vfio_pci_device.c      |  11 +-
 .../selftests/vfio/vfio_cxl_type2_test.c      | 350 +++++++
 27 files changed, 2821 insertions(+), 52 deletions(-)
 create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
 create mode 100644 drivers/cxl/core/passthrough.c
 create mode 100644 drivers/vfio/pci/cxl/Kconfig
 create mode 100644 drivers/vfio/pci/cxl/Makefile
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
 create mode 100644 include/cxl/passthrough.h
 create mode 100644 include/uapi/cxl/cxl_regs.h
 create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c

base-commit: 90cf2e0d702c8a132ccbe72e7687f33c04c14658
-- 
2.25.1


             reply	other threads:[~2026-06-25 16:55 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25 16:53 mhonap [this message]
2026-06-25 16:53 ` [PATCH v3 01/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata mhonap
2026-06-25 16:53 ` [PATCH v3 02/11] cxl: Split cxl_await_range_active() from media-ready wait mhonap
2026-06-25 16:53 ` [PATCH v3 03/11] cxl: Record BIR and BAR offset in cxl_register_map mhonap
2026-06-25 16:54 ` [PATCH v3 04/11] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
2026-06-25 16:54 ` [PATCH v3 05/11] vfio: UAPI for CXL Type-2 device passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 06/11] cxl: Add register-virtualization helpers for vfio Type-2 passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 07/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2 acquisition mhonap
2026-06-25 16:54 ` [PATCH v3 08/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim mhonap
2026-06-25 16:54 ` [PATCH v3 09/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test mhonap
2026-06-25 16:54 ` [PATCH v3 10/11] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions mhonap
2026-06-26  9:16 ` [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support Richard Cheng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260625165407.1769572-1-mhonap@nvidia.com \
    --to=mhonap@nvidia.com \
    --cc=alejandro.lucero-palau@amd.com \
    --cc=alex@shazbot.org \
    --cc=alison.schofield@intel.com \
    --cc=ankita@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=djbw@kernel.org \
    --cc=dmatlack@google.com \
    --cc=gourry@gourry.net \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jic23@kernel.org \
    --cc=kjaju@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=vsethi@nvidia.com \
    --cc=zhiw@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox