public inbox for linux-cxl@vger.kernel.org
 help / color / mirror / Atom feed
From: <mhonap@nvidia.com>
To: <aniketa@nvidia.com>, <ankita@nvidia.com>,
	<alwilliamson@nvidia.com>, <vsethi@nvidia.com>, <jgg@nvidia.com>,
	<mochs@nvidia.com>, <skolothumtho@nvidia.com>,
	<alejandro.lucero-palau@amd.com>, <dave@stgolabs.net>,
	<jonathan.cameron@huawei.com>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
	<ira.weiny@intel.com>, <dan.j.williams@intel.com>, <jgg@ziepe.ca>,
	<yishaih@nvidia.com>, <kevin.tian@intel.com>
Cc: <cjia@nvidia.com>, <targupta@nvidia.com>, <zhiw@nvidia.com>,
	<kjaju@nvidia.com>, <linux-kernel@vger.kernel.org>,
	<linux-cxl@vger.kernel.org>, <kvm@vger.kernel.org>,
	<mhonap@nvidia.com>
Subject: [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough
Date: Thu, 12 Mar 2026 02:04:38 +0530	[thread overview]
Message-ID: <20260311203440.752648-19-mhonap@nvidia.com> (raw)
In-Reply-To: <20260311203440.752648-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

Add a driver-api document describing the architecture, interfaces, and
operational constraints of CXL Type-2 device passthrough via vfio-pci-core.

CXL Type-2 devices (cache-coherent accelerators such as GPUs with attached
device memory) present unique passthrough requirements not covered by the
existing vfio-pci documentation:

- The host kernel retains ownership of the HDM decoder hardware through
  the CXL subsystem, so the guest cannot program decoders directly.
- Two additional VFIO device regions expose the emulated HDM register
  state (COMP_REGS) and the DPA memory window (DPA region) to userspace.
- DVSEC configuration space writes are intercepted and virtualized so
  that the guest cannot alter host-owned CXL.io / CXL.mem enable bits.
- Device reset (FLR) is coordinated through vfio_pci_ioctl_reset(): all
  DPA PTEs are zapped before the reset and restored afterward.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 Documentation/driver-api/index.rst        |   1 +
 Documentation/driver-api/vfio-pci-cxl.rst | 216 ++++++++++++++++++++++
 2 files changed, 217 insertions(+)
 create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 1833e6a0687e..7ec661846f6b 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -47,6 +47,7 @@ of interest to most developers working on device drivers.
    vfio-mediated-device
    vfio
    vfio-pci-device-specific-driver-acceptance
+   vfio-pci-cxl
 
 Bus-level documentation
 =======================
diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
new file mode 100644
index 000000000000..f2cbe2fdb036
--- /dev/null
+++ b/Documentation/driver-api/vfio-pci-cxl.rst
@@ -0,0 +1,216 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================
+VFIO PCI CXL Type-2 Device Passthrough
+====================================================
+
+Overview
+--------
+
+CXL (Compute Express Link) Type-2 devices are cache-coherent PCIe accelerators
+and GPUs that attach their own volatile memory (Device Physical Address space,
+or DPA) to the host memory fabric via the CXL protocol.  Examples include
+GPU/accelerator cards that expose coherent device memory to the host.
+
+When such a device is passthroughed to a virtual machine using ``vfio-pci``,
+the kernel CXL subsystem must remain in control of the Host-managed Device
+Memory (HDM) decoders that map the device's DPA into the host physical address
+(HPA) space.  A VMM such as QEMU cannot program HDM decoders directly; instead
+it uses a set of VFIO-specific regions and UAPI extensions described here.
+
+This support is compiled in when ``CONFIG_VFIO_CXL_CORE=y``.  It can be
+disabled at module load time for all devices bound to ``vfio-pci`` with::
+
+    modprobe vfio-pci disable_cxl=1
+
+Variant drivers can disable CXL extensions for individual devices by setting
+``vdev->disable_cxl = true`` in their probe function before registration.
+
+Device Detection
+----------------
+
+CXL Type-2 detection happens automatically when ``vfio-pci`` registers a
+device that has:
+
+1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID 0x0000).
+2. Bit 2 (Mem_Capable) set in the CXL Capability register within that DVSEC.
+3. A PCI class code that is **not** ``0x050210`` (CXL Type-3 memory device).
+4. An HDM Decoder block discoverable via the Register Locator DVSEC.
+5. A pre-committed HDM decoder (BIOS/firmware programmed) with non-zero size.
+
+On successful detection ``VFIO_DEVICE_FLAGS_CXL`` is set in
+``vfio_device_info.flags`` alongside ``VFIO_DEVICE_FLAGS_PCI``.
+
+UAPI Extensions
+---------------
+
+VFIO_DEVICE_GET_INFO Capability: VFIO_DEVICE_INFO_CAP_CXL
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When ``VFIO_DEVICE_FLAGS_CXL`` is set the device info capability chain
+contains a ``vfio_device_info_cap_cxl`` structure (cap ID 6)::
+
+    struct vfio_device_info_cap_cxl {
+        struct vfio_info_cap_header header; /* id=6, version=1 */
+        __u8  hdm_count;          /* number of HDM decoders */
+        __u8  hdm_regs_bar_index; /* PCI BAR containing component registers */
+        __u16 pad;
+        __u32 flags;              /* VFIO_CXL_CAP_* flags */
+        __u64 hdm_regs_size;      /* size in bytes of the HDM decoder block */
+        __u64 hdm_regs_offset;    /* byte offset within the BAR to HDM block */
+        __u64 dpa_size;           /* total DPA size in bytes */
+        __u32 dpa_region_index;   /* index of the DPA device region */
+        __u32 comp_regs_region_index; /* index of the COMP_REGS device region */
+    };
+
+Flags:
+
+``VFIO_CXL_CAP_COMMITTED`` (bit 0)
+    The HDM decoder was committed by the kernel CXL subsystem.
+
+``VFIO_CXL_CAP_PRECOMMITTED`` (bit 1)
+    The HDM decoder was pre-committed by host firmware/BIOS.  The VMM does
+    not need to allocate CXL HPA space; the mapping is already live.
+
+VFIO Regions
+~~~~~~~~~~~~~
+
+A CXL Type-2 device exposes two additional device regions beyond the standard
+PCI BAR regions.  Their indices are reported in ``dpa_region_index`` and
+``comp_regs_region_index`` in the capability structure.
+
+**DPA Region** (subtype ``VFIO_REGION_SUBTYPE_CXL``)
+    Flags: ``VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE |
+    VFIO_REGION_INFO_FLAG_MMAP``
+
+    Represents the device's DPA memory mapped at the kernel-assigned HPA.
+    The VMM should map this region with mmap() to expose device memory to the
+    guest.  Page faults are handled lazily; the kernel inserts PFNs on first
+    access rather than at mmap() time.  During FLR/reset all PTEs are
+    invalidated and the region becomes inaccessible until the reset completes.
+
+    Read and write access via the region file descriptor is also supported and
+    routes through a kernel-managed virtual address established with
+    ``ioremap_cache()``.
+
+**COMP_REGS Region** (subtype ``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``)
+    Flags: ``VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE``
+    (no mmap).
+
+    An emulated, read/write-only region exposing the HDM decoder registers.
+    The kernel shadows the hardware HDM register state and enforces all
+    bit-field rules (reserved bits, read-only bits, commit semantics) on every
+    write.  Only 32-bit aligned, 32-bit wide accesses are permitted, matching
+    the hardware requirement.
+
+    The VMM uses this region to read and write HDM decoder BASE, SIZE, and
+    CTRL registers.  Setting the COMMIT bit (bit 9) in a CTRL register causes
+    the kernel to immediately set the COMMITTED bit (bit 10) in the emulated
+    shadow state, allowing the VMM to detect the transition via a
+    ``notify_change`` callback.
+
+    The component register BAR itself (``hdm_regs_bar_index``) is hidden:
+    ``VFIO_DEVICE_GET_REGION_INFO`` for that BAR index returns ``size = 0``.
+    All HDM access must go through the COMP_REGS region.
+
+Region Type Identifiers::
+
+    /* type = PCI_VENDOR_ID_CXL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE (0x80001e98) */
+    #define VFIO_REGION_SUBTYPE_CXL           1   /* DPA memory region */
+    #define VFIO_REGION_SUBTYPE_CXL_COMP_REGS 2   /* HDM register region */
+
+DVSEC Configuration Space Emulation
+-------------------------------------
+
+When ``CONFIG_VFIO_CXL_CORE=y`` the kernel installs a CXL-aware write handler
+for the ``PCI_EXT_CAP_ID_DVSEC`` (0x23) extended capability entry in the vfio-pci
+configuration space permission table.  This handler runs for every device
+opened under ``vfio-pci``; for non-CXL devices it falls through to the
+hardware write path unchanged.
+
+For CXL devices, writes to the following DVSEC registers are intercepted and
+emulated in ``vdev->vconfig`` (the per-device shadow configuration space):
+
++--------------------+--------+-------------------------------------------+
+| Register           | Offset | Emulation                                 |
++====================+========+===========================================+
+| CXL Control        | 0x0c   | RWL semantics; IO_Enable forced to 1;     |
+|                    |        | locked after Lock register bit 0 is set.  |
++--------------------+--------+-------------------------------------------+
+| CXL Status         | 0x0e   | Bit 14 (Viral_Status) is RW1CS.           |
++--------------------+--------+-------------------------------------------+
+| CXL Control2       | 0x10   | Bits 0, 3 forwarded to hardware; bits     |
+|                    |        | 1 and 2 trigger subsystem actions.        |
++--------------------+--------+-------------------------------------------+
+| CXL Status2        | 0x12   | Bit 3 (RW1CS) forwarded to hardware when  |
+|                    |        | Capability3 bit 3 is set.                 |
++--------------------+--------+-------------------------------------------+
+| CXL Lock           | 0x14   | RWO; once set, Control becomes read-only  |
+|                    |        | until conventional reset.                 |
++--------------------+--------+-------------------------------------------+
+| Range Base High/Lo | varies | Stored in vconfig; Base Low [27:0]        |
+|                    |        | reserved bits cleared.                    |
++--------------------+--------+-------------------------------------------+
+
+Reads of these registers return the emulated vconfig values.  Read-only
+registers (Capability, Size registers, range Size High/Low) are also served
+from vconfig, which was seeded from hardware at device open time.
+
+FLR and Reset Behaviour
+-----------------------
+
+During Function Level Reset (FLR):
+
+1. ``vfio_cxl_zap_region_locked()`` is called under the write side of
+   ``memory_lock``.  It sets ``region_active = false`` and calls
+   ``unmap_mapping_range()`` to invalidate all DPA region PTEs.
+
+2. Any concurrent page fault or ``read()``/``write()`` on the DPA region
+   sees ``region_active = false`` and returns ``VM_FAULT_SIGBUS`` or ``-EIO``
+   respectively.
+
+3. After reset completes, ``vfio_cxl_reactivate_region()`` re-reads the HDM
+   decoder state from hardware into ``comp_reg_virt[]`` (it will typically
+   be all-zeros after FLR) and sets ``region_active = true`` only if the
+   COMMITTED bit is set in the freshly re-snapshotted hardware state for
+   pre-committed decoders.  The VMM may re-fault into the DPA region without
+   issuing a new ``mmap()`` call.  Each newly faulted page is scrubbed via
+   ``memset_io()`` before the PFN is inserted.
+
+VMM Integration Notes
+---------------------
+
+A VMM integrating CXL Type-2 passthrough should:
+
+1. Issue ``VFIO_DEVICE_GET_INFO`` and check ``VFIO_DEVICE_FLAGS_CXL``.
+2. Walk the capability chain to find ``VFIO_DEVICE_INFO_CAP_CXL`` (id = 6).
+3. Record ``dpa_region_index``, ``comp_regs_region_index``, ``dpa_size``,
+   ``hdm_count``, ``hdm_regs_offset``, and ``hdm_regs_size``.
+4. Map the DPA region (``dpa_region_index``) with mmap() to a guest physical
+   address.  The region supports ``PROT_READ | PROT_WRITE``.
+5. Open the COMP_REGS region (``comp_regs_region_index``) and attach a
+   ``notify_change`` callback to detect COMMIT transitions.  When bit 10
+   (COMMITTED) transitions from 0 to 1 in a CTRL register read, the VMM
+   should expose the corresponding DPA range to the guest and map the
+   relevant slice of the DPA mmap.
+6. For pre-committed devices (``VFIO_CXL_CAP_PRECOMMITTED`` set) the entire
+   DPA is already mapped and the VMM need not wait for a guest COMMIT.
+7. Program the guest CXL DVSEC registers (via VFIO config space write) to
+   reflect the guest's view.  The kernel emulates all register semantics
+   including the CONFIG_LOCK one-shot latch.
+
+Kernel Configuration
+--------------------
+
+``CONFIG_VFIO_CXL_CORE`` (bool)
+    Enable CXL Type-2 passthrough support in ``vfio-pci-core``.
+    Depends on ``CONFIG_VFIO_PCI_CORE``, ``CONFIG_CXL_BUS``, and
+    ``CONFIG_CXL_MEM``.
+
+References
+----------
+
+* CXL Specification 3.1, §8.1.3 — DVSEC for CXL Devices
+* CXL Specification 3.1, §8.2.4.20 — CXL HDM Decoder Capability Structure
+* ``include/uapi/linux/vfio.h`` — ``VFIO_DEVICE_INFO_CAP_CXL``,
+  ``VFIO_REGION_SUBTYPE_CXL``, ``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``
-- 
2.25.1


  parent reply	other threads:[~2026-03-11 20:37 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-11 20:34 [PATCH 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-03-11 20:34 ` [PATCH 01/20] cxl: Introduce cxl_get_hdm_reg_info() mhonap
2026-03-12 11:28   ` Jonathan Cameron
2026-03-12 16:33   ` Dave Jiang
2026-03-11 20:34 ` [PATCH 02/20] cxl: Expose cxl subsystem specific functions for vfio mhonap
2026-03-12 16:49   ` Dave Jiang
2026-03-13 10:05     ` Manish Honap
2026-03-11 20:34 ` [PATCH 03/20] cxl: Move CXL spec defines to public header mhonap
2026-03-13 12:18   ` Jonathan Cameron
2026-03-13 16:56     ` Dave Jiang
2026-03-18 14:56       ` Jonathan Cameron
2026-03-18 17:51         ` Manish Honap
2026-03-11 20:34 ` [PATCH 04/20] cxl: Media ready check refactoring mhonap
2026-03-12 20:29   ` Dave Jiang
2026-03-13 10:05     ` Manish Honap
2026-03-11 20:34 ` [PATCH 05/20] cxl: Expose BAR index and offset from register map mhonap
2026-03-12 20:58   ` Dave Jiang
2026-03-13 10:11     ` Manish Honap
2026-03-11 20:34 ` [PATCH 06/20] vfio/cxl: Add UAPI for CXL Type-2 device passthrough mhonap
2026-03-12 21:04   ` Dave Jiang
2026-03-11 20:34 ` [PATCH 07/20] vfio/pci: Add CXL state to vfio_pci_core_device mhonap
2026-03-11 20:34 ` [PATCH 08/20] vfio/pci: Add vfio-cxl Kconfig and build infrastructure mhonap
2026-03-13 12:27   ` Jonathan Cameron
2026-03-18 17:21     ` Manish Honap
2026-03-11 20:34 ` [PATCH 09/20] vfio/cxl: Implement CXL device detection and HDM register probing mhonap
2026-03-12 22:31   ` Dave Jiang
2026-03-13 12:43     ` Jonathan Cameron
2026-03-18 17:43       ` Manish Honap
2026-03-11 20:34 ` [PATCH 10/20] vfio/cxl: CXL region management mhonap
2026-03-12 22:55   ` Dave Jiang
2026-03-13 12:52     ` Jonathan Cameron
2026-03-18 17:48       ` Manish Honap
2026-03-11 20:34 ` [PATCH 11/20] vfio/cxl: Expose DPA memory region to userspace with fault+zap mmap mhonap
2026-03-13 17:07   ` Dave Jiang
2026-03-18 17:54     ` Manish Honap
2026-03-11 20:34 ` [PATCH 12/20] vfio/pci: Export config access helpers mhonap
2026-03-11 20:34 ` [PATCH 13/20] vfio/cxl: Introduce HDM decoder register emulation framework mhonap
2026-03-13 19:05   ` Dave Jiang
2026-03-18 17:58     ` Manish Honap
2026-03-11 20:34 ` [PATCH 14/20] vfio/cxl: Check media readiness and create CXL memdev mhonap
2026-03-11 20:34 ` [PATCH 15/20] vfio/cxl: Introduce CXL DVSEC configuration space emulation mhonap
2026-03-13 22:07   ` Dave Jiang
2026-03-18 18:41     ` Manish Honap
2026-03-11 20:34 ` [PATCH 16/20] vfio/pci: Expose CXL device and region info via VFIO ioctl mhonap
2026-03-11 20:34 ` [PATCH 17/20] vfio/cxl: Provide opt-out for CXL feature mhonap
2026-03-11 20:34 ` mhonap [this message]
2026-03-13 12:13   ` [PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough Jonathan Cameron
2026-03-17 21:24     ` Alex Williamson
2026-03-19 16:06       ` Jonathan Cameron
2026-03-23 14:36         ` Manish Honap
2026-03-11 20:34 ` [PATCH 19/20] selftests/vfio: Add CXL Type-2 passthrough tests mhonap
2026-03-11 20:34 ` [PATCH 20/20] selftests/vfio: Fix VLA initialisation in vfio_pci_irq_set() mhonap
2026-03-13 22:23   ` Dave Jiang
2026-03-18 18:07     ` Manish Honap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260311203440.752648-19-mhonap@nvidia.com \
    --to=mhonap@nvidia.com \
    --cc=alejandro.lucero-palau@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alwilliamson@nvidia.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jgg@nvidia.com \
    --cc=jgg@ziepe.ca \
    --cc=jonathan.cameron@huawei.com \
    --cc=kevin.tian@intel.com \
    --cc=kjaju@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mochs@nvidia.com \
    --cc=skolothumtho@nvidia.com \
    --cc=targupta@nvidia.com \
    --cc=vishal.l.verma@intel.com \
    --cc=vsethi@nvidia.com \
    --cc=yishaih@nvidia.com \
    --cc=zhiw@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox