Linux Documentation
 help / color / mirror / Atom feed
From: <mhonap@nvidia.com>
To: <djbw@kernel.org>, <alex@shazbot.org>, <jgg@ziepe.ca>,
	<jic23@kernel.org>, <dave.jiang@intel.com>, <ankita@nvidia.com>,
	<alejandro.lucero-palau@amd.com>, <alison.schofield@intel.com>,
	<dave@stgolabs.net>, <dmatlack@google.com>, <gourry@gourry.net>,
	<ira.weiny@intel.com>
Cc: <cjia@nvidia.com>, <kjaju@nvidia.com>, <vsethi@nvidia.com>,
	<zhiw@nvidia.com>, <mhonap@nvidia.com>, <kvm@vger.kernel.org>,
	<linux-cxl@vger.kernel.org>, <linux-doc@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-kselftest@vger.kernel.org>
Subject: [PATCH v3 10/11] docs: vfio-pci: Document CXL Type-2 device passthrough
Date: Thu, 25 Jun 2026 22:24:06 +0530	[thread overview]
Message-ID: <20260625165407.1769572-11-mhonap@nvidia.com> (raw)
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

From: Manish Honap <mhonap@nvidia.com>

Capture the ownership model, bind sequence, region layout, and the
DVSEC + HDM + CM cap-array virtualization contract for vfio-pci
Type-2 device passthrough in Documentation/driver-api/vfio-pci-cxl.rst.

cxl-core owns the CXL register virtualization through
devm_cxl_passthrough_create() and the cxl_passthrough_*_rw()
helpers; vfio-pci is a transport that forwards guest reads and
writes through them.  The HDM HPA range is mapped by vfio for the
mmappable HDM region.  Topology constraints and host-bridge decoder
limitations are listed under Known limitations.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 Documentation/driver-api/index.rst        |   1 +
 Documentation/driver-api/vfio-pci-cxl.rst | 282 ++++++++++++++++++++++
 2 files changed, 283 insertions(+)
 create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index eaf7161ff957..52f0c06a376a 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -47,6 +47,7 @@ of interest to most developers working on device drivers.
    vfio-mediated-device
    vfio
    vfio-pci-device-specific-driver-acceptance
+   vfio-pci-cxl
 
 Bus-level documentation
 =======================
diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
new file mode 100644
index 000000000000..1527b7dd85d0
--- /dev/null
+++ b/Documentation/driver-api/vfio-pci-cxl.rst
@@ -0,0 +1,282 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===========================================
+VFIO-PCI: CXL Type-2 device passthrough
+===========================================
+
+:Author: Manish Honap <mhonap@nvidia.com>
+
+Overview
+========
+
+vfio-pci-core, when built with ``CONFIG_VFIO_PCI_CXL=y``, passes a
+CXL Type-2 accelerator (CXL r4.0, HDM-D / HDM-DB) through to a KVM
+guest.  The host firmware commits the endpoint's HDM decoder before
+vfio-pci binds; the guest sees a CXL Type-2 device whose CXL.mem
+range is already programmed and locked.  The guest may inspect the
+HDM Decoder Capability block and DVSEC Device capability via spec-
+defined paths, and access the device's CXL.mem range as
+mmap'd memory.
+
+Scope
+=====
+
+The supported scope is intentionally narrow:
+
+* One CXL endpoint per host bridge.
+* The endpoint exposes exactly one HDM decoder (decoder 0).
+* No interleave.
+* Host firmware has committed the endpoint HDM decoder before
+  vfio-pci probes.  Devices whose HDM decoder is *uncommitted* fail
+  vfio-pci bind cleanly.
+* The host bridge is in single-RP-passthrough mode (the CXL host
+  bridge's own HDM decoder is not used; CFMWS-to-RP decode flows
+  implicitly).  This assumption is currently *not enforced* by
+  vfio-pci-core; it is a known limitation, see the Known
+  limitations section.
+
+Multi-decoder, interleave, FLR / reset state-machine integration,
+and host-bridge HDM decoder programming are explicitly out of scope.
+Adding any of them is additive on top of the contract described
+below.
+
+Driver model
+============
+
+There is no dedicated ``vfio-cxl`` PCI driver.  vfio-pci is the only
+driver that binds to the host PCI device.  When built with
+``CONFIG_VFIO_PCI_CXL=y``, vfio-pci-core calls into the cxl subsystem
+to do four things at bind time:
+
+1. ``devm_cxl_dev_state_create()`` — allocate per-device CXL state
+   embedded in ``struct vfio_pci_cxl_state``.
+2. ``cxl_pci_setup_regs()`` + ``cxl_get_hdm_info()`` — probe the
+   Register Locator DVSEC and harvest the HDM block's BAR-relative
+   offset and size.
+3. ``cxl_await_range_active()`` — wait for the firmware-committed
+   range to become live.
+4. ``devm_cxl_passthrough_create()`` — snapshot the CXL Device DVSEC
+   body, the HDM Decoder block, and the CXL.cache/mem cap-array
+   prefix into shadows owned by cxl-core.  All subsequent
+   register-virtualization happens inside ``drivers/cxl/core/passthrough.c``.
+5. ``devm_cxl_probe_mem()`` — register a ``cxl_memdev``, enumerate
+   the endpoint port, and auto-attach the firmware-committed
+   region.  cxl_mem binds to the memdev as it would for any other
+   Type-2 accelerator.
+
+Ownership split
+===============
+
+Each device-visible surface is owned by exactly one subsystem:
+
+============================================  ==============================================
+Surface                                       Owner
+============================================  ==============================================
+PCI config (non-DVSEC, non-CXL)               vfio-pci-core ``vconfig`` (existing perm-bits)
+CXL Device DVSEC body                         cxl-core ``cxl_passthrough_dvsec_rw()``
+HDM Decoder Capability block                  cxl-core ``cxl_passthrough_hdm_rw()``
+CM cap-array (read-only snapshot)             cxl-core ``cxl_passthrough_cm_rw()``
+``cxl_memdev`` / endpoint port / autoregion   cxl-core ``devm_cxl_probe_mem()``
+HDM HPA range mapping                         vfio-pci ``request_mem_region`` + ``memremap``
+Sparse mmap layout for the component BAR      vfio-pci
+============================================  ==============================================
+
+The vfio side holds no shadow buffer of its own.  ``vfio_pci_cxl_state``
+caches small scalars (DVSEC offset/size, HDM offset/size, component
+BAR layout) for dispatch decisions; the actual virtualization
+semantics live in cxl-core.
+
+Bind sequence
+=============
+
+``vfio_pci_cxl_acquire()`` is called from
+``vfio_pci_core_register_device()`` at PCI bind time.  The sequence::
+
+  0. devm_cxl_dev_state_create(parent, CXL_DEVTYPE_DEVMEM, dsn,
+                               dvsec_off, vfio_pci_cxl_state, cxlds,
+                               /*mbox=*/false)
+
+  1. pcie_is_cxl() and pci_find_dvsec_capability(CXL_DEVICE)
+     -> -ENODEV if either is absent
+     -> -ENODEV if the DVSEC's MEM_CAPABLE bit is clear
+
+  2. pci_enable_device_mem()
+
+     2a. cxl_pci_setup_regs(CXL_REGLOC_RBI_COMPONENT)
+     2b. cxl_get_hdm_info() — REJECT hdm_count != 1 with -EOPNOTSUPP
+     2c. cxl_regblock_get_bar_info()
+     2d. cxl_await_range_active()
+     2e. devm_cxl_passthrough_create(&pdev->dev, &cxlds)
+
+  3. pci_disable_device()
+     Clears PCI_COMMAND_MASTER but NOT PCI_COMMAND_MEMORY (see
+     do_pci_disable_device() in drivers/pci/pci.c).  Subsequent
+     MMIO from step 4 still succeeds.
+
+  4. devm_cxl_probe_mem(&cxlds, &hpa_range)
+     Registers the memdev, enumerates the endpoint port, attaches
+     the firmware-committed autoregion.
+
+  5. request_mem_region(hpa_base, hpa_size) + memremap_wb()
+
+  6. vdev->cxl = cxl  (state published; HDM and COMP_REGS regions
+     are registered later when the VFIO fd is opened)
+
+Fail-closed semantics
+---------------------
+
+Three errnos are mapped to "not a CXL device; caller falls back to
+plain vfio-pci": ``pcie_is_cxl()`` false, DVSEC absent, ``MEM_CAPABLE``
+clear.  All three return ``-ENODEV`` from
+``vfio_pci_cxl_acquire()``; the caller treats them as a silent
+fall-through.
+
+Any other negative errno from the bind sequence aborts the vfio-pci
+bind entirely.  The guest never sees a half-initialised CXL device.
+Once ``devm_cxl_probe_mem()`` has succeeded the published memdev
+holds a pointer into the embedded ``cxl_dev_state``; a failure in
+``vfio_cxl_map_hdm()`` after that point cannot ``devm_kfree(cxl)``
+and leaves the state allocated for the lifetime of the PCI device
+(devres unwinds it at pdev removal).
+
+VFIO regions exposed
+====================
+
+When the VFIO fd is first opened, ``vfio_pci_cxl_open()`` registers
+two additional regions on top of the standard vfio-pci BARs / config
+region:
+
+HDM region (``VFIO_REGION_SUBTYPE_CXL``)
+  Mappable view of the device's firmware-committed HPA range.
+
+  * ``mmap``: fault handler does
+    ``vmf_insert_pfn(vma, addr, PHYS_PFN(hpa_base + off))``.  The
+    guest gets the same backing physical memory the host sees.
+  * ``pread`` / ``pwrite``: served from the ``memremap_wb()`` kva
+    captured at bind time.
+
+COMP_REGS region (``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``)
+  Shadow of the CXL component register sub-range.  ``pread`` /
+  ``pwrite`` only; ``mmap`` is intentionally not supported (the VMM
+  uses this region instead of mmapping the BAR).  Dword-aligned
+  access only; sub-dword accesses return ``-EINVAL``.
+
+  Dispatch by offset:
+
+  ============================================  =================================
+  Offset range                                  cxl-core helper
+  ============================================  =================================
+  ``< CXL_CM_OFFSET``                           zero-fill (reserved)
+  ``CXL_CM_OFFSET .. hdm_reg_offset``           ``cxl_passthrough_cm_rw()``
+  ``hdm_reg_offset .. +hdm_reg_size``           ``cxl_passthrough_hdm_rw()``
+  ``>= hdm_reg_offset + hdm_reg_size``          zero-fill (reserved)
+  ============================================  =================================
+
+DVSEC virtualization contract
+=============================
+
+The CXL Device DVSEC body is reached through the standard PCI
+config-space path.  ``vfio_pci_config_rw_single()`` clips chunks at
+the DVSEC body boundary via ``vfio_pci_cxl_config_boundary()`` and
+forwards body bytes to ``vfio_pci_cxl_config_rw()``, which in turn
+calls ``cxl_passthrough_dvsec_rw()``.
+
+Per-field write semantics (CXL r4.0 §8.1.3):
+
+============================================  ==============================================
+Field (offset from DVSEC cap base)            Spec attribute / behaviour
+============================================  ==============================================
+CAPABILITY        (0x0a)                      HwInit — writes dropped
+CONTROL           (0x0c)                      RWL — gated on DVSEC CONFIG_LOCK
+STATUS            (0x0e)                      RW1C
+CONTROL2          (0x10)                      RWL — gated on DVSEC CONFIG_LOCK
+STATUS2           (0x12)                      RW1C
+LOCK              (0x14)                      RWO — first 1-write latches CONFIG_LOCK
+Range1 SIZE_HI/LO BASE_HI/LO  (0x18..0x27)    HwInit — writes dropped
+Range2 SIZE_HI/LO BASE_HI/LO  (0x28..0x37)    RsvdZ — writes dropped
+============================================  ==============================================
+
+HDM virtualization contract
+===========================
+
+Per CXL r4.0 §8.2.4.20, on the single firmware-committed decoder:
+
+============================================  ==============================================
+Field (offset from HDM block base)            Spec attribute / behaviour
+============================================  ==============================================
+HDM Decoder Capability Header (0x00)          HwInit — writes dropped
+HDM Decoder Global Control    (0x04)          RW — shadow
+Decoder 0 BASE_LO / BASE_HI                   RWL — gated on COMMITTED or LOCK_ON_COMMIT
+Decoder 0 SIZE_LO / SIZE_HI                   RWL — same gate
+Decoder 0 CTRL                                Implements COMMIT → COMMITTED handshake; once
+                                              COMMITTED, only COMMIT toggles are honoured
+============================================  ==============================================
+
+CM cap-array
+============
+
+The CM cap-array (CXL r4.0 §8.2.4) prefix is snapshotted from the
+device's component register MMIO at bind time and served read-only
+through ``cxl_passthrough_cm_rw()``.  Guest writes to the cap-array
+are silently dropped.
+
+UAPI: CAP_CXL
+=============
+
+``VFIO_DEVICE_GET_INFO`` returns ``VFIO_DEVICE_FLAGS_CXL`` and a
+``VFIO_DEVICE_INFO_CAP_CXL`` capability::
+
+    struct vfio_device_info_cap_cxl {
+        struct vfio_info_cap_header header;
+        __u32 flags;
+        #define VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED (1 << 0)
+        __u32 hdm_region_idx;
+        __u32 comp_reg_region_idx;
+        __u32 comp_reg_bar;
+        __u32 __resv;
+        __u64 comp_reg_offset;
+        __u64 comp_reg_size;
+    };
+
+``VFIO_DEVICE_GET_REGION_INFO`` on the component BAR returns a
+``VFIO_REGION_INFO_CAP_SPARSE_MMAP`` that excludes
+``[comp_reg_offset, comp_reg_offset + comp_reg_size)`` from the
+mmappable areas.
+
+Known limitations
+=================
+
+* Host bridge HDM decoder programming is not driven by this driver.
+  The driver silently assumes single-RP-passthrough topology (the
+  CXL host bridge's own HDM decoder is not used).  Two remediations
+  are possible: either refuse to bind when the topology is not
+  single-RP-passthrough, or extend the kernel ABI so a host-bridge
+  HDM decoder programmer can attest the lock before vfio bind.  Both
+  leave the existing contract intact or add a single boolean to
+  CAP_CXL.
+
+* Function-level reset (FLR) does not re-snapshot the shadows.
+  Guests that issue FLR will see stale HDM and DVSEC state after
+  the reset.
+
+* Multi-decoder devices return ``-EOPNOTSUPP`` at bind.
+
+* Hotplug while the device is held by vfio is not supported.
+
+* Raw BAR read/write into the CXL component register sub-range is
+  unsupported.  VMMs must use the COMP_REGS region.
+
+Selftest
+========
+
+``tools/testing/selftests/vfio/vfio_cxl_type2_test`` exercises the
+five surfaces:
+
+* ``device_is_cxl`` — GET_INFO returns FLAGS_CXL + CAP_CXL.
+* ``hdm_region_mmap_rw`` — mmap + read/write pattern.
+* ``component_bar_sparse_mmap`` — SPARSE_MMAP cap excludes the CXL
+  block.
+* ``comp_regs_cm_cap_array_read`` — CM cap-array header is served
+  from the cxl-core snapshot.
+* ``dvsec_lock_byte_read`` -- DVSEC config-rw clipping shim is wired.
-- 
2.25.1


  parent reply	other threads:[~2026-06-25 16:56 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-06-25 16:53 ` [PATCH v3 01/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata mhonap
2026-06-25 16:53 ` [PATCH v3 02/11] cxl: Split cxl_await_range_active() from media-ready wait mhonap
2026-06-25 16:53 ` [PATCH v3 03/11] cxl: Record BIR and BAR offset in cxl_register_map mhonap
2026-06-25 16:54 ` [PATCH v3 04/11] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
2026-06-25 16:54 ` [PATCH v3 05/11] vfio: UAPI for CXL Type-2 device passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 06/11] cxl: Add register-virtualization helpers for vfio Type-2 passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 07/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2 acquisition mhonap
2026-06-25 16:54 ` [PATCH v3 08/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim mhonap
2026-06-25 16:54 ` [PATCH v3 09/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test mhonap
2026-06-25 16:54 ` mhonap [this message]
2026-06-25 16:54 ` [PATCH v3 11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions mhonap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260625165407.1769572-11-mhonap@nvidia.com \
    --to=mhonap@nvidia.com \
    --cc=alejandro.lucero-palau@amd.com \
    --cc=alex@shazbot.org \
    --cc=alison.schofield@intel.com \
    --cc=ankita@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=djbw@kernel.org \
    --cc=dmatlack@google.com \
    --cc=gourry@gourry.net \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jic23@kernel.org \
    --cc=kjaju@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=vsethi@nvidia.com \
    --cc=zhiw@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox