Linux Documentation
 help / color / mirror / Atom feed
From: Richard Cheng <icheng@nvidia.com>
To: mhonap@nvidia.com
Cc: djbw@kernel.org, alex@shazbot.org, jgg@ziepe.ca,
	jic23@kernel.org,  dave.jiang@intel.com, ankita@nvidia.com,
	alejandro.lucero-palau@amd.com,  alison.schofield@intel.com,
	dave@stgolabs.net, dmatlack@google.com, gourry@gourry.net,
	 ira.weiny@intel.com, cjia@nvidia.com, kjaju@nvidia.com,
	vsethi@nvidia.com,  zhiw@nvidia.com, kvm@vger.kernel.org,
	linux-cxl@vger.kernel.org,  linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support
Date: Fri, 26 Jun 2026 17:16:54 +0800	[thread overview]
Message-ID: <aj5Brc9beJhsDdJr@MWDK4CY14F> (raw)
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>

On Thu, Jun 25, 2026 at 10:23:56PM +0800, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
> 
> CXL Type-2 accelerators (CXL.mem-capable GPUs and similar) cannot be
> passed through to virtual machines with stock vfio-pci because the
> driver has no concept of HDM decoder management, HDM region exposure,
> or component register virtualization.  This series adds those three
> pieces, sufficient for a guest to use the device's firmware-committed
> coherent memory under UVM / ATS.
> 
> v3 is a rewrite of the v2 framework form, responding to Dan's request
> in the v2 review for "less emulation, narrower interfaces, and a
> closer mapping to the spec language."
> In this release, cxl-core exposes four EXPORT_SYMBOL_GPL helpers behind
> an opaque handle.  vfio-pci becomes a thin transport on top of those.
> Please see "Changes since v2" and "Reviewer feedback addressed" below for
> the per-area summary.
>

Hi Manish,

Thanks for the work, I ran some test with your patches applied on a real
CXL type-2 device, it's a GPU with a FW-committed HDM decoder. I want to
report the result early, the acquire path works, but the first CPU access
to the mapped HDM region crash the host.

So device BDF is 0002:81:00.0 , with CXLCtl: Cache+ IO+ Mem+, HDM decoder firmware-committed.

Binding the device to vfio-pci brought the CXL Type-2 path up cleanly
"""
# modprobe vfio-pci
# echo vfio-pci > /sys/bus/pci/devices/0002:81:00.0/driver_override
# echo 0002:81:00.0 > /sys/bus/pci/drivers_probe
"""

A meme0/endpoint19/region1 appeared, and selftest device_is_cxl() passed.

When running the 9th patch's selftest
"""
# sudo ./vfio_cxl_type2_test 0002:81:00.0
ok 1 cxl_type2.device_is_cxl
#  RUN  cxl_type2.hdm_region_mmap_rw
"""
At this point, the machine hung and crash.

hdm_region_mmap_rw mmaps the HDM region and does a CPU read/write to it. That =
access never returned. I couldn't capture dmesg or trace before it crashed.

I'm not sure if this is a platform/FW issue or something in how the region
is mapped.
Have you exercised hdm_region_mmap_rw() against your machine? or only cxl_test mock?

If a guest can hang the host just by touching its mapped memory, it needs to be fixed.

Best regards,
Richard Cheng.



> Motivation
> ==========
> 
> A CXL Type-2 device exposes its HDM-mapped device memory through HDM
> decoders that BIOS programs and commits at boot.  To pass such a
> device to a guest, vfio-pci has to do three things at once:
> 
>   1. Surface the firmware-committed HDM-mapped HPA range as a guest-
>      mmappable region.
> 
>   2. Surface a CXL-spec-compliant view of the CXL Device DVSEC body,
>      the HDM Decoder Capability block, and the CXL.cache/mem cap-array
>      prefix, so the guest's CXL driver enumerates the same topology
>      the host saw.
> 
>   3. Keep the host's committed decoder configuration intact (the
>      physical decoder is never reprogrammed) while letting the guest
>      observe and manage a shadow that follows the per-field write
>      semantics in the spec.
> 
> The series builds on Alejandro Lucero-Palau's v28 work
> applied on for-7.3/cxl-type2-enabling [1] (sfc is the in-tree consumer
> today). vfio-pci becomes the second consumer.
> 
> Architecture
> ============
> 
> cxl-core owns the CXL semantics.  A new file
> drivers/cxl/core/passthrough.c (gated by hidden Kconfig
> CXL_VFIO_PASSTHROUGH) provides four exported symbols:
> 
>     struct cxl_passthrough *
>     devm_cxl_passthrough_create(struct device *dev,
>                                 struct cxl_dev_state *cxlds);
> 
>     int cxl_passthrough_dvsec_rw(p, off, val, sz, write);
>     int cxl_passthrough_hdm_rw  (p, off, val,      write);
>     int cxl_passthrough_cm_rw   (p, off, val,      write);
> 
> cxl_passthrough is an opaque handle; vfio-pci sees no cxl-internal
> struct pointers.  The shadows are snapshotted at create time: the
> DVSEC body from PCI config space dword by dword, the CM cap-array and
> HDM block from the cxl-core MMIO mapping at cxlds->reg_map.base.
> Per-field write semantics follow below:
> CXL r4.0 8.1.3 DVSEC:
> - LOCK is RWO,
> - CONTROL/CONTROL2 are RWL gated on CONFIG_LOCK,
> - STATUS/STATUS2 are RW1C,
> - RANGE1 is HwInit, RANGE2 is RsvdZ
> CXL r4.0 8.2.4.20 HDM:
> - GLOBAL_CTRL RW,
> - decoder CTRL implements COMMIT/COMMITTED,
> - decoder BASE/SIZE RWL gated on COMMITTED or LOCK_ON_COMMIT,
> - cap header HwInit).
> 
> vfio-pci becomes a thin transport.  The new module
> drivers/vfio/pci/cxl/ exposes two VFIO regions.
> 
>   VFIO_REGION_SUBTYPE_CXL (HDM region): mmappable view of the
>   HDM-mapped HPA. The mmap fault handler calls vmf_insert_pfn() from
>   the physical HPA. pread/pwrite go through the memremap_wb() kva
>   captured at bind time.
> 
>   VFIO_REGION_SUBTYPE_CXL_COMP_REGS (component register shadow):
>   pread/pwrite only, dword-aligned (-EINVAL on misalignment).
>   Each dword dispatches by offset to cxl_passthrough_cm_rw() or
>   cxl_passthrough_hdm_rw(). No shadow state on the vfio side; cxl-core
>   enforces the spec.
> 
> CXL DVSEC config-space accesses use a clipping shim in
> vfio_pci_config_rw_single(). A config-space chunk that crosses the
> DVSEC body boundary is split: header bytes go through the generic
> perm-bits path, body bytes go through cxl_passthrough_dvsec_rw().
> The shim replaces v2's approach of repointing ecap_perms[]
> 
> Sparse-mmap is exposed on the component BAR so userspace can mmap the
> non-component portions directly; only the CXL component register
> sub-range goes through pread/pwrite emulation. The CXL sub-range is
> also skipped from vfio_pci-core's request_selected_regions() set
> because cxl-core's devm_cxl_probe_mem() already holds a
> request_mem_region() on it; the asymmetric skip is matched by an
> asymmetric release on disable().
> 
> Scope and out-of-scope
> ======================
> 
> In scope (rejected at create time with -EOPNOTSUPP otherwise):
> 
>   - Firmware-committed devices (HOST_FIRMWARE_COMMITTED set).
>   - Single HDM decoder (hdm_count == 1).
>   - No interleave (IW == 0).
> 
> Out of scope, deferred for follow-on work:
> 
>   - Multi-decoder devices and interleave.
>   - Guest-driven (non-firmware-committed) HDM commit.
>   - Hotplug, FLR, and sibling-function reset of CXL Type-2 devices.
> 
> Changes since v2
> ================
> 
> This is a rewrite, not an incremental update.  The structure of the
> series changed (20 patches in v2 to 11 in v3) because v3 collapses
> v2 patches 9-15 (detection, HDM emulation, media readiness, region
> management, HDM region, DVSEC emulation) into one cxl-core helper
> file and one vfio-pci consumer.
> 
> Framework replaced by narrow opaque-handle helpers (patches 6, 8)
> 
>   v2 carried a generic register-emulation framework split across four
>   state-machine files in cxl-core.
>   v3 collapses it into one file: drivers/cxl/core/passthrough.c
>   exposing the four EXPORT_SYMBOL_GPL helpers above behind a struct
>   cxl_passthrough opaque handle.
> 
> Shadow ownership moved into cxl-core (patches 6, 8)
> 
>   vfio-pci no longer keeps any per-field state. It forwards
>   (offset, value) into cxl-core, and cxl-core enforces the spec
>   (RWO, RWL, RW1C, HwInit, RsvdZ) with explicit CXL r4.0 section
>   references in the switch arms.
> 
> DVSEC config-space clipping shim (patch 8)
> 
>   v2 repointed ecap_perms[] to redirect CXL DVSEC reads and writes.
>   v3 keeps ecap_perms[] untouched and clips per-config-access chunks
>   at the DVSEC body boundary in vfio_pci_config_rw_single(); header bytes
>   go through the generic perm-bits path, body bytes go through
>   cxl_passthrough_dvsec_rw(). The shim is local to the per-device
>   path.
> 
> CONFIG_VFIO_PCI_CXL gates the new module (patch 7)
> 
>   v2 had a CONFIG_VFIO_CXL_CORE Kconfig stub; v3 renames it to
>   CONFIG_VFIO_PCI_CXL to match the vfio-pci naming convention.
>   The hidden CXL_VFIO_PASSTHROUGH selects the cxl-core helper file
>   on demand. With both disabled, the cxl-core size is unchanged.
> 
> UAPI rewritten with named fields (patch 5)
> 
>   vfio_device_info_cap_cxl in v3 carries:
>     flags + HOST_FIRMWARE_COMMITTED bit
>     hdm_region_idx
>     comp_reg_region_idx
>     comp_reg_bar
>     comp_reg_offset
>     comp_reg_size
>   The DPA terminology is renamed to HDM region throughout.
>   CACHE_CAPABLE (HDM-DB indicator) is dropped;
>   it was informational only in v2 with no caller, and re-adding it
>   for an active CXL.cache plumbing series later.
> 
> Selftests trimmed (patch 9)
> 
>   v2 carried selftests for device detection, capability parsing,
>   region enumeration, HDM register emulation, HDM mmap with
>   page-fault insertion, FLR invalidation, and DVSEC register
>   emulation. v3 keeps a smoke-test set of six focused tests:
> 
>     device_is_cxl                  GET_INFO advertises FLAGS_CXL
>                                    and a populated CAP_CXL.
>     hdm_region_mmap_rw             mmap one page, write+read back.
>     component_bar_sparse_mmap      SPARSE_MMAP cap excludes the
>                                    CXL component register sub-range.
>     comp_regs_cm_cap_array_read    pread of the CM cap-array
>                                    header at CXL_CM_OFFSET succeeds
>                                    (CAP_ID == 1).
>     dvsec_lock_byte_read           pread of the DVSEC CONFIG_LOCK
>                                    byte through the clipping shim
>                                    succeeds.
>     hdm_decoder_commit_fsm         COMMIT / COMMITTED state machine
>                                    and LOCK_ON_COMMIT behaviour.
> 
>   FLR invalidation, page-fault insertion under load, and full
>   DVSEC field-by-field write coverage are deferred to a follow-on
>   selftest series. The current six are the minimal set that
>   exercises the kernel-side contract end-to-end.
> 
> cxl-core prep patches split (patches 1-4)
> 
>   v3 keeps the cxl-side enablers from v2 patches 1-4 but each as
>   a standalone change so the cxl maintainer can review the helper
>   API independently of the vfio consumer:
> 
>     [1/11] cxl_get_hdm_info()
>     [2/11] cxl_await_range_active() split from media-ready wait
>     [3/11] cxl_register_map records BIR + BAR offset
>     [4/11] component/HDM register defines moved to uapi/cxl/cxl_regs.h
> 
> Reviewer feedback addressed
> ===========================
> 
> Dan
> ---
> 
> - VFIO exposes HDM/host-visible region, not raw DPA; docs/UAPI say HDM
>   region, DPA only inside cxl-core where appropriate.
> - One vfio-pci device = one HDM region / one decoder, no interleave;
>   hdm_count != 1 → -EOPNOTSUPP.
> - Global HDM on DVSEC Range Base treated as legacy; RANGE1/RANGE2
>   read-only snapshot, guest writes dropped.
> - No guest/kernel lock games; DVSEC LOCK and HDM LOCK_ON_COMMIT RWO,
>   fixed at create from firmware snapshot.
> - Opaque cxl_passthrough handle only; vfio gets HPA via memdev probe +
>   layout via cxl_get_hdm_info(), rw via helpers.
> - No multi-region accelerator case in v3; single region enforced,
>   multi-region deferred.
> - cxl_await_range_active stays in cxl-core probe; not exported, vfio does
>   not call it.
> - No guest LOCK→0 reprogram; guest cannot clear LOCK to remap host HPA;
>   kernel uncommit tied to COMMIT, not LOCK alone.
> 
> Jason / Gregory / Dan
> ---------------------
> 
> - memremap(WB) + request_mem_region on HPA; conflicting direct-map/EFI use
>   fails probe with -EBUSY.
> 
> Jonathan
> --------
> 
> - uapi/cxl/cxl_regs.h for register defines so VMMs need no private
>   kernel headers.
> - __free() locals on cxl-core/passthrough error paths instead of
>   struct-owned temporaries.
> - No "precommitted at probe" assumption; acquire checks COMMITTED in
>   HDM shadow and refuses if missing.
> 
> Dave
> ----
> 
> - memremap(MEMREMAP_WB) for HDM host mapping (not ioremap_cache).
> - Renamed cap flag to VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED for clarity.
> - __free() / DEFINE_FREE() cleanup in new passthrough.c create path.
> 
> Patch series
> ============
> 
>  [1/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
>  [2/11] cxl: Split cxl_await_range_active() from media-ready wait
>  [3/11] cxl: Record BIR and BAR offset in cxl_register_map
>  [4/11] cxl: Move component/HDM register defines to
>         uapi/cxl/cxl_regs.h
>  [5/11] vfio: UAPI for CXL Type-2 device passthrough
>  [6/11] cxl: Add register-virtualization helpers for vfio Type-2
>         passthrough
>  [7/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
>         acquisition
>  [8/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping
>         shim
>  [9/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test
> [10/11] docs: vfio-pci: Document CXL Type-2 device passthrough
> [11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions
> 
> Dependencies
> ============
> 
> [1] [PATCH v28 0/5] Type2 device basic support
> https://lore.kernel.org/linux-cxl/20260618181806.118745-1-alejandro.lucero-palau@amd.com/
> 
> [2] Previous version of this patch series
> [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
> https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/
> 
> [3] Companion QEMU series
> [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci
> https://lore.kernel.org/linux-cxl/20260427181235.3003865-1-mhonap@nvidia.com/
> 
> Manish Honap (11):
>   cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
>   cxl: Split cxl_await_range_active() from media-ready wait
>   cxl: Record BIR and BAR offset in cxl_register_map
>   cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
>   vfio: UAPI for CXL Type-2 device passthrough
>   cxl: Add register-virtualization helpers for vfio Type-2 passthrough
>   vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
>     acquisition
>   vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim
>   selftests/vfio: Add CXL Type-2 device passthrough smoke test
>   docs: vfio-pci: Document CXL Type-2 device passthrough
>   vfio/pci: Provide opt-out for CXL Type-2 extensions
> 
>  Documentation/driver-api/index.rst            |   1 +
>  Documentation/driver-api/vfio-pci-cxl.rst     | 282 ++++++
>  drivers/cxl/Kconfig                           |   7 +
>  drivers/cxl/core/Makefile                     |   1 +
>  drivers/cxl/core/passthrough.c                | 590 ++++++++++++
>  drivers/cxl/core/pci.c                        |  70 +-
>  drivers/cxl/core/regs.c                       |  35 +
>  drivers/cxl/cxl.h                             |  52 +-
>  drivers/vfio/pci/Kconfig                      |   2 +
>  drivers/vfio/pci/Makefile                     |   1 +
>  drivers/vfio/pci/cxl/Kconfig                  |  34 +
>  drivers/vfio/pci/cxl/Makefile                 |   2 +
>  drivers/vfio/pci/cxl/vfio_cxl_core.c          | 889 ++++++++++++++++++
>  drivers/vfio/pci/cxl/vfio_cxl_priv.h          |  71 ++
>  drivers/vfio/pci/vfio_pci.c                   |   9 +
>  drivers/vfio/pci/vfio_pci_config.c            |  31 +
>  drivers/vfio/pci/vfio_pci_core.c              |  68 +-
>  drivers/vfio/pci/vfio_pci_priv.h              |  93 ++
>  drivers/vfio/pci/vfio_pci_rdwr.c              |  17 +
>  include/cxl/cxl.h                             |  18 +
>  include/cxl/passthrough.h                     | 121 +++
>  include/linux/vfio_pci_core.h                 |   8 +
>  include/uapi/cxl/cxl_regs.h                   |  63 ++
>  include/uapi/linux/vfio.h                     |  46 +
>  tools/testing/selftests/vfio/Makefile         |   1 +
>  .../selftests/vfio/lib/vfio_pci_device.c      |  11 +-
>  .../selftests/vfio/vfio_cxl_type2_test.c      | 350 +++++++
>  27 files changed, 2821 insertions(+), 52 deletions(-)
>  create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
>  create mode 100644 drivers/cxl/core/passthrough.c
>  create mode 100644 drivers/vfio/pci/cxl/Kconfig
>  create mode 100644 drivers/vfio/pci/cxl/Makefile
>  create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
>  create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
>  create mode 100644 include/cxl/passthrough.h
>  create mode 100644 include/uapi/cxl/cxl_regs.h
>  create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c
> 
> base-commit: 90cf2e0d702c8a132ccbe72e7687f33c04c14658
> -- 
> 2.25.1
> 
> 

      parent reply	other threads:[~2026-06-26  9:17 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-06-25 16:53 ` [PATCH v3 01/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata mhonap
2026-06-25 16:53 ` [PATCH v3 02/11] cxl: Split cxl_await_range_active() from media-ready wait mhonap
2026-06-25 16:53 ` [PATCH v3 03/11] cxl: Record BIR and BAR offset in cxl_register_map mhonap
2026-06-25 16:54 ` [PATCH v3 04/11] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
2026-06-25 16:54 ` [PATCH v3 05/11] vfio: UAPI for CXL Type-2 device passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 06/11] cxl: Add register-virtualization helpers for vfio Type-2 passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 07/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2 acquisition mhonap
2026-06-25 16:54 ` [PATCH v3 08/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim mhonap
2026-06-25 16:54 ` [PATCH v3 09/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test mhonap
2026-06-25 16:54 ` [PATCH v3 10/11] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions mhonap
2026-06-26  9:16 ` Richard Cheng [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aj5Brc9beJhsDdJr@MWDK4CY14F \
    --to=icheng@nvidia.com \
    --cc=alejandro.lucero-palau@amd.com \
    --cc=alex@shazbot.org \
    --cc=alison.schofield@intel.com \
    --cc=ankita@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=djbw@kernel.org \
    --cc=dmatlack@google.com \
    --cc=gourry@gourry.net \
    --cc=ira.weiny@intel.com \
    --cc=jgg@ziepe.ca \
    --cc=jic23@kernel.org \
    --cc=kjaju@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mhonap@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=zhiw@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox