From: Richard Cheng <icheng@nvidia.com>
To: mhonap@nvidia.com
Cc: djbw@kernel.org, alex@shazbot.org, jgg@ziepe.ca,
jic23@kernel.org, dave.jiang@intel.com, ankita@nvidia.com,
alejandro.lucero-palau@amd.com, alison.schofield@intel.com,
dave@stgolabs.net, dmatlack@google.com, gourry@gourry.net,
ira.weiny@intel.com, cjia@nvidia.com, kjaju@nvidia.com,
vsethi@nvidia.com, zhiw@nvidia.com, kvm@vger.kernel.org,
linux-cxl@vger.kernel.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support
Date: Fri, 26 Jun 2026 17:16:54 +0800 [thread overview]
Message-ID: <aj5Brc9beJhsDdJr@MWDK4CY14F> (raw)
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
On Thu, Jun 25, 2026 at 10:23:56PM +0800, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> CXL Type-2 accelerators (CXL.mem-capable GPUs and similar) cannot be
> passed through to virtual machines with stock vfio-pci because the
> driver has no concept of HDM decoder management, HDM region exposure,
> or component register virtualization. This series adds those three
> pieces, sufficient for a guest to use the device's firmware-committed
> coherent memory under UVM / ATS.
>
> v3 is a rewrite of the v2 framework form, responding to Dan's request
> in the v2 review for "less emulation, narrower interfaces, and a
> closer mapping to the spec language."
> In this release, cxl-core exposes four EXPORT_SYMBOL_GPL helpers behind
> an opaque handle. vfio-pci becomes a thin transport on top of those.
> Please see "Changes since v2" and "Reviewer feedback addressed" below for
> the per-area summary.
>
Hi Manish,
Thanks for the work, I ran some test with your patches applied on a real
CXL type-2 device, it's a GPU with a FW-committed HDM decoder. I want to
report the result early, the acquire path works, but the first CPU access
to the mapped HDM region crash the host.
So device BDF is 0002:81:00.0 , with CXLCtl: Cache+ IO+ Mem+, HDM decoder firmware-committed.
Binding the device to vfio-pci brought the CXL Type-2 path up cleanly
"""
# modprobe vfio-pci
# echo vfio-pci > /sys/bus/pci/devices/0002:81:00.0/driver_override
# echo 0002:81:00.0 > /sys/bus/pci/drivers_probe
"""
A meme0/endpoint19/region1 appeared, and selftest device_is_cxl() passed.
When running the 9th patch's selftest
"""
# sudo ./vfio_cxl_type2_test 0002:81:00.0
ok 1 cxl_type2.device_is_cxl
# RUN cxl_type2.hdm_region_mmap_rw
"""
At this point, the machine hung and crash.
hdm_region_mmap_rw mmaps the HDM region and does a CPU read/write to it. That =
access never returned. I couldn't capture dmesg or trace before it crashed.
I'm not sure if this is a platform/FW issue or something in how the region
is mapped.
Have you exercised hdm_region_mmap_rw() against your machine? or only cxl_test mock?
If a guest can hang the host just by touching its mapped memory, it needs to be fixed.
Best regards,
Richard Cheng.
> Motivation
> ==========
>
> A CXL Type-2 device exposes its HDM-mapped device memory through HDM
> decoders that BIOS programs and commits at boot. To pass such a
> device to a guest, vfio-pci has to do three things at once:
>
> 1. Surface the firmware-committed HDM-mapped HPA range as a guest-
> mmappable region.
>
> 2. Surface a CXL-spec-compliant view of the CXL Device DVSEC body,
> the HDM Decoder Capability block, and the CXL.cache/mem cap-array
> prefix, so the guest's CXL driver enumerates the same topology
> the host saw.
>
> 3. Keep the host's committed decoder configuration intact (the
> physical decoder is never reprogrammed) while letting the guest
> observe and manage a shadow that follows the per-field write
> semantics in the spec.
>
> The series builds on Alejandro Lucero-Palau's v28 work
> applied on for-7.3/cxl-type2-enabling [1] (sfc is the in-tree consumer
> today). vfio-pci becomes the second consumer.
>
> Architecture
> ============
>
> cxl-core owns the CXL semantics. A new file
> drivers/cxl/core/passthrough.c (gated by hidden Kconfig
> CXL_VFIO_PASSTHROUGH) provides four exported symbols:
>
> struct cxl_passthrough *
> devm_cxl_passthrough_create(struct device *dev,
> struct cxl_dev_state *cxlds);
>
> int cxl_passthrough_dvsec_rw(p, off, val, sz, write);
> int cxl_passthrough_hdm_rw (p, off, val, write);
> int cxl_passthrough_cm_rw (p, off, val, write);
>
> cxl_passthrough is an opaque handle; vfio-pci sees no cxl-internal
> struct pointers. The shadows are snapshotted at create time: the
> DVSEC body from PCI config space dword by dword, the CM cap-array and
> HDM block from the cxl-core MMIO mapping at cxlds->reg_map.base.
> Per-field write semantics follow below:
> CXL r4.0 8.1.3 DVSEC:
> - LOCK is RWO,
> - CONTROL/CONTROL2 are RWL gated on CONFIG_LOCK,
> - STATUS/STATUS2 are RW1C,
> - RANGE1 is HwInit, RANGE2 is RsvdZ
> CXL r4.0 8.2.4.20 HDM:
> - GLOBAL_CTRL RW,
> - decoder CTRL implements COMMIT/COMMITTED,
> - decoder BASE/SIZE RWL gated on COMMITTED or LOCK_ON_COMMIT,
> - cap header HwInit).
>
> vfio-pci becomes a thin transport. The new module
> drivers/vfio/pci/cxl/ exposes two VFIO regions.
>
> VFIO_REGION_SUBTYPE_CXL (HDM region): mmappable view of the
> HDM-mapped HPA. The mmap fault handler calls vmf_insert_pfn() from
> the physical HPA. pread/pwrite go through the memremap_wb() kva
> captured at bind time.
>
> VFIO_REGION_SUBTYPE_CXL_COMP_REGS (component register shadow):
> pread/pwrite only, dword-aligned (-EINVAL on misalignment).
> Each dword dispatches by offset to cxl_passthrough_cm_rw() or
> cxl_passthrough_hdm_rw(). No shadow state on the vfio side; cxl-core
> enforces the spec.
>
> CXL DVSEC config-space accesses use a clipping shim in
> vfio_pci_config_rw_single(). A config-space chunk that crosses the
> DVSEC body boundary is split: header bytes go through the generic
> perm-bits path, body bytes go through cxl_passthrough_dvsec_rw().
> The shim replaces v2's approach of repointing ecap_perms[]
>
> Sparse-mmap is exposed on the component BAR so userspace can mmap the
> non-component portions directly; only the CXL component register
> sub-range goes through pread/pwrite emulation. The CXL sub-range is
> also skipped from vfio_pci-core's request_selected_regions() set
> because cxl-core's devm_cxl_probe_mem() already holds a
> request_mem_region() on it; the asymmetric skip is matched by an
> asymmetric release on disable().
>
> Scope and out-of-scope
> ======================
>
> In scope (rejected at create time with -EOPNOTSUPP otherwise):
>
> - Firmware-committed devices (HOST_FIRMWARE_COMMITTED set).
> - Single HDM decoder (hdm_count == 1).
> - No interleave (IW == 0).
>
> Out of scope, deferred for follow-on work:
>
> - Multi-decoder devices and interleave.
> - Guest-driven (non-firmware-committed) HDM commit.
> - Hotplug, FLR, and sibling-function reset of CXL Type-2 devices.
>
> Changes since v2
> ================
>
> This is a rewrite, not an incremental update. The structure of the
> series changed (20 patches in v2 to 11 in v3) because v3 collapses
> v2 patches 9-15 (detection, HDM emulation, media readiness, region
> management, HDM region, DVSEC emulation) into one cxl-core helper
> file and one vfio-pci consumer.
>
> Framework replaced by narrow opaque-handle helpers (patches 6, 8)
>
> v2 carried a generic register-emulation framework split across four
> state-machine files in cxl-core.
> v3 collapses it into one file: drivers/cxl/core/passthrough.c
> exposing the four EXPORT_SYMBOL_GPL helpers above behind a struct
> cxl_passthrough opaque handle.
>
> Shadow ownership moved into cxl-core (patches 6, 8)
>
> vfio-pci no longer keeps any per-field state. It forwards
> (offset, value) into cxl-core, and cxl-core enforces the spec
> (RWO, RWL, RW1C, HwInit, RsvdZ) with explicit CXL r4.0 section
> references in the switch arms.
>
> DVSEC config-space clipping shim (patch 8)
>
> v2 repointed ecap_perms[] to redirect CXL DVSEC reads and writes.
> v3 keeps ecap_perms[] untouched and clips per-config-access chunks
> at the DVSEC body boundary in vfio_pci_config_rw_single(); header bytes
> go through the generic perm-bits path, body bytes go through
> cxl_passthrough_dvsec_rw(). The shim is local to the per-device
> path.
>
> CONFIG_VFIO_PCI_CXL gates the new module (patch 7)
>
> v2 had a CONFIG_VFIO_CXL_CORE Kconfig stub; v3 renames it to
> CONFIG_VFIO_PCI_CXL to match the vfio-pci naming convention.
> The hidden CXL_VFIO_PASSTHROUGH selects the cxl-core helper file
> on demand. With both disabled, the cxl-core size is unchanged.
>
> UAPI rewritten with named fields (patch 5)
>
> vfio_device_info_cap_cxl in v3 carries:
> flags + HOST_FIRMWARE_COMMITTED bit
> hdm_region_idx
> comp_reg_region_idx
> comp_reg_bar
> comp_reg_offset
> comp_reg_size
> The DPA terminology is renamed to HDM region throughout.
> CACHE_CAPABLE (HDM-DB indicator) is dropped;
> it was informational only in v2 with no caller, and re-adding it
> for an active CXL.cache plumbing series later.
>
> Selftests trimmed (patch 9)
>
> v2 carried selftests for device detection, capability parsing,
> region enumeration, HDM register emulation, HDM mmap with
> page-fault insertion, FLR invalidation, and DVSEC register
> emulation. v3 keeps a smoke-test set of six focused tests:
>
> device_is_cxl GET_INFO advertises FLAGS_CXL
> and a populated CAP_CXL.
> hdm_region_mmap_rw mmap one page, write+read back.
> component_bar_sparse_mmap SPARSE_MMAP cap excludes the
> CXL component register sub-range.
> comp_regs_cm_cap_array_read pread of the CM cap-array
> header at CXL_CM_OFFSET succeeds
> (CAP_ID == 1).
> dvsec_lock_byte_read pread of the DVSEC CONFIG_LOCK
> byte through the clipping shim
> succeeds.
> hdm_decoder_commit_fsm COMMIT / COMMITTED state machine
> and LOCK_ON_COMMIT behaviour.
>
> FLR invalidation, page-fault insertion under load, and full
> DVSEC field-by-field write coverage are deferred to a follow-on
> selftest series. The current six are the minimal set that
> exercises the kernel-side contract end-to-end.
>
> cxl-core prep patches split (patches 1-4)
>
> v3 keeps the cxl-side enablers from v2 patches 1-4 but each as
> a standalone change so the cxl maintainer can review the helper
> API independently of the vfio consumer:
>
> [1/11] cxl_get_hdm_info()
> [2/11] cxl_await_range_active() split from media-ready wait
> [3/11] cxl_register_map records BIR + BAR offset
> [4/11] component/HDM register defines moved to uapi/cxl/cxl_regs.h
>
> Reviewer feedback addressed
> ===========================
>
> Dan
> ---
>
> - VFIO exposes HDM/host-visible region, not raw DPA; docs/UAPI say HDM
> region, DPA only inside cxl-core where appropriate.
> - One vfio-pci device = one HDM region / one decoder, no interleave;
> hdm_count != 1 → -EOPNOTSUPP.
> - Global HDM on DVSEC Range Base treated as legacy; RANGE1/RANGE2
> read-only snapshot, guest writes dropped.
> - No guest/kernel lock games; DVSEC LOCK and HDM LOCK_ON_COMMIT RWO,
> fixed at create from firmware snapshot.
> - Opaque cxl_passthrough handle only; vfio gets HPA via memdev probe +
> layout via cxl_get_hdm_info(), rw via helpers.
> - No multi-region accelerator case in v3; single region enforced,
> multi-region deferred.
> - cxl_await_range_active stays in cxl-core probe; not exported, vfio does
> not call it.
> - No guest LOCK→0 reprogram; guest cannot clear LOCK to remap host HPA;
> kernel uncommit tied to COMMIT, not LOCK alone.
>
> Jason / Gregory / Dan
> ---------------------
>
> - memremap(WB) + request_mem_region on HPA; conflicting direct-map/EFI use
> fails probe with -EBUSY.
>
> Jonathan
> --------
>
> - uapi/cxl/cxl_regs.h for register defines so VMMs need no private
> kernel headers.
> - __free() locals on cxl-core/passthrough error paths instead of
> struct-owned temporaries.
> - No "precommitted at probe" assumption; acquire checks COMMITTED in
> HDM shadow and refuses if missing.
>
> Dave
> ----
>
> - memremap(MEMREMAP_WB) for HDM host mapping (not ioremap_cache).
> - Renamed cap flag to VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED for clarity.
> - __free() / DEFINE_FREE() cleanup in new passthrough.c create path.
>
> Patch series
> ============
>
> [1/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
> [2/11] cxl: Split cxl_await_range_active() from media-ready wait
> [3/11] cxl: Record BIR and BAR offset in cxl_register_map
> [4/11] cxl: Move component/HDM register defines to
> uapi/cxl/cxl_regs.h
> [5/11] vfio: UAPI for CXL Type-2 device passthrough
> [6/11] cxl: Add register-virtualization helpers for vfio Type-2
> passthrough
> [7/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
> acquisition
> [8/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping
> shim
> [9/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test
> [10/11] docs: vfio-pci: Document CXL Type-2 device passthrough
> [11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions
>
> Dependencies
> ============
>
> [1] [PATCH v28 0/5] Type2 device basic support
> https://lore.kernel.org/linux-cxl/20260618181806.118745-1-alejandro.lucero-palau@amd.com/
>
> [2] Previous version of this patch series
> [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
> https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/
>
> [3] Companion QEMU series
> [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci
> https://lore.kernel.org/linux-cxl/20260427181235.3003865-1-mhonap@nvidia.com/
>
> Manish Honap (11):
> cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
> cxl: Split cxl_await_range_active() from media-ready wait
> cxl: Record BIR and BAR offset in cxl_register_map
> cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
> vfio: UAPI for CXL Type-2 device passthrough
> cxl: Add register-virtualization helpers for vfio Type-2 passthrough
> vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
> acquisition
> vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim
> selftests/vfio: Add CXL Type-2 device passthrough smoke test
> docs: vfio-pci: Document CXL Type-2 device passthrough
> vfio/pci: Provide opt-out for CXL Type-2 extensions
>
> Documentation/driver-api/index.rst | 1 +
> Documentation/driver-api/vfio-pci-cxl.rst | 282 ++++++
> drivers/cxl/Kconfig | 7 +
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/passthrough.c | 590 ++++++++++++
> drivers/cxl/core/pci.c | 70 +-
> drivers/cxl/core/regs.c | 35 +
> drivers/cxl/cxl.h | 52 +-
> drivers/vfio/pci/Kconfig | 2 +
> drivers/vfio/pci/Makefile | 1 +
> drivers/vfio/pci/cxl/Kconfig | 34 +
> drivers/vfio/pci/cxl/Makefile | 2 +
> drivers/vfio/pci/cxl/vfio_cxl_core.c | 889 ++++++++++++++++++
> drivers/vfio/pci/cxl/vfio_cxl_priv.h | 71 ++
> drivers/vfio/pci/vfio_pci.c | 9 +
> drivers/vfio/pci/vfio_pci_config.c | 31 +
> drivers/vfio/pci/vfio_pci_core.c | 68 +-
> drivers/vfio/pci/vfio_pci_priv.h | 93 ++
> drivers/vfio/pci/vfio_pci_rdwr.c | 17 +
> include/cxl/cxl.h | 18 +
> include/cxl/passthrough.h | 121 +++
> include/linux/vfio_pci_core.h | 8 +
> include/uapi/cxl/cxl_regs.h | 63 ++
> include/uapi/linux/vfio.h | 46 +
> tools/testing/selftests/vfio/Makefile | 1 +
> .../selftests/vfio/lib/vfio_pci_device.c | 11 +-
> .../selftests/vfio/vfio_cxl_type2_test.c | 350 +++++++
> 27 files changed, 2821 insertions(+), 52 deletions(-)
> create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
> create mode 100644 drivers/cxl/core/passthrough.c
> create mode 100644 drivers/vfio/pci/cxl/Kconfig
> create mode 100644 drivers/vfio/pci/cxl/Makefile
> create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
> create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
> create mode 100644 include/cxl/passthrough.h
> create mode 100644 include/uapi/cxl/cxl_regs.h
> create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c
>
> base-commit: 90cf2e0d702c8a132ccbe72e7687f33c04c14658
> --
> 2.25.1
>
>
prev parent reply other threads:[~2026-06-26 9:17 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-06-25 16:53 ` [PATCH v3 01/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata mhonap
2026-06-25 16:53 ` [PATCH v3 02/11] cxl: Split cxl_await_range_active() from media-ready wait mhonap
2026-06-25 16:53 ` [PATCH v3 03/11] cxl: Record BIR and BAR offset in cxl_register_map mhonap
2026-06-25 16:54 ` [PATCH v3 04/11] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
2026-06-25 16:54 ` [PATCH v3 05/11] vfio: UAPI for CXL Type-2 device passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 06/11] cxl: Add register-virtualization helpers for vfio Type-2 passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 07/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2 acquisition mhonap
2026-06-25 16:54 ` [PATCH v3 08/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim mhonap
2026-06-25 16:54 ` [PATCH v3 09/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test mhonap
2026-06-25 16:54 ` [PATCH v3 10/11] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions mhonap
2026-06-26 9:16 ` Richard Cheng [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aj5Brc9beJhsDdJr@MWDK4CY14F \
--to=icheng@nvidia.com \
--cc=alejandro.lucero-palau@amd.com \
--cc=alex@shazbot.org \
--cc=alison.schofield@intel.com \
--cc=ankita@nvidia.com \
--cc=cjia@nvidia.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=djbw@kernel.org \
--cc=dmatlack@google.com \
--cc=gourry@gourry.net \
--cc=ira.weiny@intel.com \
--cc=jgg@ziepe.ca \
--cc=jic23@kernel.org \
--cc=kjaju@nvidia.com \
--cc=kvm@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mhonap@nvidia.com \
--cc=vsethi@nvidia.com \
--cc=zhiw@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.