public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci
@ 2026-04-27 18:12 mhonap
  2026-04-27 18:12 ` [RFC 1/9] hw/arm/virt: Add CXL FMWS PA window for device memory mhonap
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: mhonap @ 2026-04-27 18:12 UTC (permalink / raw)
  To: alwilliamson, skolothumtho, ankita, mst, imammedo, anisinha,
	eric.auger, peter.maydell, shannon.zhaosl, jonathan.cameron,
	fan.ni, pbonzini, richard.henderson, marcel.apfelbaum, clg,
	cohuck, dan.j.williams, dave.jiang, alejandro.lucero-palau
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-cxl, kvm, qemu-devel,
	qemu-arm, Manish Honap

From: Manish Honap <mhonap@nvidia.com>

This series adds QEMU-side support for passing CXL Type-2 devices
(GPUs and accelerators with host-managed device memory) to VMs via
vfio-pci.

It pairs with the kernel series "vfio/pci: CXL Type-2 passthrough"[1]
posted to the vfio mailing list. Patches 3-7 need that kernel series
present to do anything useful. I am new to QEMU development, so please
forgive and point me in the right direction for correct infrastructure
decisions.

Background
----------

CXL Type-2 devices expose device memory (CXL.mem) through HDM decoders.
The kernel vfio-pci driver shadows the HDM Decoder Capability registers
so userspace can observe and control decoder commits without touching
the hardware register page directly.

Without this series, the guest never sees the device memory range and
the HDM decoder goes unconfigured. The device shows up but its memory
is unreachable.

Design decisions
----------------

CXL.mem is exposed to the guest as a dedicated GPA window declared in ACPI
(CEDT/CFMWS) rather than a PCI BAR. The HDM decoder BASE must match the
CFMWS base and remain stable; BAR assignment is not stable. A separate
VIRT_HIGH_CXL_MMIO window in the ARM virt memory map carries this GPA range,
independent of the existing PCIe MMIO slots.

The Component Register BAR contains two distinct ranges. Accelerator
register windows are passed through as direct hardware mmaps via
VFIO_REGION_INFO_CAP_SPARSE_MMAP. The HDM Decoder Capability block is
excluded from that sparse list by the kernel and must be intercepted by
QEMU to track decoder state. A single priority-1 COMP_REGS overlay
placed at hdm_regs_offset inside the BAR container wins over any
hardware-backed alias at the same offset, with no per-window aliasing
required.

The guest has no mechanism to remap host physical mappings. QEMU programs
decoder 0 with the CFMWS base through the kernel's COMP_REGS shadow at
machine_done time, after all devices are realized and before the guest starts.
The notifier is registered only for devices the kernel reports as
firmware-committed (VFIO_CXL_CAP_FIRMWARE_COMMITTED).

The CXL.mem MemoryRegion is a mmap-backed RAM-device region backed by a
VM_IO|VM_PFNMAP VMA. The VFIO MemoryListener would attempt an IOMMU
DMA mapping for it when it is added to system_memory, which always
fails: pin_user_pages() refuses VM_IO pages. No IOMMU mapping is needed
for these regions - CPU access goes via KVM Stage-2 page faults and
device DMA to RAM uses separate per-RAM-section IOMMU entries. The
listener is extended to skip the mapping attempt for VFIO-owned
RAM-device regions.

pxb-cxl bridges had no _DSM method. Without _DSM function 5 the OS
defaults to treating PCI configuration as reassignable.
On machines with firmware-committed HDM decoders that reassignment breaks
the CXL.mem mapping, so the _DSM is added with preserve_config=true for ARM and
false for x86.

Known issues:
- The bios-tables test will fail due to the _DSM addition.
  A fix will be provided in a follow-up round.
- VFIO_CXL_CAP_CACHE_CAPABLE will require additional handling.
- Devices with multiple firmware-committed HDM decoders are not fully
  supported.
- Non-firmware-committed devices are not supported.
- linux-headers sync is manual and temporary; once the kernel series is
  merged, this patch will be replaced with script generated update.

[1] https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com

Manish Honap (9):
  hw/arm/virt: Add CXL FMWS PA window for device memory
  cxl: Add preserve_config to pxb-cxl OSC method
  linux-headers: Update vfio.h for CXL Type-2 device passthrough
  hw/vfio/region: Add vfio_region_setup_with_ops() for custom region ops
  hw/vfio/pci: Add CXL Type-2 device detection and region setup
  hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay
  hw/vfio+cxl: Program HDM decoder 0 at machine_done for
    firmware-committed devices
  hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus
  vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions

 hw/acpi/cxl-stub.c         |   2 +-
 hw/acpi/cxl.c              |   4 +-
 hw/arm/smmu-common.c       |  17 +-
 hw/arm/virt-acpi-build.c   |   5 +
 hw/arm/virt.c              |   7 +
 hw/cxl/cxl-host-stubs.c    |   2 +
 hw/cxl/cxl-host.c          |   8 +
 hw/i386/acpi-build.c       |   2 +-
 hw/pci-host/gpex-acpi.c    |  43 +++-
 hw/vfio/listener.c         |  14 ++
 hw/vfio/pci.c              | 411 +++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h              |  15 ++
 hw/vfio/region.c           |  15 +-
 hw/vfio/trace-events       |   6 +
 hw/vfio/vfio-region.h      |   3 +
 include/hw/acpi/cxl.h      |   2 +-
 include/hw/arm/virt.h      |   2 +
 include/hw/cxl/cxl_host.h  |  10 +
 include/hw/pci-host/gpex.h |   2 +
 linux-headers/linux/vfio.h |  18 ++
 20 files changed, 570 insertions(+), 18 deletions(-)

--
2.25.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-04-27 18:15 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 18:12 [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci mhonap
2026-04-27 18:12 ` [RFC 1/9] hw/arm/virt: Add CXL FMWS PA window for device memory mhonap
2026-04-27 18:12 ` [RFC 2/9] cxl: Add preserve_config to pxb-cxl OSC method mhonap
2026-04-27 18:12 ` [RFC 3/9] linux-headers: Update vfio.h for CXL Type-2 device passthrough mhonap
2026-04-27 18:12 ` [RFC 4/9] hw/vfio/region: Add vfio_region_setup_with_ops() for custom region ops mhonap
2026-04-27 18:12 ` [RFC 5/9] hw/vfio/pci: Add CXL Type-2 device detection and region setup mhonap
2026-04-27 18:12 ` [RFC 6/9] hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay mhonap
2026-04-27 18:12 ` [RFC 7/9] hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices mhonap
2026-04-27 18:12 ` [RFC 8/9] hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus mhonap
2026-04-27 18:12 ` [RFC 9/9] vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions mhonap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox