From: <mhonap@nvidia.com>
To: <alwilliamson@nvidia.com>, <skolothumtho@nvidia.com>,
<ankita@nvidia.com>, <mst@redhat.com>, <imammedo@redhat.com>,
<anisinha@redhat.com>, <eric.auger@redhat.com>,
<peter.maydell@linaro.org>, <shannon.zhaosl@gmail.com>,
<jonathan.cameron@huawei.com>, <fan.ni@samsung.com>,
<pbonzini@redhat.com>, <richard.henderson@linaro.org>,
<marcel.apfelbaum@gmail.com>, <clg@redhat.com>,
<cohuck@redhat.com>, <dan.j.williams@intel.com>,
<dave.jiang@intel.com>, <alejandro.lucero-palau@amd.com>
Cc: <vsethi@nvidia.com>, <cjia@nvidia.com>, <targupta@nvidia.com>,
<zhiw@nvidia.com>, <kjaju@nvidia.com>,
<linux-cxl@vger.kernel.org>, <kvm@vger.kernel.org>,
<qemu-devel@nongnu.org>, <qemu-arm@nongnu.org>,
"Manish Honap" <mhonap@nvidia.com>
Subject: [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci
Date: Mon, 27 Apr 2026 23:42:26 +0530 [thread overview]
Message-ID: <20260427181235.3003865-1-mhonap@nvidia.com> (raw)
From: Manish Honap <mhonap@nvidia.com>
This series adds QEMU-side support for passing CXL Type-2 devices
(GPUs and accelerators with host-managed device memory) to VMs via
vfio-pci.
It pairs with the kernel series "vfio/pci: CXL Type-2 passthrough"[1]
posted to the vfio mailing list. Patches 3-7 need that kernel series
present to do anything useful. I am new to QEMU development, so please
forgive and point me in the right direction for correct infrastructure
decisions.
Background
----------
CXL Type-2 devices expose device memory (CXL.mem) through HDM decoders.
The kernel vfio-pci driver shadows the HDM Decoder Capability registers
so userspace can observe and control decoder commits without touching
the hardware register page directly.
Without this series, the guest never sees the device memory range and
the HDM decoder goes unconfigured. The device shows up but its memory
is unreachable.
Design decisions
----------------
CXL.mem is exposed to the guest as a dedicated GPA window declared in ACPI
(CEDT/CFMWS) rather than a PCI BAR. The HDM decoder BASE must match the
CFMWS base and remain stable; BAR assignment is not stable. A separate
VIRT_HIGH_CXL_MMIO window in the ARM virt memory map carries this GPA range,
independent of the existing PCIe MMIO slots.
The Component Register BAR contains two distinct ranges. Accelerator
register windows are passed through as direct hardware mmaps via
VFIO_REGION_INFO_CAP_SPARSE_MMAP. The HDM Decoder Capability block is
excluded from that sparse list by the kernel and must be intercepted by
QEMU to track decoder state. A single priority-1 COMP_REGS overlay
placed at hdm_regs_offset inside the BAR container wins over any
hardware-backed alias at the same offset, with no per-window aliasing
required.
The guest has no mechanism to remap host physical mappings. QEMU programs
decoder 0 with the CFMWS base through the kernel's COMP_REGS shadow at
machine_done time, after all devices are realized and before the guest starts.
The notifier is registered only for devices the kernel reports as
firmware-committed (VFIO_CXL_CAP_FIRMWARE_COMMITTED).
The CXL.mem MemoryRegion is a mmap-backed RAM-device region backed by a
VM_IO|VM_PFNMAP VMA. The VFIO MemoryListener would attempt an IOMMU
DMA mapping for it when it is added to system_memory, which always
fails: pin_user_pages() refuses VM_IO pages. No IOMMU mapping is needed
for these regions - CPU access goes via KVM Stage-2 page faults and
device DMA to RAM uses separate per-RAM-section IOMMU entries. The
listener is extended to skip the mapping attempt for VFIO-owned
RAM-device regions.
pxb-cxl bridges had no _DSM method. Without _DSM function 5 the OS
defaults to treating PCI configuration as reassignable.
On machines with firmware-committed HDM decoders that reassignment breaks
the CXL.mem mapping, so the _DSM is added with preserve_config=true for ARM and
false for x86.
Known issues:
- The bios-tables test will fail due to the _DSM addition.
A fix will be provided in a follow-up round.
- VFIO_CXL_CAP_CACHE_CAPABLE will require additional handling.
- Devices with multiple firmware-committed HDM decoders are not fully
supported.
- Non-firmware-committed devices are not supported.
- linux-headers sync is manual and temporary; once the kernel series is
merged, this patch will be replaced with script generated update.
[1] https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com
Manish Honap (9):
hw/arm/virt: Add CXL FMWS PA window for device memory
cxl: Add preserve_config to pxb-cxl OSC method
linux-headers: Update vfio.h for CXL Type-2 device passthrough
hw/vfio/region: Add vfio_region_setup_with_ops() for custom region ops
hw/vfio/pci: Add CXL Type-2 device detection and region setup
hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay
hw/vfio+cxl: Program HDM decoder 0 at machine_done for
firmware-committed devices
hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus
vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions
hw/acpi/cxl-stub.c | 2 +-
hw/acpi/cxl.c | 4 +-
hw/arm/smmu-common.c | 17 +-
hw/arm/virt-acpi-build.c | 5 +
hw/arm/virt.c | 7 +
hw/cxl/cxl-host-stubs.c | 2 +
hw/cxl/cxl-host.c | 8 +
hw/i386/acpi-build.c | 2 +-
hw/pci-host/gpex-acpi.c | 43 +++-
hw/vfio/listener.c | 14 ++
hw/vfio/pci.c | 411 +++++++++++++++++++++++++++++++++++++
hw/vfio/pci.h | 15 ++
hw/vfio/region.c | 15 +-
hw/vfio/trace-events | 6 +
hw/vfio/vfio-region.h | 3 +
include/hw/acpi/cxl.h | 2 +-
include/hw/arm/virt.h | 2 +
include/hw/cxl/cxl_host.h | 10 +
include/hw/pci-host/gpex.h | 2 +
linux-headers/linux/vfio.h | 18 ++
20 files changed, 570 insertions(+), 18 deletions(-)
--
2.25.1
next reply other threads:[~2026-04-27 18:13 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-27 18:12 mhonap [this message]
2026-04-27 18:12 ` [RFC 1/9] hw/arm/virt: Add CXL FMWS PA window for device memory mhonap
2026-04-27 18:12 ` [RFC 2/9] cxl: Add preserve_config to pxb-cxl OSC method mhonap
2026-04-27 18:12 ` [RFC 3/9] linux-headers: Update vfio.h for CXL Type-2 device passthrough mhonap
2026-04-27 18:12 ` [RFC 4/9] hw/vfio/region: Add vfio_region_setup_with_ops() for custom region ops mhonap
2026-04-27 18:12 ` [RFC 5/9] hw/vfio/pci: Add CXL Type-2 device detection and region setup mhonap
2026-04-27 18:12 ` [RFC 6/9] hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay mhonap
2026-04-27 18:12 ` [RFC 7/9] hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices mhonap
2026-04-27 18:12 ` [RFC 8/9] hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus mhonap
2026-04-27 18:12 ` [RFC 9/9] vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions mhonap
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260427181235.3003865-1-mhonap@nvidia.com \
--to=mhonap@nvidia.com \
--cc=alejandro.lucero-palau@amd.com \
--cc=alwilliamson@nvidia.com \
--cc=anisinha@redhat.com \
--cc=ankita@nvidia.com \
--cc=cjia@nvidia.com \
--cc=clg@redhat.com \
--cc=cohuck@redhat.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=eric.auger@redhat.com \
--cc=fan.ni@samsung.com \
--cc=imammedo@redhat.com \
--cc=jonathan.cameron@huawei.com \
--cc=kjaju@nvidia.com \
--cc=kvm@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=marcel.apfelbaum@gmail.com \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=peter.maydell@linaro.org \
--cc=qemu-arm@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
--cc=shannon.zhaosl@gmail.com \
--cc=skolothumtho@nvidia.com \
--cc=targupta@nvidia.com \
--cc=vsethi@nvidia.com \
--cc=zhiw@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox