* [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci
@ 2026-04-27 18:12 mhonap
2026-04-27 18:12 ` [RFC 1/9] hw/arm/virt: Add CXL FMWS PA window for device memory mhonap
` (8 more replies)
0 siblings, 9 replies; 10+ messages in thread
From: mhonap @ 2026-04-27 18:12 UTC (permalink / raw)
To: alwilliamson, skolothumtho, ankita, mst, imammedo, anisinha,
eric.auger, peter.maydell, shannon.zhaosl, jonathan.cameron,
fan.ni, pbonzini, richard.henderson, marcel.apfelbaum, clg,
cohuck, dan.j.williams, dave.jiang, alejandro.lucero-palau
Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-cxl, kvm, qemu-devel,
qemu-arm, Manish Honap
From: Manish Honap <mhonap@nvidia.com>
This series adds QEMU-side support for passing CXL Type-2 devices
(GPUs and accelerators with host-managed device memory) to VMs via
vfio-pci.
It pairs with the kernel series "vfio/pci: CXL Type-2 passthrough"[1]
posted to the vfio mailing list. Patches 3-7 need that kernel series
present to do anything useful. I am new to QEMU development, so please
forgive and point me in the right direction for correct infrastructure
decisions.
Background
----------
CXL Type-2 devices expose device memory (CXL.mem) through HDM decoders.
The kernel vfio-pci driver shadows the HDM Decoder Capability registers
so userspace can observe and control decoder commits without touching
the hardware register page directly.
Without this series, the guest never sees the device memory range and
the HDM decoder goes unconfigured. The device shows up but its memory
is unreachable.
Design decisions
----------------
CXL.mem is exposed to the guest as a dedicated GPA window declared in ACPI
(CEDT/CFMWS) rather than a PCI BAR. The HDM decoder BASE must match the
CFMWS base and remain stable; BAR assignment is not stable. A separate
VIRT_HIGH_CXL_MMIO window in the ARM virt memory map carries this GPA range,
independent of the existing PCIe MMIO slots.
The Component Register BAR contains two distinct ranges. Accelerator
register windows are passed through as direct hardware mmaps via
VFIO_REGION_INFO_CAP_SPARSE_MMAP. The HDM Decoder Capability block is
excluded from that sparse list by the kernel and must be intercepted by
QEMU to track decoder state. A single priority-1 COMP_REGS overlay
placed at hdm_regs_offset inside the BAR container wins over any
hardware-backed alias at the same offset, with no per-window aliasing
required.
The guest has no mechanism to remap host physical mappings. QEMU programs
decoder 0 with the CFMWS base through the kernel's COMP_REGS shadow at
machine_done time, after all devices are realized and before the guest starts.
The notifier is registered only for devices the kernel reports as
firmware-committed (VFIO_CXL_CAP_FIRMWARE_COMMITTED).
The CXL.mem MemoryRegion is a mmap-backed RAM-device region backed by a
VM_IO|VM_PFNMAP VMA. The VFIO MemoryListener would attempt an IOMMU
DMA mapping for it when it is added to system_memory, which always
fails: pin_user_pages() refuses VM_IO pages. No IOMMU mapping is needed
for these regions - CPU access goes via KVM Stage-2 page faults and
device DMA to RAM uses separate per-RAM-section IOMMU entries. The
listener is extended to skip the mapping attempt for VFIO-owned
RAM-device regions.
pxb-cxl bridges had no _DSM method. Without _DSM function 5 the OS
defaults to treating PCI configuration as reassignable.
On machines with firmware-committed HDM decoders that reassignment breaks
the CXL.mem mapping, so the _DSM is added with preserve_config=true for ARM and
false for x86.
Known issues:
- The bios-tables test will fail due to the _DSM addition.
A fix will be provided in a follow-up round.
- VFIO_CXL_CAP_CACHE_CAPABLE will require additional handling.
- Devices with multiple firmware-committed HDM decoders are not fully
supported.
- Non-firmware-committed devices are not supported.
- linux-headers sync is manual and temporary; once the kernel series is
merged, this patch will be replaced with script generated update.
[1] https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com
Manish Honap (9):
hw/arm/virt: Add CXL FMWS PA window for device memory
cxl: Add preserve_config to pxb-cxl OSC method
linux-headers: Update vfio.h for CXL Type-2 device passthrough
hw/vfio/region: Add vfio_region_setup_with_ops() for custom region ops
hw/vfio/pci: Add CXL Type-2 device detection and region setup
hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay
hw/vfio+cxl: Program HDM decoder 0 at machine_done for
firmware-committed devices
hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus
vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions
hw/acpi/cxl-stub.c | 2 +-
hw/acpi/cxl.c | 4 +-
hw/arm/smmu-common.c | 17 +-
hw/arm/virt-acpi-build.c | 5 +
hw/arm/virt.c | 7 +
hw/cxl/cxl-host-stubs.c | 2 +
hw/cxl/cxl-host.c | 8 +
hw/i386/acpi-build.c | 2 +-
hw/pci-host/gpex-acpi.c | 43 +++-
hw/vfio/listener.c | 14 ++
hw/vfio/pci.c | 411 +++++++++++++++++++++++++++++++++++++
hw/vfio/pci.h | 15 ++
hw/vfio/region.c | 15 +-
hw/vfio/trace-events | 6 +
hw/vfio/vfio-region.h | 3 +
include/hw/acpi/cxl.h | 2 +-
include/hw/arm/virt.h | 2 +
include/hw/cxl/cxl_host.h | 10 +
include/hw/pci-host/gpex.h | 2 +
linux-headers/linux/vfio.h | 18 ++
20 files changed, 570 insertions(+), 18 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC 1/9] hw/arm/virt: Add CXL FMWS PA window for device memory
2026-04-27 18:12 [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci mhonap
@ 2026-04-27 18:12 ` mhonap
2026-04-27 18:12 ` [RFC 2/9] cxl: Add preserve_config to pxb-cxl OSC method mhonap
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: mhonap @ 2026-04-27 18:12 UTC (permalink / raw)
To: alwilliamson, skolothumtho, ankita, mst, imammedo, anisinha,
eric.auger, peter.maydell, shannon.zhaosl, jonathan.cameron,
fan.ni, pbonzini, richard.henderson, marcel.apfelbaum, clg,
cohuck, dan.j.williams, dave.jiang, alejandro.lucero-palau
Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-cxl, kvm, qemu-devel,
qemu-arm, Manish Honap
From: Manish Honap <mhonap@nvidia.com>
CXL VFIO passthrough needs a stable guest physical address range for
device memory (DPA) that falls inside a CFMWS entry the guest discovers
from ACPI CEDT. Without a dedicated range in the address map, the HDM
decoder has nowhere to point.
Add VIRT_HIGH_CXL_MMIO immediately after the second PCIe MMIO window.
It gets its own highmem_cxl_mmio flag in VirtMachineState rather than
sharing highmem_cxl, so the two slots are independently controllable
even though both are currently tied to CXL bridge presence.
The base and size flow through GPEXConfig.cxl_mmio to
acpi_dsdt_add_gpex(), which carves out a QWord memory descriptor in the
first CXL root bridge's _CRS. The CFMWS window is system-wide, so only
the first CXL bridge gets the descriptor - subsequent ones would
produce duplicate resource claims for the same range.
build_crs() already emits the bridge's own 64-bit ranges into crs.
The CFMWS window is a separate system-wide range, so only that window
is appended as a new QWord descriptor; the bridge ranges are not
re-emitted. A warn_report() fires if the CFMWS window overlaps any
existing bridge 64-bit range, since that would indicate an address
layout conflict.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
hw/arm/virt-acpi-build.c | 5 +++++
hw/arm/virt.c | 9 +++++++++
hw/pci-host/gpex-acpi.c | 40 ++++++++++++++++++++++++++++++++++++++
include/hw/arm/virt.h | 2 ++
include/hw/pci-host/gpex.h | 1 +
5 files changed, 57 insertions(+)
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 591cfc993c..863e0680fb 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -176,6 +176,11 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
}
+ if (vms->highmem_cxl) {
+ cfg.cxl_mmio.base = memmap[VIRT_HIGH_CXL_MMIO].base;
+ cfg.cxl_mmio.size = memmap[VIRT_HIGH_CXL_MMIO].size;
+ }
+
acpi_dsdt_add_gpex(scope, &cfg);
QLIST_FOREACH(bus, &vms->bus->child, sibling) {
if (pci_bus_is_cxl(bus)) {
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ec0d8475ca..fa07819401 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -211,6 +211,8 @@ static const MemMapEntry base_memmap[] = {
#define DEFAULT_HIGH_PCIE_MMIO_SIZE_GB 512
#define DEFAULT_HIGH_PCIE_MMIO_SIZE (DEFAULT_HIGH_PCIE_MMIO_SIZE_GB * GiB)
+#define DEFAULT_HIGH_CXL_MMIO_SIZE DEFAULT_HIGH_PCIE_MMIO_SIZE
+
/*
* Highmem IO Regions: This memory map is floating, located after the RAM.
* Each MemMapEntry base (GPA) will be dynamically computed, depending on the
@@ -237,6 +239,11 @@ static MemMapEntry extended_memmap[] = {
[VIRT_HIGH_PCIE_ECAM] = { 0x0, 256 * MiB },
/* Second PCIe window */
[VIRT_HIGH_PCIE_MMIO] = { 0x0, DEFAULT_HIGH_PCIE_MMIO_SIZE },
+ /*
+ * CXL FMWS guest PA window - separate from PCIe MMIO so the two are
+ * independently sizeable. Same default size for now.
+ */
+ [VIRT_HIGH_CXL_MMIO] = { 0x0, DEFAULT_HIGH_CXL_MMIO_SIZE },
/* Any CXL Fixed memory windows come here */
};
@@ -1724,6 +1731,7 @@ static void create_cxl_host_reg_region(VirtMachineState *vms)
vms->memmap[VIRT_CXL_HOST].size);
memory_region_add_subregion(sysmem, vms->memmap[VIRT_CXL_HOST].base, mr);
vms->highmem_cxl = true;
+ vms->highmem_cxl_mmio = true;
}
static void create_platform_bus(VirtMachineState *vms)
@@ -1897,6 +1905,7 @@ static inline bool *virt_get_high_memmap_enabled(VirtMachineState *vms,
&vms->highmem_cxl,
&vms->highmem_ecam,
&vms->highmem_mmio,
+ &vms->highmem_cxl_mmio,
};
assert(ARRAY_SIZE(extended_memmap) - VIRT_LOWMEMMAP_LAST ==
diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
index d9820f9b41..7de57bbc46 100644
--- a/hw/pci-host/gpex-acpi.c
+++ b/hw/pci-host/gpex-acpi.c
@@ -7,6 +7,7 @@
#include "hw/pci/pci_bridge.h"
#include "hw/pci/pcie_host.h"
#include "hw/acpi/cxl.h"
+#include "qemu/error-report.h"
static void acpi_dsdt_add_pci_route_table(Aml *dev, uint32_t irq,
Aml *scope, uint8_t bus_num)
@@ -108,6 +109,7 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
CrsRangeSet crs_range_set;
CrsRangeEntry *entry;
int i;
+ bool first_cxl = true;
/* start to construct the tables for pxb */
crs_range_set_init(&crs_range_set);
@@ -161,6 +163,44 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
*/
crs = build_crs(PCI_HOST_BRIDGE(BUS(bus)->parent), &crs_range_set,
cfg->pio.base, 0, 0, 0);
+ if (is_cxl && first_cxl && cfg->cxl_mmio.size) {
+ uint64_t cfmws_end = cfg->cxl_mmio.base +
+ cfg->cxl_mmio.size - 1;
+
+ /*
+ * The CXL Fixed Memory Window (CFMWS) is a system-wide GPA
+ * range. Only the first CXL root bridge emits the QWord
+ * descriptor; adding it to every bridge would give the OS
+ * duplicate resource claims for the same range.
+ *
+ * build_crs() has already appended the bridge's own 64-bit
+ * ranges into crs. Do not copy them again here; only append
+ * the CFMWS window itself as a new QWord descriptor.
+ *
+ * Warn if the CFMWS window overlaps any range already claimed
+ * by the bridge; in the current address layout they should be
+ * disjoint, but catch it early if the layout ever changes.
+ */
+ for (i = 0; i < crs_range_set.mem_64bit_ranges->len; i++) {
+ entry = g_ptr_array_index(crs_range_set.mem_64bit_ranges,
+ i);
+ if (entry->base <= cfmws_end &&
+ entry->limit >= cfg->cxl_mmio.base) {
+ warn_report("CXL CFMWS [0x%"PRIx64"-0x%"PRIx64"] "
+ "overlaps CXL root bridge 64-bit range "
+ "[0x%"PRIx64"-0x%"PRIx64"]",
+ cfg->cxl_mmio.base, cfmws_end,
+ entry->base, entry->limit);
+ }
+ }
+ aml_append(crs,
+ aml_qword_memory(AML_POS_DECODE, AML_MIN_FIXED,
+ AML_MAX_FIXED, AML_NON_CACHEABLE, AML_READ_WRITE,
+ 0x0000, cfg->cxl_mmio.base, cfmws_end, 0x0000,
+ cfg->cxl_mmio.size));
+ first_cxl = false;
+ }
+
aml_append(dev, aml_name_decl("_CRS", crs));
if (is_cxl) {
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 5fcbd1c76f..88bb3c0bdf 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -91,6 +91,7 @@ enum {
VIRT_CXL_HOST,
VIRT_HIGH_PCIE_ECAM,
VIRT_HIGH_PCIE_MMIO,
+ VIRT_HIGH_CXL_MMIO,
};
typedef enum VirtIOMMUType {
@@ -147,6 +148,7 @@ struct VirtMachineState {
bool highmem;
bool highmem_compact;
bool highmem_cxl;
+ bool highmem_cxl_mmio; /* VIRT_HIGH_CXL_MMIO window; follows highmem_cxl */
bool highmem_ecam;
bool highmem_mmio;
bool highmem_redists;
diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h
index 1da9c85bce..a7c2e2edf3 100644
--- a/include/hw/pci-host/gpex.h
+++ b/include/hw/pci-host/gpex.h
@@ -43,6 +43,7 @@ struct GPEXConfig {
MemMapEntry mmio32;
MemMapEntry mmio64;
MemMapEntry pio;
+ MemMapEntry cxl_mmio;
int irq;
PCIBus *bus;
bool pci_native_hotplug;
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC 2/9] cxl: Add preserve_config to pxb-cxl OSC method
2026-04-27 18:12 [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci mhonap
2026-04-27 18:12 ` [RFC 1/9] hw/arm/virt: Add CXL FMWS PA window for device memory mhonap
@ 2026-04-27 18:12 ` mhonap
2026-04-27 18:12 ` [RFC 3/9] linux-headers: Update vfio.h for CXL Type-2 device passthrough mhonap
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: mhonap @ 2026-04-27 18:12 UTC (permalink / raw)
To: alwilliamson, skolothumtho, ankita, mst, imammedo, anisinha,
eric.auger, peter.maydell, shannon.zhaosl, jonathan.cameron,
fan.ni, pbonzini, richard.henderson, marcel.apfelbaum, clg,
cohuck, dan.j.williams, dave.jiang, alejandro.lucero-palau
Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-cxl, kvm, qemu-devel,
qemu-arm, Manish Honap
From: Manish Honap <mhonap@nvidia.com>
Before this patch, pxb-cxl bridges had no _DSM method at all. When the
OS called _DSM on a CXL host bridge, ACPI returned an error and the OS
defaulted to reassigning resources across suspend/resume. On machines
where firmware pre-commits the HDM decoder, that reassignment breaks the
DPA mapping.
Wire preserve_config through GPEXConfig into build_cxl_osc_method() so
pxb-cxl host bridges get a _DSM method that signals the OS to keep
resource assignments stable when needed. The _DSM function 5 (preserve
firmware PCI configuration) is the mechanism used to convey this.
build_pci_host_bridge_dsm_method() is promoted from static to exported
so cxl.c can call it without duplicating the AML.
The x86 build_cxl_osc_method() call site passes false since x86 does
not use firmware-committed HDM decoders.
build_cxl_osc_method is renamed to acpi_dsdt_add_cxl_host_bridge_methods
The function now appends both the CXL _OSC method and the _DSM method,
so its old name is misleading. Renamed it to match the pxb-pcie analogue
acpi_dsdt_add_host_bridge_methods(), making the two root bridge code
paths symmetric. No AML change.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
hw/acpi/cxl-stub.c | 2 +-
hw/acpi/cxl.c | 4 +++-
hw/i386/acpi-build.c | 2 +-
hw/pci-host/gpex-acpi.c | 5 +++--
include/hw/acpi/cxl.h | 2 +-
include/hw/pci-host/gpex.h | 1 +
6 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/hw/acpi/cxl-stub.c b/hw/acpi/cxl-stub.c
index 15bc21076b..d7c6731975 100644
--- a/hw/acpi/cxl-stub.c
+++ b/hw/acpi/cxl-stub.c
@@ -6,7 +6,7 @@
#include "hw/acpi/aml-build.h"
#include "hw/acpi/cxl.h"
-void build_cxl_osc_method(Aml *dev)
+void acpi_dsdt_add_cxl_host_bridge_methods(Aml *dev, bool preserve_config)
{
g_assert_not_reached();
}
diff --git a/hw/acpi/cxl.c b/hw/acpi/cxl.c
index f92f7fa3d5..b32740a3e3 100644
--- a/hw/acpi/cxl.c
+++ b/hw/acpi/cxl.c
@@ -23,6 +23,7 @@
#include "hw/pci/pci_host.h"
#include "hw/cxl/cxl.h"
#include "hw/cxl/cxl_host.h"
+#include "hw/pci-host/gpex.h"
#include "hw/mem/memory-device.h"
#include "hw/acpi/acpi.h"
#include "hw/acpi/aml-build.h"
@@ -320,11 +321,12 @@ static Aml *__build_cxl_osc_method(void)
return method;
}
-void build_cxl_osc_method(Aml *dev)
+void acpi_dsdt_add_cxl_host_bridge_methods(Aml *dev, bool preserve_config)
{
aml_append(dev, aml_name_decl("SUPP", aml_int(0)));
aml_append(dev, aml_name_decl("CTRL", aml_int(0)));
aml_append(dev, aml_name_decl("SUPC", aml_int(0)));
aml_append(dev, aml_name_decl("CTRC", aml_int(0)));
aml_append(dev, __build_cxl_osc_method());
+ aml_append(dev, build_pci_host_bridge_dsm_method(preserve_config));
}
diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index f622b91b76..f66ec8ed24 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -1013,7 +1013,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
aml_append(aml_pkg, aml_eisaid("PNP0A08"));
aml_append(aml_pkg, aml_eisaid("PNP0A03"));
aml_append(dev, aml_name_decl("_CID", aml_pkg));
- build_cxl_osc_method(dev);
+ acpi_dsdt_add_cxl_host_bridge_methods(dev, false);
} else if (pci_bus_is_express(bus)) {
aml_append(dev, aml_name_decl("_HID", aml_eisaid("PNP0A08")));
aml_append(dev, aml_name_decl("_CID", aml_eisaid("PNP0A03")));
diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
index 7de57bbc46..247bd78152 100644
--- a/hw/pci-host/gpex-acpi.c
+++ b/hw/pci-host/gpex-acpi.c
@@ -52,7 +52,7 @@ static void acpi_dsdt_add_pci_route_table(Aml *dev, uint32_t irq,
}
}
-static Aml *build_pci_host_bridge_dsm_method(bool preserve_config)
+Aml *build_pci_host_bridge_dsm_method(bool preserve_config)
{
Aml *method = aml_method("_DSM", 4, AML_NOTSERIALIZED);
Aml *UUID, *ifctx, *ifctx1, *buf;
@@ -204,7 +204,8 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
aml_append(dev, aml_name_decl("_CRS", crs));
if (is_cxl) {
- build_cxl_osc_method(dev);
+ acpi_dsdt_add_cxl_host_bridge_methods(dev,
+ cfg->preserve_config);
} else {
/* pxb bridges do not have ACPI PCI Hot-plug enabled */
acpi_dsdt_add_host_bridge_methods(dev, true,
diff --git a/include/hw/acpi/cxl.h b/include/hw/acpi/cxl.h
index 8f22c71530..6fe6c9c58d 100644
--- a/include/hw/acpi/cxl.h
+++ b/include/hw/acpi/cxl.h
@@ -24,7 +24,7 @@
void cxl_build_cedt(GArray *table_offsets, GArray *table_data,
BIOSLinker *linker, const char *oem_id,
const char *oem_table_id, CXLState *cxl_state);
-void build_cxl_osc_method(Aml *dev);
+void acpi_dsdt_add_cxl_host_bridge_methods(Aml *dev, bool preserve_config);
void build_cxl_dsm_method(Aml *dev);
#endif
diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h
index a7c2e2edf3..e5c2ebef78 100644
--- a/include/hw/pci-host/gpex.h
+++ b/include/hw/pci-host/gpex.h
@@ -73,6 +73,7 @@ struct GPEXHost {
int gpex_set_irq_num(GPEXHost *s, int index, int gsi);
void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg);
+Aml *build_pci_host_bridge_dsm_method(bool preserve_config);
void acpi_dsdt_add_gpex_host(Aml *scope, uint32_t irq);
#define PCI_HOST_PIO_BASE "x-pio-base"
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC 3/9] linux-headers: Update vfio.h for CXL Type-2 device passthrough
2026-04-27 18:12 [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci mhonap
2026-04-27 18:12 ` [RFC 1/9] hw/arm/virt: Add CXL FMWS PA window for device memory mhonap
2026-04-27 18:12 ` [RFC 2/9] cxl: Add preserve_config to pxb-cxl OSC method mhonap
@ 2026-04-27 18:12 ` mhonap
2026-04-27 18:12 ` [RFC 4/9] hw/vfio/region: Add vfio_region_setup_with_ops() for custom region ops mhonap
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: mhonap @ 2026-04-27 18:12 UTC (permalink / raw)
To: alwilliamson, skolothumtho, ankita, mst, imammedo, anisinha,
eric.auger, peter.maydell, shannon.zhaosl, jonathan.cameron,
fan.ni, pbonzini, richard.henderson, marcel.apfelbaum, clg,
cohuck, dan.j.williams, dave.jiang, alejandro.lucero-palau
Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-cxl, kvm, qemu-devel,
qemu-arm, Manish Honap
From: Manish Honap <mhonap@nvidia.com>
Sync the VFIO UAPI additions from the kernel CXL Type-2 passthrough
series.
VFIO_DEVICE_FLAGS_CXL (bit 9) marks a device as CXL Type-2 and
guarantees the capability chain includes a vfio_device_info_cap_cxl
entry (cap id 6). That capability carries the BAR index holding the
CXL component registers, flags for firmware-committed and cache-capable
devices, the byte offset to the HDM Decoder Capability block within
that BAR, and region indices for both the DPA memory region and the
Component Register shadow.
Two new region subtypes:
VFIO_REGION_SUBTYPE_CXL (1): mmappable DPA memory
VFIO_REGION_SUBTYPE_CXL_COMP_REGS (2): HDM decoder shadow, r/w only
Note: UAPI headers are normally kept in sync via
scripts/update-linux-headers.sh once upstream kernel changes merge.
This patch manually adds the CXL Type-2 additions as a temporary
measure to unblock QEMU development. It should be dropped and
replaced with a proper header sync once the kernel series is accepted.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
| 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
--git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 720edfee7a..62cd725a39 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -215,6 +215,7 @@ struct vfio_device_info {
#define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6) /* vfio-fsl-mc device */
#define VFIO_DEVICE_FLAGS_CAPS (1 << 7) /* Info supports caps */
#define VFIO_DEVICE_FLAGS_CDX (1 << 8) /* vfio-cdx device */
+#define VFIO_DEVICE_FLAGS_CXL (1 << 9) /* vfio-cxl device */
__u32 num_regions; /* Max region index + 1 */
__u32 num_irqs; /* Max IRQ index + 1 */
__u32 cap_offset; /* Offset within info struct of first cap */
@@ -257,6 +258,19 @@ struct vfio_device_info_cap_pci_atomic_comp {
__u32 reserved;
};
+#define VFIO_DEVICE_INFO_CAP_CXL 6
+struct vfio_device_info_cap_cxl {
+ struct vfio_info_cap_header header; /* id=6, version=1 */
+ __u8 hdm_regs_bar_index; /* PCI BAR containing CXL component registers */
+ __u8 reserved[3];
+ __u32 flags; /* VFIO_CXL_CAP_* flags */
+#define VFIO_CXL_CAP_FIRMWARE_COMMITTED (1 << 0)
+#define VFIO_CXL_CAP_CACHE_CAPABLE (1 << 1)
+ __u64 hdm_regs_offset; /* byte offset within BAR to CXL.mem register area */
+ __u32 dpa_region_index; /* VFIO region index for DPA memory */
+ __u32 comp_regs_region_index; /* VFIO region index for COMP_REGS */
+};
+
/**
* VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
* struct vfio_region_info)
@@ -373,6 +387,10 @@ struct vfio_region_info_cap_type {
/* sub-types for VFIO_REGION_TYPE_GFX */
#define VFIO_REGION_SUBTYPE_GFX_EDID (1)
+/* sub-types for VFIO CXL regions */
+#define VFIO_REGION_SUBTYPE_CXL (1) /* DPA memory region */
+#define VFIO_REGION_SUBTYPE_CXL_COMP_REGS (2) /* HDM register shadow */
+
/**
* struct vfio_region_gfx_edid - EDID region layout.
*
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC 4/9] hw/vfio/region: Add vfio_region_setup_with_ops() for custom region ops
2026-04-27 18:12 [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci mhonap
` (2 preceding siblings ...)
2026-04-27 18:12 ` [RFC 3/9] linux-headers: Update vfio.h for CXL Type-2 device passthrough mhonap
@ 2026-04-27 18:12 ` mhonap
2026-04-27 18:12 ` [RFC 5/9] hw/vfio/pci: Add CXL Type-2 device detection and region setup mhonap
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: mhonap @ 2026-04-27 18:12 UTC (permalink / raw)
To: alwilliamson, skolothumtho, ankita, mst, imammedo, anisinha,
eric.auger, peter.maydell, shannon.zhaosl, jonathan.cameron,
fan.ni, pbonzini, richard.henderson, marcel.apfelbaum, clg,
cohuck, dan.j.williams, dave.jiang, alejandro.lucero-palau
Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-cxl, kvm, qemu-devel,
qemu-arm, Manish Honap
From: Manish Honap <mhonap@nvidia.com>
vfio_region_setup() always initializes the region MemoryRegion with
vfio_region_ops. CXL needs custom pread/pwrite ops for the Component
Register shadow region.
Add vfio_region_setup_with_ops() which accepts a const MemoryRegionOps *
parameter. When non-NULL it is passed to memory_region_init_io(); when
NULL the existing vfio_region_ops is used. vfio_region_setup() is
retained unchanged as a thin wrapper for all existing callers.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
hw/vfio/region.c | 15 ++++++++++++---
hw/vfio/vfio-region.h | 3 +++
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/hw/vfio/region.c b/hw/vfio/region.c
index 0342ca712a..9bbe758d6f 100644
--- a/hw/vfio/region.c
+++ b/hw/vfio/region.c
@@ -228,8 +228,9 @@ static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
return 0;
}
-int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
- int index, const char *name, Error **errp)
+int vfio_region_setup_with_ops(Object *obj, VFIODevice *vbasedev,
+ VFIORegion *region, int index, const char *name,
+ Error **errp, const MemoryRegionOps *ops)
{
struct vfio_region_info *info = NULL;
int ret;
@@ -249,7 +250,8 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
if (region->size) {
region->mem = g_new0(MemoryRegion, 1);
- memory_region_init_io(region->mem, obj, &vfio_region_ops,
+ memory_region_init_io(region->mem, obj,
+ ops ? ops : &vfio_region_ops,
region, name, region->size);
if (!vbasedev->no_mmap &&
@@ -273,6 +275,13 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
return 0;
}
+int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
+ int index, const char *name, Error **errp)
+{
+ return vfio_region_setup_with_ops(obj, vbasedev, region, index,
+ name, errp, NULL);
+}
+
static void vfio_subregion_unmap(VFIORegion *region, int index)
{
trace_vfio_region_unmap(memory_region_name(®ion->mmaps[index].mem),
diff --git a/hw/vfio/vfio-region.h b/hw/vfio/vfio-region.h
index 9b21d4ee5b..84abbec1ec 100644
--- a/hw/vfio/vfio-region.h
+++ b/hw/vfio/vfio-region.h
@@ -39,6 +39,9 @@ uint64_t vfio_region_read(void *opaque,
hwaddr addr, unsigned size);
int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
int index, const char *name, Error **errp);
+int vfio_region_setup_with_ops(Object *obj, VFIODevice *vbasedev,
+ VFIORegion *region, int index, const char *name,
+ Error **errp, const MemoryRegionOps *ops);
int vfio_region_mmap(VFIORegion *region);
void vfio_region_mmaps_set_enabled(VFIORegion *region, bool enabled);
void vfio_region_unmap(VFIORegion *region);
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC 5/9] hw/vfio/pci: Add CXL Type-2 device detection and region setup
2026-04-27 18:12 [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci mhonap
` (3 preceding siblings ...)
2026-04-27 18:12 ` [RFC 4/9] hw/vfio/region: Add vfio_region_setup_with_ops() for custom region ops mhonap
@ 2026-04-27 18:12 ` mhonap
2026-04-27 18:12 ` [RFC 6/9] hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay mhonap
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: mhonap @ 2026-04-27 18:12 UTC (permalink / raw)
To: alwilliamson, skolothumtho, ankita, mst, imammedo, anisinha,
eric.auger, peter.maydell, shannon.zhaosl, jonathan.cameron,
fan.ni, pbonzini, richard.henderson, marcel.apfelbaum, clg,
cohuck, dan.j.williams, dave.jiang, alejandro.lucero-palau
Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-cxl, kvm, qemu-devel,
qemu-arm, Manish Honap
From: Manish Honap <mhonap@nvidia.com>
When VFIO_DEVICE_FLAGS_CXL is set, the kernel has identified a CXL
Type-2 device and populated the capability chain with a
vfio_device_info_cap_cxl entry. Read that entry to locate the DPA
and CXL Component Register shadow regions, then call vfio_region_setup()
for each.
DPA covers the device's host-managed memory and is faulted in lazily
by the VMM. The CXL Component Register shadow gives the VMM access to
the HDM Decoder Capability block so it can intercept decoder commits
without touching the hardware register page directly.
vfio_cxl_derive_hdm_info() walks the CXL Capability Array inside the
Component Register shadow to find the HDM Decoder capability (ID 0x5)
and extracts hdm_decoder_offset and hdm_count. All reads use
le32_to_cpu() since the capability array is little-endian per the CXL
spec. Dword 0 is the array header; capability entries start at dword 1,
which is why the loop begins at i = 1.
CXL register constants are defined here using names that mirror
<linux/cxl.h> to make cross-referencing straightforward.
Add the VFIOCXL struct embedded in VFIOPCIDevice.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
hw/vfio/pci.c | 214 +++++++++++++++++++++++++++++++++++++++++++
hw/vfio/pci.h | 14 +++
hw/vfio/trace-events | 4 +
3 files changed, 232 insertions(+)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b2a07f6bb4..49ac661eb3 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -24,6 +24,7 @@
#include "hw/core/hw-error.h"
#include "hw/core/iommu.h"
+#include "hw/cxl/cxl_component.h"
#include "hw/pci/msi.h"
#include "hw/pci/msix.h"
#include "hw/pci/pci_bridge.h"
@@ -2957,6 +2958,38 @@ static VFIODeviceOps vfio_pci_ops = {
.vfio_load_config = vfio_pci_load_config,
};
+/*
+ * CXL Component Register Space constants (CXL 4.0 8.2.3).
+ */
+
+/* CXL Capability Array Header (dword 0 of COMP_REGS) */
+#define CXL_CM_CAP_HDR_ARRAY_ID 0x0001U /* expected ID value */
+#define CXL_CM_CAP_HDR_NUM_CAPS_SHIFT 24 /* bits [31:24] = num entries */
+#define CXL_CM_CAP_HDR_NUM_CAPS_MASK 0xffU
+#define CXL_CM_CAP_ENTRY_ID_MASK 0xffffU /* bits [15:0] = cap ID */
+#define CXL_CM_CAP_ENTRY_PTR_SHIFT 20 /* bits [31:20] = byte offset */
+#define CXL_CM_CAP_ENTRY_PTR_MASK 0xfffU
+#define CXL_CM_CAP_ID_HDM 0x0005U /* HDM Decoder cap ID */
+
+/* HDM Decoder Capability (HDMC) register at hdm_decoder_offset+0x00 */
+#define CXL_HDMC_DECODER_COUNT_MASK 0xfU /* bits [3:0]; 0→1, N→N*2 */
+
+/*
+ * Per-decoder register offsets from hdm_decoder_offset (CXL 4.0 Table 8-119).
+ * Decoder records begin at +0x10 and are 0x20 bytes each.
+ */
+#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
+#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
+#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
+
+/* HDM Decoder n Control register bits (CXL 4.0 Table 8-123) */
+#define CXL_HDM_CTRL_COMMIT_LOCK (1U << 8) /* decoder locked */
+#define CXL_HDM_CTRL_COMMIT (1U << 9) /* software trigger */
+#define CXL_HDM_CTRL_COMMITTED (1U << 10) /* hardware status */
+
+/* HDM Decoder BASE_LO: bits [31:28] hold address bits [31:28] */
+#define CXL_HDM_BASE_LO_ADDR_MASK 0xF0000000U
+
bool vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
{
VFIODevice *vbasedev = &vdev->vbasedev;
@@ -3102,6 +3135,25 @@ void vfio_pci_put_device(VFIOPCIDevice *vdev)
{
vfio_display_finalize(vdev);
vfio_bars_finalize(vdev);
+
+ /*
+ * The DPA region is not in bars[] and must be cleaned up here.
+ * Remove it from the system address space before releasing.
+ */
+ if (vdev->cxl.dpa_in_system_mem) {
+ memory_region_del_subregion(get_system_memory(), vdev->cxl.region.mem);
+ vdev->cxl.dpa_in_system_mem = false;
+ trace_vfio_cxl_put_device(vdev->vbasedev.name);
+ }
+ if (vdev->cxl.region.mem) {
+ vfio_region_exit(&vdev->cxl.region);
+ vfio_region_finalize(&vdev->cxl.region);
+ }
+ if (vdev->cxl.comp_regs_region.mem) {
+ vfio_region_exit(&vdev->cxl.comp_regs_region);
+ vfio_region_finalize(&vdev->cxl.comp_regs_region);
+ }
+
vfio_cpr_pci_unregister_device(vdev);
g_free(vdev->emulated_config_bits);
g_free(vdev->rom);
@@ -3254,6 +3306,164 @@ void vfio_pci_register_req_notifier(VFIOPCIDevice *vdev)
}
}
+/*
+ * vfio_cxl_derive_hdm_info - read hdm_decoder_offset and hdm_count from the
+ * COMP_REGS region by traversing the CXL Capability Array.
+ *
+ * Dword 0: CXL Capability Array Header
+ * bits[31:24] = num_caps,
+ * bits[15:0] = 1.
+ * Dwords 1..N:
+ * bits[15:0] = cap ID;
+ * bits[31:20] = byte offset from region start.
+ * HDM Decoder cap ID = 0x5; its offset is hdm_decoder_offset.
+ * HDMC register at hdm_decoder_offset+0:
+ * bits[3:0] encode count (0→1, N→N*2).
+ */
+static bool vfio_cxl_derive_hdm_info(VFIODevice *vbasedev, VFIOCXL *cxl,
+ Error **errp)
+{
+ off_t base = cxl->comp_regs_region.fd_offset;
+ uint32_t hdr, num_caps, i;
+
+ if (pread(vbasedev->fd, &hdr, sizeof(hdr), base) != sizeof(hdr)) {
+ error_setg(errp, "vfio-cxl: failed to read CXL Capability Header");
+ return false;
+ }
+ hdr = le32_to_cpu(hdr);
+
+ if ((hdr & CXL_CM_CAP_ENTRY_ID_MASK) != CXL_CM_CAP_HDR_ARRAY_ID) {
+ error_setg(errp, "vfio-cxl: unexpected CXL Capability Array ID 0x%x",
+ hdr & CXL_CM_CAP_ENTRY_ID_MASK);
+ return false;
+ }
+
+ num_caps = (hdr >> CXL_CM_CAP_HDR_NUM_CAPS_SHIFT) &
+ CXL_CM_CAP_HDR_NUM_CAPS_MASK;
+
+ /*
+ * Dword 0 is the CXL Capability Array Header;
+ * capability entries start at dword 1.
+ */
+ for (i = 1; i <= num_caps; i++) {
+ uint32_t entry, cap_id;
+
+ if (pread(vbasedev->fd, &entry, sizeof(entry),
+ base + i * sizeof(entry)) != sizeof(entry)) {
+ error_setg(errp, "vfio-cxl: failed to read cap entry %u", i);
+ return false;
+ }
+ entry = le32_to_cpu(entry);
+
+ cap_id = entry & CXL_CM_CAP_ENTRY_ID_MASK;
+ if (cap_id == CXL_CM_CAP_ID_HDM) {
+ uint32_t hdmc, field;
+
+ cxl->hdm_decoder_offset = (entry >> CXL_CM_CAP_ENTRY_PTR_SHIFT) &
+ CXL_CM_CAP_ENTRY_PTR_MASK;
+
+ if (pread(vbasedev->fd, &hdmc, sizeof(hdmc),
+ base + cxl->hdm_decoder_offset) != sizeof(hdmc)) {
+ error_setg(errp, "vfio-cxl: failed to read HDMC register");
+ return false;
+ }
+ hdmc = le32_to_cpu(hdmc);
+ field = hdmc & CXL_HDMC_DECODER_COUNT_MASK;
+ cxl->hdm_count = field ? (uint8_t)(field * 2) : 1;
+ return true;
+ }
+ }
+
+ error_setg(errp, "vfio-cxl: HDM Decoder capability not found in COMP_REGS");
+ return false;
+}
+
+static bool vfio_cxl_setup(VFIOPCIDevice *vdev, Error **errp)
+{
+ VFIODevice *vbasedev = &vdev->vbasedev;
+ VFIOCXL *cxl = &vdev->cxl;
+ g_autofree struct vfio_device_info *info = NULL;
+ struct vfio_info_cap_header *hdr;
+ struct vfio_device_info_cap_cxl *cap;
+ g_autofree struct vfio_region_info *region_info = NULL;
+ g_autofree struct vfio_region_info *comp_info = NULL;
+ int ret;
+
+ if (!(vbasedev->flags & VFIO_DEVICE_FLAGS_CXL)) {
+ return true;
+ }
+
+ info = vfio_get_device_info(vbasedev->fd);
+ if (!info) {
+ error_setg(errp, "vfio-cxl: failed to get device info");
+ return false;
+ }
+
+ hdr = vfio_get_device_info_cap(info, VFIO_DEVICE_INFO_CAP_CXL);
+ if (!hdr) {
+ error_setg(errp, "vfio-cxl: CXL capability not found in device info");
+ return false;
+ }
+ cap = (void *)hdr;
+
+ if (cap->dpa_region_index == (uint32_t)-1 ||
+ cap->comp_regs_region_index == (uint32_t)-1) {
+ error_setg(errp, "vfio-cxl: kernel did not provide region indices "
+ "(dpa=%u comp=%u)",
+ cap->dpa_region_index, cap->comp_regs_region_index);
+ return false;
+ }
+
+ cxl->hdm_regs_bar_index = cap->hdm_regs_bar_index;
+ cxl->hdm_regs_offset = cap->hdm_regs_offset;
+
+ /* DPA region */
+ ret = vfio_device_get_region_info(vbasedev, cap->dpa_region_index,
+ ®ion_info);
+ if (ret || !region_info) {
+ error_setg(errp, "vfio-cxl: failed to get DPA region info");
+ return false;
+ }
+ ret = vfio_region_setup(OBJECT(vdev), vbasedev, &cxl->region,
+ region_info->index, "cxl-dpa", errp);
+ if (ret) {
+ error_setg(errp, "vfio-cxl: failed to set up DPA region");
+ return false;
+ }
+ cxl->dpa_size = region_info->size;
+
+ if (vfio_region_mmap(&cxl->region)) {
+ error_setg(errp, "vfio-cxl: failed to mmap DPA region for %s",
+ vbasedev->name);
+ return false;
+ }
+
+ /* COMP_REGS region (HDM decoder shadow) */
+ ret = vfio_device_get_region_info(vbasedev, cap->comp_regs_region_index,
+ &comp_info);
+ if (ret || !comp_info) {
+ error_setg(errp, "vfio-cxl: failed to get COMP_REGS region info");
+ return false;
+ }
+ ret = vfio_region_setup(OBJECT(vdev), vbasedev, &cxl->comp_regs_region,
+ comp_info->index, "cxl-comp-regs", errp);
+ if (ret) {
+ error_setg(errp, "vfio-cxl: failed to set up COMP_REGS region");
+ return false;
+ }
+ cxl->hdm_regs_size = comp_info->size;
+
+ if (!vfio_cxl_derive_hdm_info(vbasedev, cxl, errp)) {
+ return false;
+ }
+
+ trace_vfio_cxl_setup_params(vbasedev->name, cxl->hdm_regs_bar_index,
+ cxl->hdm_regs_offset, cxl->hdm_regs_size,
+ cxl->dpa_size);
+ return true;
+}
+
+
static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
{
Error *err = NULL;
@@ -3508,6 +3718,10 @@ static void vfio_pci_realize(PCIDevice *pdev, Error **errp)
goto error;
}
+ if (!vfio_cxl_setup(vdev, errp)) {
+ goto error;
+ }
+
if (!vfio_pci_config_setup(vdev, errp)) {
goto error;
}
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index c3a1f53d35..f3906f0c53 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -122,6 +122,19 @@ typedef struct VFIOMSIXInfo {
OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI_DEVICE)
+typedef struct VFIOCXL {
+ uint8_t hdm_regs_bar_index;
+ uint64_t hdm_regs_offset;
+ uint64_t hdm_regs_size;
+ uint64_t hdm_decoder_offset;
+ uint8_t hdm_count;
+ uint64_t dpa_size;
+ hwaddr fmws_base; /* GPA base programmed into HDM decoder 0 */
+ bool dpa_in_system_mem;
+ VFIORegion region;
+ VFIORegion comp_regs_region;
+} VFIOCXL;
+
struct VFIOPCIDevice {
PCIDevice parent_obj;
@@ -191,6 +204,7 @@ struct VFIOPCIDevice {
VFIODisplay *dpy;
Notifier irqchip_change_notifier;
VFIOPCICPR cpr;
+ VFIOCXL cxl;
};
/* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 846e3625c5..3678481a8e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -197,3 +197,7 @@ vfio_device_get_region_info_type(const char *name, int index, uint32_t type, uin
vfio_device_reset_handler(void) ""
vfio_device_attach(const char *name, int group_id) " (%s) group %d"
vfio_device_detach(const char *name, int group_id) " (%s) group %d"
+
+# pci.c CXL Type-2 passthrough
+vfio_cxl_setup_params(const char *name, uint8_t bar, uint64_t hdm_off, uint64_t hdm_sz, uint64_t dpa_sz) " (%s) hdm_bar=%u hdm_regs_offset=0x%"PRIx64" hdm_regs_size=0x%"PRIx64" dpa_size=0x%"PRIx64
+vfio_cxl_put_device(const char *name) " (%s) removing DPA region from system memory"
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC 6/9] hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay
2026-04-27 18:12 [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci mhonap
` (4 preceding siblings ...)
2026-04-27 18:12 ` [RFC 5/9] hw/vfio/pci: Add CXL Type-2 device detection and region setup mhonap
@ 2026-04-27 18:12 ` mhonap
2026-04-27 18:12 ` [RFC 7/9] hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices mhonap
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: mhonap @ 2026-04-27 18:12 UTC (permalink / raw)
To: alwilliamson, skolothumtho, ankita, mst, imammedo, anisinha,
eric.auger, peter.maydell, shannon.zhaosl, jonathan.cameron,
fan.ni, pbonzini, richard.henderson, marcel.apfelbaum, clg,
cohuck, dan.j.williams, dave.jiang, alejandro.lucero-palau
Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-cxl, kvm, qemu-devel,
qemu-arm, Manish Honap
From: Manish Honap <mhonap@nvidia.com>
The CXL Component Register BAR contains two types of ranges that need
different handling:
- Accelerator register windows: passed through as direct hardware
mmaps for performance. The kernel reports the real BAR size and
lists mmappable windows via VFIO_REGION_INFO_CAP_SPARSE_MMAP,
excluding the HDM Decoder Capability block. vfio_region_mmap()
creates hardware-backed sub-regions for each sparse area.
- HDM Decoder Capability block: guest accesses must go through
emulated ops so QEMU can observe and program decoder state. The
kernel blocks direct mmap of this range.
vfio_bar_register(): after the normal mmap path, overlay the COMP_REGS
emulation region at hdm_regs_offset with priority 1. In QEMU's
MemoryRegion model, overlapping subregions are resolved by priority;
the default is 0. Priority 1 ensures guest accesses to the HDM range
always dispatch through the emulated COMP_REGS ops regardless of any
hardware-backed sub-region at a neighbouring offset.
vfio_pci_bars_exit(): remove the COMP_REGS overlay before the normal
BAR teardown path.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
hw/vfio/pci.c | 26 ++++++++++++++++++++++++++
hw/vfio/trace-events | 1 +
2 files changed, 27 insertions(+)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 49ac661eb3..0270de61d2 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1960,6 +1960,10 @@ static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
return;
}
+ bool cxl_comp_regs_bar = (vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_CXL) &&
+ nr == vdev->cxl.hdm_regs_bar_index &&
+ vdev->cxl.comp_regs_region.mem;
+
bar->mr = g_new0(MemoryRegion, 1);
name = g_strdup_printf("%s base BAR %d", vdev->vbasedev.name, nr);
memory_region_init_io(bar->mr, OBJECT(vdev), NULL, NULL, name, bar->size);
@@ -1974,6 +1978,21 @@ static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
}
}
+ if (cxl_comp_regs_bar) {
+ /*
+ * Overlay the COMP_REGS emulation at hdm_regs_offset with priority 1.
+ * The kernel excludes the HDM Decoder Capability block from the
+ * sparse-mmap list, so vfio_region_mmap() creates hardware-backed
+ * sub-regions only for accelerator register windows. The emulated
+ * COMP_REGS region sits above those at priority 1, ensuring guest
+ * accesses to the HDM range always dispatch through the emulated ops.
+ */
+ memory_region_add_subregion_overlap(bar->mr, vdev->cxl.hdm_regs_offset,
+ vdev->cxl.comp_regs_region.mem, 1);
+ trace_vfio_cxl_bar_subregion(vdev->vbasedev.name, nr,
+ vdev->cxl.hdm_regs_offset);
+ }
+
pci_register_bar(pdev, nr, bar->type, bar->mr);
}
@@ -1993,9 +2012,16 @@ void vfio_pci_bars_exit(VFIOPCIDevice *vdev)
for (i = 0; i < PCI_ROM_SLOT; i++) {
VFIOBAR *bar = &vdev->bars[i];
+ bool use_comp_regs = (vdev->vbasedev.flags & VFIO_DEVICE_FLAGS_CXL) &&
+ i == vdev->cxl.hdm_regs_bar_index &&
+ vdev->cxl.comp_regs_region.mem;
vfio_bar_quirk_exit(vdev, i);
vfio_region_exit(&bar->region);
+ if (use_comp_regs && bar->mr) {
+ memory_region_del_subregion(bar->mr,
+ vdev->cxl.comp_regs_region.mem);
+ }
if (bar->region.size) {
memory_region_del_subregion(bar->mr, bar->region.mem);
}
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 3678481a8e..3bced3cebb 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -201,3 +201,4 @@ vfio_device_detach(const char *name, int group_id) " (%s) group %d"
# pci.c CXL Type-2 passthrough
vfio_cxl_setup_params(const char *name, uint8_t bar, uint64_t hdm_off, uint64_t hdm_sz, uint64_t dpa_sz) " (%s) hdm_bar=%u hdm_regs_offset=0x%"PRIx64" hdm_regs_size=0x%"PRIx64" dpa_size=0x%"PRIx64
vfio_cxl_put_device(const char *name) " (%s) removing DPA region from system memory"
+vfio_cxl_bar_subregion(const char *name, int nr, uint64_t off) " (%s) BAR%d comp_regs overlay at BAR offset 0x%"PRIx64
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC 7/9] hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices
2026-04-27 18:12 [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci mhonap
` (5 preceding siblings ...)
2026-04-27 18:12 ` [RFC 6/9] hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay mhonap
@ 2026-04-27 18:12 ` mhonap
2026-04-27 18:12 ` [RFC 8/9] hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus mhonap
2026-04-27 18:12 ` [RFC 9/9] vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions mhonap
8 siblings, 0 replies; 10+ messages in thread
From: mhonap @ 2026-04-27 18:12 UTC (permalink / raw)
To: alwilliamson, skolothumtho, ankita, mst, imammedo, anisinha,
eric.auger, peter.maydell, shannon.zhaosl, jonathan.cameron,
fan.ni, pbonzini, richard.henderson, marcel.apfelbaum, clg,
cohuck, dan.j.williams, dave.jiang, alejandro.lucero-palau
Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-cxl, kvm, qemu-devel,
qemu-arm, Manish Honap
From: Manish Honap <mhonap@nvidia.com>
setup_locked_hdm() runs as a machine_done notifier after all devices
have been realized. It programs HDM decoder 0 with the CFMWS base
address so the guest can fault into device memory from the first
instruction.
The notifier is only registered when the kernel reports the device as
firmware-committed (VFIO_CXL_CAP_FIRMWARE_COMMITTED). The host is
responsible for HDM decoder programming; the guest has no mechanism to
remap host physical address mappings.
The function uses cxl->fmws_base (set by the optional cxl-fmws-base
device property) if non-zero; otherwise it falls back to the
cxl_fmws_base global captured by cxl_fmws_set_memmap() during machine
memory-map init. If neither is set, it warns and returns without
programming anything.
If COMMIT_LOCK is set in decoder 0 CTRL at machine_done time (left-over
from a prior FLR?), it is cleared before writing BASE so the subsequent
write is not blocked. COMMIT_LOCK is re-set after programming so the
hardware enforces the committed base.
read_region() return is checked; failure aborts programming rather than
leaving ctrl uninitialized. All write_region() failures are propagated.
The function exits cleanly rather than leaving the decoder half-programmed.
Add cxl_fmws_base as a hwaddr global in cxl-host.c (and a stub in
cxl-host-stubs.c). It is set once by cxl_fmws_set_memmap() and read
later at machine_done time.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
hw/cxl/cxl-host-stubs.c | 2 +
hw/cxl/cxl-host.c | 8 ++
hw/vfio/pci.c | 176 +++++++++++++++++++++++++++++++++++++-
hw/vfio/pci.h | 1 +
hw/vfio/trace-events | 1 +
include/hw/cxl/cxl_host.h | 10 +++
6 files changed, 196 insertions(+), 2 deletions(-)
diff --git a/hw/cxl/cxl-host-stubs.c b/hw/cxl/cxl-host-stubs.c
index c015baac81..0294d484c0 100644
--- a/hw/cxl/cxl-host-stubs.c
+++ b/hw/cxl/cxl-host-stubs.c
@@ -17,4 +17,6 @@ hwaddr cxl_fmws_set_memmap(hwaddr base, hwaddr max_addr)
};
void cxl_fmws_update_mmio(void) {};
+hwaddr cxl_fmws_base;
+
const MemoryRegionOps cfmws_ops;
diff --git a/hw/cxl/cxl-host.c b/hw/cxl/cxl-host.c
index a94b893e99..f7e933f452 100644
--- a/hw/cxl/cxl-host.c
+++ b/hw/cxl/cxl-host.c
@@ -429,11 +429,19 @@ void cxl_fmws_update_mmio(void)
object_child_foreach_recursive(object_get_root(), cxl_fmws_mmio_map, NULL);
}
+/*
+ * GPA base of the first CXL Fixed Memory Window region placed in the memory
+ * map by cxl_fmws_set_memmap(). Set once at machine memory-map init time.
+ */
+hwaddr cxl_fmws_base;
+
hwaddr cxl_fmws_set_memmap(hwaddr base, hwaddr max_addr)
{
GSList *cfmws_list, *iter;
CXLFixedWindow *fw;
+ cxl_fmws_base = base;
+
cfmws_list = cxl_fmws_get_all_sorted();
for (iter = cfmws_list; iter; iter = iter->next) {
fw = CXL_FMW(iter->data);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 0270de61d2..2595229ea5 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -25,6 +25,7 @@
#include "hw/core/hw-error.h"
#include "hw/core/iommu.h"
#include "hw/cxl/cxl_component.h"
+#include "hw/cxl/cxl_host.h"
#include "hw/pci/msi.h"
#include "hw/pci/msix.h"
#include "hw/pci/pci_bridge.h"
@@ -3016,6 +3017,90 @@ static VFIODeviceOps vfio_pci_ops = {
/* HDM Decoder BASE_LO: bits [31:28] hold address bits [31:28] */
#define CXL_HDM_BASE_LO_ADDR_MASK 0xF0000000U
+static bool read_region(VFIORegion *region, uint32_t *val, uint64_t offset)
+{
+ VFIODevice *vbasedev = region->vbasedev;
+ uint32_t le_val;
+
+ if (pread(vbasedev->fd, &le_val, sizeof(le_val),
+ region->fd_offset + offset) != sizeof(le_val)) {
+ error_report("vfio-cxl: pread %s offset 0x%"PRIx64" failed: %m",
+ vbasedev->name, offset);
+ return false;
+ }
+ /* CXL registers are little-endian; convert to host byte order. */
+ *val = le32_to_cpu(le_val);
+ return true;
+}
+
+static bool write_region(VFIORegion *region, uint32_t *val, uint64_t offset)
+{
+ VFIODevice *vbasedev = region->vbasedev;
+ /* CXL registers are little-endian; convert from host byte order. */
+ uint32_t le_val = cpu_to_le32(*val);
+
+ if (pwrite(vbasedev->fd, &le_val, sizeof(le_val),
+ region->fd_offset + offset) != sizeof(le_val)) {
+ error_report("vfio-cxl: pwrite %s offset 0x%"PRIx64" failed: %m",
+ vbasedev->name, offset);
+ return false;
+ }
+ return true;
+}
+
+/*
+ * Direct pread/pwrite MemoryRegionOps for the CXL Component Register shadow.
+ *
+ * The generic vfio_region_ops routes guest MMIO through
+ * vfio_device_io_region_read() which returns EINVAL for vendor region
+ * index 10 at runtime. The same pread() issued directly via
+ * region->fd_offset works fine, as vfio_cxl_derive_hdm_info() already does.
+ *
+ * The kernel enforces 4-byte aligned, 4-byte accesses on this region;
+ * valid and impl min/max_access_size are both set to 4 to match.
+ */
+static uint64_t vfio_cxl_comp_regs_mr_read(void *opaque, hwaddr addr,
+ unsigned size)
+{
+ VFIORegion *region = opaque;
+ VFIODevice *vbasedev = region->vbasedev;
+ uint32_t val = 0xFFFFFFFFU;
+
+ if (pread(vbasedev->fd, &val, size,
+ region->fd_offset + addr) != size) {
+ error_report("vfio-cxl: %s COMP_REGS read at 0x%"HWADDR_PRIx
+ " failed: %m", vbasedev->name, addr);
+ }
+
+ val = le32_to_cpu(val);
+ trace_vfio_region_read(vbasedev->name, region->nr, addr, size, val);
+ return val;
+}
+
+static void vfio_cxl_comp_regs_mr_write(void *opaque, hwaddr addr,
+ uint64_t data, unsigned size)
+{
+ VFIORegion *region = opaque;
+ VFIODevice *vbasedev = region->vbasedev;
+ uint32_t val = cpu_to_le32((uint32_t)data);
+
+ if (pwrite(vbasedev->fd, &val, size,
+ region->fd_offset + addr) != size) {
+ error_report("vfio-cxl: %s COMP_REGS write at 0x%"HWADDR_PRIx
+ " failed: %m", vbasedev->name, addr);
+ }
+
+ trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
+}
+
+static const MemoryRegionOps vfio_cxl_comp_regs_mr_ops = {
+ .read = vfio_cxl_comp_regs_mr_read,
+ .write = vfio_cxl_comp_regs_mr_write,
+ .endianness = DEVICE_LITTLE_ENDIAN,
+ .valid = { .min_access_size = 4, .max_access_size = 4 },
+ .impl = { .min_access_size = 4, .max_access_size = 4 },
+};
+
bool vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
{
VFIODevice *vbasedev = &vdev->vbasedev;
@@ -3404,6 +3489,78 @@ static bool vfio_cxl_derive_hdm_info(VFIODevice *vbasedev, VFIOCXL *cxl,
return false;
}
+/*
+ * setup_locked_hdm - machine_done notifier that programs HDM decoder 0 with
+ * the FMWS base address so the guest can access DPA through a stable GPA.
+ *
+ * Uses cxl->fmws_base (set by the optional cxl-fmws-base device property) if
+ * non-zero; otherwise falls back to the cxl_fmws_base global captured by
+ * cxl_fmws_set_memmap() during machine memory-map init. If neither is set,
+ * the notifier warns and returns without programming anything.
+ */
+static void setup_locked_hdm(Notifier *notifier, void *data)
+{
+ VFIOCXL *cxl = container_of(notifier, VFIOCXL, machine_done);
+ VFIORegion *region = &cxl->comp_regs_region;
+ MemoryRegion *sys_mem = get_system_memory();
+ uint64_t hdm_base = cxl->hdm_decoder_offset;
+ uint32_t base_lo, base_hi, ctrl;
+
+ if (!cxl->fmws_base) {
+ cxl->fmws_base = cxl_fmws_base;
+ if (!cxl->fmws_base) {
+ warn_report("vfio-cxl %s: CXL FMWS base not available",
+ region->vbasedev->name);
+ return;
+ }
+ }
+
+ if (!read_region(region, &ctrl,
+ hdm_base + CXL_HDM_DECODER0_CTRL_OFFSET(0))) {
+ error_report("vfio-cxl: %s failed to read HDM decoder 0 CTRL",
+ region->vbasedev->name);
+ return;
+ }
+
+ /*
+ * If COMMIT_LOCK (bit 8) is still set in the virtual snapshot the kernel
+ * should have cleared it during open. Warn and clear it here so the
+ * subsequent BASE write is not blocked.
+ */
+ if (ctrl & CXL_HDM_CTRL_COMMIT_LOCK) {
+ warn_report("vfio-cxl: COMMIT_LOCK set in HDM decoder 0 CTRL at "
+ "machine_done; clearing before programming guest GPA");
+ ctrl &= ~CXL_HDM_CTRL_COMMIT_LOCK;
+ if (!write_region(region, &ctrl,
+ hdm_base + CXL_HDM_DECODER0_CTRL_OFFSET(0))) {
+ return;
+ }
+ }
+
+ base_lo = (uint32_t)(cxl->fmws_base & CXL_HDM_BASE_LO_ADDR_MASK);
+ base_hi = (uint32_t)(cxl->fmws_base >> 32);
+ ctrl |= CXL_HDM_CTRL_COMMIT | CXL_HDM_CTRL_COMMIT_LOCK;
+
+ if (!write_region(region, &base_lo, hdm_base +
+ CXL_HDM_DECODER0_BASE_LOW_OFFSET(0)) ||
+ !write_region(region, &base_hi, hdm_base +
+ CXL_HDM_DECODER0_BASE_HIGH_OFFSET(0)) ||
+ !write_region(region, &ctrl, hdm_base +
+ CXL_HDM_DECODER0_CTRL_OFFSET(0))) {
+ error_report("vfio-cxl: %s failed to program HDM decoder 0",
+ region->vbasedev->name);
+ return;
+ }
+
+ trace_vfio_cxl_locked_hdm(/* name */ region->vbasedev->name,
+ cxl->fmws_base, base_lo, base_hi, ctrl);
+
+ memory_region_transaction_begin();
+ memory_region_add_subregion(sys_mem, cxl->fmws_base, cxl->region.mem);
+ memory_region_transaction_commit();
+ cxl->dpa_in_system_mem = true;
+}
+
static bool vfio_cxl_setup(VFIOPCIDevice *vdev, Error **errp)
{
VFIODevice *vbasedev = &vdev->vbasedev;
@@ -3471,8 +3628,11 @@ static bool vfio_cxl_setup(VFIOPCIDevice *vdev, Error **errp)
error_setg(errp, "vfio-cxl: failed to get COMP_REGS region info");
return false;
}
- ret = vfio_region_setup(OBJECT(vdev), vbasedev, &cxl->comp_regs_region,
- comp_info->index, "cxl-comp-regs", errp);
+
+ ret = vfio_region_setup_with_ops(OBJECT(vdev), vbasedev,
+ &cxl->comp_regs_region,
+ comp_info->index, "cxl-comp-regs",
+ errp, &vfio_cxl_comp_regs_mr_ops);
if (ret) {
error_setg(errp, "vfio-cxl: failed to set up COMP_REGS region");
return false;
@@ -3486,6 +3646,18 @@ static bool vfio_cxl_setup(VFIOPCIDevice *vdev, Error **errp)
trace_vfio_cxl_setup_params(vbasedev->name, cxl->hdm_regs_bar_index,
cxl->hdm_regs_offset, cxl->hdm_regs_size,
cxl->dpa_size);
+
+ /*
+ * Only pre-program the HDM decoder if the kernel reported the device as
+ * firmware-committed. Non-committed devices need guest driver involvement
+ * to commit the decoder; registering the notifier for them would write an
+ * uncommitted BASE value that the hardware ignores.
+ */
+ if (cap->flags & VFIO_CXL_CAP_FIRMWARE_COMMITTED) {
+ cxl->machine_done.notify = setup_locked_hdm;
+ qemu_add_machine_init_done_notifier(&cxl->machine_done);
+ }
+
return true;
}
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index f3906f0c53..5667c6ec17 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -133,6 +133,7 @@ typedef struct VFIOCXL {
bool dpa_in_system_mem;
VFIORegion region;
VFIORegion comp_regs_region;
+ Notifier machine_done;
} VFIOCXL;
struct VFIOPCIDevice {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 3bced3cebb..174e577837 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -202,3 +202,4 @@ vfio_device_detach(const char *name, int group_id) " (%s) group %d"
vfio_cxl_setup_params(const char *name, uint8_t bar, uint64_t hdm_off, uint64_t hdm_sz, uint64_t dpa_sz) " (%s) hdm_bar=%u hdm_regs_offset=0x%"PRIx64" hdm_regs_size=0x%"PRIx64" dpa_size=0x%"PRIx64
vfio_cxl_put_device(const char *name) " (%s) removing DPA region from system memory"
vfio_cxl_bar_subregion(const char *name, int nr, uint64_t off) " (%s) BAR%d comp_regs overlay at BAR offset 0x%"PRIx64
+vfio_cxl_locked_hdm(const char *name, uint64_t fmws, uint32_t blo, uint32_t bhi, uint32_t ctrl) " (%s) fmws_base=0x%"PRIx64" wrote decoder0 base_lo=0x%08x base_hi=0x%08x ctrl=0x%08x"
diff --git a/include/hw/cxl/cxl_host.h b/include/hw/cxl/cxl_host.h
index 21619bb748..f890a5c0b9 100644
--- a/include/hw/cxl/cxl_host.h
+++ b/include/hw/cxl/cxl_host.h
@@ -20,6 +20,16 @@ hwaddr cxl_fmws_set_memmap(hwaddr base, hwaddr max_addr);
void cxl_fmws_update_mmio(void);
GSList *cxl_fmws_get_all_sorted(void);
+/**
+ * cxl_fmws_base - GPA base of the first CXL Fixed Memory Window region.
+ *
+ * Set by cxl_fmws_set_memmap() to the base address it receives (typically
+ * ROUND_UP(highest_gpa + 1, 256 MiB) on ARM virt). Valid after the
+ * machine memory-map init callback returns, i.e. at machine_done time.
+ * Zero when no machine has called cxl_fmws_set_memmap() (stub builds).
+ */
+extern hwaddr cxl_fmws_base;
+
extern const MemoryRegionOps cfmws_ops;
#endif
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC 8/9] hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus
2026-04-27 18:12 [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci mhonap
` (6 preceding siblings ...)
2026-04-27 18:12 ` [RFC 7/9] hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices mhonap
@ 2026-04-27 18:12 ` mhonap
2026-04-27 18:12 ` [RFC 9/9] vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions mhonap
8 siblings, 0 replies; 10+ messages in thread
From: mhonap @ 2026-04-27 18:12 UTC (permalink / raw)
To: alwilliamson, skolothumtho, ankita, mst, imammedo, anisinha,
eric.auger, peter.maydell, shannon.zhaosl, jonathan.cameron,
fan.ni, pbonzini, richard.henderson, marcel.apfelbaum, clg,
cohuck, dan.j.williams, dave.jiang, alejandro.lucero-palau
Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-cxl, kvm, qemu-devel,
qemu-arm, Manish Honap
From: Manish Honap <mhonap@nvidia.com>
The SMMUv3 primary bus check only accepted pxb-pcie as a valid root.
pxb-cxl uses the same PCIe-compatible bus implementation; reject it
and CXL devices behind it cannot reach the IOMMU.
Extend the check to also accept CXL buses so SMMUv3 translation applies
to passthrough CXL devices. Update the comment above the check to
mention pxb-cxl alongside pxb-pcie.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
hw/arm/smmu-common.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 58c4452b1f..eb52ea1976 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -963,19 +963,18 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
s->iommu_ops = &smmu_ops;
}
/*
- * We only allow default PCIe Root Complex(pcie.0) or pxb-pcie based extra
- * root complexes to be associated with SMMU.
+ * We only allow the default PCIe root complex (pcie.0) or pxb-pcie /
+ * pxb-cxl based extra root complexes to be associated with SMMU.
*/
if (pci_bus_is_express(pci_bus) && pci_bus_is_root(pci_bus) &&
object_dynamic_cast(OBJECT(pci_bus)->parent, TYPE_PCI_HOST_BRIDGE)) {
/*
- * This condition matches either the default pcie.0, pxb-pcie, or
- * pxb-cxl. For both pxb-pcie and pxb-cxl, parent_dev will be set.
- * Currently, we don't allow pxb-cxl as it requires further
- * verification. Therefore, make sure this is indeed pxb-pcie.
+ * pcie.0 has no parent_dev; pxb-pcie and pxb-cxl do. Accept both
+ * bus types explicitly so other root complexes are still rejected.
*/
if (pci_bus->parent_dev) {
- if (!object_dynamic_cast(OBJECT(pci_bus), TYPE_PXB_PCIE_BUS)) {
+ if (!object_dynamic_cast(OBJECT(pci_bus), TYPE_PXB_PCIE_BUS) &&
+ !object_dynamic_cast(OBJECT(pci_bus), TYPE_PXB_CXL_BUS)) {
goto out_err;
}
}
@@ -988,8 +987,8 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
return;
}
out_err:
- error_setg(errp, "SMMU should be attached to a default PCIe root complex"
- "(pcie.0) or a pxb-pcie based root complex");
+ error_setg(errp, "SMMU should be attached to a default PCIe root complex "
+ "(pcie.0), a pxb-pcie, or a pxb-cxl based root complex");
}
/*
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC 9/9] vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions
2026-04-27 18:12 [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci mhonap
` (7 preceding siblings ...)
2026-04-27 18:12 ` [RFC 8/9] hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus mhonap
@ 2026-04-27 18:12 ` mhonap
8 siblings, 0 replies; 10+ messages in thread
From: mhonap @ 2026-04-27 18:12 UTC (permalink / raw)
To: alwilliamson, skolothumtho, ankita, mst, imammedo, anisinha,
eric.auger, peter.maydell, shannon.zhaosl, jonathan.cameron,
fan.ni, pbonzini, richard.henderson, marcel.apfelbaum, clg,
cohuck, dan.j.williams, dave.jiang, alejandro.lucero-palau
Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-cxl, kvm, qemu-devel,
qemu-arm, Manish Honap
From: Manish Honap <mhonap@nvidia.com>
vfio_container_region_add() attempts an IOMMU DMA mapping for every
RAM section that enters the guest address space. For VFIO mmap-backed
regions (PCI BAR windows, CXL.mem regions), this mapping always fails:
the backing VMAs carry VM_IO | VM_PFNMAP flags and pin_user_pages()
refuses to pin VM_IO pages, so IOMMU_IOAS_MAP returns -EFAULT.
CPU access to these regions goes through KVM Stage-2 page faults
independently of the SMMU/IOMMU, so no IOMMU entry is required for
correct operation.
Add an early return for RAM-device sections owned by a VFIO device.
vfio_get_vfio_device(memory_region_owner(section->mr)) returns non-NULL
for any mmap subregion created by vfio_region_mmap(), since
memory_region_init_ram_device_ptr() propagates the VFIOPCIDevice owner
from the containing region. Matching on ownership covers both normal
PCI BAR windows and CXL.mem regions uniformly; non-VFIO RAM-device
regions such as NVDIMMs are unaffected and continue through the normal
mapping path.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
hw/vfio/listener.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index 31c3113f8f..46cad18357 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -608,6 +608,20 @@ void vfio_container_region_add(VFIOContainer *bcontainer,
pgmask + 1);
return;
}
+
+ /*
+ * VFIO MMAP backed regions (CXL.mem) uses VM_IO | VM_PFNMAP VMAs
+ * backed by physical device addresses. Skip vfio_container_dma_map
+ * as mapping is not needed for this region.
+ */
+ if (vfio_get_vfio_device(memory_region_owner(section->mr))) {
+ trace_vfio_listener_region_add_no_dma_map(
+ memory_region_name(section->mr),
+ section->offset_within_address_space,
+ int128_getlo(section->size),
+ pgmask + 1);
+ return;
+ }
}
ret = vfio_container_dma_map(bcontainer, iova, int128_get64(llsize),
--
2.25.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-04-27 18:15 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 18:12 [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci mhonap
2026-04-27 18:12 ` [RFC 1/9] hw/arm/virt: Add CXL FMWS PA window for device memory mhonap
2026-04-27 18:12 ` [RFC 2/9] cxl: Add preserve_config to pxb-cxl OSC method mhonap
2026-04-27 18:12 ` [RFC 3/9] linux-headers: Update vfio.h for CXL Type-2 device passthrough mhonap
2026-04-27 18:12 ` [RFC 4/9] hw/vfio/region: Add vfio_region_setup_with_ops() for custom region ops mhonap
2026-04-27 18:12 ` [RFC 5/9] hw/vfio/pci: Add CXL Type-2 device detection and region setup mhonap
2026-04-27 18:12 ` [RFC 6/9] hw/vfio/pci: Wire CXL component-register BAR with COMP_REGS overlay mhonap
2026-04-27 18:12 ` [RFC 7/9] hw/vfio+cxl: Program HDM decoder 0 at machine_done for firmware-committed devices mhonap
2026-04-27 18:12 ` [RFC 8/9] hw/arm/smmu-common: Allow pxb-cxl as SMMUv3 primary bus mhonap
2026-04-27 18:12 ` [RFC 9/9] vfio/listener: Skip DMA mapping for VFIO-owned RAM-device regions mhonap
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox