* [PATCH v3 01/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
@ 2026-06-25 16:53 ` mhonap
2026-06-25 16:53 ` [PATCH v3 02/11] cxl: Split cxl_await_range_active() from media-ready wait mhonap
` (10 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
cxl_probe_component_regs() finds the HDM decoder block during device
probe and caches its location, but does not record the decoder count
and does not expose the result outside drivers/cxl/.
In-kernel cxl drivers (Type-2 accelerator drivers, vfio-cxl) need the
decoder count and the byte offset and size of the HDM block without
re-running the probe sequence.
Record decoder_cnt in rmap->count when parsing the HDM capability in
cxl_probe_component_regs(), extend struct cxl_reg_map with a count
member, and add cxl_get_hdm_info() to return offset, size, and count
from the cached map. Export under the CXL namespace.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/pci.c | 33 +++++++++++++++++++++++++++++++++
drivers/cxl/core/regs.c | 1 +
include/cxl/cxl.h | 4 ++++
3 files changed, 38 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 2bcd683aa286..c917608c16f9 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -449,6 +449,39 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
}
EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, "CXL");
+/**
+ * cxl_get_hdm_info - Get HDM decoder register block location and count
+ * @cxlds: CXL device state (must have component regs enumerated via
+ * cxl_probe_component_regs())
+ * @count: number of HDM decoders (from HDM Capability bits [3:0])
+ * @offset: byte offset of HDM decoder block within the component register BAR
+ * @size: size in bytes of the HDM decoder block
+ *
+ * Exported for cxl drivers (in-kernel accelerator drivers, vfio-cxl) that
+ * need HDM decoder metadata from the cached component-register map without
+ * re-running the probe sequence.
+ *
+ * Return: 0 on success. -ENODEV if the HDM decoder block is not present.
+ */
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
+ resource_size_t *offset, resource_size_t *size)
+{
+ struct cxl_reg_map *hdm = &cxlds->reg_map.component_map.hdm_decoder;
+
+ if (WARN_ON(!count || !offset || !size))
+ return -EINVAL;
+
+ if (!hdm->valid)
+ return -ENODEV;
+
+ *count = hdm->count;
+ *offset = hdm->offset;
+ *size = hdm->size;
+
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_info, "CXL");
+
#define CXL_DOE_TABLE_ACCESS_REQ_CODE 0x000000ff
#define CXL_DOE_TABLE_ACCESS_REQ_CODE_READ 0
#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE 0x0000ff00
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 20c2d9fbcfe7..e828df0629d0 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -85,6 +85,7 @@ void cxl_probe_component_regs(struct device *dev, void __iomem *base,
decoder_cnt = cxl_hdm_decoder_count(hdr);
length = 0x20 * decoder_cnt + 0x10;
rmap = &map->hdm_decoder;
+ rmap->count = decoder_cnt;
break;
}
case CXL_CM_CAP_CAP_ID_RAS:
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 802b143de83d..440ab09c640e 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -75,6 +75,7 @@ struct cxl_reg_map {
int id;
unsigned long offset;
unsigned long size;
+ u8 count;
};
struct cxl_component_reg_map {
@@ -228,4 +229,7 @@ struct cxl_memdev *devm_cxl_probe_mem(struct cxl_dev_state *cxlds,
struct range *range);
int cxl_set_capacity(struct cxl_dev_state *cxlds, u64 capacity);
+
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
+ resource_size_t *offset, resource_size_t *size);
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v3 02/11] cxl: Split cxl_await_range_active() from media-ready wait
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-06-25 16:53 ` [PATCH v3 01/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata mhonap
@ 2026-06-25 16:53 ` mhonap
2026-06-25 16:53 ` [PATCH v3 03/11] cxl: Record BIR and BAR offset in cxl_register_map mhonap
` (9 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
Before accessing CXL device memory after reset or power-on, the
driver must ensure media is ready. Not every CXL device implements
the CXL Memory Device register group: many Type-2 devices do not.
cxl_await_media_ready() reads cxlds->regs.memdev. Access to memdev
registers on a Type-2 device that lacks them can result in a kernel
panic.
Split the HDM DVSEC range-active poll out of cxl_await_media_ready()
into a new helper cxl_await_range_active(). Type-2 cxl drivers
(vfio-cxl, in-kernel accelerator drivers) that lack the CXLMDEV
status register call this directly. cxl_await_media_ready() now
calls cxl_await_range_active() for the DVSEC poll, then reads the
memory device status as before.
The 60 second per-range timeout from cxl_await_media_ready()
(media_ready_timeout module param) applies. Export under the CXL
namespace.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/pci.c | 35 ++++++++++++++++++++++++++++++-----
include/cxl/cxl.h | 2 ++
2 files changed, 32 insertions(+), 5 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c917608c16f9..c44595447bd8 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -142,16 +142,24 @@ static int cxl_dvsec_mem_range_active(struct cxl_dev_state *cxlds, int id)
return 0;
}
-/*
- * Wait up to @media_ready_timeout for the device to report memory
- * active.
+/**
+ * cxl_await_range_active - Wait for all HDM DVSEC memory ranges to be active
+ * @cxlds: CXL device state (DVSEC and HDM count must be valid)
+ *
+ * For each HDM decoder range reported in the CXL DVSEC capability, waits
+ * for the range to report MEM INFO VALID (up to 1s per range), then
+ * MEM ACTIVE (up to media_ready_timeout seconds per range, default 60s).
+ * Used by cxl_await_media_ready() and by cxl drivers that bind to Type-2
+ * devices without the memdev mailbox (e.g. vfio-cxl, accelerator drivers).
+ *
+ * Return: 0 if all ranges become valid and active, -ETIMEDOUT if a
+ * timeout occurs, or a negative errno from config read on failure.
*/
-int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+int cxl_await_range_active(struct cxl_dev_state *cxlds)
{
struct pci_dev *pdev = to_pci_dev(cxlds->dev);
int d = cxlds->cxl_dvsec;
int rc, i, hdm_count;
- u64 md_status;
u16 cap;
rc = pci_read_config_word(pdev,
@@ -172,6 +180,23 @@ int cxl_await_media_ready(struct cxl_dev_state *cxlds)
return rc;
}
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_await_range_active, "CXL");
+
+/*
+ * Wait up to @media_ready_timeout for the device to report memory
+ * active.
+ */
+int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+{
+ u64 md_status;
+ int rc;
+
+ rc = cxl_await_range_active(cxlds);
+ if (rc)
+ return rc;
+
md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
if (!CXLMDEV_READY(md_status))
return -EIO;
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 440ab09c640e..3dcc034360af 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -232,4 +232,6 @@ int cxl_set_capacity(struct cxl_dev_state *cxlds, u64 capacity);
int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
resource_size_t *offset, resource_size_t *size);
+
+int cxl_await_range_active(struct cxl_dev_state *cxlds);
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v3 03/11] cxl: Record BIR and BAR offset in cxl_register_map
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-06-25 16:53 ` [PATCH v3 01/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata mhonap
2026-06-25 16:53 ` [PATCH v3 02/11] cxl: Split cxl_await_range_active() from media-ready wait mhonap
@ 2026-06-25 16:53 ` mhonap
2026-06-25 16:54 ` [PATCH v3 04/11] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
` (8 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: mhonap @ 2026-06-25 16:53 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
The Register Locator DVSEC (CXL r4.0 8.1.9) describes register blocks
by BAR index (BIR) and offset within the BAR. CXL core currently
only stores the resolved HPA (resource + offset) in struct
cxl_register_map, so callers that need pci_iomap() or want to report
the BAR to userspace must reverse-engineer the BAR from the HPA.
Add bar_index and bar_offset to struct cxl_register_map and fill
them in cxl_decode_regblock() when the regblock is BAR-backed
(BIR 0-5). Add cxl_regblock_get_bar_info() so cxl drivers
(vfio-cxl, in-kernel accelerator drivers) can read the values
without touching the struct internals. Export under the CXL
namespace.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/pci.c | 2 ++
drivers/cxl/core/regs.c | 34 ++++++++++++++++++++++++++++++++++
include/cxl/cxl.h | 12 ++++++++++++
3 files changed, 48 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c44595447bd8..9b9b17db9ee4 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -764,6 +764,8 @@ static int cxl_rcrb_get_comp_regs(struct pci_dev *pdev,
*map = (struct cxl_register_map) {
.host = &pdev->dev,
.resource = CXL_RESOURCE_NONE,
+ .bar_index = 0xff,
+ .bar_offset = 0,
};
component_reg_phys = cxl_rcd_component_reg_phys(&pdev->dev, dport);
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index e828df0629d0..6af5739aa776 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -285,12 +285,46 @@ static bool cxl_decode_regblock(struct pci_dev *pdev, u32 reg_lo, u32 reg_hi,
return false;
}
+ if (bar >= 0 && bar <= 5) {
+ map->bar_index = (u8)bar;
+ map->bar_offset = offset;
+ } else {
+ map->bar_index = 0xff;
+ map->bar_offset = 0;
+ }
+
map->reg_type = reg_type;
map->resource = pci_resource_start(pdev, bar) + offset;
map->max_size = pci_resource_len(pdev, bar) - offset;
return true;
}
+/**
+ * cxl_regblock_get_bar_info - read BAR index and offset for a regblock
+ * @map: regblock map produced by cxl_find_regblock()
+ * @bar_index: out, PCI BAR index (0-5)
+ * @bar_offset: out, byte offset of the regblock within the BAR
+ *
+ * Exported for cxl drivers (vfio-cxl, in-kernel accelerator drivers)
+ * that need to map the regblock via pci_iomap() or report the BAR to
+ * userspace.
+ *
+ * Return: 0 on success, -EINVAL if the regblock is not BAR-backed or
+ * if any out pointer is NULL.
+ */
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map,
+ u8 *bar_index, resource_size_t *bar_offset)
+{
+ if (!map || !bar_index || !bar_offset)
+ return -EINVAL;
+ if (map->bar_index > 5)
+ return -EINVAL;
+ *bar_index = map->bar_index;
+ *bar_offset = map->bar_offset;
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_regblock_get_bar_info, "CXL");
+
/*
* __cxl_find_regblock_instance() - Locate a register block or count instances by type / index
* Use CXL_INSTANCES_COUNT for @index if counting instances.
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 3dcc034360af..3bcb71d80c91 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -100,9 +100,16 @@ struct cxl_pmu_reg_map {
* @resource: physical resource base of the register block
* @max_size: maximum mapping size to perform register search
* @reg_type: see enum cxl_regloc_type
+ * @bar_index: PCI BAR index (0-5) when regblock is BAR-backed; 0xff otherwise
+ * @bar_offset: offset within the BAR; only valid when bar_index <= 5
* @component_map: cxl_reg_map for component registers
* @device_map: cxl_reg_maps for device registers
* @pmu_map: cxl_reg_maps for CXL Performance Monitoring Units
+ *
+ * When the register block is described by the Register Locator DVSEC with
+ * a BAR Indicator (BIR 0-5), bar_index and bar_offset are set so callers
+ * can use pci_iomap(pdev, bar_index, size) and base + bar_offset instead
+ * of ioremap(resource).
*/
struct cxl_register_map {
struct device *host;
@@ -110,6 +117,8 @@ struct cxl_register_map {
resource_size_t resource;
resource_size_t max_size;
u8 reg_type;
+ u8 bar_index;
+ resource_size_t bar_offset;
union {
struct cxl_component_reg_map component_map;
struct cxl_device_reg_map device_map;
@@ -234,4 +243,7 @@ int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
resource_size_t *offset, resource_size_t *size);
int cxl_await_range_active(struct cxl_dev_state *cxlds);
+
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map,
+ u8 *bar_index, resource_size_t *bar_offset);
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v3 04/11] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (2 preceding siblings ...)
2026-06-25 16:53 ` [PATCH v3 03/11] cxl: Record BIR and BAR offset in cxl_register_map mhonap
@ 2026-06-25 16:54 ` mhonap
2026-06-25 16:54 ` [PATCH v3 05/11] vfio: UAPI for CXL Type-2 device passthrough mhonap
` (7 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
The CXL component register layout and the HDM Decoder Capability
Structure defines live in drivers/cxl/cxl.h, where userspace
consumers cannot include them without depending on kernel-only
headers. A VMM that owns a vfio-cxl COMP_REGS shadow region needs
these defines to interpret the shadow contents.
Move the spec-defined register layout, capability identifiers, and
HDM decoder field masks to a new public uapi header,
include/uapi/cxl/cxl_regs.h. Use __GENMASK() and _BITUL() (not
GENMASK() / BIT()) so the header is uapi-clean. Include
<asm/bitsperlong.h> for the __BITS_PER_LONG that __GENMASK() needs.
drivers/cxl/cxl.h now includes <uapi/cxl/cxl_regs.h>; the values
are identical, so kernel callers see no change. Static inline
helpers that use FIELD_GET stay in drivers/cxl/cxl.h.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/cxl.h | 52 +++++-------------------------
include/uapi/cxl/cxl_regs.h | 63 +++++++++++++++++++++++++++++++++++++
2 files changed, 70 insertions(+), 45 deletions(-)
create mode 100644 include/uapi/cxl/cxl_regs.h
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index f43abd1903ce..583a27b6659e 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -24,51 +24,13 @@ extern const struct nvdimm_security_ops *cxl_security_ops;
* (port-driver, region-driver, nvdimm object-drivers... etc).
*/
-/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */
-#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K
-
-/* CXL 2.0 8.2.5 CXL.cache and CXL.mem Registers*/
-#define CXL_CM_OFFSET 0x1000
-#define CXL_CM_CAP_HDR_OFFSET 0x0
-#define CXL_CM_CAP_HDR_ID_MASK GENMASK(15, 0)
-#define CM_CAP_HDR_CAP_ID 1
-#define CXL_CM_CAP_HDR_VERSION_MASK GENMASK(19, 16)
-#define CM_CAP_HDR_CAP_VERSION 1
-#define CXL_CM_CAP_HDR_CACHE_MEM_VERSION_MASK GENMASK(23, 20)
-#define CM_CAP_HDR_CACHE_MEM_VERSION 1
-#define CXL_CM_CAP_HDR_ARRAY_SIZE_MASK GENMASK(31, 24)
-#define CXL_CM_CAP_PTR_MASK GENMASK(31, 20)
-
-#define CXL_CM_CAP_CAP_ID_RAS 0x2
-#define CXL_CM_CAP_CAP_ID_HDM 0x5
-#define CXL_CM_CAP_CAP_HDM_VERSION 1
-
-/* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
-#define CXL_HDM_DECODER_CAP_OFFSET 0x0
-#define CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
-#define CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4)
-#define CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8)
-#define CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9)
-#define CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY BIT(11)
-#define CXL_HDM_DECODER_INTERLEAVE_16_WAY BIT(12)
-#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
-#define CXL_HDM_DECODER_ENABLE BIT(1)
-#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
-#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
-#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
-#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
-#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
-#define CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0)
-#define CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4)
-#define CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
-#define CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
-#define CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
-#define CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
-#define CXL_HDM_DECODER0_CTRL_HOSTONLY BIT(12)
-#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
-#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
-#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
-#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
+/*
+ * Spec-defined CXL component register layout and HDM Decoder
+ * Capability Structure constants live in <uapi/cxl/cxl_regs.h> so a
+ * userspace VMM that owns a vfio-cxl COMP_REGS shadow region can
+ * consume them without depending on kernel-only headers.
+ */
+#include <uapi/cxl/cxl_regs.h>
/* HDM decoder control register constants CXL 3.0 8.2.5.19.7 */
#define CXL_DECODER_MIN_GRANULARITY 256
diff --git a/include/uapi/cxl/cxl_regs.h b/include/uapi/cxl/cxl_regs.h
new file mode 100644
index 000000000000..b284b7ad2d42
--- /dev/null
+++ b/include/uapi/cxl/cxl_regs.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0-only WITH Linux-syscall-note */
+/*
+ * CXL component register layout and HDM Decoder Capability Structure
+ * defines. Userspace consumers (e.g. a VMM that owns a vfio-cxl
+ * COMP_REGS shadow region) need these without kernel-only header
+ * dependencies.
+ *
+ * Spec references: CXL r4.0 sections 8.2.3 and 8.2.4.20.
+ */
+#ifndef _UAPI_CXL_REGS_H_
+#define _UAPI_CXL_REGS_H_
+
+#include <asm/bitsperlong.h> /* __BITS_PER_LONG; needed by __GENMASK() */
+#include <linux/const.h> /* _BITUL(), _BITULL() */
+#include <linux/bits.h> /* __GENMASK() */
+
+/* CXL r4.0 8.2.3 CXL Component Register Layout and Definition */
+#define CXL_COMPONENT_REG_BLOCK_SIZE 0x00010000
+
+/* CXL r4.0 8.2.4 CXL.cache and CXL.mem Registers */
+#define CXL_CM_OFFSET 0x1000
+#define CXL_CM_CAP_HDR_OFFSET 0x0
+#define CXL_CM_CAP_HDR_ID_MASK __GENMASK(15, 0)
+#define CM_CAP_HDR_CAP_ID 1
+#define CXL_CM_CAP_HDR_VERSION_MASK __GENMASK(19, 16)
+#define CM_CAP_HDR_CAP_VERSION 1
+#define CXL_CM_CAP_HDR_CACHE_MEM_VERSION_MASK __GENMASK(23, 20)
+#define CM_CAP_HDR_CACHE_MEM_VERSION 1
+#define CXL_CM_CAP_HDR_ARRAY_SIZE_MASK __GENMASK(31, 24)
+#define CXL_CM_CAP_PTR_MASK __GENMASK(31, 20)
+
+#define CXL_CM_CAP_CAP_ID_RAS 0x2
+#define CXL_CM_CAP_CAP_ID_HDM 0x5
+#define CXL_CM_CAP_CAP_HDM_VERSION 1
+
+/* HDM decoders, CXL r4.0 8.2.4.20 */
+#define CXL_HDM_DECODER_CAP_OFFSET 0x0
+#define CXL_HDM_DECODER_COUNT_MASK __GENMASK(3, 0)
+#define CXL_HDM_DECODER_TARGET_COUNT_MASK __GENMASK(7, 4)
+#define CXL_HDM_DECODER_INTERLEAVE_11_8 _BITUL(8)
+#define CXL_HDM_DECODER_INTERLEAVE_14_12 _BITUL(9)
+#define CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY _BITUL(11)
+#define CXL_HDM_DECODER_INTERLEAVE_16_WAY _BITUL(12)
+#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
+#define CXL_HDM_DECODER_ENABLE _BITUL(1)
+#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
+#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
+#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
+#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
+#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
+#define CXL_HDM_DECODER0_CTRL_IG_MASK __GENMASK(3, 0)
+#define CXL_HDM_DECODER0_CTRL_IW_MASK __GENMASK(7, 4)
+#define CXL_HDM_DECODER0_CTRL_LOCK _BITUL(8)
+#define CXL_HDM_DECODER0_CTRL_COMMIT _BITUL(9)
+#define CXL_HDM_DECODER0_CTRL_COMMITTED _BITUL(10)
+#define CXL_HDM_DECODER0_CTRL_COMMIT_ERROR _BITUL(11)
+#define CXL_HDM_DECODER0_CTRL_HOSTONLY _BITUL(12)
+#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
+#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
+#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
+#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
+
+#endif /* _UAPI_CXL_REGS_H_ */
--
2.25.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v3 05/11] vfio: UAPI for CXL Type-2 device passthrough
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (3 preceding siblings ...)
2026-06-25 16:54 ` [PATCH v3 04/11] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
@ 2026-06-25 16:54 ` mhonap
2026-06-25 16:54 ` [PATCH v3 06/11] cxl: Add register-virtualization helpers for vfio Type-2 passthrough mhonap
` (6 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
Add the user-visible interface that exposes a CXL Type-2 device to a
VMM through vfio-pci:
VFIO_DEVICE_FLAGS_CXL (bit 9) on vfio_device_info::flags marks the
device as CXL.
VFIO_DEVICE_INFO_CAP_CXL (id 6) is the capability that carries the
HDM-backed memory region index, the CXL component register region
index, and the layout of the component register block within the
containing PCI BAR.
VFIO_REGION_SUBTYPE_CXL identifies the HDM memory region.
VFIO_REGION_SUBTYPE_CXL_COMP_REGS identifies the CXL component
register shadow.
Only the HOST_FIRMWARE_COMMITTED flag is exposed. Other CXL device
states stay invisible to userspace at this stage.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
include/uapi/linux/vfio.h | 46 +++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 5de618a3a5ee..3707d53c4de5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -215,6 +215,7 @@ struct vfio_device_info {
#define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6) /* vfio-fsl-mc device */
#define VFIO_DEVICE_FLAGS_CAPS (1 << 7) /* Info supports caps */
#define VFIO_DEVICE_FLAGS_CDX (1 << 8) /* vfio-cdx device */
+#define VFIO_DEVICE_FLAGS_CXL (1 << 9) /* vfio-cxl Type-2 device */
__u32 num_regions; /* Max region index + 1 */
__u32 num_irqs; /* Max IRQ index + 1 */
__u32 cap_offset; /* Offset within info struct of first cap */
@@ -257,6 +258,36 @@ struct vfio_device_info_cap_pci_atomic_comp {
__u32 reserved;
};
+/*
+ * VFIO_DEVICE_INFO capability for CXL Type-2 passthrough devices.
+ * Present when VFIO_DEVICE_FLAGS_CXL is set on vfio_device_info::flags.
+ *
+ * @flags: VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED indicates the host CXL
+ * subsystem committed the endpoint HDM decoder.
+ * @hdm_region_idx: VFIO region index for the HDM memory region
+ * (subtype VFIO_REGION_SUBTYPE_CXL).
+ * @comp_reg_region_idx: VFIO region index for the CXL Component
+ * Register shadow (subtype VFIO_REGION_SUBTYPE_CXL_COMP_REGS).
+ * @comp_reg_bar: PCI BAR index that contains the CXL component
+ * register block. Get-region-info on this BAR returns a
+ * VFIO_REGION_INFO_CAP_SPARSE_MMAP that excludes the CXL block.
+ * @comp_reg_offset: byte offset of the CXL component register block
+ * within @comp_reg_bar.
+ * @comp_reg_size: byte size of the CXL component register block.
+ */
+#define VFIO_DEVICE_INFO_CAP_CXL 6
+struct vfio_device_info_cap_cxl {
+ struct vfio_info_cap_header header;
+ __u32 flags;
+#define VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED (1 << 0)
+ __u32 hdm_region_idx;
+ __u32 comp_reg_region_idx;
+ __u32 comp_reg_bar;
+ __u32 __resv;
+ __u64 comp_reg_offset;
+ __u64 comp_reg_size;
+};
+
/**
* VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
* struct vfio_region_info)
@@ -425,6 +456,21 @@ struct vfio_region_gfx_edid {
#define VFIO_REGION_SUBTYPE_CCW_SCHIB (2)
#define VFIO_REGION_SUBTYPE_CCW_CRW (3)
+/*
+ * sub-types for VFIO_REGION_TYPE_PCI_VENDOR (vendor id 1e98 reserved
+ * for the CXL Consortium); used by vfio-cxl Type-2 device passthrough.
+ *
+ * VFIO_REGION_SUBTYPE_CXL exposes the HDM-backed device memory range
+ * as a mappable region. The range is allocated by the host CXL
+ * subsystem and the VMM is expected to mmap() it.
+ * VFIO_REGION_SUBTYPE_CXL_COMP_REGS exposes the CXL Component Register
+ * block (read-write via pread()/pwrite() only, no mmap()). The VMM
+ * reads and writes HDM Decoder Capability registers through this
+ * shadow region instead of touching hardware directly.
+ */
+#define VFIO_REGION_SUBTYPE_CXL (1)
+#define VFIO_REGION_SUBTYPE_CXL_COMP_REGS (2)
+
/* sub-types for VFIO_REGION_TYPE_MIGRATION */
#define VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED (1)
--
2.25.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v3 06/11] cxl: Add register-virtualization helpers for vfio Type-2 passthrough
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (4 preceding siblings ...)
2026-06-25 16:54 ` [PATCH v3 05/11] vfio: UAPI for CXL Type-2 device passthrough mhonap
@ 2026-06-25 16:54 ` mhonap
2026-06-25 16:54 ` [PATCH v3 07/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2 acquisition mhonap
` (5 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
vfio-pci needs the CXL Device DVSEC body, the HDM Decoder Capability
block, and the CXL.cache/mem cap-array prefix to be virtualized
toward a KVM guest in a CXL-spec-compliant way.
Introduce a narrow helper API owned by cxl-core:
struct cxl_passthrough *
devm_cxl_passthrough_create(struct device *dev,
struct cxl_dev_state *cxlds);
int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off,
u32 *val, size_t sz, bool write);
int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off,
u32 *val, bool write);
int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off,
u32 *val, bool write);
Each helper takes a per-device mutex covering the DVSEC + HDM shadows
(the CM cap-array snapshot is immutable after create) and dispatches
by offset to a hand-written write handler against CXL r4.0 §8.1.3
(DVSEC: LOCK is RWO, CONTROL/CONTROL2 are RWL gated on CONFIG_LOCK,
STATUS/STATUS2 are RW1C, RANGE1 is HwInit, RANGE2 is RsvdZ) and
§8.2.4.20 (HDM: GLOBAL_CTRL RW, decoder CTRL implements
COMMIT/COMMITTED, decoder BASE/SIZE RWL gated on COMMITTED or
LOCK_ON_COMMIT, cap header HwInit).
Writes to the CM cap-array are silently discarded because the
cap-array headers are RO per CXL r4.0 §8.2.4; the write parameter is
kept on the rw API to make the drop policy explicit at the call site.
The shadows are snapshotted at create time: the DVSEC body from PCI
config space dword-at-a-time, the CM cap-array and HDM block from
the cxl-core MMIO mapping at cxlds->reg_map.base. This preserves
firmware-committed values so the guest reads what the host BIOS
committed, while writes update the shadow per the per-field write
semantics above.
The file is gated by the hidden Kconfig CXL_VFIO_PASSTHROUGH so the
passthrough code stays out of cxl_core when no vfio consumer is configured.
Scope: firmware-committed, single-decoder, no-interleave Type-2
passthrough. Multi-decoder, interleave, and hotplug are
out-of-scope and rejected at create time (-EOPNOTSUPP for
hdm_count != 1).
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/Kconfig | 7 +
drivers/cxl/core/Makefile | 1 +
drivers/cxl/core/passthrough.c | 590 +++++++++++++++++++++++++++++++++
include/cxl/passthrough.h | 121 +++++++
4 files changed, 719 insertions(+)
create mode 100644 drivers/cxl/core/passthrough.c
create mode 100644 include/cxl/passthrough.h
diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 80aeb0d556bd..7c874d486a9c 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -19,6 +19,13 @@ menuconfig CXL_BUS
if CXL_BUS
+config CXL_VFIO_PASSTHROUGH
+ bool
+ # Hidden symbol selected by VFIO_PCI_CXL to pull
+ # drivers/cxl/core/passthrough.c into cxl_core when a vfio
+ # Type-2 passthrough consumer is configured. Keep silent: no
+ # help text, no default, no user-visible prompt.
+
config CXL_PCI
tristate "PCI manageability"
default CXL_BUS
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index ce7213818d3c..0cc80bd35a88 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -22,3 +22,4 @@ cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
cxl_core-$(CONFIG_CXL_RAS) += ras.o
cxl_core-$(CONFIG_CXL_RAS) += ras_rch.o
cxl_core-$(CONFIG_CXL_ATL) += atl.o
+cxl_core-$(CONFIG_CXL_VFIO_PASSTHROUGH) += passthrough.o
diff --git a/drivers/cxl/core/passthrough.c b/drivers/cxl/core/passthrough.c
new file mode 100644
index 000000000000..b89829586024
--- /dev/null
+++ b/drivers/cxl/core/passthrough.c
@@ -0,0 +1,590 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved.
+ *
+ * vfio-pci Type-2 device passthrough — CXL register virtualization.
+ *
+ * Owns the CXL spec-defined virtualization semantics for the
+ * - CXL Device DVSEC capability body (CXL r4.0 §8.1.3)
+ * - HDM Decoder Capability block (CXL r4.0 §8.2.4.20)
+ * - CXL.cache/mem (CM) cap-array (CXL r4.0 §8.2.4)
+ *
+ * vfio-pci is the only caller. This file is NOT a generic emulation
+ * framework: every register the guest may touch has a hand-written
+ * write handler against the spec. Reads serve from a shadow
+ * snapshotted at create time; writes update the shadow per the spec
+ * attribute mode for that field.
+ *
+ * Scope: firmware-committed, single-decoder, no-interleave Type-2
+ * passthrough. Multi-decoder, interleave, and hotplug are
+ * out-of-scope and rejected at create time.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/bitops.h>
+#include <linux/cleanup.h>
+#include <linux/device.h>
+#include <linux/export.h>
+#include <linux/io.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci_ids.h>
+#include <linux/pci_regs.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/unaligned.h>
+
+#include <uapi/cxl/cxl_regs.h>
+
+#include <cxlpci.h>
+#include <cxlmem.h>
+#include <cxl/cxl.h>
+#include <cxl/passthrough.h>
+
+#include "core.h"
+
+/* DVSEC CXL Device body offsets — relative to DVSEC capability start.
+ * Body begins at PCI_DVSEC_CXL_CAP (0x0a); preceding bytes are the PCI
+ * ext-cap header and DVSEC headers handled by the generic vfio
+ * perm-bits path.
+ */
+#define DVSEC_OFF_CAPABILITY PCI_DVSEC_CXL_CAP /* 0x0a, u16 */
+#define DVSEC_OFF_CONTROL PCI_DVSEC_CXL_CTRL /* 0x0c, u16 */
+#define DVSEC_OFF_STATUS 0x0e /* u16 */
+#define DVSEC_OFF_CONTROL2 0x10 /* u16 */
+#define DVSEC_OFF_STATUS2 0x12 /* u16 */
+#define DVSEC_OFF_LOCK 0x14 /* u16 */
+#define DVSEC_OFF_RANGE1_SIZE_HI 0x18 /* u32 */
+#define DVSEC_OFF_RANGE1_SIZE_LO 0x1c
+#define DVSEC_OFF_RANGE1_BASE_HI 0x20
+#define DVSEC_OFF_RANGE1_BASE_LO 0x24
+#define DVSEC_OFF_RANGE2_SIZE_HI 0x28
+#define DVSEC_OFF_RANGE2_SIZE_LO 0x2c
+#define DVSEC_OFF_RANGE2_BASE_HI 0x30
+#define DVSEC_OFF_RANGE2_BASE_LO 0x34
+#define DVSEC_BODY_END 0x38
+
+#define DVSEC_LOCK_CONFIG_LOCK BIT(0)
+
+/* HDM Decoder Capability block offsets — relative to HDM block base.
+ * Decoder N register set starts at 0x10 + N * 0x20.
+ */
+#define HDM_OFF_CAP_HEADER 0x00
+#define HDM_OFF_GLOBAL_CTRL 0x04
+#define HDM_DEC_BASE 0x10
+#define HDM_DEC_STRIDE 0x20
+#define HDM_DEC_OFF_BASE_LO(n) (HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x00)
+#define HDM_DEC_OFF_BASE_HI(n) (HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x04)
+#define HDM_DEC_OFF_SIZE_LO(n) (HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x08)
+#define HDM_DEC_OFF_SIZE_HI(n) (HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x0c)
+#define HDM_DEC_OFF_CTRL(n) (HDM_DEC_BASE + (n) * HDM_DEC_STRIDE + 0x10)
+
+/* HDM Decoder CTRL bits per CXL r4.0 §8.2.4.20.5. */
+#define HDM_CTRL_LOCK_ON_COMMIT BIT(8)
+#define HDM_CTRL_COMMIT BIT(9)
+#define HDM_CTRL_COMMITTED BIT(10)
+#define HDM_CTRL_ERR_NOT_COMMITTED BIT(11)
+
+struct cxl_passthrough {
+ struct cxl_dev_state *cxlds;
+
+ /* DVSEC body shadow. Byte-indexed by (off - PCI_DVSEC_CXL_CAP).
+ * Allocated rounded up to a dword so dword reads at the tail
+ * never overrun.
+ */
+ u8 *dvsec_shadow;
+ u16 dvsec_size; /* full DVSEC cap length, incl. headers */
+ bool dvsec_config_locked;
+
+ /* HDM block shadow. Byte-indexed; size = hdm_reg_size. */
+ u8 *hdm_shadow;
+ resource_size_t hdm_reg_size;
+
+ /* CM cap-array snapshot. Dword-indexed by (off / 4) where off
+ * is the byte offset from CXL_CM_OFFSET. Read-only after create.
+ */
+ __le32 *cm_snapshot;
+ size_t cm_snapshot_dwords;
+
+ /* Covers dvsec_shadow + dvsec_config_locked + hdm_shadow.
+ * cm_snapshot is immutable after create; no lock needed. Leaf-
+ * level: no entry point holding this mutex calls into cxl-bus or
+ * vfio.
+ */
+ struct mutex lock;
+};
+
+/* ------------------------------------------------------------------ */
+/* Snapshot helpers */
+/* ------------------------------------------------------------------ */
+
+/* Read the DVSEC body bytes [PCI_DVSEC_CXL_CAP, dvsec_size) from PCI
+ * config space into the shadow.
+ *
+ * The body starts at PCI_DVSEC_CXL_CAP (0x0a), which is word-aligned but
+ * NOT dword-aligned, and CXL r4.0 §8.1.3 places six 16-bit descriptors
+ * (CAPABILITY through LOCK) at offsets 0x0a..0x14 before any 32-bit
+ * field. Strict-alignment PCIe host bridges (e.g. ARM64 ECAM) reject
+ * misaligned dword config accesses with PCIBIOS_BAD_REGISTER_NUMBER;
+ * snapshot at the natural granularity of the body's 16-bit descriptors
+ * (2-byte stride) so every offset in the range is naturally aligned.
+ */
+static int snapshot_dvsec_body(struct cxl_passthrough *p)
+{
+ struct pci_dev *pdev = to_pci_dev(p->cxlds->dev);
+ u16 dvsec = p->cxlds->cxl_dvsec;
+ u16 off;
+ u16 word;
+ int rc;
+
+ for (off = PCI_DVSEC_CXL_CAP; off < p->dvsec_size; off += 2) {
+ rc = pci_read_config_word(pdev, dvsec + off, &word);
+ if (rc)
+ return -EIO;
+ put_unaligned_le16(word, p->dvsec_shadow +
+ (off - PCI_DVSEC_CXL_CAP));
+ }
+ return 0;
+}
+
+/* Read the CM cap-array prefix [CXL_CM_OFFSET, hdm_reg_offset) from
+ * MMIO into cm_snapshot, and the HDM block [hdm_reg_offset,
+ * hdm_reg_offset + hdm_reg_size) into hdm_shadow.
+ *
+ * @base is a short-lived kva for the component register block,
+ * established by the caller via ioremap() against cxlds->reg_map.resource.
+ * cxl_setup_regs() drops its own ioremap (clears reg_map.base) after the
+ * cap-array probe completes, so this function cannot rely on
+ * cxlds->reg_map.base being valid; the caller passes a fresh mapping
+ * here and releases it once snapshot data has been copied into the
+ * in-memory shadows.
+ */
+static void snapshot_cm_and_hdm(struct cxl_passthrough *p,
+ void __iomem *base,
+ resource_size_t hdm_off)
+{
+ size_t i;
+
+ for (i = 0; i < p->cm_snapshot_dwords; i++)
+ p->cm_snapshot[i] = cpu_to_le32(readl(base + CXL_CM_OFFSET +
+ i * 4));
+
+ for (i = 0; i < p->hdm_reg_size / 4; i++)
+ put_unaligned_le32(readl(base + hdm_off + i * 4),
+ p->hdm_shadow + i * 4);
+}
+
+/* ------------------------------------------------------------------ */
+/* devres */
+/* ------------------------------------------------------------------ */
+
+static void cxl_passthrough_release(struct device *dev, void *res)
+{
+ struct cxl_passthrough *p = *(struct cxl_passthrough **)res;
+
+ kfree(p->dvsec_shadow);
+ kfree(p->hdm_shadow);
+ kfree(p->cm_snapshot);
+ mutex_destroy(&p->lock);
+ kfree(p);
+}
+
+struct cxl_passthrough *
+devm_cxl_passthrough_create(struct device *dev, struct cxl_dev_state *cxlds)
+{
+ struct cxl_passthrough **dres;
+ struct cxl_passthrough *p;
+ struct pci_dev *pdev;
+ resource_size_t hdm_off, hdm_size;
+ size_t dvsec_shadow_size;
+ u8 hdm_count;
+ u32 hdr;
+ int rc;
+
+ /*
+ * cxl_setup_regs() releases its short-lived ioremap before returning,
+ * so reg_map.base is NULL by the time we run. Validate the persistent
+ * fields (resource address and size) instead; the local ioremap
+ * established further below covers the snapshot reads.
+ */
+ if (!dev || !cxlds || !cxlds->dev || !cxlds->cxl_dvsec ||
+ !cxlds->reg_map.resource || !cxlds->reg_map.max_size)
+ return ERR_PTR(-EINVAL);
+
+ pdev = to_pci_dev(cxlds->dev);
+
+ rc = cxl_get_hdm_info(cxlds, &hdm_count, &hdm_off, &hdm_size);
+ if (rc)
+ return ERR_PTR(rc);
+ if (hdm_count != 1 || !hdm_size || hdm_off <= CXL_CM_OFFSET ||
+ !IS_ALIGNED(hdm_size, 4))
+ return ERR_PTR(-EOPNOTSUPP);
+
+ p = kzalloc_obj(*p, GFP_KERNEL);
+ if (!p)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_init(&p->lock);
+ p->cxlds = cxlds;
+ p->hdm_reg_size = hdm_size;
+
+ /* DVSEC body length from PCI ext-cap header. */
+ rc = pci_read_config_dword(pdev, cxlds->cxl_dvsec + PCI_DVSEC_HEADER1,
+ &hdr);
+ if (rc) {
+ rc = -EIO;
+ goto err;
+ }
+ p->dvsec_size = PCI_DVSEC_HEADER1_LEN(hdr);
+ if (p->dvsec_size < DVSEC_BODY_END) {
+ rc = -EINVAL;
+ goto err;
+ }
+
+ dvsec_shadow_size = round_up(p->dvsec_size - PCI_DVSEC_CXL_CAP, 4);
+ p->dvsec_shadow = kzalloc(dvsec_shadow_size, GFP_KERNEL);
+ if (!p->dvsec_shadow) {
+ rc = -ENOMEM;
+ goto err;
+ }
+
+ p->cm_snapshot_dwords = (hdm_off - CXL_CM_OFFSET) / 4;
+ p->cm_snapshot = kcalloc(p->cm_snapshot_dwords, sizeof(__le32),
+ GFP_KERNEL);
+ if (!p->cm_snapshot) {
+ rc = -ENOMEM;
+ goto err;
+ }
+
+ p->hdm_shadow = kzalloc(hdm_size, GFP_KERNEL);
+ if (!p->hdm_shadow) {
+ rc = -ENOMEM;
+ goto err;
+ }
+
+ rc = snapshot_dvsec_body(p);
+ if (rc)
+ goto err;
+
+ {
+ void __iomem *base;
+
+ /*
+ * Bind-time-only ioremap. cxl_setup_regs() has already
+ * released the cxl-core ioremap (see comment on the entry
+ * gate). Take a fresh, short-lived mapping for the
+ * snapshot, then release it; all subsequent reads serve
+ * from the in-memory shadows.
+ */
+ base = ioremap(cxlds->reg_map.resource,
+ cxlds->reg_map.max_size);
+ if (!base) {
+ rc = -ENOMEM;
+ goto err;
+ }
+ snapshot_cm_and_hdm(p, base, hdm_off);
+ iounmap(base);
+ }
+
+ dres = devres_alloc(cxl_passthrough_release, sizeof(*dres),
+ GFP_KERNEL);
+ if (!dres) {
+ rc = -ENOMEM;
+ goto err;
+ }
+ *dres = p;
+ devres_add(dev, dres);
+ return p;
+
+err:
+ kfree(p->dvsec_shadow);
+ kfree(p->cm_snapshot);
+ kfree(p->hdm_shadow);
+ mutex_destroy(&p->lock);
+ kfree(p);
+ return ERR_PTR(rc);
+}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_passthrough_create, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* DVSEC write semantics */
+/* ------------------------------------------------------------------ */
+
+static u16 dvsec_shadow_get_u16(struct cxl_passthrough *p, u16 off)
+{
+ return get_unaligned_le16(p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP));
+}
+
+static void dvsec_shadow_set_u16(struct cxl_passthrough *p, u16 off, u16 val)
+{
+ put_unaligned_le16(val, p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP));
+}
+
+/* Apply a write to a single DVSEC field at @off, with the field's
+ * native width (2 for descriptors, 4 for RANGE entries). @width is
+ * the field's spec width; @new is the merged value to apply. Caller
+ * holds p->lock.
+ */
+static void dvsec_apply_write(struct cxl_passthrough *p, u16 off, size_t width,
+ u32 new)
+{
+ u16 cur16;
+
+ switch (off) {
+ case DVSEC_OFF_CAPABILITY:
+ /* HwInit — drop. */
+ return;
+ case DVSEC_OFF_CONTROL:
+ case DVSEC_OFF_CONTROL2:
+ /* RWL — gated on CONFIG_LOCK. */
+ if (p->dvsec_config_locked)
+ return;
+ dvsec_shadow_set_u16(p, off, (u16)new);
+ return;
+ case DVSEC_OFF_STATUS:
+ case DVSEC_OFF_STATUS2:
+ /* RW1C — clear bits where the guest wrote 1. */
+ cur16 = dvsec_shadow_get_u16(p, off);
+ dvsec_shadow_set_u16(p, off, cur16 & ~(u16)new);
+ return;
+ case DVSEC_OFF_LOCK:
+ /* RWO — first 1-write latches CONFIG_LOCK; subsequent
+ * writes are ignored.
+ */
+ cur16 = dvsec_shadow_get_u16(p, off);
+ if (cur16 & DVSEC_LOCK_CONFIG_LOCK)
+ return;
+ if (new & DVSEC_LOCK_CONFIG_LOCK) {
+ dvsec_shadow_set_u16(p, off,
+ cur16 | DVSEC_LOCK_CONFIG_LOCK);
+ p->dvsec_config_locked = true;
+ }
+ return;
+ case DVSEC_OFF_RANGE1_SIZE_HI:
+ case DVSEC_OFF_RANGE1_SIZE_LO:
+ case DVSEC_OFF_RANGE1_BASE_HI:
+ case DVSEC_OFF_RANGE1_BASE_LO:
+ /* HwInit — drop. */
+ return;
+ case DVSEC_OFF_RANGE2_SIZE_HI:
+ case DVSEC_OFF_RANGE2_SIZE_LO:
+ case DVSEC_OFF_RANGE2_BASE_HI:
+ case DVSEC_OFF_RANGE2_BASE_LO:
+ /* RsvdZ — drop. */
+ return;
+ default:
+ /* Reserved offsets inside the modelled body: drop. */
+ (void)width;
+ return;
+ }
+}
+
+/* Map a byte offset @off inside the DVSEC body to the natural-width
+ * field that contains it: returns the field's base offset (16-bit
+ * aligned for descriptors, 32-bit aligned for RANGE entries) and width.
+ * Returns false if @off lies outside any modelled field.
+ */
+static bool dvsec_field_at(u16 off, u16 *field_off, size_t *width)
+{
+ if (off >= DVSEC_OFF_CAPABILITY && off < DVSEC_OFF_RANGE1_SIZE_HI) {
+ *field_off = ALIGN_DOWN(off, 2);
+ *width = 2;
+ return true;
+ }
+ if (off >= DVSEC_OFF_RANGE1_SIZE_HI && off < DVSEC_BODY_END) {
+ *field_off = ALIGN_DOWN(off, 4);
+ *width = 4;
+ return true;
+ }
+ return false;
+}
+
+int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ size_t sz, bool write)
+{
+ u8 *shadow;
+ u16 field_off;
+ size_t field_width;
+ u32 cur, merged;
+ u32 sub_shift;
+ u32 width_mask;
+
+ if (!p || !val)
+ return -EINVAL;
+ if (sz != 1 && sz != 2 && sz != 4)
+ return -EINVAL;
+ if (off < PCI_DVSEC_CXL_CAP || off + sz > p->dvsec_size)
+ return -EINVAL;
+
+ guard(mutex)(&p->lock);
+
+ shadow = p->dvsec_shadow + (off - PCI_DVSEC_CXL_CAP);
+
+ if (!write) {
+ switch (sz) {
+ case 1:
+ *val = *shadow;
+ break;
+ case 2:
+ *val = get_unaligned_le16(shadow);
+ break;
+ case 4:
+ *val = get_unaligned_le32(shadow);
+ break;
+ }
+ return 0;
+ }
+
+ if (!dvsec_field_at(off, &field_off, &field_width))
+ return 0; /* outside any modelled field: drop */
+
+ /* Read-modify-merge the field at its natural width. */
+ if (field_width == 2)
+ cur = dvsec_shadow_get_u16(p, field_off);
+ else
+ cur = get_unaligned_le32(p->dvsec_shadow +
+ (field_off - PCI_DVSEC_CXL_CAP));
+
+ width_mask = (sz == 4) ? 0xffffffff : (sz == 2 ? 0xffff : 0xff);
+ sub_shift = (off - field_off) * 8;
+ merged = cur & ~(width_mask << sub_shift);
+ merged |= (*val & width_mask) << sub_shift;
+
+ dvsec_apply_write(p, field_off, field_width, merged);
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_dvsec_rw, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* HDM write semantics */
+/* ------------------------------------------------------------------ */
+
+static u32 hdm_shadow_get(struct cxl_passthrough *p, u32 off)
+{
+ return get_unaligned_le32(p->hdm_shadow + off);
+}
+
+static void hdm_shadow_set(struct cxl_passthrough *p, u32 off, u32 val)
+{
+ put_unaligned_le32(val, p->hdm_shadow + off);
+}
+
+/* Decoder index for a per-decoder register offset. */
+static u32 hdm_decoder_of(u32 off)
+{
+ return (off - HDM_DEC_BASE) / HDM_DEC_STRIDE;
+}
+
+static u32 hdm_decoder_field(u32 off)
+{
+ return (off - HDM_DEC_BASE) % HDM_DEC_STRIDE;
+}
+
+static void hdm_decoder_ctrl_write(struct cxl_passthrough *p, u32 off, u32 val)
+{
+ u32 cur = hdm_shadow_get(p, off);
+ u32 next;
+
+ /* Once COMMITTED, only the COMMIT toggle is honoured. Releasing
+ * COMMIT clears COMMITTED and Lock-on-Commit per CXL r4.0
+ * §8.2.4.20.5.
+ */
+ if (cur & HDM_CTRL_COMMITTED) {
+ next = (cur & ~HDM_CTRL_COMMIT) | (val & HDM_CTRL_COMMIT);
+ if (!(val & HDM_CTRL_COMMIT)) {
+ next &= ~HDM_CTRL_COMMITTED;
+ next &= ~HDM_CTRL_LOCK_ON_COMMIT;
+ }
+ hdm_shadow_set(p, off, next);
+ return;
+ }
+
+ next = val & ~(HDM_CTRL_COMMITTED | HDM_CTRL_ERR_NOT_COMMITTED);
+ if (val & HDM_CTRL_COMMIT)
+ next |= HDM_CTRL_COMMITTED;
+ hdm_shadow_set(p, off, next);
+}
+
+static void hdm_decoder_basesize_write(struct cxl_passthrough *p, u32 off,
+ u32 val)
+{
+ u32 n = hdm_decoder_of(off);
+ u32 ctrl = hdm_shadow_get(p, HDM_DEC_OFF_CTRL(n));
+
+ /* RWL — BASE/SIZE locked when the decoder is committed or
+ * lock-on-commit has been latched.
+ */
+ if (ctrl & (HDM_CTRL_COMMITTED | HDM_CTRL_LOCK_ON_COMMIT))
+ return;
+ hdm_shadow_set(p, off, val);
+}
+
+int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ bool write)
+{
+ u32 field;
+
+ if (!p || !val)
+ return -EINVAL;
+ if (!IS_ALIGNED(off, 4) || off + 4 > p->hdm_reg_size)
+ return -EINVAL;
+
+ guard(mutex)(&p->lock);
+
+ if (!write) {
+ *val = hdm_shadow_get(p, off);
+ return 0;
+ }
+
+ switch (off) {
+ case HDM_OFF_CAP_HEADER:
+ /* HwInit — drop. */
+ return 0;
+ case HDM_OFF_GLOBAL_CTRL:
+ /* RW — shadow. */
+ hdm_shadow_set(p, off, *val);
+ return 0;
+ }
+
+ if (off < HDM_DEC_BASE)
+ return 0; /* gap before per-decoder regs: drop */
+
+ field = hdm_decoder_field(off);
+ switch (field) {
+ case 0x00: case 0x04: /* BASE_LO / BASE_HI */
+ case 0x08: case 0x0c: /* SIZE_LO / SIZE_HI */
+ hdm_decoder_basesize_write(p, off, *val);
+ return 0;
+ case 0x10: /* CTRL */
+ hdm_decoder_ctrl_write(p, off, *val);
+ return 0;
+ default:
+ /* TARGET_LIST_{LO,HI} and other per-decoder bytes are
+ * accepted as plain RW shadow for the firmware-committed
+ * scope; multi-decoder / interleave behaviour is
+ * out-of-scope.
+ */
+ hdm_shadow_set(p, off, *val);
+ return 0;
+ }
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_hdm_rw, "CXL");
+
+/* ------------------------------------------------------------------ */
+/* CM cap-array snapshot */
+/* ------------------------------------------------------------------ */
+
+int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ bool write)
+{
+ if (!p || !val)
+ return -EINVAL;
+ if (!IS_ALIGNED(off, 4) || off / 4 >= p->cm_snapshot_dwords)
+ return -EINVAL;
+
+ if (write)
+ return 0; /* cap-array headers are RO; drop. */
+
+ *val = le32_to_cpu(p->cm_snapshot[off / 4]);
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_passthrough_cm_rw, "CXL");
diff --git a/include/cxl/passthrough.h b/include/cxl/passthrough.h
new file mode 100644
index 000000000000..43214b0d34f6
--- /dev/null
+++ b/include/cxl/passthrough.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved.
+ *
+ * CXL register virtualization helpers for vfio-pci Type-2 passthrough.
+ *
+ * See Documentation/driver-api/vfio-pci-cxl.rst for the ownership
+ * contract. In short: cxl-core owns the per-device DVSEC body, HDM
+ * Decoder block, and CM cap-array shadows; vfio-pci is a transport
+ * that forwards guest reads and writes through the helpers below.
+ *
+ * The helpers are not a generic emulation framework. Each register
+ * is hand-coded against CXL r4.0 §8.1.3 and §8.2.4.20. Adding a new
+ * field is "add a case", not "add a mode".
+ */
+#ifndef __CXL_PASSTHROUGH_H__
+#define __CXL_PASSTHROUGH_H__
+
+#include <linux/types.h>
+
+struct cxl_dev_state;
+struct cxl_passthrough;
+struct device;
+
+/**
+ * devm_cxl_passthrough_create - snapshot a Type-2 device's DVSEC + HDM +
+ * CM cap-array shadows and return the opaque handle the rw helpers
+ * operate on.
+ *
+ * @dev: device whose devres lifetime bounds the returned handle.
+ * @cxlds: CXL device state with cxlds->cxl_dvsec populated and
+ * cxlds->reg_map.resource and cxlds->reg_map.max_size describing
+ * the component register block. cxlds->reg_map.base is NOT
+ * required; cxl_pci_setup_regs() releases its short-lived
+ * ioremap before returning, so this helper takes a local
+ * bind-time ioremap against cxlds->reg_map.resource for the
+ * duration of the snapshot.
+ *
+ * On success the returned handle is bound to @dev's devres so unwind
+ * happens automatically when @dev is unbound. The handle must not be
+ * freed by the caller.
+ *
+ * Return: a valid &struct cxl_passthrough on success, ERR_PTR(-errno)
+ * on failure.
+ */
+struct cxl_passthrough *
+devm_cxl_passthrough_create(struct device *dev, struct cxl_dev_state *cxlds);
+
+/**
+ * cxl_passthrough_dvsec_rw - read or write the CXL Device DVSEC body shadow.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from the start of the DVSEC capability. Must be
+ * >= PCI_DVSEC_CXL_CAP and (off + sz) must lie inside the DVSEC.
+ * Accesses to the PCI ext-cap header bytes (off < PCI_DVSEC_CXL_CAP)
+ * are the caller's responsibility; they belong on the generic
+ * perm-bits path, not here.
+ * @val: pointer to a u32 holding the read result or the write value.
+ * The low @sz bytes of *val are the payload; upper bytes ignored
+ * for writes and zero for reads.
+ * @sz: 1, 2, or 4. Other values return -EINVAL.
+ * @write: false for read, true for write.
+ *
+ * Reads serve from the shadow. Writes update the shadow per the spec
+ * attribute mode for the addressed field (LOCK is RWO, CONTROL/CONTROL2
+ * are RWL gated on CONFIG_LOCK, STATUS/STATUS2 are RW1C, RANGE1/2 are
+ * HwInit, Reserved/RsvdZ silently consumed).
+ *
+ * Known limitation: a 4-byte write whose @off straddles a 16-bit DVSEC
+ * field boundary (CONTROL/STATUS at 0x0c/0x0e, CONTROL2/STATUS2 at
+ * 0x10/0x12) applies only the field containing the first byte of the
+ * access; the adjacent 16-bit field is not updated by the same write.
+ * Standard CXL register-access patterns issue separate 2-byte accesses
+ * to CONTROL, STATUS, CONTROL2 and STATUS2, so this corner case is
+ * documented rather than handled.
+ *
+ * Return: 0 on success; -EINVAL on out-of-range or bad size.
+ */
+int cxl_passthrough_dvsec_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ size_t sz, bool write);
+
+/**
+ * cxl_passthrough_hdm_rw - read or write the HDM Decoder block shadow.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from the HDM block base; must be 4-byte aligned and
+ * (off + 4) <= hdm_reg_size. Sub-dword access is not supported on
+ * HDM registers per CXL r4.0 §8.2.4.
+ * @val: pointer to a u32 holding the read result or the write value.
+ * @write: false for read, true for write.
+ *
+ * Reads serve from the shadow. Writes implement the per-decoder
+ * COMMIT/COMMITTED handshake (CTRL) and the RWL gating on BASE/SIZE
+ * imposed by COMMITTED|LOCK_ON_COMMIT. GLOBAL_CTRL is RW; the cap
+ * header is HwInit (writes dropped); other offsets in the per-decoder
+ * stride are RW shadow.
+ *
+ * Return: 0 on success; -EINVAL on misalignment or out-of-range.
+ */
+int cxl_passthrough_hdm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ bool write);
+
+/**
+ * cxl_passthrough_cm_rw - read or write the CXL.cache/mem cap-array snapshot.
+ *
+ * @p: handle from devm_cxl_passthrough_create().
+ * @off: byte offset from CXL_CM_OFFSET (the start of the CM cap-array
+ * header in the component register block); must be 4-byte aligned
+ * and (off + 4) <= cm_snapshot_size.
+ * @val: pointer to a u32 holding the read result; ignored on write.
+ * @write: false for read. Writes to the cap-array are silently dropped
+ * (the array headers are RO per CXL r4.0 §8.2.4); the @write
+ * parameter is present only to keep the API symmetric with the
+ * other rw helpers and to make the drop policy explicit at the
+ * call site.
+ *
+ * Return: 0 on success; -EINVAL on misalignment or out-of-range.
+ */
+int cxl_passthrough_cm_rw(struct cxl_passthrough *p, u32 off, u32 *val,
+ bool write);
+
+#endif /* __CXL_PASSTHROUGH_H__ */
--
2.25.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v3 07/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2 acquisition
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (5 preceding siblings ...)
2026-06-25 16:54 ` [PATCH v3 06/11] cxl: Add register-virtualization helpers for vfio Type-2 passthrough mhonap
@ 2026-06-25 16:54 ` mhonap
2026-06-25 16:54 ` [PATCH v3 08/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim mhonap
` (4 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
Wire vfio-pci-core to acquire CXL Type-2 device state at PCI bind
and release it at PCI unbind, mirroring the existing vfio_pci_zdev_*
integration model. Four lifecycle hooks are introduced —
vfio_pci_cxl_acquire / _release / _open / _close — with !-config
stubs that return -ENODEV / 0 / 0 / no-op respectively so vfio-pci
behaviour is unchanged when CONFIG_VFIO_PCI_CXL=n.
vfio_pci_cxl_acquire() implements the bind sequence:
- pcie_is_cxl() and CXL Device DVSEC discovery (-ENODEV if absent
or if MEM_CAPABLE clear — caller falls back to plain vfio-pci)
- devm_cxl_dev_state_create() with struct vfio_pci_cxl_state
embedding cxl_dev_state at offset 0 (required by the 7-arg
macro's static_assert in include/cxl/cxl.h)
- pci_enable_device_mem(), cxl_pci_setup_regs(), cxl_get_hdm_info()
(rejecting hdm_count != 1), cxl_regblock_get_bar_info(),
cxl_await_range_active()
- devm_cxl_passthrough_create() to snapshot the DVSEC body, HDM
block, and CM cap-array shadows owned by cxl-core
- pci_disable_device() — clears PCI_COMMAND_MASTER but NOT
PCI_COMMAND_MEMORY, so cxl-core MMIO accesses from the next step
still succeed
- devm_cxl_probe_mem() to register the cxl_memdev, enumerate the
endpoint port, and attach the firmware-committed autoregion
- request_mem_region() + memremap_wb() of the autoregion's HPA so
the HDM VFIO region can serve guest accesses through it
The sequence is fail-closed for confirmed-CXL devices: -ENODEV maps
to plain vfio-pci fall-through; any other negative errno aborts the
vfio-pci bind so the guest never sees a half-initialised CXL device.
vfio_pci_cxl_open() / _close() are present as stable call sites for
the region-registration hooks that follow.
Selects CXL_VFIO_PASSTHROUGH so cxl-core's per-device
register-virtualization helpers (drivers/cxl/core/passthrough.c) are
built.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/Kconfig | 2 +
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/cxl/Kconfig | 34 +++
drivers/vfio/pci/cxl/Makefile | 2 +
drivers/vfio/pci/cxl/vfio_cxl_core.c | 369 +++++++++++++++++++++++++++
drivers/vfio/pci/cxl/vfio_cxl_priv.h | 71 ++++++
drivers/vfio/pci/vfio_pci_core.c | 24 ++
drivers/vfio/pci/vfio_pci_priv.h | 21 ++
include/linux/vfio_pci_core.h | 7 +
9 files changed, 531 insertions(+)
create mode 100644 drivers/vfio/pci/cxl/Kconfig
create mode 100644 drivers/vfio/pci/cxl/Makefile
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 296bf01e185e..4cd6acd36053 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -58,6 +58,8 @@ config VFIO_PCI_ZDEV_KVM
config VFIO_PCI_DMABUF
def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
+source "drivers/vfio/pci/cxl/Kconfig"
+
source "drivers/vfio/pci/mlx5/Kconfig"
source "drivers/vfio/pci/ism/Kconfig"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 6138f1bf241d..ac26e7494f0a 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -3,6 +3,7 @@
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
+include $(srctree)/$(src)/cxl/Makefile
obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
vfio-pci-y := vfio_pci.o
diff --git a/drivers/vfio/pci/cxl/Kconfig b/drivers/vfio/pci/cxl/Kconfig
new file mode 100644
index 000000000000..5d88999e1256
--- /dev/null
+++ b/drivers/vfio/pci/cxl/Kconfig
@@ -0,0 +1,34 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config VFIO_PCI_CXL
+ bool "VFIO support for CXL Type-2 device passthrough"
+ depends on VFIO_PCI_CORE
+ depends on CXL_BUS
+ depends on CXL_REGION
+ depends on CXL_MEM
+ # CXL providers are tristate; refuse a builtin vfio-pci-core
+ # against modular cxl-core (would fail to link the per-device
+ # helpers in drivers/cxl/core/passthrough.c).
+ depends on CXL_BUS=y || VFIO_PCI_CORE=m
+ depends on CXL_REGION=y || VFIO_PCI_CORE=m
+ depends on CXL_MEM=y || VFIO_PCI_CORE=m
+ select CXL_VFIO_PASSTHROUGH
+ help
+ Support CXL Type-2 (HDM-D, HDM-DB) accelerator device passthrough
+ to a KVM guest. When this option is enabled, vfio-pci-core
+ probes the CXL Register Locator DVSEC at PCI bind time, acquires
+ a cxl_memdev and autoregion via devm_cxl_probe_mem(), and
+ exposes two additional VFIO regions to userspace: a mappable
+ HDM memory region for the device's HPA range, and a COMP_REGS
+ shadow region forwarding HDM Decoder Capability accesses
+ through the cxl-core register-virtualization helpers added by
+ drivers/cxl/core/passthrough.c.
+
+ Devices that do not advertise a CXL Device DVSEC fall back to
+ plain vfio-pci behaviour. Confirmed-CXL devices whose host
+ firmware did not commit an HDM decoder, or whose cxl-core probe
+ otherwise fails, do not bind to vfio-pci at all so the guest is
+ never offered a half-initialised CXL device.
+
+ Scope: firmware-committed, single-decoder, no-interleave.
+
+ Say Y to support CXL Type-2 device passthrough.
diff --git a/drivers/vfio/pci/cxl/Makefile b/drivers/vfio/pci/cxl/Makefile
new file mode 100644
index 000000000000..35e952fe1858
--- /dev/null
+++ b/drivers/vfio/pci/cxl/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+vfio-pci-core-$(CONFIG_VFIO_PCI_CXL) += cxl/vfio_cxl_core.o
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
new file mode 100644
index 000000000000..42cd00bbe869
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -0,0 +1,369 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved.
+ *
+ * vfio-pci CXL Type-2 device passthrough — core entry points.
+ *
+ * Four lifecycle hooks are inserted into vfio-pci-core: acquire and
+ * release run at PCI bind / unbind, open and close run on VFIO fd
+ * open / close. This mirrors the existing vfio_pci_zdev_* integration
+ * model.
+ *
+ * vfio_pci_cxl_acquire() runs at PCI bind time. It performs the CXL
+ * register-locator probe and HDM decoder discovery under a brief
+ * pci_enable_device_mem() / pci_disable_device() bracket, then asks
+ * cxl-core to register a cxl_memdev and auto-attach the
+ * firmware-committed region via devm_cxl_probe_mem(). pci_disable_device()
+ * clears PCI_COMMAND_MASTER but NOT PCI_COMMAND_MEMORY (see
+ * do_pci_disable_device() in drivers/pci/pci.c), so the cxl-core
+ * MMIO accesses performed by devm_cxl_probe_mem() after the disable
+ * still succeed even with vfio-pci's PCI enable refcount returned to
+ * zero. The refcount is re-taken cleanly by vfio_pci_core_enable()
+ * at first VFIO fd open.
+ *
+ * Acquisition is fail-closed for confirmed-CXL devices. Devices that
+ * do not advertise a CXL Device DVSEC, and CXL devices whose
+ * MEM_CAPABLE bit is clear, return -ENODEV so the caller falls back
+ * to plain vfio-pci behaviour. Any other negative errno from
+ * acquire() is a confirmed-CXL probe failure (locator missing, HDM
+ * not single-decoder, range-active timeout, passthrough shadow
+ * snapshot failure, devm_cxl_probe_mem() refusal, HDM HPA range busy)
+ * and aborts the vfio-pci bind so the guest never sees a CXL device
+ * with half-initialised cxl-core state.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/range.h>
+#include <linux/vfio_pci_core.h>
+
+#include <uapi/cxl/cxl_regs.h>
+#include <uapi/linux/pci_regs.h>
+#include <uapi/linux/vfio.h>
+
+#include <cxl/cxl.h>
+#include <cxl/passthrough.h>
+#include <cxl/pci.h>
+
+#include "../vfio_pci_priv.h"
+#include "vfio_cxl_priv.h"
+
+MODULE_IMPORT_NS("CXL");
+
+#define VFIO_PCI_CXL_HDM_RES_NAME "vfio-cxl-hdm"
+
+/* ------------------------------------------------------------------ */
+/* Bind-time setup helpers */
+/* ------------------------------------------------------------------ */
+
+static struct vfio_pci_cxl_state *
+vfio_cxl_create_device_state(struct pci_dev *pdev, u16 dvsec)
+{
+ struct vfio_pci_cxl_state *cxl;
+ u32 hdr1;
+ u16 cap;
+ int rc;
+
+ cxl = devm_cxl_dev_state_create(&pdev->dev, CXL_DEVTYPE_DEVMEM,
+ pci_get_dsn(pdev), dvsec,
+ struct vfio_pci_cxl_state,
+ cxlds, false);
+ if (!cxl)
+ return ERR_PTR(-ENOMEM);
+
+ cxl->pdev = pdev;
+
+ rc = pci_read_config_dword(pdev, dvsec + PCI_DVSEC_HEADER1, &hdr1);
+ if (rc) {
+ devm_kfree(&pdev->dev, cxl);
+ return ERR_PTR(-EIO);
+ }
+ cxl->info.dvsec_offset = dvsec;
+ cxl->info.dvsec_size = PCI_DVSEC_HEADER1_LEN(hdr1);
+
+ rc = pci_read_config_word(pdev, dvsec + PCI_DVSEC_CXL_CAP, &cap);
+ if (rc) {
+ devm_kfree(&pdev->dev, cxl);
+ return ERR_PTR(-EIO);
+ }
+ if (!(cap & PCI_DVSEC_CXL_MEM_CAPABLE)) {
+ devm_kfree(&pdev->dev, cxl);
+ return ERR_PTR(-ENODEV);
+ }
+
+ return cxl;
+}
+
+static int vfio_cxl_probe_regs(struct vfio_pci_cxl_state *cxl)
+{
+ struct cxl_dev_state *cxlds = &cxl->cxlds;
+ resource_size_t hdm_off, hdm_size, bar_off;
+ u8 hdm_count, bir;
+ int rc;
+
+ if (WARN_ON_ONCE(!pci_is_enabled(cxl->pdev)))
+ return -EINVAL;
+
+ rc = cxl_pci_setup_regs(cxl->pdev, CXL_REGLOC_RBI_COMPONENT,
+ &cxlds->reg_map);
+ if (rc)
+ return rc;
+
+ rc = cxl_get_hdm_info(cxlds, &hdm_count, &hdm_off, &hdm_size);
+ if (rc)
+ return rc;
+ if (hdm_count != 1) {
+ pci_err(cxl->pdev,
+ "vfio-cxl: hdm_count=%u, only 1 supported\n",
+ hdm_count);
+ return -EOPNOTSUPP;
+ }
+
+ rc = cxl_regblock_get_bar_info(&cxlds->reg_map, &bir, &bar_off);
+ if (rc)
+ return rc;
+
+ cxl->info.hdm_count = hdm_count;
+ cxl->info.hdm_reg_offset = hdm_off;
+ cxl->info.hdm_reg_size = hdm_size;
+ cxl->info.comp_reg_bir = bir;
+ cxl->info.comp_reg_offset = bar_off;
+ cxl->info.comp_reg_size = cxlds->reg_map.max_size;
+ cxl->info.host_firmware_committed = true;
+
+ /*
+ * Range-active polls a config-space bit in the CXL DVSEC, not
+ * MMIO, so it is safe inside or outside the memory-decode
+ * bracket. Keep it here so cxlds->media_ready is set before the
+ * caller drops the PCI enable refcount.
+ */
+ rc = cxl_await_range_active(cxlds);
+ if (rc)
+ return rc;
+ cxlds->media_ready = true;
+ return 0;
+}
+
+static int vfio_cxl_create_memdev(struct vfio_pci_cxl_state *cxl)
+{
+ struct range hpa_range;
+ struct cxl_memdev *cxlmd;
+
+ /*
+ * devm_cxl_probe_mem() runs synchronously: it registers a
+ * cxl_memdev which triggers cxl_mem_probe(), endpoint port
+ * creation, and autoregion attach. Endpoint port probe reads
+ * HDM decoder MMIO via devm_cxl_setup_hdm(); the device must
+ * therefore still be memory-decoded. pci_disable_device() only
+ * clears PCI_COMMAND_MASTER (not _MEMORY), so the paired enable
+ * / disable done by the caller leaves the decode bit asserted
+ * and these reads succeed even with the vfio refcount at zero.
+ */
+ cxlmd = devm_cxl_probe_mem(&cxl->cxlds, &hpa_range);
+ if (IS_ERR(cxlmd))
+ return PTR_ERR(cxlmd);
+
+ cxl->cxlmd = cxlmd;
+ cxl->info.hpa_base = hpa_range.start;
+ cxl->info.hpa_size = range_len(&hpa_range);
+ return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* HDM HPA mapping */
+/* ------------------------------------------------------------------ */
+
+static int vfio_cxl_map_hdm(struct vfio_pci_cxl_state *cxl)
+{
+ phys_addr_t base = cxl->info.hpa_base;
+ u64 size = cxl->info.hpa_size;
+
+ if (!size)
+ return -EINVAL;
+
+ cxl->hdm_res = request_mem_region(base, size,
+ VFIO_PCI_CXL_HDM_RES_NAME);
+ if (!cxl->hdm_res) {
+ pci_err(cxl->pdev,
+ "vfio-cxl: HDM HPA %pa-%llx busy; check firmware mappings\n",
+ &base, size);
+ return -EBUSY;
+ }
+
+ cxl->hdm_kva = memremap(base, size, MEMREMAP_WB);
+ if (!cxl->hdm_kva) {
+ release_mem_region(base, size);
+ cxl->hdm_res = NULL;
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static void vfio_cxl_unmap_hdm(struct vfio_pci_cxl_state *cxl)
+{
+ if (cxl->hdm_kva) {
+ memunmap(cxl->hdm_kva);
+ cxl->hdm_kva = NULL;
+ }
+ if (cxl->hdm_res) {
+ release_mem_region(cxl->info.hpa_base, cxl->info.hpa_size);
+ cxl->hdm_res = NULL;
+ }
+}
+
+/* ------------------------------------------------------------------ */
+/* Lifecycle hooks */
+/* ------------------------------------------------------------------ */
+
+int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
+{
+ struct pci_dev *pdev = vdev->pdev;
+ struct vfio_pci_cxl_state *cxl;
+ u16 dvsec;
+ int rc;
+
+ if (!pcie_is_cxl(pdev))
+ return -ENODEV;
+
+ dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+ PCI_DVSEC_CXL_DEVICE);
+ if (!dvsec)
+ return -ENODEV;
+
+ cxl = vfio_cxl_create_device_state(pdev, dvsec);
+ if (IS_ERR(cxl)) {
+ rc = PTR_ERR(cxl);
+ if (rc == -ENODEV)
+ return -ENODEV; /* MEM_CAPABLE clear: treat as non-CXL. */
+ pci_warn(pdev, "vfio-cxl: state alloc failed (%d)\n", rc);
+ return rc;
+ }
+
+ rc = pci_enable_device_mem(pdev);
+ if (rc) {
+ pci_warn(pdev, "vfio-cxl: pci_enable_device_mem failed (%d)\n",
+ rc);
+ goto err_free;
+ }
+
+ rc = vfio_cxl_probe_regs(cxl);
+ if (rc) {
+ pci_disable_device(pdev);
+ pci_warn(pdev, "vfio-cxl: register probe failed (%d)\n", rc);
+ goto err_free;
+ }
+
+ /*
+ * Allocate the cxl-core passthrough handle (DVSEC/HDM/CM
+ * shadows) BEFORE devm_cxl_probe_mem() so that a -ENOMEM or
+ * snapshot -EIO here is recoverable: devm_kfree() the
+ * containing state and let devres unwind cxlds. After
+ * devm_cxl_probe_mem() publishes the memdev, no devm_kfree() is
+ * possible because cxlmd->cxlds points into the state.
+ */
+ cxl->cxlpt = devm_cxl_passthrough_create(&pdev->dev, &cxl->cxlds);
+ if (IS_ERR(cxl->cxlpt)) {
+ rc = PTR_ERR(cxl->cxlpt);
+ cxl->cxlpt = NULL;
+ pci_disable_device(pdev);
+ pci_warn(pdev,
+ "vfio-cxl: passthrough shadow snapshot failed (%d)\n",
+ rc);
+ goto err_free;
+ }
+
+ /*
+ * Drop the PCI enable refcount before publishing the cxl_memdev:
+ * vfio_pci_core_enable() will take a fresh refcount at first VFIO
+ * fd open. PCI_COMMAND_MEMORY stays asserted (see file header).
+ */
+ pci_disable_device(pdev);
+
+ /*
+ * Populate the DPA partition tree on cxlds before
+ * devm_cxl_probe_mem() runs. The endpoint port probe will try to
+ * reserve the firmware-committed HDM decoder range as a DPA
+ * resource child of cxlds->dpa_res; without an explicit
+ * cxl_set_capacity() call dpa_res is zero-sized and the
+ * reservation fails with -EBUSY (see __cxl_dpa_reserve() in
+ * drivers/cxl/core/hdm.c). Read the decoder's SIZE from the
+ * snapshot we just took and size dpa_res to cover it.
+ */
+ {
+ u32 size_lo = 0, size_hi = 0;
+ u64 dpa_size;
+
+ cxl_passthrough_hdm_rw(cxl->cxlpt,
+ CXL_HDM_DECODER0_SIZE_LOW_OFFSET(0),
+ &size_lo, false);
+ cxl_passthrough_hdm_rw(cxl->cxlpt,
+ CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(0),
+ &size_hi, false);
+ dpa_size = ((u64)size_hi << 32) | size_lo;
+
+ rc = cxl_set_capacity(&cxl->cxlds, dpa_size);
+ if (rc) {
+ pci_warn(pdev,
+ "vfio-cxl: cxl_set_capacity(0x%llx) failed (%d)\n",
+ dpa_size, rc);
+ goto err_free;
+ }
+ }
+
+ rc = vfio_cxl_create_memdev(cxl);
+ if (rc) {
+ pci_warn(pdev,
+ "vfio-cxl: memdev/region creation failed (%d)\n", rc);
+ goto err_free;
+ }
+
+ /*
+ * Once devm_cxl_probe_mem() has published a cxl_memdev that
+ * holds a pointer into cxl->cxlds, the state must NOT be
+ * devm_kfree'd. A failure from vfio_cxl_map_hdm() is reported
+ * to userspace; the state stays allocated for the lifetime of
+ * the PCI device, and devres unwinds it when the pdev is
+ * removed.
+ */
+ rc = vfio_cxl_map_hdm(cxl);
+ if (rc) {
+ pci_warn(pdev, "vfio-cxl: HDM HPA mapping failed (%d)\n", rc);
+ return rc;
+ }
+
+ vdev->cxl = cxl;
+ pci_info(pdev,
+ "vfio-cxl: acquired (hpa=%pa/0x%llx hdm@0x%llx/0x%llx BAR%u@0x%llx/0x%llx)\n",
+ &cxl->info.hpa_base, cxl->info.hpa_size,
+ cxl->info.hdm_reg_offset, cxl->info.hdm_reg_size,
+ cxl->info.comp_reg_bir,
+ cxl->info.comp_reg_offset, cxl->info.comp_reg_size);
+ return 0;
+
+err_free:
+ devm_kfree(&pdev->dev, cxl);
+ return rc;
+}
+
+void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (cxl)
+ vfio_cxl_unmap_hdm(cxl);
+ vdev->cxl = NULL;
+}
+
+int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
+{
+ /*
+ * Region registration (HDM, COMP_REGS) is added by the next
+ * patch in this series. This hook exists so vfio-pci-core's
+ * fd-open path has a stable call site.
+ */
+ return 0;
+}
+
+void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev)
+{
+}
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
new file mode 100644
index 000000000000..4ce8f88f8d3d
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright(c) 2026 NVIDIA Corporation. All rights reserved. */
+#ifndef __VFIO_PCI_CXL_PRIV_H__
+#define __VFIO_PCI_CXL_PRIV_H__
+
+#include <linux/pci.h>
+#include <linux/vfio_pci_core.h>
+
+#include <cxl/cxl.h>
+#include <cxl/passthrough.h>
+
+/**
+ * struct vfio_pci_cxl_state - per-device CXL Type-2 passthrough state
+ *
+ * Anchored to a vfio-pci-core device via @vdev->cxl. Allocated by
+ * devm_cxl_dev_state_create() so its lifetime is bound to the PCI
+ * device; the cxl_memdev acquired via devm_cxl_probe_mem() and the
+ * cxl_passthrough handle returned by devm_cxl_passthrough_create()
+ * are similarly devres-anchored.
+ *
+ * @cxlds: CXL device state. MUST be the first member (enforced by
+ * devm_cxl_dev_state_create()'s static_assert).
+ * @pdev: backpointer to the PCI device.
+ * @cxlmd: cxl_memdev acquired at PCI bind via devm_cxl_probe_mem().
+ * @cxlpt: register-virtualization handle owned by cxl-core; vfio
+ * forwards DVSEC config-space, COMP_REGS region, and HDM
+ * block accesses through this opaque pointer. See
+ * Documentation/driver-api/vfio-pci-cxl.rst.
+ * @info: snapshot of cxl-side metadata describing the device's CXL
+ * layout. Filled in during vfio_pci_cxl_acquire() and used
+ * by the VMM-facing helpers (CAP_CXL builder, region info,
+ * COMP_REGS dispatch boundary).
+ * @hdm_region_idx, @comp_reg_region_idx: VFIO region indices.
+ * Assigned by vfio_pci_cxl_open() when the regions are
+ * registered; zero on a device whose fd has never been
+ * opened.
+ * @hdm_res: request_mem_region cookie for the HPA range.
+ * @hdm_kva: memremap(MEMREMAP_WB) mapping of the HPA range. Used
+ * for the HDM region's pread/pwrite path. The mmap fault
+ * handler does vmf_insert_pfn from the physical HPA so the
+ * guest gets the same backing memory the host sees.
+ */
+struct vfio_pci_cxl_state {
+ /* MUST be first member - see devm_cxl_dev_state_create() macro. */
+ struct cxl_dev_state cxlds;
+
+ struct pci_dev *pdev;
+ struct cxl_memdev *cxlmd;
+ struct cxl_passthrough *cxlpt;
+
+ struct {
+ u16 dvsec_offset;
+ u16 dvsec_size;
+ phys_addr_t hpa_base;
+ u64 hpa_size;
+ u8 comp_reg_bir;
+ u64 comp_reg_offset;
+ u64 comp_reg_size;
+ u8 hdm_count;
+ u64 hdm_reg_offset;
+ u64 hdm_reg_size;
+ bool host_firmware_committed;
+ } info;
+
+ u32 hdm_region_idx;
+ u32 comp_reg_region_idx;
+ struct resource *hdm_res;
+ void *hdm_kva;
+};
+
+#endif /* __VFIO_PCI_CXL_PRIV_H__ */
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 050e7542952e..05ab4ae59157 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -602,10 +602,25 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
vdev->has_vga = true;
+ /*
+ * Register CXL VFIO regions before mapping BARs. CXL region
+ * registration only list-appends to vdev->region[]; it has no
+ * dependency on vdev->barmap[] being populated. Running it
+ * first means a failure here unwinds through out_free_config
+ * without leaking BAR ioremaps or selected-region requests
+ * (those are released by vfio_pci_core_disable(), which is not
+ * called for a failed open).
+ */
+ ret = vfio_pci_cxl_open(vdev);
+ if (ret)
+ goto out_free_config;
+
vfio_pci_core_map_bars(vdev);
return 0;
+out_free_config:
+ vfio_config_free(vdev);
out_free_zdev:
vfio_pci_zdev_close_device(vdev);
out_free_state:
@@ -699,6 +714,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
vdev->needs_reset = true;
+ vfio_pci_cxl_close(vdev);
vfio_pci_zdev_close_device(vdev);
/*
@@ -2222,6 +2238,10 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
if (ret)
goto out_vf;
+ ret = vfio_pci_cxl_acquire(vdev);
+ if (ret && ret != -ENODEV)
+ goto out_vga;
+
vfio_pci_probe_power_state(vdev);
/*
@@ -2250,6 +2270,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
pm_runtime_get_noresume(dev);
pm_runtime_forbid(dev);
+ vfio_pci_cxl_release(vdev);
+out_vga:
+ vfio_pci_vga_uninit(vdev);
out_vf:
vfio_pci_vf_uninit(vdev);
return ret;
@@ -2264,6 +2287,7 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
vfio_pci_vf_uninit(vdev);
vfio_pci_vga_uninit(vdev);
+ vfio_pci_cxl_release(vdev);
if (!disable_idle_d3)
pm_runtime_get_noresume(&vdev->pdev->dev);
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index fca9d0dfac90..94bf7c6a8548 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -109,6 +109,27 @@ static inline void vfio_pci_zdev_close_device(struct vfio_pci_core_device *vdev)
{}
#endif
+#ifdef CONFIG_VFIO_PCI_CXL
+int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev);
+void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev);
+int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev);
+void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev);
+#else
+static inline int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
+{
+ return -ENODEV;
+}
+
+static inline void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev) { }
+
+static inline int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
+{
+ return 0;
+}
+
+static inline void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev) { }
+#endif
+
static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
{
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 89165b769e5c..541c1911e090 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -142,6 +142,13 @@ struct vfio_pci_core_device {
struct notifier_block nb;
struct rw_semaphore memory_lock;
struct list_head dmabufs;
+ /*
+ * Opaque pointer to struct vfio_pci_cxl_state (defined in
+ * drivers/vfio/pci/cxl/vfio_cxl_priv.h). Set by
+ * vfio_pci_cxl_acquire() at PCI bind; NULL on non-CXL devices
+ * and when CONFIG_VFIO_PCI_CXL=n.
+ */
+ void *cxl;
};
enum vfio_pci_io_width {
--
2.25.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v3 08/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (6 preceding siblings ...)
2026-06-25 16:54 ` [PATCH v3 07/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2 acquisition mhonap
@ 2026-06-25 16:54 ` mhonap
2026-06-25 16:54 ` [PATCH v3 09/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test mhonap
` (3 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
Complete the vfio-pci-core integration of CXL Type-2 device
passthrough by exposing two VFIO regions to userspace, wiring DVSEC
config-space accesses through cxl-core's register-virtualization
helpers, and reserving the CXL component register block from BAR
mmap and BAR resource claim.
HDM region (VFIO_REGION_SUBTYPE_CXL):
- mmappable view of the device's firmware-committed HPA range
- mmap fault handler calls vmf_insert_pfn() from the physical HPA
so the guest gets the same backing memory the host sees
- pread/pwrite go through the memremap_wb() kva captured at
bind time by vfio_cxl_map_hdm()
COMP_REGS region (VFIO_REGION_SUBTYPE_CXL_COMP_REGS):
- pread/pwrite only, dword-aligned (-EINVAL on misalignment)
- thin transport: each dword dispatches by offset to
cxl_passthrough_cm_rw() (CM cap-array snapshot) or
cxl_passthrough_hdm_rw() (HDM Decoder block). No shadow buffer
on the vfio side; all per-field semantics live in cxl-core.
DVSEC config-space access:
- vfio_pci_cxl_config_boundary() clips a chunk at the CXL Device
DVSEC body edge in vfio_pci_config_rw_single() so the generic
perm-bits path handles the DVSEC header bytes and the CXL hook
handles the body bytes. The clipping shim is used instead of
re-pointing the ecap_perms[] readfn/writefn (which would mutate
a module-init static and race across multiple CXL devices).
- vfio_pci_cxl_config_rw() forwards clipped accesses to
cxl_passthrough_dvsec_rw(); cxl-core enforces the per-field
write semantics (LOCK/RWO, CONTROL/RWL, STATUS/RW1C,
RANGE1/HwInit, RANGE2/RsvdZ).
GET_INFO / GET_REGION_INFO:
- VFIO_DEVICE_INFO_CAP_CXL advertises the two region indices, the
component BAR layout, and HOST_FIRMWARE_COMMITTED.
- GET_REGION_INFO on the component BAR returns a sparse-mmap cap
that excludes [comp_reg_offset, comp_reg_offset+comp_reg_size).
BAR resource handling:
- cxl-core holds request_mem_region() on the CXL component
register sub-range from devm_cxl_probe_mem(), so vfio_pci-core's
pci_request_selected_regions() on the full BAR would collide.
map_bars() skips the request for the component BAR (still iomaps
it; vfio holds the BAR via driver binding); disable() mirrors
the asymmetric skip.
- mmap of the component BAR refuses any range overlapping the CXL
sub-range via vfio_pci_cxl_mmap_overlaps_comp_regs().
vfio_pci_cxl_open() now registers both VFIO regions; close()
unregisters them. Raw BAR rw redirect into the CXL sub-range is
intentionally not implemented: VMMs use the COMP_REGS region
directly.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_core.c | 521 ++++++++++++++++++++++++++-
drivers/vfio/pci/vfio_pci_config.c | 31 ++
drivers/vfio/pci/vfio_pci_core.c | 44 ++-
drivers/vfio/pci/vfio_pci_priv.h | 72 ++++
drivers/vfio/pci/vfio_pci_rdwr.c | 17 +
5 files changed, 679 insertions(+), 6 deletions(-)
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 42cd00bbe869..8a00b776d7c7 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -123,12 +123,24 @@ static int vfio_cxl_probe_regs(struct vfio_pci_cxl_state *cxl)
if (rc)
return rc;
+ /*
+ * The CXL Component Register block is a fixed 64 KiB area (CXL r4.0
+ * §8.2.3). cxl_pci_setup_regs() records the remaining BAR length
+ * after the regblock offset in reg_map.max_size, which is an upper
+ * bound, not the spec-defined size. Bail if the BAR does not have
+ * room for a full component register block at the recorded offset,
+ * and publish the spec size so the UAPI, sparse-mmap exclusion, and
+ * COMP_REGS region all agree on the same window.
+ */
+ if (cxlds->reg_map.max_size < CXL_COMPONENT_REG_BLOCK_SIZE)
+ return -ENXIO;
+
cxl->info.hdm_count = hdm_count;
cxl->info.hdm_reg_offset = hdm_off;
cxl->info.hdm_reg_size = hdm_size;
cxl->info.comp_reg_bir = bir;
cxl->info.comp_reg_offset = bar_off;
- cxl->info.comp_reg_size = cxlds->reg_map.max_size;
+ cxl->info.comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
cxl->info.host_firmware_committed = true;
/*
@@ -354,16 +366,515 @@ void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev)
vdev->cxl = NULL;
}
+static int vfio_pci_cxl_register_hdm(struct vfio_pci_core_device *vdev);
+static int vfio_pci_cxl_register_comp_regs(struct vfio_pci_core_device *vdev);
+
int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ int rc;
+
+ if (!cxl)
+ return 0; /* plain vfio-pci device */
+
+ rc = vfio_pci_cxl_register_comp_regs(vdev);
+ if (rc) {
+ pci_warn(vdev->pdev,
+ "vfio-cxl: COMP_REGS region register failed (%d)\n",
+ rc);
+ return rc;
+ }
+
+ rc = vfio_pci_cxl_register_hdm(vdev);
+ if (rc) {
+ pci_warn(vdev->pdev,
+ "vfio-cxl: HDM region register failed (%d)\n", rc);
+ /*
+ * COMP_REGS already registered above. vfio core does not
+ * call close_device() when open_device() returns an error,
+ * so roll back the COMP_REGS dynamic region here to avoid
+ * a leaked half-registered open state.
+ */
+ vfio_pci_cxl_close(vdev);
+ return rc;
+ }
+ return 0;
+}
+
+void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ unsigned int i;
+
+ if (!cxl)
+ return;
+
+ for (i = vdev->num_regions; i > 0; i--) {
+ struct vfio_pci_region *r = &vdev->region[i - 1];
+
+ if (r->data != cxl)
+ break;
+ if (r->ops->release)
+ r->ops->release(vdev, r);
+ vdev->num_regions--;
+ }
+}
+
+/* ------------------------------------------------------------------ */
+/* HDM region: mmappable view of the device's HPA range */
+/* ------------------------------------------------------------------ */
+
+static vm_fault_t hdm_region_fault(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct vfio_pci_cxl_state *cxl = vma->vm_private_data;
+ unsigned long off = (vmf->address - vma->vm_start) +
+ (vma->vm_pgoff << PAGE_SHIFT);
+ phys_addr_t pa;
+
+ if (!cxl || !cxl->info.hpa_size)
+ return VM_FAULT_SIGBUS;
+ if (off >= cxl->info.hpa_size)
+ return VM_FAULT_SIGBUS;
+
+ pa = cxl->info.hpa_base + off;
+ return vmf_insert_pfn(vma, vmf->address, PHYS_PFN(pa));
+}
+
+static const struct vm_operations_struct hdm_region_vm_ops = {
+ .fault = hdm_region_fault,
+};
+
+static int hdm_region_mmap(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region,
+ struct vm_area_struct *vma)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ pgoff_t pgoff;
+ u64 req_start, req_len;
+
+ if (!cxl || !cxl->info.hpa_size)
+ return -ENODEV;
+
/*
- * Region registration (HDM, COMP_REGS) is added by the next
- * patch in this series. This hook exists so vfio-pci-core's
- * fd-open path has a stable call site.
+ * vfio_pci_core_mmap() forwards the VMA with vm_pgoff still
+ * carrying the VFIO region index in the high bits. Mask it off
+ * so req_start is the in-region offset; also overwrite vm_pgoff
+ * with the normalised value so the fault handler computes the
+ * physical address from a clean offset.
*/
+ pgoff = vma->vm_pgoff &
+ ((1ULL << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+ req_start = (u64)pgoff << PAGE_SHIFT;
+ req_len = vma->vm_end - vma->vm_start;
+ if (req_start > cxl->info.hpa_size ||
+ req_len > cxl->info.hpa_size - req_start)
+ return -EINVAL;
+
+ vma->vm_pgoff = pgoff;
+ vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+ vma->vm_ops = &hdm_region_vm_ops;
+ vma->vm_private_data = cxl;
return 0;
}
-void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev)
+static ssize_t hdm_region_rw(struct vfio_pci_core_device *vdev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ void *kva;
+
+ if (!cxl || !cxl->hdm_kva)
+ return -EINVAL;
+ if (pos < 0 || (u64)pos > cxl->info.hpa_size ||
+ count > cxl->info.hpa_size - (u64)pos)
+ return -EINVAL;
+
+ kva = (u8 *)cxl->hdm_kva + pos;
+ if (iswrite) {
+ if (copy_from_user(kva, buf, count))
+ return -EFAULT;
+ } else {
+ if (copy_to_user(buf, kva, count))
+ return -EFAULT;
+ }
+
+ *ppos += count;
+ return count;
+}
+
+static void hdm_region_release(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region)
+{
+}
+
+static const struct vfio_pci_regops vfio_pci_cxl_hdm_ops = {
+ .rw = hdm_region_rw,
+ .mmap = hdm_region_mmap,
+ .release = hdm_region_release,
+};
+
+static int vfio_pci_cxl_register_hdm(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 region_type = VFIO_REGION_TYPE_PCI_VENDOR_TYPE | PCI_VENDOR_ID_CXL;
+ u32 region_flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE |
+ VFIO_REGION_INFO_FLAG_MMAP;
+ int rc;
+
+ rc = vfio_pci_core_register_dev_region(vdev, region_type,
+ VFIO_REGION_SUBTYPE_CXL,
+ &vfio_pci_cxl_hdm_ops,
+ cxl->info.hpa_size,
+ region_flags, cxl);
+ if (rc)
+ return rc;
+
+ cxl->hdm_region_idx = VFIO_PCI_NUM_REGIONS + vdev->num_regions - 1;
+ return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* COMP_REGS region: thin transport to cxl-core register helpers */
+/* ------------------------------------------------------------------ */
+
+/*
+ * COMP_REGS exposes the CXL component register sub-range of the
+ * device's component BAR as a pread/pwrite-only VFIO region. Access
+ * is dword-only (4-byte aligned); sub-dword access returns -EINVAL.
+ * The dispatch maps each dword to one of cxl-core's three rw helpers:
+ *
+ * pos < CXL_CM_OFFSET → zero-fill / drop
+ * CXL_CM_OFFSET <= pos < hdm_reg_offset → cxl_passthrough_cm_rw
+ * hdm_reg_offset <= pos < hdm_reg_offset+size → cxl_passthrough_hdm_rw
+ * pos >= hdm_reg_offset + hdm_reg_size → zero-fill / drop
+ *
+ * vfio holds no shadow buffer of its own; the per-field write
+ * semantics live entirely in cxl-core.
+ */
+static ssize_t comp_regs_rw(struct vfio_pci_core_device *vdev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ resource_size_t cm_off, hdm_start, hdm_end;
+ size_t done = 0;
+
+ if (!cxl || !cxl->cxlpt)
+ return -EINVAL;
+ if (pos < 0 || (u64)pos > cxl->info.comp_reg_size ||
+ count > cxl->info.comp_reg_size - (u64)pos)
+ return -EINVAL;
+ if (!IS_ALIGNED(pos, 4) || !IS_ALIGNED(count, 4))
+ return -EINVAL;
+
+ cm_off = CXL_CM_OFFSET;
+ hdm_start = cxl->info.hdm_reg_offset;
+ hdm_end = hdm_start + cxl->info.hdm_reg_size;
+
+ while (done < count) {
+ __le32 le = 0;
+ u32 v32 = 0;
+ int rc;
+
+ if (iswrite) {
+ if (copy_from_user(&le, buf + done, 4))
+ return done ?: -EFAULT;
+ v32 = le32_to_cpu(le);
+ }
+
+ if (pos >= cm_off && pos < hdm_start) {
+ rc = cxl_passthrough_cm_rw(cxl->cxlpt,
+ (u32)(pos - cm_off),
+ &v32, iswrite);
+ if (rc)
+ return done ?: rc;
+ } else if (pos >= hdm_start && pos < hdm_end) {
+ rc = cxl_passthrough_hdm_rw(cxl->cxlpt,
+ (u32)(pos - hdm_start),
+ &v32, iswrite);
+ if (rc)
+ return done ?: rc;
+ } else if (!iswrite) {
+ v32 = 0; /* outside modelled ranges: read 0 */
+ }
+ /* writes outside modelled ranges are silently dropped */
+
+ if (!iswrite) {
+ le = cpu_to_le32(v32);
+ if (copy_to_user(buf + done, &le, 4))
+ return done ?: -EFAULT;
+ }
+
+ pos += 4;
+ done += 4;
+ }
+
+ *ppos += done;
+ return done;
+}
+
+static void comp_regs_release(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region)
+{
+}
+
+static const struct vfio_pci_regops vfio_pci_cxl_comp_regs_ops = {
+ .rw = comp_regs_rw,
+ .release = comp_regs_release,
+};
+
+static int vfio_pci_cxl_register_comp_regs(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 region_type = VFIO_REGION_TYPE_PCI_VENDOR_TYPE | PCI_VENDOR_ID_CXL;
+ u32 region_flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE;
+ int rc;
+
+ rc = vfio_pci_core_register_dev_region(vdev, region_type,
+ VFIO_REGION_SUBTYPE_CXL_COMP_REGS,
+ &vfio_pci_cxl_comp_regs_ops,
+ cxl->info.comp_reg_size,
+ region_flags, cxl);
+ if (rc)
+ return rc;
+
+ cxl->comp_reg_region_idx = VFIO_PCI_NUM_REGIONS + vdev->num_regions - 1;
+ return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* DVSEC config-space clipping shim */
+/* ------------------------------------------------------------------ */
+
+/*
+ * vfio_pci_cxl_config_boundary - clip a config-rw chunk at the DVSEC body edge
+ *
+ * Returns the maximum byte count the caller may pass through the
+ * generic chunker without straddling the CXL Device DVSEC body
+ * boundary, or SIZE_MAX when no clip is required. Used by
+ * vfio_pci_config_rw_single() so the DVSEC header bytes stay on the
+ * generic perm-bits path and the body bytes reach the CXL hook.
+ */
+size_t vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev,
+ loff_t pos)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 body_start, body_end;
+
+ if (!cxl)
+ return SIZE_MAX;
+
+ body_start = cxl->info.dvsec_offset + PCI_DVSEC_CXL_CAP;
+ body_end = cxl->info.dvsec_offset + cxl->info.dvsec_size;
+
+ if (pos < body_start)
+ return body_start - pos;
+ if (pos < body_end)
+ return body_end - pos;
+ return SIZE_MAX;
+}
+
+/*
+ * vfio_pci_cxl_config_rw - forward CXL DVSEC config accesses to cxl-core
+ *
+ * Returns the number of bytes processed on success, -ENOENT if the
+ * access lies entirely outside the CXL Device DVSEC body (caller
+ * takes the standard perm-bits path), or another negative errno on
+ * hard failure. vfio_pci_config_rw_single() applies
+ * vfio_pci_cxl_config_boundary() before width selection, so any
+ * access that reaches here was already clipped to lie entirely inside
+ * the DVSEC body.
+ */
+ssize_t vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev,
+ loff_t pos, size_t count, __le32 *val,
+ bool iswrite)
{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 dvsec_off, body_start, body_end, off;
+ u32 host_val;
+ int rc;
+
+ if (!cxl || !cxl->cxlpt)
+ return -ENOENT;
+
+ dvsec_off = cxl->info.dvsec_offset;
+ body_start = dvsec_off + PCI_DVSEC_CXL_CAP;
+ body_end = dvsec_off + cxl->info.dvsec_size;
+
+ if (pos + count <= body_start || pos >= body_end)
+ return -ENOENT;
+ if (WARN_ON_ONCE(pos < body_start || pos + count > body_end))
+ return -EINVAL; /* caller failed to clip at body boundary */
+
+ off = (u32)(pos - dvsec_off);
+ host_val = iswrite ? le32_to_cpu(*val) : 0;
+
+ rc = cxl_passthrough_dvsec_rw(cxl->cxlpt, off, &host_val, count,
+ iswrite);
+ if (rc)
+ return rc;
+
+ if (!iswrite)
+ *val = cpu_to_le32(host_val);
+ return count;
+}
+
+/* ------------------------------------------------------------------ */
+/* GET_INFO / GET_REGION_INFO / mmap helpers */
+/* ------------------------------------------------------------------ */
+
+u8 vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ return cxl ? cxl->info.comp_reg_bir : U8_MAX;
+}
+
+bool vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+ size_t *start, size_t *end)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl || !cxl->info.comp_reg_size)
+ return false;
+
+ *start = cxl->info.comp_reg_offset;
+ *end = cxl->info.comp_reg_offset + cxl->info.comp_reg_size;
+ return true;
+}
+
+bool vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ u64 req_start, u64 req_len)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl || !cxl->info.comp_reg_size)
+ return false;
+
+ return req_start < cxl->info.comp_reg_offset + cxl->info.comp_reg_size &&
+ req_start + req_len > cxl->info.comp_reg_offset;
+}
+
+/*
+ * vfio_pci_cxl_bar_overlaps_comp_regs - check whether a BAR-relative access
+ * overlaps the CXL component register sub-range.
+ *
+ * Returns true when @bar is the component BAR and the [@start, @start + @len)
+ * window overlaps [comp_reg_offset, comp_reg_offset + comp_reg_size). Used
+ * by the raw BAR read/write and ioeventfd paths to reject accesses that
+ * would bypass the COMP_REGS region and reach the physical component
+ * registers directly, sidestepping cxl-core's shadow and per-field write
+ * semantics.
+ */
+bool vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ int bar, u64 start, u64 len)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl || !cxl->info.comp_reg_size || !len)
+ return false;
+ if (bar != cxl->info.comp_reg_bir)
+ return false;
+
+ return start < cxl->info.comp_reg_offset + cxl->info.comp_reg_size &&
+ start + len > cxl->info.comp_reg_offset;
+}
+
+int vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ struct vfio_device_info_cap_cxl cap = { };
+
+ if (!cxl)
+ return 0;
+
+ cap.header.id = VFIO_DEVICE_INFO_CAP_CXL;
+ cap.header.version = 1;
+ if (cxl->info.host_firmware_committed)
+ cap.flags |= VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED;
+ cap.hdm_region_idx = cxl->hdm_region_idx;
+ cap.comp_reg_region_idx = cxl->comp_reg_region_idx;
+ cap.comp_reg_bar = cxl->info.comp_reg_bir;
+ cap.comp_reg_offset = cxl->info.comp_reg_offset;
+ cap.comp_reg_size = cxl->info.comp_reg_size;
+
+ return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
+}
+
+/*
+ * Build a VFIO_REGION_INFO_CAP_SPARSE_MMAP that excludes the CXL
+ * component register block from the mmappable areas of the
+ * component BAR. Returns -ENOTTY when the request is not for the
+ * component BAR or the component BAR is not mmappable; the caller
+ * (vfio_pci_ioctl_get_region_info) then continues with the standard
+ * BAR path.
+ */
+int vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ struct vfio_region_info_cap_sparse_mmap *sparse;
+ u64 bar_len, comp_start, comp_end;
+ u64 before_end, after_start;
+ struct vfio_region_sparse_mmap_area areas[2];
+ u32 nr_areas = 0, cap_size;
+ int ret;
+
+ if (!cxl)
+ return -ENOTTY;
+ if (info->index != cxl->info.comp_reg_bir)
+ return -ENOTTY;
+ if (!cxl->info.comp_reg_size)
+ return -ENOTTY;
+ if (!vdev->bar_mmap_supported[info->index])
+ return -ENOTTY;
+
+ bar_len = pci_resource_len(vdev->pdev, info->index);
+ comp_start = cxl->info.comp_reg_offset;
+ comp_end = comp_start + cxl->info.comp_reg_size;
+
+ before_end = round_down(comp_start, PAGE_SIZE);
+ after_start = round_up(comp_end, PAGE_SIZE);
+
+ if (before_end > 0) {
+ areas[nr_areas].offset = 0;
+ areas[nr_areas].size = before_end;
+ nr_areas++;
+ }
+ if (after_start < bar_len) {
+ areas[nr_areas].offset = after_start;
+ areas[nr_areas].size = bar_len - after_start;
+ nr_areas++;
+ }
+
+ info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
+ info->size = bar_len;
+ info->flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE;
+ if (!nr_areas)
+ return 0;
+
+ info->flags |= VFIO_REGION_INFO_FLAG_MMAP;
+
+ cap_size = struct_size(sparse, areas, nr_areas);
+ sparse = kzalloc(cap_size, GFP_KERNEL);
+ if (!sparse)
+ return -ENOMEM;
+
+ sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ sparse->header.version = 1;
+ sparse->nr_areas = nr_areas;
+ memcpy(sparse->areas, areas, nr_areas * sizeof(areas[0]));
+
+ ret = vfio_info_add_capability(caps, &sparse->header, cap_size);
+ kfree(sparse);
+ return ret;
}
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index a10ed733f0e3..b9f30a33515a 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1898,8 +1898,15 @@ ssize_t vfio_pci_config_rw_single(struct vfio_pci_core_device *vdev,
/*
* Chop accesses into aligned chunks containing no more than a
* single capability. Caller increments to the next chunk.
+ *
+ * For CXL Type-2 devices also clip at the CXL Device DVSEC body
+ * boundary so the generic perm-bits path handles the DVSEC
+ * header bytes and the CXL hook handles the body bytes; without
+ * this clip a 32-bit access at dvsec + 0x08 would span the
+ * generic Header2 word and the CXL CAPABILITY word.
*/
count = min(count, vfio_pci_cap_remaining_dword(vdev, *ppos));
+ count = min(count, vfio_pci_cxl_config_boundary(vdev, *ppos));
if (count >= 4 && !(*ppos % 4))
count = 4;
else if (count >= 2 && !(*ppos % 2))
@@ -1909,6 +1916,30 @@ ssize_t vfio_pci_config_rw_single(struct vfio_pci_core_device *vdev,
ret = count;
+ /*
+ * Give the CXL Type-2 hook first claim on this access: if the
+ * range lies inside the CXL Device DVSEC body, forward it to
+ * cxl-core's register-virtualization helpers instead of the
+ * standard perm-bits path. -ENOENT means "not for me; use the
+ * default path"; any other negative value is a hard error.
+ */
+ if (vdev->cxl) {
+ __le32 le_val = 0;
+ ssize_t cxl_ret;
+
+ if (iswrite && copy_from_user(&le_val, buf, count))
+ return -EFAULT;
+ cxl_ret = vfio_pci_cxl_config_rw(vdev, *ppos, count, &le_val,
+ iswrite);
+ if (cxl_ret >= 0) {
+ if (!iswrite && copy_to_user(buf, &le_val, count))
+ return -EFAULT;
+ return cxl_ret;
+ }
+ if (cxl_ret != -ENOENT)
+ return cxl_ret;
+ }
+
cap_id = vdev->pci_config_map[*ppos];
if (cap_id == PCI_CAP_ID_INVALID) {
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 05ab4ae59157..2d2dae278d1e 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -501,6 +501,23 @@ static void vfio_pci_core_map_bars(struct vfio_pci_core_device *vdev)
if (!pci_resource_len(pdev, i))
continue;
+ /*
+ * cxl-core already holds request_mem_region() on the CXL
+ * component register sub-range of this BAR. Skip the
+ * full-BAR request so we do not collide with that
+ * sub-region; vfio still owns the BAR via the driver
+ * binding and the iomap below succeeds without a region
+ * claim.
+ */
+ if (vdev->cxl && bar == vfio_pci_cxl_get_component_reg_bar(vdev)) {
+ vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+ if (!vdev->barmap[bar]) {
+ pci_dbg(pdev, "Failed to iomap region %d\n", bar);
+ vdev->barmap[bar] = IOMEM_ERR_PTR(-ENOMEM);
+ }
+ continue;
+ }
+
if (pci_request_selected_regions(pdev, 1 << bar, "vfio")) {
pci_dbg(pdev, "Failed to reserve region %d\n", bar);
vdev->barmap[bar] = IOMEM_ERR_PTR(-EBUSY);
@@ -701,7 +718,10 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
if (IS_ERR_OR_NULL(vdev->barmap[bar]))
continue;
pci_iounmap(pdev, vdev->barmap[bar]);
- pci_release_selected_regions(pdev, 1 << bar);
+ /* Mirror the asymmetric setup-time skip in map_bars(). */
+ if (!(vdev->cxl &&
+ i == vfio_pci_cxl_get_component_reg_bar(vdev)))
+ pci_release_selected_regions(pdev, 1 << bar);
vdev->barmap[bar] = NULL;
}
@@ -1051,6 +1071,16 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
info.num_irqs = VFIO_PCI_NUM_IRQS;
+ if (vdev->cxl) {
+ ret = vfio_pci_cxl_get_info(vdev, &caps);
+ if (ret) {
+ pci_warn(vdev->pdev,
+ "Failed to add CXL info capability\n");
+ return ret;
+ }
+ info.flags |= VFIO_DEVICE_FLAGS_CXL;
+ }
+
ret = vfio_pci_info_zdev_add_caps(vdev, &caps);
if (ret && ret != -ENODEV) {
pci_warn(vdev->pdev,
@@ -1093,6 +1123,12 @@ int vfio_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
struct pci_dev *pdev = vdev->pdev;
int i, ret;
+ if (vdev->cxl) {
+ ret = vfio_pci_cxl_get_region_info(vdev, info, caps);
+ if (ret != -ENOTTY)
+ return ret;
+ }
+
switch (info->index) {
case VFIO_PCI_CONFIG_REGION_INDEX:
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
@@ -1811,6 +1847,12 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
if (req_start + req_len > phys_len)
return -EINVAL;
+ /* Block mmap of the CXL component register block. */
+ if (vdev->cxl &&
+ index == vfio_pci_cxl_get_component_reg_bar(vdev) &&
+ vfio_pci_cxl_mmap_overlaps_comp_regs(vdev, req_start, req_len))
+ return -EINVAL;
+
/*
* Even though we don't make use of the barmap for the mmap,
* we need to request the region and the barmap tracks that.
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 94bf7c6a8548..88b89da6dd5a 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -114,6 +114,23 @@ int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev);
void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev);
int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev);
void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev);
+size_t vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev,
+ loff_t pos);
+ssize_t vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev,
+ loff_t pos, size_t count, __le32 *val,
+ bool iswrite);
+int vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps);
+int vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps);
+u8 vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev);
+bool vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+ size_t *start, size_t *end);
+bool vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ u64 req_start, u64 req_len);
+bool vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ int bar, u64 start, u64 len);
#else
static inline int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
{
@@ -128,6 +145,61 @@ static inline int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
}
static inline void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev) { }
+
+static inline size_t
+vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev, loff_t pos)
+{
+ return SIZE_MAX;
+}
+
+static inline ssize_t
+vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev, loff_t pos,
+ size_t count, __le32 *val, bool iswrite)
+{
+ return -ENOENT;
+}
+
+static inline int
+vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps)
+{
+ return 0;
+}
+
+static inline int
+vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps)
+{
+ return -ENOTTY;
+}
+
+static inline u8
+vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{
+ return U8_MAX;
+}
+
+static inline bool
+vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+ size_t *start, size_t *end)
+{
+ return false;
+}
+
+static inline bool
+vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ u64 req_start, u64 req_len)
+{
+ return false;
+}
+
+static inline bool
+vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ int bar, u64 start, u64 len)
+{
+ return false;
+}
#endif
static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 3bfbb879a005..a856f29a3c94 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -236,6 +236,15 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf,
count = min(count, (size_t)(end - pos));
+ /*
+ * Reject raw BAR access that would land inside the CXL component
+ * register sub-range. cxl-core owns the per-field shadow and
+ * spec-defined write semantics; userspace must use the dedicated
+ * COMP_REGS VFIO region for that range.
+ */
+ if (vfio_pci_cxl_bar_overlaps_comp_regs(vdev, bar, pos, count))
+ return -EINVAL;
+
if (bar == PCI_ROM_RESOURCE) {
/*
* The ROM can fill less space than the BAR, so we start the
@@ -437,6 +446,14 @@ int vfio_pci_ioeventfd(struct vfio_pci_core_device *vdev, loff_t offset,
pos >= vdev->msix_offset + vdev->msix_size))
return -EINVAL;
+ /*
+ * Disallow ioeventfds arming against the CXL component register
+ * sub-range; that area is fronted by cxl-core's shadow and must
+ * not be reached through the raw BAR map.
+ */
+ if (vfio_pci_cxl_bar_overlaps_comp_regs(vdev, bar, pos, count))
+ return -EINVAL;
+
if (count == 8)
return -EINVAL;
--
2.25.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v3 09/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (7 preceding siblings ...)
2026-06-25 16:54 ` [PATCH v3 08/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim mhonap
@ 2026-06-25 16:54 ` mhonap
2026-06-25 16:54 ` [PATCH v3 10/11] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
` (2 subsequent siblings)
11 siblings, 0 replies; 13+ messages in thread
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
Exercise the user-visible contract added by CONFIG_VFIO_PCI_CXL:
device_is_cxl GET_INFO returns VFIO_DEVICE_FLAGS_CXL
and a populated VFIO_DEVICE_INFO_CAP_CXL.
hdm_region_mmap_rw mmap() one page of the HDM region,
write a pattern, read it back. Proves
the mmap fault handler's vmf_insert_pfn
path and the firmware-committed HPA
mapping.
component_bar_sparse_mmap GET_REGION_INFO on the component BAR
advertises a SPARSE_MMAP cap, and every
advertised mmappable area lies outside
[comp_reg_offset, +comp_reg_size).
comp_regs_cm_cap_array_read pread() of the COMP_REGS region at
CXL_CM_OFFSET returns a valid CM
cap-array header (CAP_ID == 1,
ARRAY_SIZE > 0). Proves the
cxl_passthrough_cm_rw() dispatch is
wired.
dvsec_lock_byte_read pread() of the DVSEC CONFIG_LOCK byte
through the config-rw clipping shim
succeeds. Proves the
cxl_passthrough_dvsec_rw() path is
wired.
COMMIT/COMMITTED state-machine and DVSEC LOCK latch behaviour are
out of scope for this smoke test. No debugfs dependency.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
tools/testing/selftests/vfio/Makefile | 1 +
.../selftests/vfio/lib/vfio_pci_device.c | 11 +-
.../selftests/vfio/vfio_cxl_type2_test.c | 350 ++++++++++++++++++
3 files changed, 361 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c
diff --git a/tools/testing/selftests/vfio/Makefile b/tools/testing/selftests/vfio/Makefile
index 0684932d91bf..25f2a9420ef6 100644
--- a/tools/testing/selftests/vfio/Makefile
+++ b/tools/testing/selftests/vfio/Makefile
@@ -12,6 +12,7 @@ TEST_GEN_PROGS += vfio_iommufd_setup_test
TEST_GEN_PROGS += vfio_pci_device_test
TEST_GEN_PROGS += vfio_pci_device_init_perf_test
TEST_GEN_PROGS += vfio_pci_driver_test
+TEST_GEN_PROGS += vfio_cxl_type2_test
TEST_FILES += scripts/cleanup.sh
TEST_FILES += scripts/lib.sh
diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_device.c b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
index fc75e04ef010..d2150129d854 100644
--- a/tools/testing/selftests/vfio/lib/vfio_pci_device.c
+++ b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
@@ -281,7 +281,16 @@ static void vfio_pci_device_setup(struct vfio_pci_device *device)
struct vfio_pci_bar *bar = device->bars + i;
vfio_pci_region_get(device, i, &bar->info);
- if (bar->info.flags & VFIO_REGION_INFO_FLAG_MMAP)
+ /*
+ * Skip auto-mmap when the BAR advertises region-info caps
+ * (e.g. VFIO_REGION_INFO_CAP_SPARSE_MMAP). Such BARs are
+ * only partially mmappable; the kernel rejects full-BAR
+ * mmaps and the caller must walk the sparse-area cap and
+ * mmap each advertised area separately. Tests that need
+ * access to such a BAR handle the per-area mmap themselves.
+ */
+ if ((bar->info.flags & VFIO_REGION_INFO_FLAG_MMAP) &&
+ !(bar->info.flags & VFIO_REGION_INFO_FLAG_CAPS))
vfio_pci_bar_map(device, i);
}
diff --git a/tools/testing/selftests/vfio/vfio_cxl_type2_test.c b/tools/testing/selftests/vfio/vfio_cxl_type2_test.c
new file mode 100644
index 000000000000..bc98a29f90ad
--- /dev/null
+++ b/tools/testing/selftests/vfio/vfio_cxl_type2_test.c
@@ -0,0 +1,350 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * vfio_cxl_type2_test - smoke + dispatch tests for CXL Type-2 device
+ * passthrough through vfio-pci.
+ *
+ * Exercises the user-visible surface gated by CONFIG_VFIO_PCI_CXL:
+ * - GET_INFO returns VFIO_DEVICE_FLAGS_CXL + a populated CAP_CXL.
+ * - The HDM-backed VFIO region can be mmap'd and read/written.
+ * - The component BAR exposes a SPARSE_MMAP cap that excludes the
+ * CXL component register sub-range.
+ * - The COMP_REGS region serves CM cap-array dwords from cxl-core's
+ * snapshot (proves the cxl_passthrough_cm_rw() path is wired).
+ * - DVSEC body reads through the config-rw clipping shim return the
+ * cxl-core shadow (proves cxl_passthrough_dvsec_rw() is wired).
+ *
+ * Usage:
+ * ./vfio_cxl_type2_test <BDF>
+ * or export VFIO_SELFTESTS_BDF=<BDF> before running. The device must
+ * be bound to vfio-pci and the kernel must have CONFIG_VFIO_PCI_CXL=y.
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES.
+ */
+
+#include <fcntl.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+#include <linux/pci_regs.h>
+#include <linux/sizes.h>
+#include <linux/vfio.h>
+
+#include <cxl/cxl_regs.h>
+
+#include <libvfio.h>
+
+#include "kselftest_harness.h"
+
+#define PCI_DVSEC_VENDOR_ID_CXL 0x1e98
+#define PCI_DVSEC_ID_CXL_DEVICE 0x0000
+
+/*
+ * vfio-pci's region offset packing (kernel-internal in
+ * include/linux/vfio_pci_core.h, not exposed via UAPI as of writing).
+ * Provide local definitions so the selftest builds against the bare
+ * UAPI vfio.h. The guards let a future kernel hoist these to UAPI
+ * without breaking this test.
+ */
+#ifndef VFIO_PCI_OFFSET_SHIFT
+#define VFIO_PCI_OFFSET_SHIFT 40
+#endif
+#ifndef VFIO_PCI_INDEX_TO_OFFSET
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((uint64_t)(index) << VFIO_PCI_OFFSET_SHIFT)
+#endif
+
+static const char *device_bdf;
+
+/* Find a struct vfio_device_info capability by id in a GET_INFO buffer. */
+static const struct vfio_info_cap_header *
+find_device_cap(const void *buf, size_t bufsz, uint16_t id)
+{
+ const struct vfio_device_info *info = buf;
+ const struct vfio_info_cap_header *cap;
+ size_t off = info->cap_offset;
+
+ while (off && off < bufsz) {
+ cap = (const void *)((const char *)buf + off);
+ if (cap->id == id)
+ return cap;
+ off = cap->next;
+ }
+ return NULL;
+}
+
+/* Walk PCI extended capability list for the CXL Device DVSEC. */
+static uint16_t find_cxl_dvsec(struct vfio_pci_device *dev)
+{
+ uint16_t pos = PCI_CFG_SPACE_SIZE;
+ int iter = 0;
+
+ while (pos && iter++ < 64) {
+ uint32_t hdr = vfio_pci_config_readl(dev, pos);
+ uint16_t cap_id = hdr & 0xffff;
+ uint16_t next = (hdr >> 20) & 0xffc;
+ uint32_t hdr1, hdr2;
+
+ if (cap_id == PCI_EXT_CAP_ID_DVSEC) {
+ hdr1 = vfio_pci_config_readl(dev, pos + 4);
+ hdr2 = vfio_pci_config_readl(dev, pos + 8);
+ if ((hdr1 & 0xffff) == PCI_DVSEC_VENDOR_ID_CXL &&
+ (hdr2 & 0xffff) == PCI_DVSEC_ID_CXL_DEVICE)
+ return pos;
+ }
+ pos = next;
+ }
+ return 0;
+}
+
+FIXTURE(cxl_type2) {
+ struct iommu *iommu;
+ struct vfio_pci_device *dev;
+
+ struct vfio_device_info_cap_cxl cxl_cap;
+ uint16_t dvsec_base;
+
+ uint64_t hdm_region_size;
+ uint64_t comp_regs_size;
+};
+
+FIXTURE_SETUP(cxl_type2)
+{
+ uint8_t infobuf[512] = {};
+ struct vfio_device_info *info = (void *)infobuf;
+ const struct vfio_device_info_cap_cxl *cap;
+ struct vfio_region_info ri = { .argsz = sizeof(ri) };
+
+ self->iommu = iommu_init(default_iommu_mode);
+ self->dev = vfio_pci_device_init(device_bdf, self->iommu);
+
+ info->argsz = sizeof(infobuf);
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_INFO, info));
+
+ if (!(info->flags & VFIO_DEVICE_FLAGS_CXL))
+ SKIP(return, "not a CXL Type-2 device");
+
+ cap = (const void *)find_device_cap(infobuf, sizeof(infobuf),
+ VFIO_DEVICE_INFO_CAP_CXL);
+ ASSERT_NE(NULL, cap);
+ memcpy(&self->cxl_cap, cap, sizeof(*cap));
+
+ ri.index = self->cxl_cap.hdm_region_idx;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &ri));
+ self->hdm_region_size = ri.size;
+
+ ri.argsz = sizeof(ri);
+ ri.index = self->cxl_cap.comp_reg_region_idx;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &ri));
+ self->comp_regs_size = ri.size;
+
+ self->dvsec_base = find_cxl_dvsec(self->dev);
+}
+
+FIXTURE_TEARDOWN(cxl_type2)
+{
+ vfio_pci_device_cleanup(self->dev);
+ iommu_cleanup(self->iommu);
+}
+
+TEST_F(cxl_type2, device_is_cxl)
+{
+ const struct vfio_device_info_cap_cxl *c = &self->cxl_cap;
+
+ ASSERT_EQ(VFIO_DEVICE_INFO_CAP_CXL, c->header.id);
+ ASSERT_EQ(1, c->header.version);
+ ASSERT_NE(c->hdm_region_idx, c->comp_reg_region_idx);
+ ASSERT_GE(c->hdm_region_idx, VFIO_PCI_NUM_REGIONS);
+ ASSERT_GE(c->comp_reg_region_idx, VFIO_PCI_NUM_REGIONS);
+ ASSERT_LT(c->comp_reg_bar, PCI_STD_NUM_BARS);
+ ASSERT_GT(c->comp_reg_size, 0ULL);
+ ASSERT_EQ(c->comp_reg_size, self->comp_regs_size);
+}
+
+TEST_F(cxl_type2, hdm_region_mmap_rw)
+{
+ uint64_t off = (uint64_t)VFIO_PCI_INDEX_TO_OFFSET(
+ self->cxl_cap.hdm_region_idx);
+ uint32_t pattern = 0xdeadbeefU;
+ uint32_t readback = 0;
+ void *map;
+
+ if (self->hdm_region_size < SZ_4K)
+ SKIP(return, "HDM region < 4K");
+
+ map = mmap(NULL, SZ_4K, PROT_READ | PROT_WRITE, MAP_SHARED,
+ self->dev->fd, off);
+ ASSERT_NE(MAP_FAILED, map);
+
+ *(volatile uint32_t *)map = pattern;
+ readback = *(volatile uint32_t *)map;
+ ASSERT_EQ(pattern, readback);
+
+ ASSERT_EQ(0, munmap(map, SZ_4K));
+}
+
+TEST_F(cxl_type2, component_bar_sparse_mmap)
+{
+ const uint8_t bar = self->cxl_cap.comp_reg_bar;
+ uint8_t buf[512] = {};
+ struct vfio_region_info *ri = (void *)buf;
+ const struct vfio_region_info_cap_sparse_mmap *sp;
+ const struct vfio_info_cap_header *hdr;
+ size_t off;
+ uint32_t i;
+
+ ri->argsz = sizeof(buf);
+ ri->index = bar;
+ ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, ri));
+
+ ASSERT_TRUE(ri->flags & VFIO_REGION_INFO_FLAG_CAPS);
+ off = ri->cap_offset;
+ hdr = NULL;
+ while (off && off < sizeof(buf)) {
+ hdr = (const void *)(buf + off);
+ if (hdr->id == VFIO_REGION_INFO_CAP_SPARSE_MMAP)
+ break;
+ off = hdr->next;
+ hdr = NULL;
+ }
+ ASSERT_NE(NULL, hdr);
+ sp = (const void *)hdr;
+ ASSERT_GE(sp->nr_areas, 1U);
+ for (i = 0; i < sp->nr_areas; i++) {
+ uint64_t a_start = sp->areas[i].offset;
+ uint64_t a_end = a_start + sp->areas[i].size;
+
+ ASSERT_TRUE(a_end <= self->cxl_cap.comp_reg_offset ||
+ a_start >= self->cxl_cap.comp_reg_offset +
+ self->cxl_cap.comp_reg_size);
+ }
+}
+
+TEST_F(cxl_type2, comp_regs_cm_cap_array_read)
+{
+ uint64_t off = (uint64_t)VFIO_PCI_INDEX_TO_OFFSET(
+ self->cxl_cap.comp_reg_region_idx) + CXL_CM_OFFSET;
+ uint32_t hdr = 0;
+ uint16_t cap_id;
+ uint8_t array_size;
+
+ ASSERT_EQ((ssize_t)sizeof(hdr),
+ pread(self->dev->fd, &hdr, sizeof(hdr), off));
+
+ cap_id = hdr & CXL_CM_CAP_HDR_ID_MASK;
+ array_size = (hdr & CXL_CM_CAP_HDR_ARRAY_SIZE_MASK) >> 24;
+ ASSERT_EQ(cap_id, CM_CAP_HDR_CAP_ID);
+ ASSERT_GT(array_size, 0);
+}
+
+TEST_F(cxl_type2, dvsec_lock_byte_read)
+{
+ uint8_t v;
+
+ if (!self->dvsec_base)
+ SKIP(return, "CXL Device DVSEC not found");
+
+ v = vfio_pci_config_readb(self->dev,
+ self->dvsec_base + 0x14); /* CONFIG_LOCK */
+ /* Snapshot value is host-firmware-dependent; just assert read
+ * succeeds (no SIGBUS, no -EIO).
+ */
+ (void)v;
+}
+
+/*
+ * Exercise the per-decoder COMMIT/COMMITTED state machine in
+ * cxl_passthrough_hdm_rw() (cxl-core). Steps:
+ *
+ * - Walk the CM cap-array via COMP_REGS reads to locate the HDM block.
+ * - Read decoder 0 CTRL; for a firmware-committed Type-2 device both
+ * COMMIT (bit 9) and COMMITTED (bit 10) are expected to be set.
+ * - Release COMMIT by writing CTRL with bit 9 cleared.
+ * Expected FSM transition: COMMITTED -> 0, LOCK_ON_COMMIT (bit 8) -> 0.
+ * - Re-set COMMIT. Expected: COMMITTED -> 1 (auto-set by the handler).
+ * - Restore the original CTRL value so subsequent test runs see the
+ * firmware-committed state.
+ *
+ * The CTRL writes touch the cxl-core shadow only — they do not reach
+ * the device — so the operation is safe to run repeatedly.
+ */
+TEST_F(cxl_type2, hdm_decoder_commit_fsm)
+{
+ uint64_t comp_off = (uint64_t)VFIO_PCI_INDEX_TO_OFFSET(
+ self->cxl_cap.comp_reg_region_idx);
+ uint32_t cm_hdr = 0, entry = 0;
+ uint64_t hdm_reg_offset = 0;
+ uint64_t ctrl_off;
+ uint32_t ctrl_orig, ctrl_test;
+ uint32_t array_size;
+ uint32_t i;
+
+ /* Discover HDM block offset via CM cap-array walk. */
+ ASSERT_EQ((ssize_t)sizeof(cm_hdr),
+ pread(self->dev->fd, &cm_hdr, sizeof(cm_hdr),
+ comp_off + CXL_CM_OFFSET));
+ ASSERT_EQ(CM_CAP_HDR_CAP_ID, cm_hdr & CXL_CM_CAP_HDR_ID_MASK);
+ array_size = (cm_hdr & CXL_CM_CAP_HDR_ARRAY_SIZE_MASK) >> 24;
+ ASSERT_GT(array_size, 0);
+
+ for (i = 1; i <= array_size; i++) {
+ ASSERT_EQ((ssize_t)sizeof(entry),
+ pread(self->dev->fd, &entry, sizeof(entry),
+ comp_off + CXL_CM_OFFSET + i * 4));
+ if ((entry & CXL_CM_CAP_HDR_ID_MASK) == CXL_CM_CAP_CAP_ID_HDM) {
+ hdm_reg_offset = CXL_CM_OFFSET +
+ ((entry & CXL_CM_CAP_PTR_MASK) >> 20);
+ break;
+ }
+ }
+ ASSERT_NE(0, hdm_reg_offset);
+
+ /* Read decoder 0 CTRL. */
+ ctrl_off = comp_off + hdm_reg_offset +
+ CXL_HDM_DECODER0_CTRL_OFFSET(0);
+ ASSERT_EQ((ssize_t)sizeof(ctrl_orig),
+ pread(self->dev->fd, &ctrl_orig, sizeof(ctrl_orig),
+ ctrl_off));
+
+ /* Firmware-committed Type-2 device: COMMIT + COMMITTED both set. */
+ ASSERT_TRUE(ctrl_orig & BIT(9)); /* COMMIT */
+ ASSERT_TRUE(ctrl_orig & BIT(10)); /* COMMITTED */
+
+ /* Release COMMIT; FSM clears COMMITTED and LOCK_ON_COMMIT. */
+ ctrl_test = ctrl_orig & ~BIT(9);
+ ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+ pwrite(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+ ctrl_off));
+ ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+ pread(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+ ctrl_off));
+ ASSERT_FALSE(ctrl_test & BIT(9)); /* COMMIT cleared */
+ ASSERT_FALSE(ctrl_test & BIT(10)); /* COMMITTED auto-cleared */
+ ASSERT_FALSE(ctrl_test & BIT(8)); /* LOCK_ON_COMMIT auto-cleared */
+
+ /* Re-set COMMIT; FSM auto-sets COMMITTED. */
+ ctrl_test = BIT(9);
+ ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+ pwrite(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+ ctrl_off));
+ ASSERT_EQ((ssize_t)sizeof(ctrl_test),
+ pread(self->dev->fd, &ctrl_test, sizeof(ctrl_test),
+ ctrl_off));
+ ASSERT_TRUE(ctrl_test & BIT(9)); /* COMMIT */
+ ASSERT_TRUE(ctrl_test & BIT(10)); /* COMMITTED auto-set */
+
+ /* Restore the original CTRL value. */
+ ASSERT_EQ((ssize_t)sizeof(ctrl_orig),
+ pwrite(self->dev->fd, &ctrl_orig, sizeof(ctrl_orig),
+ ctrl_off));
+}
+
+int main(int argc, char *argv[])
+{
+ device_bdf = vfio_selftests_get_bdf(&argc, argv);
+ return test_harness_run(argc, argv);
+}
--
2.25.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v3 10/11] docs: vfio-pci: Document CXL Type-2 device passthrough
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (8 preceding siblings ...)
2026-06-25 16:54 ` [PATCH v3 09/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test mhonap
@ 2026-06-25 16:54 ` mhonap
2026-06-25 16:54 ` [PATCH v3 11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions mhonap
2026-06-26 9:16 ` [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support Richard Cheng
11 siblings, 0 replies; 13+ messages in thread
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
Capture the ownership model, bind sequence, region layout, and the
DVSEC + HDM + CM cap-array virtualization contract for vfio-pci
Type-2 device passthrough in Documentation/driver-api/vfio-pci-cxl.rst.
cxl-core owns the CXL register virtualization through
devm_cxl_passthrough_create() and the cxl_passthrough_*_rw()
helpers; vfio-pci is a transport that forwards guest reads and
writes through them. The HDM HPA range is mapped by vfio for the
mmappable HDM region. Topology constraints and host-bridge decoder
limitations are listed under Known limitations.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
Documentation/driver-api/index.rst | 1 +
Documentation/driver-api/vfio-pci-cxl.rst | 282 ++++++++++++++++++++++
2 files changed, 283 insertions(+)
create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index eaf7161ff957..52f0c06a376a 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -47,6 +47,7 @@ of interest to most developers working on device drivers.
vfio-mediated-device
vfio
vfio-pci-device-specific-driver-acceptance
+ vfio-pci-cxl
Bus-level documentation
=======================
diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
new file mode 100644
index 000000000000..1527b7dd85d0
--- /dev/null
+++ b/Documentation/driver-api/vfio-pci-cxl.rst
@@ -0,0 +1,282 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===========================================
+VFIO-PCI: CXL Type-2 device passthrough
+===========================================
+
+:Author: Manish Honap <mhonap@nvidia.com>
+
+Overview
+========
+
+vfio-pci-core, when built with ``CONFIG_VFIO_PCI_CXL=y``, passes a
+CXL Type-2 accelerator (CXL r4.0, HDM-D / HDM-DB) through to a KVM
+guest. The host firmware commits the endpoint's HDM decoder before
+vfio-pci binds; the guest sees a CXL Type-2 device whose CXL.mem
+range is already programmed and locked. The guest may inspect the
+HDM Decoder Capability block and DVSEC Device capability via spec-
+defined paths, and access the device's CXL.mem range as
+mmap'd memory.
+
+Scope
+=====
+
+The supported scope is intentionally narrow:
+
+* One CXL endpoint per host bridge.
+* The endpoint exposes exactly one HDM decoder (decoder 0).
+* No interleave.
+* Host firmware has committed the endpoint HDM decoder before
+ vfio-pci probes. Devices whose HDM decoder is *uncommitted* fail
+ vfio-pci bind cleanly.
+* The host bridge is in single-RP-passthrough mode (the CXL host
+ bridge's own HDM decoder is not used; CFMWS-to-RP decode flows
+ implicitly). This assumption is currently *not enforced* by
+ vfio-pci-core; it is a known limitation, see the Known
+ limitations section.
+
+Multi-decoder, interleave, FLR / reset state-machine integration,
+and host-bridge HDM decoder programming are explicitly out of scope.
+Adding any of them is additive on top of the contract described
+below.
+
+Driver model
+============
+
+There is no dedicated ``vfio-cxl`` PCI driver. vfio-pci is the only
+driver that binds to the host PCI device. When built with
+``CONFIG_VFIO_PCI_CXL=y``, vfio-pci-core calls into the cxl subsystem
+to do four things at bind time:
+
+1. ``devm_cxl_dev_state_create()`` — allocate per-device CXL state
+ embedded in ``struct vfio_pci_cxl_state``.
+2. ``cxl_pci_setup_regs()`` + ``cxl_get_hdm_info()`` — probe the
+ Register Locator DVSEC and harvest the HDM block's BAR-relative
+ offset and size.
+3. ``cxl_await_range_active()`` — wait for the firmware-committed
+ range to become live.
+4. ``devm_cxl_passthrough_create()`` — snapshot the CXL Device DVSEC
+ body, the HDM Decoder block, and the CXL.cache/mem cap-array
+ prefix into shadows owned by cxl-core. All subsequent
+ register-virtualization happens inside ``drivers/cxl/core/passthrough.c``.
+5. ``devm_cxl_probe_mem()`` — register a ``cxl_memdev``, enumerate
+ the endpoint port, and auto-attach the firmware-committed
+ region. cxl_mem binds to the memdev as it would for any other
+ Type-2 accelerator.
+
+Ownership split
+===============
+
+Each device-visible surface is owned by exactly one subsystem:
+
+============================================ ==============================================
+Surface Owner
+============================================ ==============================================
+PCI config (non-DVSEC, non-CXL) vfio-pci-core ``vconfig`` (existing perm-bits)
+CXL Device DVSEC body cxl-core ``cxl_passthrough_dvsec_rw()``
+HDM Decoder Capability block cxl-core ``cxl_passthrough_hdm_rw()``
+CM cap-array (read-only snapshot) cxl-core ``cxl_passthrough_cm_rw()``
+``cxl_memdev`` / endpoint port / autoregion cxl-core ``devm_cxl_probe_mem()``
+HDM HPA range mapping vfio-pci ``request_mem_region`` + ``memremap``
+Sparse mmap layout for the component BAR vfio-pci
+============================================ ==============================================
+
+The vfio side holds no shadow buffer of its own. ``vfio_pci_cxl_state``
+caches small scalars (DVSEC offset/size, HDM offset/size, component
+BAR layout) for dispatch decisions; the actual virtualization
+semantics live in cxl-core.
+
+Bind sequence
+=============
+
+``vfio_pci_cxl_acquire()`` is called from
+``vfio_pci_core_register_device()`` at PCI bind time. The sequence::
+
+ 0. devm_cxl_dev_state_create(parent, CXL_DEVTYPE_DEVMEM, dsn,
+ dvsec_off, vfio_pci_cxl_state, cxlds,
+ /*mbox=*/false)
+
+ 1. pcie_is_cxl() and pci_find_dvsec_capability(CXL_DEVICE)
+ -> -ENODEV if either is absent
+ -> -ENODEV if the DVSEC's MEM_CAPABLE bit is clear
+
+ 2. pci_enable_device_mem()
+
+ 2a. cxl_pci_setup_regs(CXL_REGLOC_RBI_COMPONENT)
+ 2b. cxl_get_hdm_info() — REJECT hdm_count != 1 with -EOPNOTSUPP
+ 2c. cxl_regblock_get_bar_info()
+ 2d. cxl_await_range_active()
+ 2e. devm_cxl_passthrough_create(&pdev->dev, &cxlds)
+
+ 3. pci_disable_device()
+ Clears PCI_COMMAND_MASTER but NOT PCI_COMMAND_MEMORY (see
+ do_pci_disable_device() in drivers/pci/pci.c). Subsequent
+ MMIO from step 4 still succeeds.
+
+ 4. devm_cxl_probe_mem(&cxlds, &hpa_range)
+ Registers the memdev, enumerates the endpoint port, attaches
+ the firmware-committed autoregion.
+
+ 5. request_mem_region(hpa_base, hpa_size) + memremap_wb()
+
+ 6. vdev->cxl = cxl (state published; HDM and COMP_REGS regions
+ are registered later when the VFIO fd is opened)
+
+Fail-closed semantics
+---------------------
+
+Three errnos are mapped to "not a CXL device; caller falls back to
+plain vfio-pci": ``pcie_is_cxl()`` false, DVSEC absent, ``MEM_CAPABLE``
+clear. All three return ``-ENODEV`` from
+``vfio_pci_cxl_acquire()``; the caller treats them as a silent
+fall-through.
+
+Any other negative errno from the bind sequence aborts the vfio-pci
+bind entirely. The guest never sees a half-initialised CXL device.
+Once ``devm_cxl_probe_mem()`` has succeeded the published memdev
+holds a pointer into the embedded ``cxl_dev_state``; a failure in
+``vfio_cxl_map_hdm()`` after that point cannot ``devm_kfree(cxl)``
+and leaves the state allocated for the lifetime of the PCI device
+(devres unwinds it at pdev removal).
+
+VFIO regions exposed
+====================
+
+When the VFIO fd is first opened, ``vfio_pci_cxl_open()`` registers
+two additional regions on top of the standard vfio-pci BARs / config
+region:
+
+HDM region (``VFIO_REGION_SUBTYPE_CXL``)
+ Mappable view of the device's firmware-committed HPA range.
+
+ * ``mmap``: fault handler does
+ ``vmf_insert_pfn(vma, addr, PHYS_PFN(hpa_base + off))``. The
+ guest gets the same backing physical memory the host sees.
+ * ``pread`` / ``pwrite``: served from the ``memremap_wb()`` kva
+ captured at bind time.
+
+COMP_REGS region (``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``)
+ Shadow of the CXL component register sub-range. ``pread`` /
+ ``pwrite`` only; ``mmap`` is intentionally not supported (the VMM
+ uses this region instead of mmapping the BAR). Dword-aligned
+ access only; sub-dword accesses return ``-EINVAL``.
+
+ Dispatch by offset:
+
+ ============================================ =================================
+ Offset range cxl-core helper
+ ============================================ =================================
+ ``< CXL_CM_OFFSET`` zero-fill (reserved)
+ ``CXL_CM_OFFSET .. hdm_reg_offset`` ``cxl_passthrough_cm_rw()``
+ ``hdm_reg_offset .. +hdm_reg_size`` ``cxl_passthrough_hdm_rw()``
+ ``>= hdm_reg_offset + hdm_reg_size`` zero-fill (reserved)
+ ============================================ =================================
+
+DVSEC virtualization contract
+=============================
+
+The CXL Device DVSEC body is reached through the standard PCI
+config-space path. ``vfio_pci_config_rw_single()`` clips chunks at
+the DVSEC body boundary via ``vfio_pci_cxl_config_boundary()`` and
+forwards body bytes to ``vfio_pci_cxl_config_rw()``, which in turn
+calls ``cxl_passthrough_dvsec_rw()``.
+
+Per-field write semantics (CXL r4.0 §8.1.3):
+
+============================================ ==============================================
+Field (offset from DVSEC cap base) Spec attribute / behaviour
+============================================ ==============================================
+CAPABILITY (0x0a) HwInit — writes dropped
+CONTROL (0x0c) RWL — gated on DVSEC CONFIG_LOCK
+STATUS (0x0e) RW1C
+CONTROL2 (0x10) RWL — gated on DVSEC CONFIG_LOCK
+STATUS2 (0x12) RW1C
+LOCK (0x14) RWO — first 1-write latches CONFIG_LOCK
+Range1 SIZE_HI/LO BASE_HI/LO (0x18..0x27) HwInit — writes dropped
+Range2 SIZE_HI/LO BASE_HI/LO (0x28..0x37) RsvdZ — writes dropped
+============================================ ==============================================
+
+HDM virtualization contract
+===========================
+
+Per CXL r4.0 §8.2.4.20, on the single firmware-committed decoder:
+
+============================================ ==============================================
+Field (offset from HDM block base) Spec attribute / behaviour
+============================================ ==============================================
+HDM Decoder Capability Header (0x00) HwInit — writes dropped
+HDM Decoder Global Control (0x04) RW — shadow
+Decoder 0 BASE_LO / BASE_HI RWL — gated on COMMITTED or LOCK_ON_COMMIT
+Decoder 0 SIZE_LO / SIZE_HI RWL — same gate
+Decoder 0 CTRL Implements COMMIT → COMMITTED handshake; once
+ COMMITTED, only COMMIT toggles are honoured
+============================================ ==============================================
+
+CM cap-array
+============
+
+The CM cap-array (CXL r4.0 §8.2.4) prefix is snapshotted from the
+device's component register MMIO at bind time and served read-only
+through ``cxl_passthrough_cm_rw()``. Guest writes to the cap-array
+are silently dropped.
+
+UAPI: CAP_CXL
+=============
+
+``VFIO_DEVICE_GET_INFO`` returns ``VFIO_DEVICE_FLAGS_CXL`` and a
+``VFIO_DEVICE_INFO_CAP_CXL`` capability::
+
+ struct vfio_device_info_cap_cxl {
+ struct vfio_info_cap_header header;
+ __u32 flags;
+ #define VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED (1 << 0)
+ __u32 hdm_region_idx;
+ __u32 comp_reg_region_idx;
+ __u32 comp_reg_bar;
+ __u32 __resv;
+ __u64 comp_reg_offset;
+ __u64 comp_reg_size;
+ };
+
+``VFIO_DEVICE_GET_REGION_INFO`` on the component BAR returns a
+``VFIO_REGION_INFO_CAP_SPARSE_MMAP`` that excludes
+``[comp_reg_offset, comp_reg_offset + comp_reg_size)`` from the
+mmappable areas.
+
+Known limitations
+=================
+
+* Host bridge HDM decoder programming is not driven by this driver.
+ The driver silently assumes single-RP-passthrough topology (the
+ CXL host bridge's own HDM decoder is not used). Two remediations
+ are possible: either refuse to bind when the topology is not
+ single-RP-passthrough, or extend the kernel ABI so a host-bridge
+ HDM decoder programmer can attest the lock before vfio bind. Both
+ leave the existing contract intact or add a single boolean to
+ CAP_CXL.
+
+* Function-level reset (FLR) does not re-snapshot the shadows.
+ Guests that issue FLR will see stale HDM and DVSEC state after
+ the reset.
+
+* Multi-decoder devices return ``-EOPNOTSUPP`` at bind.
+
+* Hotplug while the device is held by vfio is not supported.
+
+* Raw BAR read/write into the CXL component register sub-range is
+ unsupported. VMMs must use the COMP_REGS region.
+
+Selftest
+========
+
+``tools/testing/selftests/vfio/vfio_cxl_type2_test`` exercises the
+five surfaces:
+
+* ``device_is_cxl`` — GET_INFO returns FLAGS_CXL + CAP_CXL.
+* ``hdm_region_mmap_rw`` — mmap + read/write pattern.
+* ``component_bar_sparse_mmap`` — SPARSE_MMAP cap excludes the CXL
+ block.
+* ``comp_regs_cm_cap_array_read`` — CM cap-array header is served
+ from the cxl-core snapshot.
+* ``dvsec_lock_byte_read`` -- DVSEC config-rw clipping shim is wired.
--
2.25.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH v3 11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (9 preceding siblings ...)
2026-06-25 16:54 ` [PATCH v3 10/11] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
@ 2026-06-25 16:54 ` mhonap
2026-06-26 9:16 ` [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support Richard Cheng
11 siblings, 0 replies; 13+ messages in thread
From: mhonap @ 2026-06-25 16:54 UTC (permalink / raw)
To: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny
Cc: cjia, kjaju, vsethi, zhiw, mhonap, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
From: Manish Honap <mhonap@nvidia.com>
Add an opt-out so users can keep vfio-pci's CXL extensions out of the
path for individual devices or for an entire vfio-pci instance. The
build-time gate is CONFIG_VFIO_PCI_CXL; the runtime gates are:
- Module parameter vfio_pci.disable_cxl (bool, 0444). Setting
disable_cxl=1 at modprobe time makes vfio_pci_probe() set
vdev->disable_cxl on every device it binds.
- Variant drivers (mlx5, pds, hisi, nvgrace, xe, etc.) may set
vdev->disable_cxl=true in their own probe for per-device control
without needing the module parameter. The bit lives on
struct vfio_pci_core_device so it's reachable from any variant.
vfio_pci_cxl_acquire() consults vdev->disable_cxl as the very first
check and returns -ENODEV when set, which makes vfio-pci-core treat
the device as a plain (non-CXL) PCI passthrough — no CAP_CXL, no HDM
or COMP_REGS VFIO regions, no DVSEC clipping shim.
This mirrors the long-standing disable_denylist opt-out shape.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_core.c | 9 +++++++++
drivers/vfio/pci/vfio_pci.c | 9 +++++++++
include/linux/vfio_pci_core.h | 1 +
3 files changed, 19 insertions(+)
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 8a00b776d7c7..905f74f4e725 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -234,6 +234,15 @@ int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
u16 dvsec;
int rc;
+ /*
+ * Honour the per-device opt-out (set by vfio-pci's module
+ * parameter disable_cxl, or by a variant driver before
+ * registration). Returning -ENODEV here makes the caller
+ * treat this device as plain vfio-pci.
+ */
+ if (vdev->disable_cxl)
+ return -ENODEV;
+
if (!pcie_is_cxl(pdev))
return -ENODEV;
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 0c771064c0b8..fd226cb65d8b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -60,6 +60,12 @@ static bool disable_denylist;
module_param(disable_denylist, bool, 0444);
MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
+#if IS_ENABLED(CONFIG_VFIO_PCI_CXL)
+static bool disable_cxl;
+module_param(disable_cxl, bool, 0444);
+MODULE_PARM_DESC(disable_cxl, "Disable CXL Type-2 extensions for all devices bound to vfio-pci. Variant drivers may instead set vdev->disable_cxl in their probe for per-device control without needing this parameter.");
+#endif
+
static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
{
switch (pdev->vendor) {
@@ -166,6 +172,9 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
return PTR_ERR(vdev);
dev_set_drvdata(&pdev->dev, vdev);
+#if IS_ENABLED(CONFIG_VFIO_PCI_CXL)
+ vdev->disable_cxl = disable_cxl;
+#endif
vdev->pci_ops = &vfio_pci_dev_ops;
ret = vfio_pci_core_register_device(vdev);
if (ret)
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 541c1911e090..20e9599b3bd7 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -127,6 +127,7 @@ struct vfio_pci_core_device {
bool needs_pm_restore:1;
bool pm_intx_masked:1;
bool pm_runtime_engaged:1;
+ bool disable_cxl:1;
struct pci_saved_state *pci_saved_state;
struct pci_saved_state *pm_save;
int ioeventfds_nr;
--
2.25.1
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
` (10 preceding siblings ...)
2026-06-25 16:54 ` [PATCH v3 11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions mhonap
@ 2026-06-26 9:16 ` Richard Cheng
11 siblings, 0 replies; 13+ messages in thread
From: Richard Cheng @ 2026-06-26 9:16 UTC (permalink / raw)
To: mhonap
Cc: djbw, alex, jgg, jic23, dave.jiang, ankita,
alejandro.lucero-palau, alison.schofield, dave, dmatlack, gourry,
ira.weiny, cjia, kjaju, vsethi, zhiw, kvm, linux-cxl, linux-doc,
linux-kernel, linux-kselftest
On Thu, Jun 25, 2026 at 10:23:56PM +0800, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> CXL Type-2 accelerators (CXL.mem-capable GPUs and similar) cannot be
> passed through to virtual machines with stock vfio-pci because the
> driver has no concept of HDM decoder management, HDM region exposure,
> or component register virtualization. This series adds those three
> pieces, sufficient for a guest to use the device's firmware-committed
> coherent memory under UVM / ATS.
>
> v3 is a rewrite of the v2 framework form, responding to Dan's request
> in the v2 review for "less emulation, narrower interfaces, and a
> closer mapping to the spec language."
> In this release, cxl-core exposes four EXPORT_SYMBOL_GPL helpers behind
> an opaque handle. vfio-pci becomes a thin transport on top of those.
> Please see "Changes since v2" and "Reviewer feedback addressed" below for
> the per-area summary.
>
Hi Manish,
Thanks for the work, I ran some test with your patches applied on a real
CXL type-2 device, it's a GPU with a FW-committed HDM decoder. I want to
report the result early, the acquire path works, but the first CPU access
to the mapped HDM region crash the host.
So device BDF is 0002:81:00.0 , with CXLCtl: Cache+ IO+ Mem+, HDM decoder firmware-committed.
Binding the device to vfio-pci brought the CXL Type-2 path up cleanly
"""
# modprobe vfio-pci
# echo vfio-pci > /sys/bus/pci/devices/0002:81:00.0/driver_override
# echo 0002:81:00.0 > /sys/bus/pci/drivers_probe
"""
A meme0/endpoint19/region1 appeared, and selftest device_is_cxl() passed.
When running the 9th patch's selftest
"""
# sudo ./vfio_cxl_type2_test 0002:81:00.0
ok 1 cxl_type2.device_is_cxl
# RUN cxl_type2.hdm_region_mmap_rw
"""
At this point, the machine hung and crash.
hdm_region_mmap_rw mmaps the HDM region and does a CPU read/write to it. That =
access never returned. I couldn't capture dmesg or trace before it crashed.
I'm not sure if this is a platform/FW issue or something in how the region
is mapped.
Have you exercised hdm_region_mmap_rw() against your machine? or only cxl_test mock?
If a guest can hang the host just by touching its mapped memory, it needs to be fixed.
Best regards,
Richard Cheng.
> Motivation
> ==========
>
> A CXL Type-2 device exposes its HDM-mapped device memory through HDM
> decoders that BIOS programs and commits at boot. To pass such a
> device to a guest, vfio-pci has to do three things at once:
>
> 1. Surface the firmware-committed HDM-mapped HPA range as a guest-
> mmappable region.
>
> 2. Surface a CXL-spec-compliant view of the CXL Device DVSEC body,
> the HDM Decoder Capability block, and the CXL.cache/mem cap-array
> prefix, so the guest's CXL driver enumerates the same topology
> the host saw.
>
> 3. Keep the host's committed decoder configuration intact (the
> physical decoder is never reprogrammed) while letting the guest
> observe and manage a shadow that follows the per-field write
> semantics in the spec.
>
> The series builds on Alejandro Lucero-Palau's v28 work
> applied on for-7.3/cxl-type2-enabling [1] (sfc is the in-tree consumer
> today). vfio-pci becomes the second consumer.
>
> Architecture
> ============
>
> cxl-core owns the CXL semantics. A new file
> drivers/cxl/core/passthrough.c (gated by hidden Kconfig
> CXL_VFIO_PASSTHROUGH) provides four exported symbols:
>
> struct cxl_passthrough *
> devm_cxl_passthrough_create(struct device *dev,
> struct cxl_dev_state *cxlds);
>
> int cxl_passthrough_dvsec_rw(p, off, val, sz, write);
> int cxl_passthrough_hdm_rw (p, off, val, write);
> int cxl_passthrough_cm_rw (p, off, val, write);
>
> cxl_passthrough is an opaque handle; vfio-pci sees no cxl-internal
> struct pointers. The shadows are snapshotted at create time: the
> DVSEC body from PCI config space dword by dword, the CM cap-array and
> HDM block from the cxl-core MMIO mapping at cxlds->reg_map.base.
> Per-field write semantics follow below:
> CXL r4.0 8.1.3 DVSEC:
> - LOCK is RWO,
> - CONTROL/CONTROL2 are RWL gated on CONFIG_LOCK,
> - STATUS/STATUS2 are RW1C,
> - RANGE1 is HwInit, RANGE2 is RsvdZ
> CXL r4.0 8.2.4.20 HDM:
> - GLOBAL_CTRL RW,
> - decoder CTRL implements COMMIT/COMMITTED,
> - decoder BASE/SIZE RWL gated on COMMITTED or LOCK_ON_COMMIT,
> - cap header HwInit).
>
> vfio-pci becomes a thin transport. The new module
> drivers/vfio/pci/cxl/ exposes two VFIO regions.
>
> VFIO_REGION_SUBTYPE_CXL (HDM region): mmappable view of the
> HDM-mapped HPA. The mmap fault handler calls vmf_insert_pfn() from
> the physical HPA. pread/pwrite go through the memremap_wb() kva
> captured at bind time.
>
> VFIO_REGION_SUBTYPE_CXL_COMP_REGS (component register shadow):
> pread/pwrite only, dword-aligned (-EINVAL on misalignment).
> Each dword dispatches by offset to cxl_passthrough_cm_rw() or
> cxl_passthrough_hdm_rw(). No shadow state on the vfio side; cxl-core
> enforces the spec.
>
> CXL DVSEC config-space accesses use a clipping shim in
> vfio_pci_config_rw_single(). A config-space chunk that crosses the
> DVSEC body boundary is split: header bytes go through the generic
> perm-bits path, body bytes go through cxl_passthrough_dvsec_rw().
> The shim replaces v2's approach of repointing ecap_perms[]
>
> Sparse-mmap is exposed on the component BAR so userspace can mmap the
> non-component portions directly; only the CXL component register
> sub-range goes through pread/pwrite emulation. The CXL sub-range is
> also skipped from vfio_pci-core's request_selected_regions() set
> because cxl-core's devm_cxl_probe_mem() already holds a
> request_mem_region() on it; the asymmetric skip is matched by an
> asymmetric release on disable().
>
> Scope and out-of-scope
> ======================
>
> In scope (rejected at create time with -EOPNOTSUPP otherwise):
>
> - Firmware-committed devices (HOST_FIRMWARE_COMMITTED set).
> - Single HDM decoder (hdm_count == 1).
> - No interleave (IW == 0).
>
> Out of scope, deferred for follow-on work:
>
> - Multi-decoder devices and interleave.
> - Guest-driven (non-firmware-committed) HDM commit.
> - Hotplug, FLR, and sibling-function reset of CXL Type-2 devices.
>
> Changes since v2
> ================
>
> This is a rewrite, not an incremental update. The structure of the
> series changed (20 patches in v2 to 11 in v3) because v3 collapses
> v2 patches 9-15 (detection, HDM emulation, media readiness, region
> management, HDM region, DVSEC emulation) into one cxl-core helper
> file and one vfio-pci consumer.
>
> Framework replaced by narrow opaque-handle helpers (patches 6, 8)
>
> v2 carried a generic register-emulation framework split across four
> state-machine files in cxl-core.
> v3 collapses it into one file: drivers/cxl/core/passthrough.c
> exposing the four EXPORT_SYMBOL_GPL helpers above behind a struct
> cxl_passthrough opaque handle.
>
> Shadow ownership moved into cxl-core (patches 6, 8)
>
> vfio-pci no longer keeps any per-field state. It forwards
> (offset, value) into cxl-core, and cxl-core enforces the spec
> (RWO, RWL, RW1C, HwInit, RsvdZ) with explicit CXL r4.0 section
> references in the switch arms.
>
> DVSEC config-space clipping shim (patch 8)
>
> v2 repointed ecap_perms[] to redirect CXL DVSEC reads and writes.
> v3 keeps ecap_perms[] untouched and clips per-config-access chunks
> at the DVSEC body boundary in vfio_pci_config_rw_single(); header bytes
> go through the generic perm-bits path, body bytes go through
> cxl_passthrough_dvsec_rw(). The shim is local to the per-device
> path.
>
> CONFIG_VFIO_PCI_CXL gates the new module (patch 7)
>
> v2 had a CONFIG_VFIO_CXL_CORE Kconfig stub; v3 renames it to
> CONFIG_VFIO_PCI_CXL to match the vfio-pci naming convention.
> The hidden CXL_VFIO_PASSTHROUGH selects the cxl-core helper file
> on demand. With both disabled, the cxl-core size is unchanged.
>
> UAPI rewritten with named fields (patch 5)
>
> vfio_device_info_cap_cxl in v3 carries:
> flags + HOST_FIRMWARE_COMMITTED bit
> hdm_region_idx
> comp_reg_region_idx
> comp_reg_bar
> comp_reg_offset
> comp_reg_size
> The DPA terminology is renamed to HDM region throughout.
> CACHE_CAPABLE (HDM-DB indicator) is dropped;
> it was informational only in v2 with no caller, and re-adding it
> for an active CXL.cache plumbing series later.
>
> Selftests trimmed (patch 9)
>
> v2 carried selftests for device detection, capability parsing,
> region enumeration, HDM register emulation, HDM mmap with
> page-fault insertion, FLR invalidation, and DVSEC register
> emulation. v3 keeps a smoke-test set of six focused tests:
>
> device_is_cxl GET_INFO advertises FLAGS_CXL
> and a populated CAP_CXL.
> hdm_region_mmap_rw mmap one page, write+read back.
> component_bar_sparse_mmap SPARSE_MMAP cap excludes the
> CXL component register sub-range.
> comp_regs_cm_cap_array_read pread of the CM cap-array
> header at CXL_CM_OFFSET succeeds
> (CAP_ID == 1).
> dvsec_lock_byte_read pread of the DVSEC CONFIG_LOCK
> byte through the clipping shim
> succeeds.
> hdm_decoder_commit_fsm COMMIT / COMMITTED state machine
> and LOCK_ON_COMMIT behaviour.
>
> FLR invalidation, page-fault insertion under load, and full
> DVSEC field-by-field write coverage are deferred to a follow-on
> selftest series. The current six are the minimal set that
> exercises the kernel-side contract end-to-end.
>
> cxl-core prep patches split (patches 1-4)
>
> v3 keeps the cxl-side enablers from v2 patches 1-4 but each as
> a standalone change so the cxl maintainer can review the helper
> API independently of the vfio consumer:
>
> [1/11] cxl_get_hdm_info()
> [2/11] cxl_await_range_active() split from media-ready wait
> [3/11] cxl_register_map records BIR + BAR offset
> [4/11] component/HDM register defines moved to uapi/cxl/cxl_regs.h
>
> Reviewer feedback addressed
> ===========================
>
> Dan
> ---
>
> - VFIO exposes HDM/host-visible region, not raw DPA; docs/UAPI say HDM
> region, DPA only inside cxl-core where appropriate.
> - One vfio-pci device = one HDM region / one decoder, no interleave;
> hdm_count != 1 → -EOPNOTSUPP.
> - Global HDM on DVSEC Range Base treated as legacy; RANGE1/RANGE2
> read-only snapshot, guest writes dropped.
> - No guest/kernel lock games; DVSEC LOCK and HDM LOCK_ON_COMMIT RWO,
> fixed at create from firmware snapshot.
> - Opaque cxl_passthrough handle only; vfio gets HPA via memdev probe +
> layout via cxl_get_hdm_info(), rw via helpers.
> - No multi-region accelerator case in v3; single region enforced,
> multi-region deferred.
> - cxl_await_range_active stays in cxl-core probe; not exported, vfio does
> not call it.
> - No guest LOCK→0 reprogram; guest cannot clear LOCK to remap host HPA;
> kernel uncommit tied to COMMIT, not LOCK alone.
>
> Jason / Gregory / Dan
> ---------------------
>
> - memremap(WB) + request_mem_region on HPA; conflicting direct-map/EFI use
> fails probe with -EBUSY.
>
> Jonathan
> --------
>
> - uapi/cxl/cxl_regs.h for register defines so VMMs need no private
> kernel headers.
> - __free() locals on cxl-core/passthrough error paths instead of
> struct-owned temporaries.
> - No "precommitted at probe" assumption; acquire checks COMMITTED in
> HDM shadow and refuses if missing.
>
> Dave
> ----
>
> - memremap(MEMREMAP_WB) for HDM host mapping (not ioremap_cache).
> - Renamed cap flag to VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED for clarity.
> - __free() / DEFINE_FREE() cleanup in new passthrough.c create path.
>
> Patch series
> ============
>
> [1/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
> [2/11] cxl: Split cxl_await_range_active() from media-ready wait
> [3/11] cxl: Record BIR and BAR offset in cxl_register_map
> [4/11] cxl: Move component/HDM register defines to
> uapi/cxl/cxl_regs.h
> [5/11] vfio: UAPI for CXL Type-2 device passthrough
> [6/11] cxl: Add register-virtualization helpers for vfio Type-2
> passthrough
> [7/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
> acquisition
> [8/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping
> shim
> [9/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test
> [10/11] docs: vfio-pci: Document CXL Type-2 device passthrough
> [11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions
>
> Dependencies
> ============
>
> [1] [PATCH v28 0/5] Type2 device basic support
> https://lore.kernel.org/linux-cxl/20260618181806.118745-1-alejandro.lucero-palau@amd.com/
>
> [2] Previous version of this patch series
> [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
> https://lore.kernel.org/linux-cxl/20260401143917.108413-1-mhonap@nvidia.com/
>
> [3] Companion QEMU series
> [RFC 0/9] QEMU: CXL Type-2 device passthrough via vfio-pci
> https://lore.kernel.org/linux-cxl/20260427181235.3003865-1-mhonap@nvidia.com/
>
> Manish Honap (11):
> cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata
> cxl: Split cxl_await_range_active() from media-ready wait
> cxl: Record BIR and BAR offset in cxl_register_map
> cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
> vfio: UAPI for CXL Type-2 device passthrough
> cxl: Add register-virtualization helpers for vfio Type-2 passthrough
> vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2
> acquisition
> vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim
> selftests/vfio: Add CXL Type-2 device passthrough smoke test
> docs: vfio-pci: Document CXL Type-2 device passthrough
> vfio/pci: Provide opt-out for CXL Type-2 extensions
>
> Documentation/driver-api/index.rst | 1 +
> Documentation/driver-api/vfio-pci-cxl.rst | 282 ++++++
> drivers/cxl/Kconfig | 7 +
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/passthrough.c | 590 ++++++++++++
> drivers/cxl/core/pci.c | 70 +-
> drivers/cxl/core/regs.c | 35 +
> drivers/cxl/cxl.h | 52 +-
> drivers/vfio/pci/Kconfig | 2 +
> drivers/vfio/pci/Makefile | 1 +
> drivers/vfio/pci/cxl/Kconfig | 34 +
> drivers/vfio/pci/cxl/Makefile | 2 +
> drivers/vfio/pci/cxl/vfio_cxl_core.c | 889 ++++++++++++++++++
> drivers/vfio/pci/cxl/vfio_cxl_priv.h | 71 ++
> drivers/vfio/pci/vfio_pci.c | 9 +
> drivers/vfio/pci/vfio_pci_config.c | 31 +
> drivers/vfio/pci/vfio_pci_core.c | 68 +-
> drivers/vfio/pci/vfio_pci_priv.h | 93 ++
> drivers/vfio/pci/vfio_pci_rdwr.c | 17 +
> include/cxl/cxl.h | 18 +
> include/cxl/passthrough.h | 121 +++
> include/linux/vfio_pci_core.h | 8 +
> include/uapi/cxl/cxl_regs.h | 63 ++
> include/uapi/linux/vfio.h | 46 +
> tools/testing/selftests/vfio/Makefile | 1 +
> .../selftests/vfio/lib/vfio_pci_device.c | 11 +-
> .../selftests/vfio/vfio_cxl_type2_test.c | 350 +++++++
> 27 files changed, 2821 insertions(+), 52 deletions(-)
> create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
> create mode 100644 drivers/cxl/core/passthrough.c
> create mode 100644 drivers/vfio/pci/cxl/Kconfig
> create mode 100644 drivers/vfio/pci/cxl/Makefile
> create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
> create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
> create mode 100644 include/cxl/passthrough.h
> create mode 100644 include/uapi/cxl/cxl_regs.h
> create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c
>
> base-commit: 90cf2e0d702c8a132ccbe72e7687f33c04c14658
> --
> 2.25.1
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread