public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
@ 2026-04-01 14:38 mhonap
  2026-04-01 14:38 ` [PATCH v2 01/20] cxl: Add cxl_get_hdm_info() for HDM decoder metadata mhonap
                   ` (19 more replies)
  0 siblings, 20 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:38 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap, Alex Williamson,
	Jonathan Cameron

From: Manish Honap <mhonap@nvidia.com>

CXL Type-2 accelerators (e.g. CXL.mem-capable GPUs) cannot be passed
through to virtual machines with stock vfio-pci because the driver has
no concept of HDM decoder management, DPA region exposure, or component
register emulation.  This series wires all of that into vfio-pci-core
behind a new CONFIG_VFIO_CXL_CORE optional module, without requiring a
variant driver.

When a CXL Device DVSEC (Vendor ID 0x1E98, ID 0x0000) is detected at
device open time, the driver:

  - Probes the HDM Decoder Capability block in the component registers
    and allocates a DPA region through the CXL subsystem.  On devices
    where firmware has already committed a decoder, the kernel skips
    allocation and re-uses the committed range.

  - Builds a kernel-owned shadow of the HDM register block.  The VMM
    reads and writes this shadow through a dedicated COMP_REGS VFIO
    region rather than touching the hardware directly.  The kernel
    enforces CXL 3.1 bit-field rules: reserved bits, read-only bits,
    the COMMIT/COMMITTED latch, and the LOCK→0 reprogram path for
    firmware-committed decoders.

  - Exposes the DPA range as a second VFIO region (VFIO_REGION_SUBTYPE_CXL)
    backed by the kernel-assigned HPA.  PTEs are inserted lazily on first
    page fault and torn down atomically under memory_lock during FLR.

  - Intercepts writes to the CXL DVSEC configuration-space registers
    (Control, Status, Control2, Status2, Lock, Range Base) and replays
    them through a per-device vconfig shadow, enforcing RWL/RW1CS/RWO
    access semantics and the CONFIG_LOCK one-shot latch.

  - Returns a VFIO_DEVICE_INFO_CAP_CXL capability (id=6) carrying the
    HDM register BAR index and offset, commit flags, and the indices of
    the DPA and COMP_REGS regions.  HDM decoder count and the HDM block
    offset within COMP_REGS are derivable by the VMM from the CXL
    Capability Array in the COMP_REGS region itself, so they are not
    duplicated in the capability struct.

  - Builds a sparse-mmap capability for the component register BAR so
    VMMs can map GPU/accelerator register windows while the kernel
    protects the CXL component register block.  Three physical layouts
    are handled: component block at the BAR end, at the start, and in
    the middle.

  - Provides a module parameter (disable_cxl=1) and a per-device flag
    (vdev->disable_cxl) for suppressing the feature without recompiling.

  - Includes selftests covering device detection, capability parsing,
    region enumeration, HDM register emulation, DPA mmap with page-fault
    insertion, FLR invalidation, and DVSEC register emulation.

The series is applied on top of the cxl/next branch using the base
specified at the end of this cover letter plus Alejandro's v23 Type-2
device support patches [1].

Series structure
================

  Patches 1-5 extend the CXL subsystem with the APIs vfio-pci needs.

  Patches 6-8 add the vfio-pci-core plumbing (UAPI, device state,
  Kconfig/build).

  Patches 9-15 implement the core device lifecycle: detection, HDM
  emulation, media readiness, region management, DPA region, and DVSEC
  emulation.

  Patches 16-18 wire everything together at open/close time and
  populate the VFIO ioctl paths.

  Patches 19-20 add documentation and selftests.

Changes since v1
================

UAPI struct minimization (patch 6)

  v1 carried hdm_count, hdm_regs_size, hdm_decoder_offset, dpa_size,
  and a pad byte in vfio_device_info_cap_cxl. All four fields are
  derivable from data the VMM already has: hdm_count and the HDM block
  offset come from the CXL Capability Array in the COMP_REGS region,
  hdm_regs_size is implicit in the COMP_REGS region size, and dpa_size
  is the DPA region size.  v2 drops them and replaces pad with
  reserved[3].  The VFIO_CXL_CAP_PRECOMMITTED flag is gone; the single
  VFIO_CXL_CAP_FIRMWARE_COMMITTED flag covers both the committed and
  precommitted cases.  VFIO_CXL_CAP_CACHE_CAPABLE is added to expose
  the HDM-DB (CXL.cache) capability bit.

Component BAR access: sparse mmap instead of blanket rejection (patch 17)

  v1 returned size=0 for the component BAR and rejected all mmap and
  r/w access to it. That broke GPU passthrough scenarios where the
  device puts accelerator register windows in the same BAR as the CXL
  component registers. v2 replaces the blanket rejection with a
  sparse-mmap capability that advertises only the GPU register windows,
  carving out the component register block.  vfio_cxl_mmap_overlaps_comp_regs()
  rejects only the sub-range covering [comp_reg_offset, comp_reg_offset
  + comp_reg_size); everything else in the BAR remains mappable.

CXL register defines moved to uapi/cxl/cxl_regs.h (patch 3)

  v1 placed the component register defines in a private header
  (include/cxl/cxl_regs.h). v2 moves them to include/uapi/cxl/cxl_regs.h
  so VMMs can include them directly without duplicating definitions.

HDM API simplification (patch 1)

  v1 exported cxl_get_hdm_reg_info() which returned a raw struct with
  offset and size fields. v2 replaces it with cxl_get_hdm_info() which
  uses the cached count already populated by cxl_probe_component_regs()
  and returns a single struct with all HDM metadata, removing the need
  for callers to re-read the hardware.

cxl_await_range_active() split (patch 4)

  cxl_await_media_ready() requires a CXLMDEV mailbox register, which
  Type-2 accelerators may not have.  v2 splits out cxl_await_range_active()
  so the HDM range-active poll can be used independently of the media
  ready path.

LOCK→0 transition in HDM ctrl write emulation (patch 11)

  v1 did not handle the case where a guest tries to clear the LOCK bit
  to reprogram a firmware-committed decoder. v2 allows this transition
  and re-programs the hardware accordingly.

Component register buffer allocation (patch 11)

  v1 allocated only the HDM register sub-range in the COMP_REGS buffer.
  v2 allocates the full CXL_COMPONENT_REG_BLOCK_SIZE so future patches
  can expose other capability blocks (e.g. RAS, CXL.cache) without a
  structural change.

Register region setup split (patch 16)

  v1 tied region registration to the detection/init path.  v2 splits it
  into explicit vfio_cxl_register_cxl_region() and
  vfio_cxl_register_comp_regs_region() functions called from
  vfio_pci_open_device(), which is the correct point since vconfig and
  pci_config_map are valid there.

VLA fix merged into selftest (patch 20)

  v1 had a separate patch 20 fixing a VLA initialisation in
  vfio_pci_irq_set().  v2 folds that fix into the selftest patch to
  keep the standalone CXL change count at 19 functional patches.

Reviewer feedback addressed
===========================

Dave Jiang:
  - Replace open-coded bit shifts with FIELD_GET() / FIELD_PREP()
    throughout the HDM emulation code.
  - Rename flag from VFIO_CXL_CAP_COMMITTED / VFIO_CXL_CAP_PRECOMMITTED
    to VFIO_CXL_CAP_FIRMWARE_COMMITTED; the old names were ambiguous.
  - Use memremap(MEMREMAP_WB) for the DPA kernel mapping instead of
    ioremap_cache(), which selects the wrong memory-type descriptor on
    ARM64.
  - Use __free() / DEFINE_FREE() scope helpers for CXL resource cleanup
    in the region management path, replacing the open-coded error
    unwind.
  - Remove the unused abs_off parameter from the HDM accessor.
  - Rename cxl_dvsec_control_write() to better reflect its role.

Jonathan Cameron:
  - Move CXL register defines to uapi/cxl/cxl_regs.h so VMMs can
    consume them without a kernel header dependency.
  - Use local variables with __free() rather than struct members for
    intermediate ERR_PTR returns in the region management code; avoids
    ambiguity about ownership on error paths.
  - The assumption that a pre-committed decoder always exists at probe
    time is too restrictive for hotplug scenarios; v2 makes the
    precommitted path a fast-track that falls back to dynamic allocation
    when no committed decoder is found.

Alex Williamson:
  - The blanket size=0 / mmap-reject approach for the component BAR
    prevents VMMs from accessing GPU register windows in the same BAR.
    v2 implements the sparse-mmap capability described above.

Limitations and future work
===========================

  Switched topologies with more than one caching agent are not yet
  supported; that is planned for a follow-on series.

  RAS/ECC handling and CXL core reset integration (cxl_reset support
  from Srirangan [2]) will be added in subsequent patches.

Dependencies
============

[1] CXL Type-2 device basic support (Alejandro Lucero-Palau, v23):
    https://lore.kernel.org/linux-cxl/20260201155438.2664640-1-alejandro.lucero-palau@amd.com/

[2] CXL reset support for Type-2 devices (Srirangan Madhavan):
    https://lore.kernel.org/linux-cxl/20260306092322.148765-1-smadhavan@nvidia.com/

Cc: Alex Williamson <alex@shazbot.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Alejandro Lucero <alejandro.lucero-palau@amd.com>
Cc: linux-cxl@vger.kernel.org
Cc: kvm@vger.kernel.org

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>

base-commit: 3f7938b1aec7f06d5b23adca83e4542fcf027001
--

Manish Honap (20):
  cxl: Add cxl_get_hdm_info() for HDM decoder metadata
  cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public
    header
  cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
  cxl: Split cxl_await_range_active() from media-ready wait
  cxl: Record BIR and BAR offset in cxl_register_map
  vfio: UAPI for CXL-capable PCI device assignment
  vfio/pci: Add CXL state to vfio_pci_core_device
  vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks
  vfio/cxl: Detect CXL DVSEC and probe HDM block
  vfio/pci: Export config access helpers
  vfio/cxl: Introduce HDM decoder register emulation framework
  vfio/cxl: Wait for HDM ranges and create memdev
  vfio/cxl: CXL region management support
  vfio/cxl: DPA VFIO region with demand fault mmap and reset zap
  vfio/cxl: Virtualize CXL DVSEC config writes
  vfio/cxl: Register regions with VFIO layer
  vfio/pci: Advertise CXL cap and sparse component BAR to userspace
  vfio/cxl: Provide opt-out for CXL feature
  docs: vfio-pci: Document CXL Type-2 device passthrough
  selftests/vfio: Add CXL Type-2 VFIO assignment test

 Documentation/driver-api/index.rst            |    1 +
 Documentation/driver-api/vfio-pci-cxl.rst     |  382 +++
 drivers/cxl/core/pci.c                        |   64 +-
 drivers/cxl/core/regs.c                       |   30 +
 drivers/cxl/cxl.h                             |   46 -
 drivers/vfio/pci/Kconfig                      |    2 +
 drivers/vfio/pci/Makefile                     |    1 +
 drivers/vfio/pci/cxl/Kconfig                  |    9 +
 drivers/vfio/pci/cxl/vfio_cxl_config.c        |  306 ++
 drivers/vfio/pci/cxl/vfio_cxl_core.c          |  880 ++++++
 drivers/vfio/pci/cxl/vfio_cxl_emu.c           |  509 ++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h          |  133 +
 drivers/vfio/pci/vfio_pci.c                   |   32 +
 drivers/vfio/pci/vfio_pci_config.c            |   58 +-
 drivers/vfio/pci/vfio_pci_core.c              |   46 +-
 drivers/vfio/pci/vfio_pci_priv.h              |   66 +
 drivers/vfio/pci/vfio_pci_rdwr.c              |   16 +-
 include/cxl/cxl.h                             |   51 +
 include/linux/vfio_pci_core.h                 |   10 +
 include/uapi/cxl/cxl_regs.h                   |  160 +
 include/uapi/linux/vfio.h                     |   86 +
 tools/testing/selftests/vfio/Makefile         |    1 +
 .../selftests/vfio/lib/vfio_pci_device.c      |    3 +-
 .../selftests/vfio/vfio_cxl_type2_test.c      | 2631 +++++++++++++++++
 24 files changed, 5459 insertions(+), 64 deletions(-)
 create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst
 create mode 100644 drivers/vfio/pci/cxl/Kconfig
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_config.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_emu.c
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h
 create mode 100644 include/uapi/cxl/cxl_regs.h
 create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c

--
2.25.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2 01/20] cxl: Add cxl_get_hdm_info() for HDM decoder metadata
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
@ 2026-04-01 14:38 ` mhonap
  2026-04-01 14:38 ` [PATCH v2 02/20] cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public header mhonap
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:38 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

cxl_probe_component_regs() finds the HDM decoder block during device probe
and caches its location, but does not record the decoder count and does
not expose the result outside drivers/cxl/.

vfio-cxl needs the decoder count and the byte offset and size of the HDM
block without re-running the probe sequence. Record decoder_cnt in
rmap->count when parsing the HDM capability in cxl_probe_component_regs(),
extend struct cxl_reg_map with a count member, and add cxl_get_hdm_info()
to return offset, size, and count from the cached map.

Export under the CXL namespace; stub to -EOPNOTSUPP when CONFIG_CXL_BUS
is off.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/core/pci.c  | 29 +++++++++++++++++++++++++++++
 drivers/cxl/core/regs.c |  1 +
 include/cxl/cxl.h       | 16 ++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index ba2d393c540a..a5147602f91f 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -449,6 +449,35 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, "CXL");
 
+/**
+ * cxl_get_hdm_info - Get HDM decoder register block location and count
+ * @cxlds: CXL device state (must have component regs enumerated via
+ *	   cxl_probe_component_regs())
+ * @count:  number of HDM decoders in the block (from HDM Capability bits [3:0])
+ * @offset: byte offset of HDM decoder block within the component register BAR
+ * @size:   size in bytes of the HDM decoder block
+ *
+ * Return: 0 on success. -ENODEV if the HDM decoder block is not present.
+ */
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
+		     resource_size_t *offset, resource_size_t *size)
+{
+	struct cxl_reg_map *hdm = &cxlds->reg_map.component_map.hdm_decoder;
+
+	if (WARN_ON(!count || !offset || !size))
+		return -EINVAL;
+
+	if (!hdm->valid)
+		return -ENODEV;
+
+	*count	= hdm->count;
+	*offset = hdm->offset;
+	*size	= hdm->size;
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_info, "CXL");
+
 #define CXL_DOE_TABLE_ACCESS_REQ_CODE		0x000000ff
 #define   CXL_DOE_TABLE_ACCESS_REQ_CODE_READ	0
 #define CXL_DOE_TABLE_ACCESS_TABLE_TYPE		0x0000ff00
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 20c2d9fbcfe7..e828df0629d0 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -85,6 +85,7 @@ void cxl_probe_component_regs(struct device *dev, void __iomem *base,
 			decoder_cnt = cxl_hdm_decoder_count(hdr);
 			length = 0x20 * decoder_cnt + 0x10;
 			rmap = &map->hdm_decoder;
+			rmap->count = decoder_cnt;
 			break;
 		}
 		case CXL_CM_CAP_CAP_ID_RAS:
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 50acbd13bcf8..d86faebb99b7 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -80,6 +80,7 @@ struct cxl_reg_map {
 	int id;
 	unsigned long offset;
 	unsigned long size;
+	u8 count;
 };
 
 struct cxl_component_reg_map {
@@ -284,4 +285,19 @@ int cxl_dpa_free(struct cxl_endpoint_decoder *cxled);
 struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
 				     struct cxl_endpoint_decoder **cxled,
 				     int ways);
+
+#ifdef CONFIG_CXL_BUS
+
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
+		     resource_size_t *offset, resource_size_t *size);
+
+#else
+
+static inline
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
+		     resource_size_t *offset, resource_size_t *size)
+{ return -EOPNOTSUPP; }
+
+#endif /* CONFIG_CXL_BUS */
+
 #endif /* __CXL_CXL_H__ */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 02/20] cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public header
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
  2026-04-01 14:38 ` [PATCH v2 01/20] cxl: Add cxl_get_hdm_info() for HDM decoder metadata mhonap
@ 2026-04-01 14:38 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 03/20] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:38 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

vfio-cxl lives outside drivers/cxl/ but still needs to locate the
component register block and fill cxl_component_reg_map. Those
prototypes were stuck in the internal drivers/cxl/cxl.h.

Move the declarations to include/cxl/cxl.h next to the other
vfio-facing hooks, with stubs when CXL bus support is disabled.
Drop the duplicate prototypes from the private header.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/cxl.h |  4 ----
 include/cxl/cxl.h | 16 ++++++++++++++++
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 2b1f7d687a0e..10ddab3949ee 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -198,8 +198,6 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw)
 #define   CXLDEV_MBOX_BG_CMD_COMMAND_VENDOR_MASK GENMASK_ULL(63, 48)
 #define CXLDEV_MBOX_PAYLOAD_OFFSET 0x20
 
-void cxl_probe_component_regs(struct device *dev, void __iomem *base,
-			      struct cxl_component_reg_map *map);
 void cxl_probe_device_regs(struct device *dev, void __iomem *base,
 			   struct cxl_device_reg_map *map);
 int cxl_map_device_regs(const struct cxl_register_map *map,
@@ -211,8 +209,6 @@ enum cxl_regloc_type;
 int cxl_count_regblock(struct pci_dev *pdev, enum cxl_regloc_type type);
 int cxl_find_regblock_instance(struct pci_dev *pdev, enum cxl_regloc_type type,
 			       struct cxl_register_map *map, unsigned int index);
-int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
-		      struct cxl_register_map *map);
 int cxl_setup_regs(struct cxl_register_map *map);
 struct cxl_dport;
 int cxl_dport_map_rcd_linkcap(struct pci_dev *pdev, struct cxl_dport *dport);
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index d86faebb99b7..8ef7915a51f7 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -286,17 +286,33 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
 				     struct cxl_endpoint_decoder **cxled,
 				     int ways);
 
+struct pci_dev;
+enum cxl_regloc_type;
+
 #ifdef CONFIG_CXL_BUS
 
 int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
 		     resource_size_t *offset, resource_size_t *size);
 
+int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
+		      struct cxl_register_map *map);
+void cxl_probe_component_regs(struct device *dev, void __iomem *base,
+			      struct cxl_component_reg_map *map);
+
 #else
 
 static inline
 int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u8 *count,
 		     resource_size_t *offset, resource_size_t *size)
 { return -EOPNOTSUPP; }
+static inline int
+cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
+		  struct cxl_register_map *map)
+{ return -EOPNOTSUPP; }
+static inline void
+cxl_probe_component_regs(struct device *dev, void __iomem *base,
+			 struct cxl_component_reg_map *map)
+{ }
 
 #endif /* CONFIG_CXL_BUS */
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 03/20] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
  2026-04-01 14:38 ` [PATCH v2 01/20] cxl: Add cxl_get_hdm_info() for HDM decoder metadata mhonap
  2026-04-01 14:38 ` [PATCH v2 02/20] cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public header mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 04/20] cxl: Split cxl_await_range_active() from media-ready wait mhonap
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

VFIO and other code outside the CXL core needs the same offset/mask
constants the core uses for the component register block and HDM
decoders.

Pull them into a new include/uapi/cxl/cxl_regs.h
(GPL-2.0 WITH Linux-syscall-note) and include it from
include/cxl/cxl.h. Use the uapi-friendly __GENMASK helpers where
needed. Section comments in the new file reference CXL spec r4.0 numbering.

For UAPI change, replaced the SZ_64K with actual size as the macro
will not be available for userspace programs.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/cxl.h           | 42 ---------------------------
 include/cxl/cxl.h           |  1 +
 include/uapi/cxl/cxl_regs.h | 57 +++++++++++++++++++++++++++++++++++++
 3 files changed, 58 insertions(+), 42 deletions(-)
 create mode 100644 include/uapi/cxl/cxl_regs.h

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 10ddab3949ee..172e38d58c50 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -24,48 +24,6 @@ extern const struct nvdimm_security_ops *cxl_security_ops;
  * (port-driver, region-driver, nvdimm object-drivers... etc).
  */
 
-/* CXL 2.0 8.2.4 CXL Component Register Layout and Definition */
-#define CXL_COMPONENT_REG_BLOCK_SIZE SZ_64K
-
-/* CXL 2.0 8.2.5 CXL.cache and CXL.mem Registers*/
-#define CXL_CM_OFFSET 0x1000
-#define CXL_CM_CAP_HDR_OFFSET 0x0
-#define   CXL_CM_CAP_HDR_ID_MASK GENMASK(15, 0)
-#define     CM_CAP_HDR_CAP_ID 1
-#define   CXL_CM_CAP_HDR_VERSION_MASK GENMASK(19, 16)
-#define     CM_CAP_HDR_CAP_VERSION 1
-#define   CXL_CM_CAP_HDR_CACHE_MEM_VERSION_MASK GENMASK(23, 20)
-#define     CM_CAP_HDR_CACHE_MEM_VERSION 1
-#define   CXL_CM_CAP_HDR_ARRAY_SIZE_MASK GENMASK(31, 24)
-#define CXL_CM_CAP_PTR_MASK GENMASK(31, 20)
-
-/* HDM decoders CXL 2.0 8.2.5.12 CXL HDM Decoder Capability Structure */
-#define CXL_HDM_DECODER_CAP_OFFSET 0x0
-#define   CXL_HDM_DECODER_COUNT_MASK GENMASK(3, 0)
-#define   CXL_HDM_DECODER_TARGET_COUNT_MASK GENMASK(7, 4)
-#define   CXL_HDM_DECODER_INTERLEAVE_11_8 BIT(8)
-#define   CXL_HDM_DECODER_INTERLEAVE_14_12 BIT(9)
-#define   CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY BIT(11)
-#define   CXL_HDM_DECODER_INTERLEAVE_16_WAY BIT(12)
-#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
-#define   CXL_HDM_DECODER_ENABLE BIT(1)
-#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
-#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
-#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
-#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
-#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
-#define   CXL_HDM_DECODER0_CTRL_IG_MASK GENMASK(3, 0)
-#define   CXL_HDM_DECODER0_CTRL_IW_MASK GENMASK(7, 4)
-#define   CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
-#define   CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
-#define   CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
-#define   CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
-#define   CXL_HDM_DECODER0_CTRL_HOSTONLY BIT(12)
-#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
-#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
-#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
-#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
-
 /* HDM decoder control register constants CXL 3.0 8.2.5.19.7 */
 #define CXL_DECODER_MIN_GRANULARITY 256
 #define CXL_DECODER_MAX_ENCODED_IG 6
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 8ef7915a51f7..f48274673b1b 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -9,6 +9,7 @@
 #include <linux/ioport.h>
 #include <linux/range.h>
 #include <cxl/mailbox.h>
+#include <uapi/cxl/cxl_regs.h>
 
 /**
  * enum cxl_devtype - delineate type-2 from a generic type-3 device
diff --git a/include/uapi/cxl/cxl_regs.h b/include/uapi/cxl/cxl_regs.h
new file mode 100644
index 000000000000..1a48a3805f52
--- /dev/null
+++ b/include/uapi/cxl/cxl_regs.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * CXL Standard defines
+ *
+ * Hardware register offsets and bit-field masks for the CXL Component
+ * Register block, as defined by the CXL Specification r4.0.
+ */
+
+#ifndef _UAPI_CXL_REGS_H_
+#define _UAPI_CXL_REGS_H_
+
+#include <linux/const.h>   /* _BITUL(), _BITULL() */
+#include <linux/bits.h>    /* __GENMASK() */
+
+/* CXL 4.0 8.2.3 CXL Component Register Layout and Definition */
+#define CXL_COMPONENT_REG_BLOCK_SIZE 0x00010000
+
+/* CXL 4.0 8.2.4 CXL.cache and CXL.mem Registers*/
+#define CXL_CM_OFFSET 0x1000
+#define CXL_CM_CAP_HDR_OFFSET 0x0
+#define   CXL_CM_CAP_HDR_ID_MASK __GENMASK(15, 0)
+#define     CM_CAP_HDR_CAP_ID 1
+#define   CXL_CM_CAP_HDR_VERSION_MASK __GENMASK(19, 16)
+#define     CM_CAP_HDR_CAP_VERSION 1
+#define   CXL_CM_CAP_HDR_CACHE_MEM_VERSION_MASK __GENMASK(23, 20)
+#define     CM_CAP_HDR_CACHE_MEM_VERSION 1
+#define   CXL_CM_CAP_HDR_ARRAY_SIZE_MASK __GENMASK(31, 24)
+#define CXL_CM_CAP_PTR_MASK __GENMASK(31, 20)
+
+/* HDM decoders CXL 4.0 8.2.4.20 CXL HDM Decoder Capability Structure */
+#define CXL_HDM_DECODER_CAP_OFFSET 0x0
+#define   CXL_HDM_DECODER_COUNT_MASK __GENMASK(3, 0)
+#define   CXL_HDM_DECODER_TARGET_COUNT_MASK __GENMASK(7, 4)
+#define   CXL_HDM_DECODER_INTERLEAVE_11_8 _BITUL(8)
+#define   CXL_HDM_DECODER_INTERLEAVE_14_12 _BITUL(9)
+#define   CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY _BITUL(11)
+#define   CXL_HDM_DECODER_INTERLEAVE_16_WAY _BITUL(12)
+#define CXL_HDM_DECODER_CTRL_OFFSET 0x4
+#define   CXL_HDM_DECODER_ENABLE _BITUL(1)
+#define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
+#define CXL_HDM_DECODER0_BASE_HIGH_OFFSET(i) (0x20 * (i) + 0x14)
+#define CXL_HDM_DECODER0_SIZE_LOW_OFFSET(i) (0x20 * (i) + 0x18)
+#define CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(i) (0x20 * (i) + 0x1c)
+#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
+#define   CXL_HDM_DECODER0_CTRL_IG_MASK __GENMASK(3, 0)
+#define   CXL_HDM_DECODER0_CTRL_IW_MASK __GENMASK(7, 4)
+#define   CXL_HDM_DECODER0_CTRL_LOCK _BITUL(8)
+#define   CXL_HDM_DECODER0_CTRL_COMMIT _BITUL(9)
+#define   CXL_HDM_DECODER0_CTRL_COMMITTED _BITUL(10)
+#define   CXL_HDM_DECODER0_CTRL_COMMIT_ERROR _BITUL(11)
+#define   CXL_HDM_DECODER0_CTRL_HOSTONLY _BITUL(12)
+#define CXL_HDM_DECODER0_TL_LOW(i) (0x20 * (i) + 0x24)
+#define CXL_HDM_DECODER0_TL_HIGH(i) (0x20 * (i) + 0x28)
+#define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
+#define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
+
+#endif /* _UAPI_CXL_REGS_H_ */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 04/20] cxl: Split cxl_await_range_active() from media-ready wait
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (2 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 03/20] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 05/20] cxl: Record BIR and BAR offset in cxl_register_map mhonap
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Before accessing CXL device memory after reset/power-on, the driver
must ensure media is ready. Not every CXL device implements the CXL
Memory Device register group (many Type-2 devices do not).
cxl_await_media_ready() reads cxlds->regs.memdev. Access to the
memory device registers on a Type-2 device may result in kernel
panic.

Split the HDM DVSEC range-active poll out of cxl_await_media_ready()
into a new function, cxl_await_range_active(). Type-2 devices often
lack the CXLMDEV status register, so they need the range check
without the memdev read. cxl_await_media_ready() now calls
cxl_await_range_active() for the DVSEC poll, then reads the memory
device status as before.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/cxl/core/pci.c | 35 ++++++++++++++++++++++++++++++-----
 include/cxl/cxl.h      |  3 +++
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index a5147602f91f..1fbe3338a0da 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -142,16 +142,24 @@ static int cxl_dvsec_mem_range_active(struct cxl_dev_state *cxlds, int id)
 	return 0;
 }
 
-/*
- * Wait up to @media_ready_timeout for the device to report memory
- * active.
+/**
+ * cxl_await_range_active - Wait for all HDM DVSEC memory ranges to be active
+ * @cxlds: CXL device state (DVSEC and HDM count must be valid)
+ *
+ * For each HDM decoder range reported in the CXL DVSEC capability, waits for
+ * the range to report MEM INFO VALID (up to 1s per range), then MEM ACTIVE
+ * (up to media_ready_timeout seconds per range, default 60s). Used by
+ * cxl_await_media_ready() and by callers that only need range readiness
+ * without checking the memory device status register.
+ *
+ * Return: 0 if all ranges become valid and active, -ETIMEDOUT if a timeout
+ * occurs, or a negative errno from config read on failure.
  */
-int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+int cxl_await_range_active(struct cxl_dev_state *cxlds)
 {
 	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
 	int d = cxlds->cxl_dvsec;
 	int rc, i, hdm_count;
-	u64 md_status;
 	u16 cap;
 
 	rc = pci_read_config_word(pdev,
@@ -172,6 +180,23 @@ int cxl_await_media_ready(struct cxl_dev_state *cxlds)
 			return rc;
 	}
 
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_await_range_active, "CXL");
+
+/*
+ * Wait up to @media_ready_timeout for the device to report memory
+ * active.
+ */
+int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+{
+	u64 md_status;
+	int rc;
+
+	rc = cxl_await_range_active(cxlds);
+	if (rc)
+		return rc;
+
 	md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
 	if (!CXLMDEV_READY(md_status))
 		return -EIO;
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index f48274673b1b..45d911735883 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -299,6 +299,7 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
 		      struct cxl_register_map *map);
 void cxl_probe_component_regs(struct device *dev, void __iomem *base,
 			      struct cxl_component_reg_map *map);
+int cxl_await_range_active(struct cxl_dev_state *cxlds);
 
 #else
 
@@ -314,6 +315,8 @@ static inline void
 cxl_probe_component_regs(struct device *dev, void __iomem *base,
 			 struct cxl_component_reg_map *map)
 { }
+static inline int cxl_await_range_active(struct cxl_dev_state *cxlds)
+{ return -EOPNOTSUPP; }
 
 #endif /* CONFIG_CXL_BUS */
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 05/20] cxl: Record BIR and BAR offset in cxl_register_map
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (3 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 04/20] cxl: Split cxl_await_range_active() from media-ready wait mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 06/20] vfio: UAPI for CXL-capable PCI device assignment mhonap
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

The Register Locator DVSEC (CXL 4.0 8.1.9) describes register blocks
by BAR index (BIR) and offset within the BAR. CXL core currently only
stores the resolved HPA (resource + offset) in struct cxl_register_map,
so callers that need to use pci_iomap() or report the BAR to userspace
must reverse-engineer the BAR from the HPA.

Add bar_index and bar_offset to struct cxl_register_map and fill them
in cxl_decode_regblock() when the regblock is BAR-backed (BIR 0-5).
Add cxl_regblock_get_bar_info() so callers (e.g. vfio-cxl) can get BAR
index and offset directly and use pci_iomap() instead of ioremap(HPA).

Add cxl_regblock_get_bar_info() to return those fields; -EINVAL if
the map is not BAR-backed.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/cxl/core/regs.c | 29 +++++++++++++++++++++++++++++
 include/cxl/cxl.h       | 15 +++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index e828df0629d0..43661e51230a 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -288,9 +288,37 @@ static bool cxl_decode_regblock(struct pci_dev *pdev, u32 reg_lo, u32 reg_hi,
 	map->reg_type = reg_type;
 	map->resource = pci_resource_start(pdev, bar) + offset;
 	map->max_size = pci_resource_len(pdev, bar) - offset;
+	map->bar_index = bar;
+	map->bar_offset = offset;
 	return true;
 }
 
+/**
+ * cxl_regblock_get_bar_info() - Get BAR index and offset for a BAR-backed
+ * regblock
+ * @map: Register map from cxl_find_regblock() or cxl_find_regblock_instance()
+ * @bar_index: Output BAR index (0-5). Optional, may be NULL.
+ * @bar_offset: Output offset within the BAR. Optional, may be NULL.
+ *
+ * When the register block was found via the Register Locator DVSEC and
+ * lives in a PCI BAR (BIR 0-5), this returns the BAR index and the offset
+ * within that BAR.
+ *
+ * Return: 0 if the regblock is BAR-backed (bar_index <= 5), -EINVAL otherwise.
+ */
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map, u8 *bar_index,
+			      resource_size_t *bar_offset)
+{
+	if (!map || map->bar_index == 0xff)
+		return -EINVAL;
+	if (bar_index)
+		*bar_index = map->bar_index;
+	if (bar_offset)
+		*bar_offset = map->bar_offset;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_regblock_get_bar_info, "CXL");
+
 /*
  * __cxl_find_regblock_instance() - Locate a register block or count instances by type / index
  * Use CXL_INSTANCES_COUNT for @index if counting instances.
@@ -309,6 +337,7 @@ static int __cxl_find_regblock_instance(struct pci_dev *pdev, enum cxl_regloc_ty
 
 	*map = (struct cxl_register_map) {
 		.host = &pdev->dev,
+		.bar_index = 0xFF,
 		.resource = CXL_RESOURCE_NONE,
 	};
 
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 45d911735883..52eb40352edc 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -106,9 +106,16 @@ struct cxl_pmu_reg_map {
  * @resource: physical resource base of the register block
  * @max_size: maximum mapping size to perform register search
  * @reg_type: see enum cxl_regloc_type
+ * @bar_index: PCI BAR index (0-5) when regblock is BAR-backed; 0xFF otherwise
+ * @bar_offset: offset within the BAR; only valid when bar_index <= 5
  * @component_map: cxl_reg_map for component registers
  * @device_map: cxl_reg_maps for device registers
  * @pmu_map: cxl_reg_maps for CXL Performance Monitoring Units
+ *
+ * When the register block is described by the Register Locator DVSEC with
+ * a BAR Indicator (BIR 0-5), bar_index and bar_offset are set so callers can
+ * use pci_iomap(pdev, bar_index, size) and base + bar_offset instead of
+ * ioremap(resource).
  */
 struct cxl_register_map {
 	struct device *host;
@@ -116,6 +123,8 @@ struct cxl_register_map {
 	resource_size_t resource;
 	resource_size_t max_size;
 	u8 reg_type;
+	u8 bar_index;
+	resource_size_t bar_offset;
 	union {
 		struct cxl_component_reg_map component_map;
 		struct cxl_device_reg_map device_map;
@@ -300,6 +309,8 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
 void cxl_probe_component_regs(struct device *dev, void __iomem *base,
 			      struct cxl_component_reg_map *map);
 int cxl_await_range_active(struct cxl_dev_state *cxlds);
+int cxl_regblock_get_bar_info(const struct cxl_register_map *map, u8 *bar_index,
+			      resource_size_t *bar_offset);
 
 #else
 
@@ -317,6 +328,10 @@ cxl_probe_component_regs(struct device *dev, void __iomem *base,
 { }
 static inline int cxl_await_range_active(struct cxl_dev_state *cxlds)
 { return -EOPNOTSUPP; }
+static inline int
+cxl_regblock_get_bar_info(const struct cxl_register_map *map, u8 *bar_index,
+			  resource_size_t *bar_offset)
+{ return -EINVAL; }
 
 #endif /* CONFIG_CXL_BUS */
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 06/20] vfio: UAPI for CXL-capable PCI device assignment
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (4 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 05/20] cxl: Record BIR and BAR offset in cxl_register_map mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 07/20] vfio/pci: Add CXL state to vfio_pci_core_device mhonap
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Vendor GPUs and accelerators can expose CXL.mem (HDM-D or HDM-DB)
without using PCI class code 0x0502. VMMs need a stable way to learn
DPA sizing, firmware commit state, and where the extra VFIO regions live.

Add VFIO_DEVICE_FLAGS_CXL (bit 9) and VFIO_DEVICE_INFO_CAP_CXL (cap ID 6).
The capability struct carries:

  hdm_regs_bar_index       PCI BAR containing the component register block
  hdm_regs_offset          byte offset within that BAR to the CXL.mem area
                           (comp_reg_offset + CXL_CM_OFFSET)
  dpa_region_index         VFIO region index for the DPA window
  comp_regs_region_index   VFIO region index for the emulated COMP_REGS

HDM decoder count and the HDM block offset within COMP_REGS are
intentionally absent; both are derivable from the CXL Capability Array at
COMP_REGS offset 0. Locate cap ID 0x5 (HDM) and read bits[31:20] of its
entry for the byte offset. Then read bits[3:0] of the HDM Decoder Capability
register for the count: count = (field == 0) ? 1 : field * 2.

Two flags accompany the capability:

  VFIO_CXL_CAP_FIRMWARE_COMMITTED
    A decoder covering @dpa_size bytes was programmed and committed by
    platform firmware before device open. The VMM can use the DPA region
    immediately without re-committing.

  VFIO_CXL_CAP_CACHE_CAPABLE
    The device is HDM-DB (CXL.mem + CXL.cache). HDM-DB requires a
    Write-Back Invalidation sequence before FLR to flush dirty cache
    lines; HDM-D (CXL.mem only) does not. QEMU uses this flag to
    schedule WBI and to report Back-Invalidation capability accurately
    in the virtual CXL topology. Mirrors the Cache_Capable bit from
    the CXL DVSEC Capability register.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 include/uapi/linux/vfio.h | 86 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ac2329f24141..fc07fc50b2e5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -215,6 +215,16 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6)	/* vfio-fsl-mc device */
 #define VFIO_DEVICE_FLAGS_CAPS	(1 << 7)	/* Info supports caps */
 #define VFIO_DEVICE_FLAGS_CDX	(1 << 8)	/* vfio-cdx device */
+/*
+ * Vendor-specific CXL device with CXL.mem capability (HDM-D or HDM-DB
+ * decoder, PCI class code != PCI_CLASS_MEMORY_CXL).  Covers CXL Type-2
+ * accelerators and non-class-code Type-3 variants.  When set,
+ * VFIO_DEVICE_FLAGS_PCI is also set (same device is a PCI device). The
+ * capability chain (VFIO_DEVICE_FLAGS_CAPS) contains VFIO_DEVICE_INFO_CAP_CXL
+ * describing HDM decoders, region indices, decoder layout, and CXL-specific
+ * options.
+ */
+#define VFIO_DEVICE_FLAGS_CXL   (1 << 9)        /* Device supports CXL */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 	__u32   cap_offset;	/* Offset within info struct of first cap */
@@ -257,6 +267,70 @@ struct vfio_device_info_cap_pci_atomic_comp {
 	__u32 reserved;
 };
 
+/*
+ * VFIO_DEVICE_INFO_CAP_CXL - CXL Type-2 device capability
+ *
+ * Present in the device info capability chain when VFIO_DEVICE_FLAGS_CXL
+ * is set. Describes Host Managed Device Memory (HDM) layout and CXL
+ * memory options so that userspace (e.g. QEMU) can expose the CXL region
+ * and component registers correctly to the guest.
+ *
+ * The HDM decoder count and HDM decoder block offset within the COMP_REGS
+ * region are derivable from the COMP_REGS region itself.
+ *
+ * To find the HDM decoder block offset (hdm_decoder_offset), traverse the CXL
+ * Capability Array starting at COMP_REGS region offset 0:
+ *   - Dword 0 bits[31:24] (CXL_CM_CAP_HDR_ARRAY_SIZE_MASK): number of
+ *     capability entries.
+ *   - Each subsequent dword at offset (cap * 4): bits[15:0] = cap ID
+ *     (CXL_CM_CAP_HDR_ID_MASK), bits[31:20] = byte offset from COMP_REGS
+ *     start to that capability's register block (CXL_CM_CAP_PTR_MASK).
+ *   - Locate the entry with cap ID == CXL_CM_CAP_CAP_ID_HDM (0x5); the
+ *     extracted bits[31:20] value is directly the byte offset
+ *     hdm_decoder_offset (no further scaling required).
+ *
+ * To find the HDM decoder count, pread the HDM Decoder Capability register
+ * at hdm_decoder_offset + CXL_HDM_DECODER_CAP_OFFSET within the
+ * COMP_REGS region; bits[3:0] (CXL_HDM_DECODER_COUNT_MASK) encode the count
+ * using the formula: count = (field == 0) ? 1 : field * 2.
+ */
+#define VFIO_DEVICE_INFO_CAP_CXL		6
+struct vfio_device_info_cap_cxl {
+	struct vfio_info_cap_header header;
+	__u8  hdm_regs_bar_index; /* PCI BAR containing HDM registers */
+	__u8  reserved[3];
+	__u32 flags;
+/* Decoder was committed by host firmware/BIOS */
+#define VFIO_CXL_CAP_FIRMWARE_COMMITTED		(1 << 0)
+/*
+ * Device implements an HDM-DB decoder (CXL.cache + CXL.mem).  Reflects
+ * the Cache_Capable bit (bit 0) in the CXL DVSEC Capability register.
+ *
+ * When clear: HDM-D decoder (CXL.mem only, no CXL.cache).  FLR does not
+ * require a Write-Back Invalidation (WBI) sequence; the device holds no
+ * coherent copies of host memory.
+ *
+ * When set: HDM-DB decoder (CXL 3.0+).  The kernel driver does not
+ * perform Write-Back Invalidation (WBI) automatically.  The VMM must
+ * issue a WBI sequence before asserting FLR to flush dirty device cache
+ * lines and prevent coherency violations, and should advertise
+ * Back-Invalidation support in the virtual CXL topology.
+ */
+#define	VFIO_CXL_CAP_CACHE_CAPABLE		(1 << 1)
+	/*
+	 * Byte offset within the BAR to the CXL.mem register area start
+	 * (= comp_reg_offset + CXL_CM_OFFSET).	 This is where the CXL
+	 * Capability Array Header lives.
+	 */
+	__u64 hdm_regs_offset;
+	/*
+	 * Region indices for the two CXL VFIO device regions.
+	 * Avoids forcing userspace to scan all regions by type/subtype.
+	 */
+	__u32  dpa_region_index;       /* VFIO_REGION_SUBTYPE_CXL */
+	__u32  comp_regs_region_index; /* VFIO_REGION_SUBTYPE_CXL_COMP_REGS */
+};
+
 /**
  * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
  *				       struct vfio_region_info)
@@ -370,6 +444,18 @@ struct vfio_region_info_cap_type {
  */
 #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
 
+/* 1e98 vendor PCI sub-types (CXL Consortium) */
+/*
+ * CXL memory region. Use with region type
+ * (PCI_VENDOR_ID_CXL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE).
+ * DPA memory region (fault+zap mmap)
+ */
+#define VFIO_REGION_SUBTYPE_CXL                 (1)
+/*
+ * HDM decoder register emulation region (read/write only, no mmap).
+ */
+#define VFIO_REGION_SUBTYPE_CXL_COMP_REGS       (2)
+
 /* sub-types for VFIO_REGION_TYPE_GFX */
 #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 07/20] vfio/pci: Add CXL state to vfio_pci_core_device
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (5 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 06/20] vfio: UAPI for CXL-capable PCI device assignment mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 08/20] vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks mhonap
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Add CXL-specific state to vfio_pci_core_device structure to support
CXL Type-2 device passthrough.

The new vfio_pci_cxl_state structure embeds CXL core objects:
- struct cxl_dev_state: CXL device state (from CXL core)
- struct cxl_memdev: CXL memory device
- struct cxl_region: CXL region object
- Root and endpoint decoders

Key design point: The CXL state pointer is NULL for non-CXL devices,
allowing vfio-pci-core to handle both CXL and standard PCI devices
with minimal overhead.

This will follow the approach where vfio-pci-core itself gains CXL
awareness, rather than requiring a separate variant driver.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/cxl/vfio_cxl_priv.h | 28 ++++++++++++++++++++++++++++
 include/linux/vfio_pci_core.h        |  3 +++
 2 files changed, 31 insertions(+)
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_priv.h

diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
new file mode 100644
index 000000000000..4cecc25db410
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Common infrastructure for CXL Type-2 device variant drivers
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#ifndef __LINUX_VFIO_CXL_PRIV_H
+#define __LINUX_VFIO_CXL_PRIV_H
+
+#include <cxl/cxl.h>
+#include <linux/types.h>
+
+/* CXL device state embedded in vfio_pci_core_device */
+struct vfio_pci_cxl_state {
+	struct cxl_dev_state         cxlds;
+	struct cxl_memdev           *cxlmd;
+	struct cxl_root_decoder     *cxlrd;
+	struct cxl_endpoint_decoder *cxled;
+	resource_size_t              hdm_reg_offset;
+	size_t                       hdm_reg_size;
+	resource_size_t              comp_reg_offset;
+	size_t                       comp_reg_size;
+	u8                           hdm_count;
+	u8                           comp_reg_bar;
+};
+
+#endif /* __LINUX_VFIO_CXL_PRIV_H */
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 1ac86896875c..cd8ed98a82a3 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -30,6 +30,8 @@ struct vfio_pci_region;
 struct p2pdma_provider;
 struct dma_buf_phys_vec;
 struct dma_buf_attachment;
+struct vfio_pci_cxl_state;
+
 
 struct vfio_pci_eventfd {
 	struct eventfd_ctx	*ctx;
@@ -138,6 +140,7 @@ struct vfio_pci_core_device {
 	struct mutex		ioeventfds_lock;
 	struct list_head	ioeventfds_list;
 	struct vfio_pci_vf_token	*vf_token;
+	struct vfio_pci_cxl_state *cxl;
 	struct list_head		sriov_pfs_item;
 	struct vfio_pci_core_device	*sriov_pf_core_dev;
 	struct notifier_block	nb;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 08/20] vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (6 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 07/20] vfio/pci: Add CXL state to vfio_pci_core_device mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 09/20] vfio/cxl: Detect CXL DVSEC and probe HDM block mhonap
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Introduce the Kconfig option CONFIG_VFIO_CXL_CORE and the necessary
build rules to compile CXL.mem passthrough infrastructure for
vendor-specific CXL devices into the vfio-pci-core module.  The new
option depends on VFIO_PCI_CORE, CXL_BUS and CXL_MEM.

Wire up the detection and cleanup entry-point stubs in
vfio_pci_core_register_device() and vfio_pci_core_unregister_device()
so that subsequent patches can fill in the CXL-specific logic without
touching the vfio-pci-core flow again.

The vfio_cxl_core.c file added here is an empty skeleton; the actual
CXL detection and initialisation code is introduced in the following
patch to keep this build-system patch reviewable on its own.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/Kconfig             |  2 ++
 drivers/vfio/pci/Makefile            |  1 +
 drivers/vfio/pci/cxl/Kconfig         |  9 ++++++
 drivers/vfio/pci/cxl/vfio_cxl_core.c | 41 ++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_core.c     |  4 +++
 drivers/vfio/pci/vfio_pci_priv.h     | 14 ++++++++++
 6 files changed, 71 insertions(+)
 create mode 100644 drivers/vfio/pci/cxl/Kconfig
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_core.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 1e82b44bda1a..b981a7c164ca 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -68,6 +68,8 @@ source "drivers/vfio/pci/virtio/Kconfig"
 
 source "drivers/vfio/pci/nvgrace-gpu/Kconfig"
 
+source "drivers/vfio/pci/cxl/Kconfig"
+
 source "drivers/vfio/pci/qat/Kconfig"
 
 source "drivers/vfio/pci/xe/Kconfig"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index e0a0757dd1d2..ecb0eacbc089 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
+vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
diff --git a/drivers/vfio/pci/cxl/Kconfig b/drivers/vfio/pci/cxl/Kconfig
new file mode 100644
index 000000000000..fad53300fecf
--- /dev/null
+++ b/drivers/vfio/pci/cxl/Kconfig
@@ -0,0 +1,9 @@
+config VFIO_CXL_CORE
+	bool "VFIO CXL core"
+	depends on VFIO_PCI_CORE && CXL_BUS && CXL_MEM
+	help
+	    Extends vfio-pci-core with CXL.mem passthrough for vendor-specific
+	    CXL devices (CXL_DEVTYPE_DEVMEM) that implement HDM-D or HDM-DB
+	    decoders without the standard CXL memory expander class code
+	    (PCI_CLASS_MEMORY_CXL).  Covers CXL Type-2 accelerators and
+	    non-class-code Type-3 variants (e.g. compressed memory devices).
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
new file mode 100644
index 000000000000..d12afec82ecd
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -0,0 +1,41 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * VFIO CXL Core - CXL.mem passthrough for vendor-specific CXL devices
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ *
+ * This module extends vfio-pci-core to pass through CXL.mem regions for
+ * vendor-specific CXL devices (CXL_DEVTYPE_DEVMEM) that implement HDM-D or
+ * HDM-DB decoders but do not report the standard CXL memory expander class
+ * code (PCI_CLASS_MEMORY_CXL, 0x0502).  This covers both CXL Type-2
+ * accelerators (with CXL.cache) and non-class-code Type-3 variants (e.g.
+ * compressed memory devices) which cannot be paravirtualized by the host
+ * CXL subsystem and require direct DPA region access from the guest.
+ */
+
+#include <linux/vfio_pci_core.h>
+#include <linux/pci.h>
+#include <cxl/cxl.h>
+#include <cxl/pci.h>
+
+#include "../vfio_pci_priv.h"
+#include "vfio_cxl_priv.h"
+
+/**
+ * vfio_pci_cxl_detect_and_init - Detect and initialize a vendor-specific
+ *                                CXL.mem device
+ * @vdev: VFIO PCI device
+ *
+ * Called from vfio_pci_core_register_device(). Detects CXL DVSEC capability
+ * and initializes CXL features. On failure vdev->cxl remains NULL and the
+ * device operates as a standard PCI device.
+ */
+void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
+{
+}
+
+void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
+{
+}
+
+MODULE_IMPORT_NS("CXL");
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 3a11e6f450f7..b7364178e23d 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -2181,6 +2181,8 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 	if (ret)
 		goto out_vf;
 
+	vfio_pci_cxl_detect_and_init(vdev);
+
 	vfio_pci_probe_power_state(vdev);
 
 	/*
@@ -2224,6 +2226,8 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
 	vfio_pci_vf_uninit(vdev);
 	vfio_pci_vga_uninit(vdev);
 
+	vfio_pci_cxl_cleanup(vdev);
+
 	if (!disable_idle_d3)
 		pm_runtime_get_noresume(&vdev->pdev->dev);
 
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 27ac280f00b9..d7df5538dcde 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -133,4 +133,18 @@ static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
 }
 #endif
 
+#if IS_ENABLED(CONFIG_VFIO_CXL_CORE)
+
+void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev);
+void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev);
+
+#else
+
+static inline void
+vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev) { }
+static inline void
+vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev) { }
+
+#endif /* CONFIG_VFIO_CXL_CORE */
+
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 09/20] vfio/cxl: Detect CXL DVSEC and probe HDM block
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (7 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 08/20] vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 10/20] vfio/pci: Export config access helpers mhonap
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Detect a vendor-specific CXL device at vfio-pci bind time and probe
its HDM decoder register block.

vfio_cxl_create_device_state() allocates per-device state via devm and
reads MEM_CAPABLE and CACHE_CAPABLE from the CXL DVSEC.

vfio_cxl_setup_regs() locates the component register block, temporarily
maps it, calls cxl_probe_component_regs() to find the HDM block, then
releases the mapping.

vfio_pci_cxl_detect_and_init() chains these two steps. If either fails,
vdev->cxl stays NULL and the device falls back to plain vfio-pci.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/cxl/vfio_cxl_core.c | 217 +++++++++++++++++++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h |  12 ++
 2 files changed, 229 insertions(+)

diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index d12afec82ecd..b1c7603590b5 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -21,6 +21,158 @@
 #include "../vfio_pci_priv.h"
 #include "vfio_cxl_priv.h"
 
+/*
+ * vfio_cxl_create_device_state - Allocate and validate CXL device state
+ *
+ * Returns a pointer to the allocated vfio_pci_cxl_state on success, or
+ * ERR_PTR on failure.  The allocation uses devm; the caller must call
+ * devm_kfree(&pdev->dev, cxl) on any subsequent setup failure to release
+ * the resource before device unbind.  Using devm_kfree() to undo a devm
+ * allocation early is explicitly supported by the devres API.
+ *
+ * The caller assigns vdev->cxl only after all setup steps succeed, preventing
+ * partially-initialised state from being visible through vdev->cxl on any
+ * failure path.
+ */
+static struct vfio_pci_cxl_state *
+vfio_cxl_create_device_state(struct pci_dev *pdev, u16 dvsec)
+{
+	struct vfio_pci_cxl_state *cxl;
+	u16 cap_word;
+	u32 hdr1;
+
+	/* Freed automatically when pdev->dev is released. */
+	cxl = devm_cxl_dev_state_create(&pdev->dev,
+					CXL_DEVTYPE_DEVMEM,
+					pdev->dev.id, dvsec,
+					struct vfio_pci_cxl_state,
+					cxlds, false);
+	if (!cxl)
+		return ERR_PTR(-ENOMEM);
+
+	pci_read_config_dword(pdev, dvsec + PCI_DVSEC_HEADER1, &hdr1);
+	cxl->dvsec_len = PCI_DVSEC_HEADER1_LEN(hdr1);
+
+	pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
+			     &cap_word);
+
+	/*
+	 * Only handle vendor devices (class != 0x0502) with Mem_Capable set.
+	 * CACHE_CAPABLE is forwarded to the VMM so it knows whether a WBI
+	 * sequence is needed before FLR.
+	 */
+	if (!FIELD_GET(CXL_DVSEC_MEM_CAPABLE, cap_word) ||
+	    (pdev->class >> 8) == PCI_CLASS_MEMORY_CXL) {
+		devm_kfree(&pdev->dev, cxl);
+		return ERR_PTR(-ENODEV);
+	}
+
+	cxl->cache_capable = FIELD_GET(CXL_DVSEC_CACHE_CAPABLE, cap_word);
+
+	return cxl;
+}
+
+static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev,
+			       struct vfio_pci_cxl_state *cxl)
+{
+	struct cxl_register_map *map = &cxl->cxlds.reg_map;
+	resource_size_t offset, bar_offset, size;
+	struct pci_dev *pdev = vdev->pdev;
+	void __iomem *base;
+	int ret;
+	u8 count;
+	u8 bar;
+
+	if (WARN_ON_ONCE(!pci_is_enabled(pdev)))
+		return -EINVAL;
+
+	/* Find component register block via Register Locator DVSEC */
+	ret = cxl_find_regblock(pdev, CXL_REGLOC_RBI_COMPONENT, map);
+	if (ret)
+		return ret;
+
+	/*
+	 * Request the region and map.  This is a transient mapping
+	 * used only to probe register capabilities; released immediately
+	 * after cxl_probe_component_regs() returns.
+	 */
+	if (!request_mem_region(map->resource, map->max_size, "vfio-cxl-probe"))
+		return -EBUSY;
+
+	base = ioremap(map->resource, map->max_size);
+	if (!base) {
+		ret = -ENOMEM;
+		goto failed_release;
+	}
+
+	/* Probe component register capabilities */
+	cxl_probe_component_regs(&pdev->dev, base, &map->component_map);
+
+	/* Check if HDM decoder was found */
+	if (!map->component_map.hdm_decoder.valid) {
+		ret = -ENODEV;
+		goto failed_unmap;
+	}
+
+	pci_dbg(pdev, "vfio_cxl: HDM decoder at offset=0x%lx, size=0x%lx\n",
+		map->component_map.hdm_decoder.offset,
+		map->component_map.hdm_decoder.size);
+
+	/* Get HDM register info */
+	ret = cxl_get_hdm_info(&cxl->cxlds, &count, &offset, &size);
+	if (ret)
+		goto failed_unmap;
+
+	if (!count || !size) {
+		ret = -ENODEV;
+		goto failed_unmap;
+	}
+
+	cxl->hdm_count = count;
+	/*
+	 * cxl_get_hdm_info() returns rmap->offset = CXL_CM_OFFSET + <hdm_within_cm>
+	 * (see cxl_probe_component_regs() which does base += CXL_CM_OFFSET before
+	 * reading caps and stores CXL_CM_OFFSET + cap_ptr as the offset).
+	 * Subtract CXL_CM_OFFSET so hdm_reg_offset is relative to the CXL.mem
+	 * register area start, which is where comp_reg_virt[0] is anchored.
+	 * The physical BAR address for hdm_iobase is recovered by adding
+	 * CXL_CM_OFFSET back in vfio_cxl_setup_virt_regs().
+	 */
+	cxl->hdm_reg_offset = offset - CXL_CM_OFFSET;
+	cxl->hdm_reg_size = size;
+
+	ret = cxl_regblock_get_bar_info(map, &bar, &bar_offset);
+	if (ret)
+		goto failed_unmap;
+
+	cxl->comp_reg_bar = bar;
+	cxl->comp_reg_offset = bar_offset;
+	cxl->comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
+
+	iounmap(base);
+	release_mem_region(map->resource, map->max_size);
+
+	return 0;
+
+failed_unmap:
+	iounmap(base);
+failed_release:
+	release_mem_region(map->resource, map->max_size);
+
+	return ret;
+}
+
+/*
+ * Free CXL state early on probe failure.  devm_kfree() on a live devres
+ * allocation removes it from the list immediately, so the normal devres
+ * teardown at unbind time won't double-free it.
+ */
+static void vfio_cxl_dev_state_free(struct pci_dev *pdev,
+				    struct vfio_pci_cxl_state *cxl)
+{
+	devm_kfree(&pdev->dev, cxl);
+}
+
 /**
  * vfio_pci_cxl_detect_and_init - Detect and initialize a vendor-specific
  *                                CXL.mem device
@@ -32,10 +184,75 @@
  */
 void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
 {
+	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_pci_cxl_state *cxl;
+	u16 dvsec;
+	int ret;
+
+	if (!pcie_is_cxl(pdev))
+		return;
+
+	dvsec = pci_find_dvsec_capability(pdev,
+					  PCI_VENDOR_ID_CXL,
+					  PCI_DVSEC_CXL_DEVICE);
+	if (!dvsec)
+		return;
+
+	/*
+	 * CXL DVSEC found: any failure from here is a hard probe error on
+	 * a confirmed CXL-capable device, not a silent non-CXL fallback.
+	 * Warn the operator so misconfiguration is visible.
+	 */
+	cxl = vfio_cxl_create_device_state(pdev, dvsec);
+	if (IS_ERR(cxl)) {
+		if (PTR_ERR(cxl) != -ENODEV)
+			pci_warn(pdev,
+				 "vfio-cxl: CXL device state allocation failed: %ld\n",
+				 PTR_ERR(cxl));
+		return;
+	}
+
+	/*
+	 * Required for ioremap of the component register block and
+	 * calls to cxl_probe_component_regs().
+	 */
+	ret = pci_enable_device_mem(pdev);
+	if (ret) {
+		pci_warn(pdev,
+			 "vfio-cxl: pci_enable_device_mem failed: %d\n", ret);
+		goto free_cxl;
+	}
+
+	ret = vfio_cxl_setup_regs(vdev, cxl);
+	if (ret) {
+		pci_warn(pdev,
+			 "vfio-cxl: HDM register probing failed: %d\n", ret);
+		pci_disable_device(pdev);
+		goto free_cxl;
+	}
+
+	pci_disable_device(pdev);
+
+	/*
+	 * Register probing succeeded.  Assign vdev->cxl now so that
+	 * all subsequent helpers can access state via vdev->cxl.
+	 * All failure paths below clear vdev->cxl before calling
+	 * vfio_cxl_dev_state_free().
+	 */
+	vdev->cxl = cxl;
+
+	return;
+
+free_cxl:
+	vfio_cxl_dev_state_free(pdev, cxl);
 }
 
 void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
 {
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+	if (!cxl)
+		return;
 }
 
 MODULE_IMPORT_NS("CXL");
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index 4cecc25db410..54b1f6d885aa 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -21,8 +21,20 @@ struct vfio_pci_cxl_state {
 	size_t                       hdm_reg_size;
 	resource_size_t              comp_reg_offset;
 	size_t                       comp_reg_size;
+	u16                          dvsec_len;
 	u8                           hdm_count;
 	u8                           comp_reg_bar;
+	bool                         cache_capable;
 };
 
+/*
+ * CXL DVSEC for CXL Devices - register offsets within the DVSEC
+ * (CXL 4.0 8.1.3).
+ * Offsets are relative to the DVSEC capability base (cxl->dvsec).
+ */
+#define CXL_DVSEC_CAPABILITY_OFFSET 0xa
+#define CXL_DVSEC_MEM_CAPABLE	    BIT(2)
+/* CXL DVSEC Capability register bit 0: device supports CXL.cache (HDM-DB) */
+#define CXL_DVSEC_CACHE_CAPABLE	    BIT(0)
+
 #endif /* __LINUX_VFIO_CXL_PRIV_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 10/20] vfio/pci: Export config access helpers
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (8 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 09/20] vfio/cxl: Detect CXL DVSEC and probe HDM block mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 11/20] vfio/cxl: Introduce HDM decoder register emulation framework mhonap
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Promote vfio_raw_config_write() and vfio_raw_config_read() to non-static so
that the CXL DVSEC write handler in the next patch can call them.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 12 ++++++------
 drivers/vfio/pci/vfio_pci_priv.h   |  8 ++++++++
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index dc4e510e6e1b..79aaf270adb2 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -270,9 +270,9 @@ static int vfio_direct_config_read(struct vfio_pci_core_device *vdev, int pos,
 }
 
 /* Raw access skips any kind of virtualization */
-static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
-				 int count, struct perm_bits *perm,
-				 int offset, __le32 val)
+int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
+			  int count, struct perm_bits *perm,
+			  int offset, __le32 val)
 {
 	int ret;
 
@@ -283,9 +283,9 @@ static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
 	return count;
 }
 
-static int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
-				int count, struct perm_bits *perm,
-				int offset, __le32 *val)
+int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
+			 int count, struct perm_bits *perm,
+			 int offset, __le32 *val)
 {
 	int ret;
 
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index d7df5538dcde..1082ba43bafe 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -37,6 +37,14 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
 ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite);
 
+int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
+			  int count, struct perm_bits *perm,
+			  int offset, __le32 val);
+
+int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
+			 int count, struct perm_bits *perm,
+			 int offset, __le32 *val);
+
 ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			size_t count, loff_t *ppos, bool iswrite);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 11/20] vfio/cxl: Introduce HDM decoder register emulation framework
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (9 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 10/20] vfio/pci: Export config access helpers mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 12/20] vfio/cxl: Wait for HDM ranges and create memdev mhonap
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Add HDM decoder register emulation for CXL devices assigned to a guest.

New file vfio_cxl_emu.c allocates comp_reg_virt[] covering the full
component register block (CXL_COMPONENT_REG_BLOCK_SIZE), snapshots it
from MMIO after probe, and registers a VFIO device region
(VFIO_REGION_SUBTYPE_CXL_COMP_REGS) with read/write ops but no mmap,
so every access hits the emulated buffer and write dispatchers.

vfio_cxl_setup_virt_regs() is called from the tail of
vfio_cxl_setup_regs(); vfio_cxl_clean_virt_regs() runs on cleanup.

HDM decoder register defines come from include/uapi/cxl/cxl_regs.h.
Bits with no hardware equivalent stay in vfio_cxl_priv.h.

hdm_decoder_n_ctrl_write() allows the guest to clear the LOCK bit.
A firmware-committed decoder arrives with LOCK=1; the guest driver
must clear it before reprogramming BASE and SIZE with the VM's GPA.
Such a write clears the bit in the shadow while preserving all other
fields.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/Makefile            |   2 +-
 drivers/vfio/pci/cxl/vfio_cxl_core.c |   5 +
 drivers/vfio/pci/cxl/vfio_cxl_emu.c  | 433 +++++++++++++++++++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h |  47 +++
 include/uapi/cxl/cxl_regs.h          |   5 +
 5 files changed, 491 insertions(+), 1 deletion(-)
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_emu.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index ecb0eacbc089..bef916495eae 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
-vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o
+vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o cxl/vfio_cxl_emu.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index b1c7603590b5..0b9e4419cd47 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -149,8 +149,11 @@ static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev,
 	cxl->comp_reg_offset = bar_offset;
 	cxl->comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
 
+	ret = vfio_cxl_setup_virt_regs(vdev, cxl, base);
 	iounmap(base);
 	release_mem_region(map->resource, map->max_size);
+	if (ret)
+		return ret;
 
 	return 0;
 
@@ -253,6 +256,8 @@ void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
 
 	if (!cxl)
 		return;
+
+	vfio_cxl_clean_virt_regs(cxl);
 }
 
 MODULE_IMPORT_NS("CXL");
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_emu.c b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
new file mode 100644
index 000000000000..6fb02253e631
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
@@ -0,0 +1,433 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/bitops.h>
+#include <linux/vfio_pci_core.h>
+
+#include "../vfio_pci_priv.h"
+#include "vfio_cxl_priv.h"
+
+/*
+ * comp_reg_virt[] shadow layout:
+ *   Covers the full CXL.mem register area (starting at CXL_CM_OFFSET
+ *   within the component register block).  Index 0 is the CXL Capability
+ *   Array Header; the HDM decoder block starts at index
+ *   hdm_reg_offset / sizeof(__le32).
+ *
+ * Register layout within the HDM block (CXL spec 4.0 8.2.4.20 CXL HDM Decoder
+ * Capability Structure):
+ *   0x00: HDM Decoder Capability
+ *   0x04: HDM Decoder Global Control
+ *   0x08: (reserved)
+ *   0x0c: (reserved)
+ *   For each decoder N (N=0..hdm_count-1), at base 0x10 + N*0x20:
+ *     +0x00: BASE_LO
+ *     +0x04: BASE_HI
+ *     +0x08: SIZE_LO
+ *     +0x0c: SIZE_HI
+ *     +0x10: CTRL
+ *     +0x14: TARGET_LIST_LO
+ *     +0x18: TARGET_LIST_HI
+ *     +0x1c: (reserved)
+ */
+
+static inline __le32 *hdm_reg_ptr(struct vfio_pci_cxl_state *cxl, u32 hdm_off)
+{
+	/*
+	 * hdm_off is a byte offset within the HDM decoder block.
+	 * comp_reg_virt covers the CXL.mem register area starting at
+	 * CXL_CM_OFFSET within the component register block.
+	 * hdm_reg_offset is CXL.mem-relative, so adding hdm_reg_offset
+	 * gives the correct index into comp_reg_virt[].
+	 */
+	return &cxl->comp_reg_virt[(cxl->hdm_reg_offset + hdm_off) /
+				   sizeof(__le32)];
+}
+
+static ssize_t virt_hdm_rev_reg_write(struct vfio_pci_core_device *vdev,
+				      const __le32 *val32, u64 offset, u64 size)
+{
+	/* Discard writes on reserved registers. */
+	return size;
+}
+
+static ssize_t hdm_decoder_n_lo_write(struct vfio_pci_core_device *vdev,
+				      const __le32 *val32, u64 offset, u64 size)
+{
+	u32 new_val = le32_to_cpu(*val32);
+
+	if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
+		return -EINVAL;
+
+	/* Bits [27:0] are reserved. */
+	new_val &= ~CXL_HDM_DECODER_BASE_LO_RESERVED_MASK;
+
+	*hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(new_val);
+
+	return size;
+}
+
+static ssize_t hdm_decoder_global_ctrl_write(struct vfio_pci_core_device *vdev,
+					     const __le32 *val32, u64 size)
+{
+	u32 hdm_gcap;
+	u32 new_val = le32_to_cpu(*val32);
+
+	if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
+		return -EINVAL;
+
+	/* Bit [31:2] are reserved. */
+	new_val &= ~CXL_HDM_DECODER_GLOBAL_CTRL_RESERVED_MASK;
+
+	/* Poison On Decode Error Enable (bit 0) is RO=0 if not supported. */
+	hdm_gcap = le32_to_cpu(*hdm_reg_ptr(vdev->cxl,
+					    CXL_HDM_DECODER_CAP_OFFSET));
+	if (!(hdm_gcap & CXL_HDM_DECODER_POISON_ON_DECODE_ERR))
+		new_val &= ~CXL_HDM_DECODER_GLOBAL_CTRL_POISON_EN_BIT;
+
+	*hdm_reg_ptr(vdev->cxl, CXL_HDM_DECODER_CTRL_OFFSET) =
+		cpu_to_le32(new_val);
+
+	return size;
+}
+
+/**
+ * hdm_decoder_n_ctrl_write - Write handler for HDM decoder CTRL register.
+ * @vdev:   VFIO PCI core device
+ * @val32:  New register value supplied by userspace (little-endian)
+ * @offset: Byte offset within the HDM block for this decoder's CTRL register
+ * @size:   Access size in bytes; must equal CXL_REG_SIZE_DWORD
+ *
+ * The COMMIT bit (bit 9) is the key: setting it requests the hardware to
+ * lock the decoder.  The emulated COMMITTED bit (bit 10) mirrors COMMIT
+ * immediately to allow QEMU's notify_change to detect the transition and
+ * map/unmap the DPA MemoryRegion in the guest address space.
+ *
+ * Note: the actual hardware HDM decoder programming (writing the real
+ * BASE/SIZE with host physical addresses) happens in the QEMU notify_change
+ * callback BEFORE this write reaches the hardware.  This ordering is
+ * correct because vfio_region_write() calls notify_change() first.
+ *
+ * Return: @size on success, %-EINVAL if @size is not %CXL_REG_SIZE_DWORD.
+ */
+static ssize_t hdm_decoder_n_ctrl_write(struct vfio_pci_core_device *vdev,
+					const __le32 *val32, u64 offset, u64 size)
+{
+	u32 hdm_gcap;
+	u32 ro_mask = CXL_HDM_DECODER_CTRL_RO_BITS_MASK;
+	u32 rev_mask = CXL_HDM_DECODER_CTRL_RESERVED_MASK;
+	u32 new_val = le32_to_cpu(*val32);
+	u32 cur_val;
+
+	if (WARN_ON_ONCE(size != CXL_REG_SIZE_DWORD))
+		return -EINVAL;
+
+	cur_val = le32_to_cpu(*hdm_reg_ptr(vdev->cxl, offset));
+	if (cur_val & CXL_HDM_DECODER0_CTRL_LOCK) {
+		if (new_val & CXL_HDM_DECODER0_CTRL_LOCK)
+			return size;
+
+		/* LOCK_0 only: preserve all other bits, clear LOCK */
+		*hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(
+			cur_val & ~CXL_HDM_DECODER0_CTRL_LOCK);
+		return size;
+	}
+
+	hdm_gcap = le32_to_cpu(*hdm_reg_ptr(vdev->cxl,
+					    CXL_HDM_DECODER_CAP_OFFSET));
+	ro_mask |= CXL_HDM_DECODER_CTRL_DEVICE_BITS_RO;
+	rev_mask |= CXL_HDM_DECODER_CTRL_DEVICE_RESERVED;
+
+	if (!(hdm_gcap & CXL_HDM_DECODER_UIO_CAPABLE))
+		rev_mask |= CXL_HDM_DECODER_CTRL_UIO_RESERVED;
+
+	new_val &= ~rev_mask;
+	cur_val &= ro_mask;
+	new_val = (new_val & ~ro_mask) | cur_val;
+
+	/*
+	 * Mirror COMMIT to COMMITTED immediately in the emulated state.
+	 */
+	if (new_val & CXL_HDM_DECODER0_CTRL_COMMIT)
+		new_val |= CXL_HDM_DECODER0_CTRL_COMMITTED;
+	else
+		new_val &= ~CXL_HDM_DECODER0_CTRL_COMMITTED;
+
+	*hdm_reg_ptr(vdev->cxl, offset) = cpu_to_le32(new_val);
+
+	return size;
+}
+
+/*
+ * Dispatch table for COMP_REGS region writes. Indexed by byte offset within
+ * the HDM decoder block. Returns the appropriate write handler.
+ *
+ * Layout:
+ *   0x00	  HDM Decoder Capability  (RO)
+ *   0x04	  HDM Global Control	  (RW with reserved masking)
+ *   0x08-0x0f	  (reserved)		  (ignored)
+ *   Per decoder N, base = 0x10 + N*0x20:
+ *     base+0x00  BASE_LO  (RW, [27:0] reserved)
+ *     base+0x04  BASE_HI  (RW)
+ *     base+0x08  SIZE_LO  (RW, [27:0] reserved)
+ *     base+0x0c  SIZE_HI  (RW)
+ *     base+0x10  CTRL	   (RW, complex rules)
+ *     base+0x14  TARGET_LIST_LO  (ignored for Type-2)
+ *     base+0x18  TARGET_LIST_HI  (ignored for Type-2)
+ *     base+0x1c  (reserved)	 (ignored)
+ */
+static ssize_t comp_regs_dispatch_write(struct vfio_pci_core_device *vdev,
+					u32 off, const __le32 *val32, u32 size)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	u32 dec_base, dec_off;
+
+	/* HDM Decoder Capability (0x00): RO */
+	if (off == CXL_HDM_DECODER_CAP_OFFSET)
+		return size;
+
+	/* HDM Global Control (0x04) */
+	if (off == CXL_HDM_DECODER_CTRL_OFFSET)
+		return hdm_decoder_global_ctrl_write(vdev, val32, size);
+
+	/*
+	 * Offsets 0x08-0x0f are reserved per CXL 4.0 Table 8-115.
+	 * Per-decoder registers start at 0x10, stride 0x20
+	 */
+	if (off < CXL_HDM_DECODER_FIRST_BLOCK_OFFSET)
+		return size; /* reserved gap */
+
+	dec_base = CXL_HDM_DECODER_FIRST_BLOCK_OFFSET;
+	/*
+	 * Reject accesses beyond the last implemented HDM decoder.
+	 * Without this check an out-of-bounds offset would silently
+	 * corrupt comp_reg_virt[] memory past the end of the allocation.
+	 */
+	if ((off - dec_base) / CXL_HDM_DECODER_BLOCK_STRIDE >= cxl->hdm_count)
+		return size;
+
+	dec_off = (off - dec_base) % CXL_HDM_DECODER_BLOCK_STRIDE;
+
+	switch (dec_off) {
+	case CXL_HDM_DECODER_N_BASE_LOW_OFFSET:	 /* BASE_LO */
+	case CXL_HDM_DECODER_N_SIZE_LOW_OFFSET:	 /* SIZE_LO */
+		return hdm_decoder_n_lo_write(vdev, val32, off, size);
+	case CXL_HDM_DECODER_N_BASE_HIGH_OFFSET: /* BASE_HI */
+	case CXL_HDM_DECODER_N_SIZE_HIGH_OFFSET: /* SIZE_HI */
+	{
+		/* Full 32-bit write, no reserved bits; frozen when COMMIT_LOCK set */
+		u32 ctrl_off = off - dec_off + CXL_HDM_DECODER_N_CTRL_OFFSET;
+		u32 ctrl = le32_to_cpu(*hdm_reg_ptr(cxl, ctrl_off));
+
+		if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK)
+			return size;
+		*hdm_reg_ptr(cxl, off) = *val32;
+		return size;
+	}
+	case CXL_HDM_DECODER_N_CTRL_OFFSET:	  /* CTRL */
+		return hdm_decoder_n_ctrl_write(vdev, val32, off, size);
+	case CXL_HDM_DECODER_N_TARGET_LIST_LOW_OFFSET:
+	case CXL_HDM_DECODER_N_TARGET_LIST_HIGH_OFFSET:
+	case CXL_HDM_DECODER_N_REV_OFFSET:
+		return virt_hdm_rev_reg_write(vdev, val32, off, size);
+	default:
+		return size;
+	}
+}
+
+/*
+ * vfio_cxl_comp_regs_rw - regops rw handler for
+ * VFIO_REGION_SUBTYPE_CXL_COMP_REGS.
+ *
+ * Reads return the emulated HDM state (comp_reg_virt[]).
+ * Writes go through comp_regs_dispatch_write() for bit-field enforcement.
+ * Only 4-byte aligned 4-byte accesses are supported (hardware requirement).
+ */
+static ssize_t vfio_cxl_comp_regs_rw(struct vfio_pci_core_device *vdev,
+				     char __user *buf, size_t count,
+				     loff_t *ppos, bool iswrite)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	size_t done = 0;
+
+	if (!count)
+		return 0;
+
+	/* Clamp to total region size: cap array prefix + HDM block */
+	if (pos >= cxl->hdm_reg_offset + cxl->hdm_reg_size)
+		return -EINVAL;
+	count = min(count,
+		    (size_t)(cxl->hdm_reg_offset + cxl->hdm_reg_size - pos));
+
+	while (done < count) {
+		u32 sz	 = count - done;
+		u32 off	 = pos + done;
+		__le32 v;
+
+		/* Enforce exactly 4-byte, 4-byte-aligned accesses */
+		if (sz != CXL_REG_SIZE_DWORD || (off & 0x3))
+			return done ? (ssize_t)done : -EINVAL;
+
+		if (iswrite) {
+			if (off < cxl->hdm_reg_offset) {
+				/* Cap array area is read-only; discard writes */
+				done += sizeof(v);
+				continue;
+			}
+			if (copy_from_user(&v, buf + done, sizeof(v)))
+				return done ? (ssize_t)done : -EFAULT;
+			comp_regs_dispatch_write(vdev,
+						 off - cxl->hdm_reg_offset,
+						 &v, sizeof(v));
+		} else {
+			/* Read from extended buffer _ covers cap array and HDM */
+			v = cxl->comp_reg_virt[off / sizeof(__le32)];
+			if (copy_to_user(buf + done, &v, sizeof(v)))
+				return done ? (ssize_t)done : -EFAULT;
+		}
+		done += sizeof(v);
+	}
+
+	*ppos += done;
+	return done;
+}
+
+static void vfio_cxl_comp_regs_release(struct vfio_pci_core_device *vdev,
+				       struct vfio_pci_region *region)
+{
+	/* comp_reg_virt is freed in vfio_cxl_clean_virt_regs() */
+}
+
+static const struct vfio_pci_regops vfio_cxl_comp_regs_ops = {
+	.rw	 = vfio_cxl_comp_regs_rw,
+	.release = vfio_cxl_comp_regs_release,
+};
+
+/*
+ * vfio_cxl_setup_virt_regs - Allocate emulated HDM register state.
+ *
+ * Allocates comp_reg_virt as a compact __le32 array covering only
+ * hdm_reg_size bytes of HDM decoder registers. The initial values
+ * are read from hardware via the BAR ioremap established by the caller.
+ *
+ * DVSEC state is accessed via vdev->vconfig (see the following patch).
+ */
+int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev,
+			     struct vfio_pci_cxl_state *cxl,
+			     void __iomem *cap_base)
+{
+	size_t total_size, nregs, i;
+
+	if (WARN_ON(!cxl->hdm_reg_size))
+		return -EINVAL;
+
+	total_size = cxl->hdm_reg_offset + cxl->hdm_reg_size;
+
+	if (pci_resource_len(vdev->pdev, cxl->comp_reg_bar) <
+	    cxl->comp_reg_offset + CXL_CM_OFFSET + total_size)
+		return -ENODEV;
+
+	nregs = total_size / sizeof(__le32);
+	cxl->comp_reg_virt = kcalloc(nregs, sizeof(__le32), GFP_KERNEL);
+	if (!cxl->comp_reg_virt)
+		return -ENOMEM;
+
+	/*
+	 * Snapshot the CXL.mem register area from the caller's mapping.
+	 * cap_base maps the component register block from comp_reg_offset.
+	 * The CXL.mem registers start at CXL_CM_OFFSET (= 0x1000) within that
+	 * block; reading from cap_base + CXL_CM_OFFSET ensures comp_reg_virt[0]
+	 * holds the CXL Capability Array Header required by guest drivers.
+	 */
+	for (i = 0; i < nregs; i++)
+		cxl->comp_reg_virt[i] =
+			cpu_to_le32(readl(cap_base + CXL_CM_OFFSET +
+					  i * sizeof(__le32)));
+
+	/*
+	 * Establish persistent mapping; kept alive until
+	 * vfio_cxl_clean_virt_regs().
+	 */
+	cxl->hdm_iobase = ioremap(pci_resource_start(vdev->pdev,
+						     cxl->comp_reg_bar) +
+				  cxl->comp_reg_offset + CXL_CM_OFFSET +
+				  cxl->hdm_reg_offset,
+				  cxl->hdm_reg_size);
+	if (!cxl->hdm_iobase) {
+		kfree(cxl->comp_reg_virt);
+		cxl->comp_reg_virt = NULL;
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
+/*
+ * Called with memory_lock write side held (from vfio_cxl_reactivate_region).
+ * Uses the pre-established hdm_iobase, no ioremap() under the lock,
+ * which would deadlock on PREEMPT_RT where ioremap() can sleep.
+ */
+void vfio_cxl_reinit_comp_regs(struct vfio_pci_cxl_state *cxl)
+{
+	size_t i, nregs;
+	u32 n;
+
+	if (!cxl || !cxl->comp_reg_virt || !cxl->hdm_iobase)
+		return;
+
+	nregs = cxl->hdm_reg_size / sizeof(__le32);
+
+	for (i = 0; i < nregs; i++)
+		*hdm_reg_ptr(cxl, i * sizeof(__le32)) =
+			cpu_to_le32(readl(cxl->hdm_iobase +
+					  i * sizeof(__le32)));
+
+	/*
+	 * For firmware-committed decoders, clear COMMIT_LOCK (bit 8) and zero
+	 * BASE in comp_reg_virt[] so QEMU can write the correct guest GPA via
+	 * setup_locked_hdm() before guest DPA access begins.
+	 *
+	 * Check the COMMITTED bit (bit 10) directly from the freshly-snapshotted
+	 * ctrl register rather than relying on cxl->precommitted.  At probe time
+	 * this function is called before cxl->precommitted is set (it is set
+	 * after vfio_cxl_read_committed_decoder_size() succeeds), so using
+	 * cxl->precommitted here would silently skip the LOCK clearing and leave
+	 * the hardware HPA in comp_reg_virt[].
+	 */
+	for (n = 0; n < cxl->hdm_count; n++) {
+		u32 ctrl_off = CXL_HDM_DECODER_FIRST_BLOCK_OFFSET +
+			n * CXL_HDM_DECODER_BLOCK_STRIDE +
+			CXL_HDM_DECODER_N_CTRL_OFFSET;
+		u32 base_lo_off = CXL_HDM_DECODER_FIRST_BLOCK_OFFSET +
+			n * CXL_HDM_DECODER_BLOCK_STRIDE +
+			CXL_HDM_DECODER_N_BASE_LOW_OFFSET;
+		u32 base_hi_off = CXL_HDM_DECODER_FIRST_BLOCK_OFFSET +
+			n * CXL_HDM_DECODER_BLOCK_STRIDE +
+			CXL_HDM_DECODER_N_BASE_HIGH_OFFSET;
+		u32 ctrl = le32_to_cpu(*hdm_reg_ptr(cxl, ctrl_off));
+
+		if (!(ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED))
+			continue;
+
+		if (ctrl & CXL_HDM_DECODER0_CTRL_LOCK) {
+			*hdm_reg_ptr(cxl, ctrl_off) =
+				cpu_to_le32(ctrl &
+					    ~CXL_HDM_DECODER0_CTRL_LOCK);
+			*hdm_reg_ptr(cxl, base_lo_off) = 0;
+			*hdm_reg_ptr(cxl, base_hi_off) = 0;
+		}
+	}
+}
+
+void vfio_cxl_clean_virt_regs(struct vfio_pci_cxl_state *cxl)
+{
+	if (cxl->hdm_iobase) {
+		iounmap(cxl->hdm_iobase);
+		cxl->hdm_iobase = NULL;
+	}
+	kfree(cxl->comp_reg_virt);
+	cxl->comp_reg_virt = NULL;
+}
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index 54b1f6d885aa..463a55062144 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -21,12 +21,53 @@ struct vfio_pci_cxl_state {
 	size_t                       hdm_reg_size;
 	resource_size_t              comp_reg_offset;
 	size_t                       comp_reg_size;
+	__le32                      *comp_reg_virt;
+	void __iomem                *hdm_iobase;
 	u16                          dvsec_len;
 	u8                           hdm_count;
 	u8                           comp_reg_bar;
 	bool                         cache_capable;
 };
 
+/* Register access sizes */
+#define CXL_REG_SIZE_WORD  2
+#define CXL_REG_SIZE_DWORD 4
+
+/* HDM Decoder - register offsets (CXL 4.0 Table 8-115) */
+#define CXL_HDM_DECODER_GLOBAL_CTRL_OFFSET        0x4
+#define CXL_HDM_DECODER_FIRST_BLOCK_OFFSET        0x10
+#define CXL_HDM_DECODER_BLOCK_STRIDE              0x20
+#define CXL_HDM_DECODER_N_BASE_LOW_OFFSET         0x0
+#define CXL_HDM_DECODER_N_BASE_HIGH_OFFSET        0x4
+#define CXL_HDM_DECODER_N_SIZE_LOW_OFFSET         0x8
+#define CXL_HDM_DECODER_N_SIZE_HIGH_OFFSET        0xc
+#define CXL_HDM_DECODER_N_CTRL_OFFSET             0x10
+#define CXL_HDM_DECODER_N_TARGET_LIST_LOW_OFFSET  0x14
+#define CXL_HDM_DECODER_N_TARGET_LIST_HIGH_OFFSET 0x18
+#define CXL_HDM_DECODER_N_REV_OFFSET              0x1c
+
+/*
+ * HDM Decoder N Control emulation masks.
+ *
+ * Single-bit hardware definitions are in <uapi/cxl/cxl_regs.h> as
+ * CXL_HDM_DECODER0_CTRL_* (bits 0-14) and CXL_HDM_DECODER_*_CAP.
+ * The masks below express emulation policy for a CXL.mem device.
+ */
+#define CXL_HDM_DECODER_CTRL_RO_BITS_MASK    (BIT(10) | BIT(11))
+#define CXL_HDM_DECODER_CTRL_RESERVED_MASK   (BIT(15) | GENMASK(31, 28))
+#define CXL_HDM_DECODER_CTRL_DEVICE_BITS_RO  BIT(12)
+#define CXL_HDM_DECODER_CTRL_DEVICE_RESERVED (GENMASK(19, 16) | GENMASK(23, 20))
+#define CXL_HDM_DECODER_CTRL_UIO_RESERVED    (BIT(14) | GENMASK(27, 24))
+/*
+ * bit 13 (BI) is RsvdP for devices without CXL.cache (Cache_Capable=0).
+ * HDM-D (CXL.mem only) decoders must not have BI set by the guest.
+ */
+#define CXL_HDM_DECODER_CTRL_BI_RESERVED          BIT(13)
+#define CXL_HDM_DECODER_BASE_LO_RESERVED_MASK     GENMASK(27, 0)
+
+#define CXL_HDM_DECODER_GLOBAL_CTRL_RESERVED_MASK GENMASK(31, 2)
+#define CXL_HDM_DECODER_GLOBAL_CTRL_POISON_EN_BIT BIT(0)
+
 /*
  * CXL DVSEC for CXL Devices - register offsets within the DVSEC
  * (CXL 4.0 8.1.3).
@@ -37,4 +78,10 @@ struct vfio_pci_cxl_state {
 /* CXL DVSEC Capability register bit 0: device supports CXL.cache (HDM-DB) */
 #define CXL_DVSEC_CACHE_CAPABLE	    BIT(0)
 
+int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev,
+			     struct vfio_pci_cxl_state *cxl,
+			     void __iomem *cap_base);
+void vfio_cxl_clean_virt_regs(struct vfio_pci_cxl_state *cxl);
+void vfio_cxl_reinit_comp_regs(struct vfio_pci_cxl_state *cxl);
+
 #endif /* __LINUX_VFIO_CXL_PRIV_H */
diff --git a/include/uapi/cxl/cxl_regs.h b/include/uapi/cxl/cxl_regs.h
index 1a48a3805f52..b6fcae91d216 100644
--- a/include/uapi/cxl/cxl_regs.h
+++ b/include/uapi/cxl/cxl_regs.h
@@ -33,8 +33,13 @@
 #define   CXL_HDM_DECODER_TARGET_COUNT_MASK __GENMASK(7, 4)
 #define   CXL_HDM_DECODER_INTERLEAVE_11_8 _BITUL(8)
 #define   CXL_HDM_DECODER_INTERLEAVE_14_12 _BITUL(9)
+#define   CXL_HDM_DECODER_POISON_ON_DECODE_ERR _BITUL(10)
 #define   CXL_HDM_DECODER_INTERLEAVE_3_6_12_WAY _BITUL(11)
 #define   CXL_HDM_DECODER_INTERLEAVE_16_WAY _BITUL(12)
+#define   CXL_HDM_DECODER_UIO_CAPABLE _BITUL(13)
+#define   CXL_HDM_DECODER_UIO_COUNT_MASK __GENMASK(19, 16)
+#define   CXL_HDM_DECODER_MEMDATA_NXM _BITUL(20)
+#define   CXL_HDM_DECODER_COHERENCY_MODELS_MASK    __GENMASK(22, 21)
 #define CXL_HDM_DECODER_CTRL_OFFSET 0x4
 #define   CXL_HDM_DECODER_ENABLE _BITUL(1)
 #define CXL_HDM_DECODER0_BASE_LOW_OFFSET(i) (0x20 * (i) + 0x10)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 12/20] vfio/cxl: Wait for HDM ranges and create memdev
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (10 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 11/20] vfio/cxl: Introduce HDM decoder register emulation framework mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 13/20] vfio/cxl: CXL region management support mhonap
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

After HDM registers are mapped, call cxl_await_range_active() so we
only proceed when DVSEC ranges report active without touching the
memdev register group Type-2 may lack.

Re-snapshot component regs (vfio_cxl_reinit_comp_regs) once
MEM_ACTIVE so firmware final SIZE_HIGH etc. land in comp_reg_virt.

Read committed decoder size from hardware, set capacity via
cxl_set_capacity(), and devm_cxl_add_memdev().

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/cxl/vfio_cxl_core.c | 56 ++++++++++++++++++++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_emu.c  | 42 +++++++++++++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h |  4 ++
 3 files changed, 102 insertions(+)

diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 0b9e4419cd47..02755265d530 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -165,6 +165,22 @@ static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+static int vfio_cxl_create_memdev(struct vfio_pci_cxl_state *cxl,
+				  resource_size_t capacity)
+{
+	int ret;
+
+	ret = cxl_set_capacity(&cxl->cxlds, capacity);
+	if (ret)
+		return ret;
+
+	cxl->cxlmd = devm_cxl_add_memdev(&cxl->cxlds, NULL);
+	if (IS_ERR(cxl->cxlmd))
+		return PTR_ERR(cxl->cxlmd);
+
+	return 0;
+}
+
 /*
  * Free CXL state early on probe failure.  devm_kfree() on a live devres
  * allocation removes it from the list immediately, so the normal devres
@@ -189,6 +205,7 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_cxl_state *cxl;
+	resource_size_t capacity = 0;
 	u16 dvsec;
 	int ret;
 
@@ -234,8 +251,44 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
 		goto free_cxl;
 	}
 
+	cxl->cxlds.media_ready = !cxl_await_range_active(&cxl->cxlds);
+	if (!cxl->cxlds.media_ready) {
+		pci_warn(pdev, "CXL media not ready\n");
+		pci_disable_device(pdev);
+		goto regs_failed;
+	}
+
+	/*
+	 * Take the single authoritative HDM decoder snapshot now that
+	 * MEM_ACTIVE is confirmed and BAR memory is still enabled.  Using
+	 * readl() per-dword ensures correct MMIO serialisation and captures
+	 * the final firmware-written values for all fields including SIZE_HIGH,
+	 * which firmware commits to the BAR at MEM_ACTIVE time.
+	 */
+	vfio_cxl_reinit_comp_regs(cxl);
+
 	pci_disable_device(pdev);
 
+	capacity = vfio_cxl_read_committed_decoder_size(vdev, cxl);
+	if (capacity == 0) {
+		/*
+		 * TODO: Add handling for devices which do not have
+		 * firmware pre-committed decoders
+		 */
+		pci_info(pdev, "Uncommitted region size must be configured via sysfs before bind\n");
+		goto regs_failed;
+	}
+
+	cxl->dpa_size = capacity;
+
+	pci_dbg(pdev, "Device capacity: %llu MB\n", capacity >> 20);
+
+	ret = vfio_cxl_create_memdev(cxl, capacity);
+	if (ret) {
+		pci_warn(pdev, "Failed to create memdev\n");
+		goto regs_failed;
+	}
+
 	/*
 	 * Register probing succeeded.  Assign vdev->cxl now so that
 	 * all subsequent helpers can access state via vdev->cxl.
@@ -246,6 +299,9 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
 
 	return;
 
+regs_failed:
+	vfio_cxl_clean_virt_regs(cxl);
+
 free_cxl:
 	vfio_cxl_dev_state_free(pdev, cxl);
 }
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_emu.c b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
index 6fb02253e631..11195e8c21d7 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_emu.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
@@ -365,6 +365,48 @@ int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev,
 	return 0;
 }
 
+/*
+ * vfio_cxl_read_committed_decoder_size - Extract committed DPA capacity from
+ *					  comp_reg_virt[].
+ *
+ * Called from probe context after vfio_cxl_reinit_comp_regs() has taken the
+ * post-MEM_ACTIVE readl() snapshot and patched SIZE_HIGH/SIZE_LOW from DVSEC.
+ * comp_reg_virt[] is already correct at this point; no hardware access needed.
+ *
+ * Returns the committed DPA capacity in bytes, or 0 if the decoder is not
+ * committed.
+ */
+resource_size_t
+vfio_cxl_read_committed_decoder_size(struct vfio_pci_core_device *vdev,
+				     struct vfio_pci_cxl_state *cxl)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	resource_size_t capacity;
+	u32 ctrl, sz_hi, sz_lo;
+
+	if (WARN_ON(!cxl || !cxl->comp_reg_virt))
+		return 0;
+
+	ctrl  = le32_to_cpu(*hdm_reg_ptr(cxl, CXL_HDM_DECODER0_CTRL_OFFSET(0)));
+	sz_hi = le32_to_cpu(*hdm_reg_ptr(cxl, CXL_HDM_DECODER0_SIZE_HIGH_OFFSET(0)));
+	sz_lo = le32_to_cpu(*hdm_reg_ptr(cxl, CXL_HDM_DECODER0_SIZE_LOW_OFFSET(0)));
+
+	if (!(ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED)) {
+		pci_dbg(pdev,
+			"vfio_cxl: decoder0 not committed: ctrl=0x%08x\n",
+			ctrl);
+		return 0;
+	}
+
+	capacity = ((resource_size_t)sz_hi << 32) | (sz_lo & GENMASK(31, 28));
+
+	pci_dbg(pdev,
+		"vfio_cxl: decoder0 committed: sz_hi=0x%08x sz_lo=0x%08x capacity=0x%llx\n",
+		sz_hi, sz_lo, (unsigned long long)capacity);
+
+	return capacity;
+}
+
 /*
  * Called with memory_lock write side held (from vfio_cxl_reactivate_region).
  * Uses the pre-established hdm_iobase, no ioremap() under the lock,
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index 463a55062144..6359ad260bde 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -22,6 +22,7 @@ struct vfio_pci_cxl_state {
 	resource_size_t              comp_reg_offset;
 	size_t                       comp_reg_size;
 	__le32                      *comp_reg_virt;
+	size_t                       dpa_size;
 	void __iomem                *hdm_iobase;
 	u16                          dvsec_len;
 	u8                           hdm_count;
@@ -83,5 +84,8 @@ int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev,
 			     void __iomem *cap_base);
 void vfio_cxl_clean_virt_regs(struct vfio_pci_cxl_state *cxl);
 void vfio_cxl_reinit_comp_regs(struct vfio_pci_cxl_state *cxl);
+resource_size_t
+vfio_cxl_read_committed_decoder_size(struct vfio_pci_core_device *vdev,
+				     struct vfio_pci_cxl_state *cxl);
 
 #endif /* __LINUX_VFIO_CXL_PRIV_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 13/20] vfio/cxl: CXL region management support
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (11 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 12/20] vfio/cxl: Wait for HDM ranges and create memdev mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 14/20] vfio/cxl: DPA VFIO region with demand fault mmap and reset zap mhonap
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Region Management makes use of APIs provided by CXL_CORE as below:

CREATE_REGION flow:
1. Validate request (size, decoder availability)
2. Allocate HPA via cxl_get_hpa_freespace()
3. Allocate DPA via cxl_request_dpa()
4. Create region via cxl_create_region() - commits HDM decoder
5. Get HPA range via cxl_get_region_range()

DESTROY_REGION flow:
1. Detach decoder via cxl_decoder_detach()
2. Free DPA via cxl_dpa_free()
3. Release root decoder via cxl_put_root_decoder()

Use DEFINE_FREE scope helpers so error paths unwind cleanly.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/cxl/vfio_cxl_core.c | 119 +++++++++++++++++++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h |   8 ++
 2 files changed, 127 insertions(+)

diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 02755265d530..30b365b91903 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -21,6 +21,13 @@
 #include "../vfio_pci_priv.h"
 #include "vfio_cxl_priv.h"
 
+/*
+ * Scope-based cleanup wrappers for the CXL resource APIs
+ */
+DEFINE_FREE(cxl_put_root_decoder, struct cxl_root_decoder *, if (!IS_ERR_OR_NULL(_T)) cxl_put_root_decoder(_T))
+DEFINE_FREE(cxl_dpa_free, struct cxl_endpoint_decoder *, if (!IS_ERR_OR_NULL(_T)) cxl_dpa_free(_T))
+DEFINE_FREE(cxl_unregister_region, struct cxl_region *, if (!IS_ERR_OR_NULL(_T)) cxl_unregister_region(_T))
+
 /*
  * vfio_cxl_create_device_state - Allocate and validate CXL device state
  *
@@ -165,6 +172,112 @@ static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev,
 	return ret;
 }
 
+int vfio_cxl_create_cxl_region(struct vfio_pci_cxl_state *cxl,
+			       resource_size_t size)
+{
+	resource_size_t max_size;
+
+	WARN_ON(cxl->precommitted);
+
+	struct cxl_root_decoder *cxlrd __free(cxl_put_root_decoder) =
+		cxl_get_hpa_freespace(cxl->cxlmd, 1,
+				      CXL_DECODER_F_RAM | CXL_DECODER_F_TYPE2,
+				      &max_size);
+	if (IS_ERR(cxlrd))
+		return PTR_ERR(cxlrd);
+
+	/* Insufficient HPA space; cxlrd freed automatically by __free() */
+	if (max_size < size)
+		return -ENOSPC;
+
+	struct cxl_endpoint_decoder *cxled __free(cxl_dpa_free) =
+		cxl_request_dpa(cxl->cxlmd, CXL_PARTMODE_RAM, size);
+	if (IS_ERR(cxled))
+		return PTR_ERR(cxled);
+
+	struct cxl_region *region __free(cxl_unregister_region) =
+		cxl_create_region(cxlrd, &cxled, 1);
+	if (IS_ERR(region))
+		return PTR_ERR(region);
+
+	/* All operations succeeded; transfer ownership to cxl state */
+	cxl->cxlrd  = no_free_ptr(cxlrd);
+	cxl->cxled  = no_free_ptr(cxled);
+	cxl->region = no_free_ptr(region);
+
+	return 0;
+}
+
+void vfio_cxl_destroy_cxl_region(struct vfio_pci_cxl_state *cxl)
+{
+	if (!cxl->region)
+		return;
+
+	cxl_unregister_region(cxl->region);
+	cxl->region = NULL;
+
+	if (!cxl->precommitted) {
+		cxl_dpa_free(cxl->cxled);
+		cxl_put_root_decoder(cxl->cxlrd);
+	}
+
+	cxl->cxled = NULL;
+	cxl->cxlrd = NULL;
+}
+
+static int vfio_cxl_create_region_helper(struct vfio_pci_core_device *vdev,
+					 struct vfio_pci_cxl_state *cxl,
+					 resource_size_t capacity)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct range range;
+	int ret;
+
+	if (cxl->precommitted) {
+		struct cxl_endpoint_decoder *cxled;
+		struct cxl_region *region;
+
+		cxled = cxl_get_committed_decoder(cxl->cxlmd, &region);
+		if (IS_ERR(cxled))
+			return PTR_ERR(cxled);
+		cxl->cxled = cxled;
+		cxl->region = region;
+	} else {
+		ret = vfio_cxl_create_cxl_region(cxl, capacity);
+		if (ret)
+			return ret;
+	}
+
+	if (!cxl->region) {
+		pci_err(pdev, "Failed to create CXL region\n");
+		ret = -ENODEV;
+		goto failed;
+	}
+
+	ret = cxl_get_region_range(cxl->region, &range);
+	if (ret)
+		goto failed;
+
+	cxl->region_hpa = range.start;
+	cxl->region_size = range_len(&range);
+
+	pci_dbg(pdev, "CXL region: HPA 0x%llx size %lu MB\n",
+		cxl->region_hpa, cxl->region_size >> 20);
+
+	return 0;
+
+failed:
+	if (cxl->region) {
+		cxl_unregister_region(cxl->region);
+		cxl->region = NULL;
+	}
+
+	cxl->cxled = NULL;
+	cxl->cxlrd = NULL;
+
+	return ret;
+}
+
 static int vfio_cxl_create_memdev(struct vfio_pci_cxl_state *cxl,
 				  resource_size_t capacity)
 {
@@ -279,6 +392,7 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
 		goto regs_failed;
 	}
 
+	cxl->precommitted = true;
 	cxl->dpa_size = capacity;
 
 	pci_dbg(pdev, "Device capacity: %llu MB\n", capacity >> 20);
@@ -289,6 +403,10 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
 		goto regs_failed;
 	}
 
+	ret = vfio_cxl_create_region_helper(vdev, cxl, capacity);
+	if (ret)
+		goto regs_failed;
+
 	/*
 	 * Register probing succeeded.  Assign vdev->cxl now so that
 	 * all subsequent helpers can access state via vdev->cxl.
@@ -314,6 +432,7 @@ void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
 		return;
 
 	vfio_cxl_clean_virt_regs(cxl);
+	vfio_cxl_destroy_cxl_region(cxl);
 }
 
 MODULE_IMPORT_NS("CXL");
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index 6359ad260bde..72a0d7d7e183 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -17,6 +17,10 @@ struct vfio_pci_cxl_state {
 	struct cxl_memdev           *cxlmd;
 	struct cxl_root_decoder     *cxlrd;
 	struct cxl_endpoint_decoder *cxled;
+	struct cxl_region	    *region;
+	resource_size_t		     region_hpa;
+	size_t			     region_size;
+	void			    *region_vaddr;
 	resource_size_t              hdm_reg_offset;
 	size_t                       hdm_reg_size;
 	resource_size_t              comp_reg_offset;
@@ -28,6 +32,7 @@ struct vfio_pci_cxl_state {
 	u8                           hdm_count;
 	u8                           comp_reg_bar;
 	bool                         cache_capable;
+	bool                         precommitted;
 };
 
 /* Register access sizes */
@@ -87,5 +92,8 @@ void vfio_cxl_reinit_comp_regs(struct vfio_pci_cxl_state *cxl);
 resource_size_t
 vfio_cxl_read_committed_decoder_size(struct vfio_pci_core_device *vdev,
 				     struct vfio_pci_cxl_state *cxl);
+int vfio_cxl_create_cxl_region(struct vfio_pci_cxl_state *cxl,
+			       resource_size_t size);
+void vfio_cxl_destroy_cxl_region(struct vfio_pci_cxl_state *cxl);
 
 #endif /* __LINUX_VFIO_CXL_PRIV_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 14/20] vfio/cxl: DPA VFIO region with demand fault mmap and reset zap
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (12 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 13/20] vfio/cxl: CXL region management support mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 15/20] vfio/cxl: Virtualize CXL DVSEC config writes mhonap
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Wire the CXL DPA range up as a VFIO demand-paged region so QEMU can
mmap guest device memory directly. Faults call vmf_insert_pfn() to
insert one PFN at a time rather than mapping the full range upfront.

CXL region lifecycle:
- The CXL memory region is registered with VFIO layer during
  vfio_pci_open_device
- mmap() establishes the VMA with vm_ops but inserts no PTEs
- Each guest page fault calls vfio_cxl_region_page_fault() which
  inserts a single PFN under the memory_lock read side
- On device reset, vfio_cxl_zap_region_locked() sets region_active=false
  and calls unmap_mapping_range() to invalidate all DPA PTEs atomically
  while holding memory_lock for writing
- Faults racing with reset see region_active==false and return
  VM_FAULT_SIGBUS
- vfio_cxl_reactivate_region() restores region_active after successful
  hardware reset

Also integrate the zap/reactivate calls into vfio_pci_ioctl_reset() so
that FLR correctly invalidates DPA mappings and restores them on success.

Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/cxl/vfio_cxl_core.c | 187 +++++++++++++++++++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_emu.c  |   2 +-
 drivers/vfio/pci/cxl/vfio_cxl_priv.h |   3 +
 drivers/vfio/pci/vfio_pci_core.c     |  11 ++
 drivers/vfio/pci/vfio_pci_priv.h     |   6 +
 5 files changed, 208 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 30b365b91903..19d3dc205f99 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -435,4 +435,191 @@ void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev)
 	vfio_cxl_destroy_cxl_region(cxl);
 }
 
+static vm_fault_t vfio_cxl_region_vm_fault(struct vm_fault *vmf)
+{
+	struct vfio_pci_region *region = vmf->vma->vm_private_data;
+	struct vfio_pci_cxl_state *cxl = region->data;
+	unsigned long pgoff;
+	unsigned long pfn;
+
+	if (!READ_ONCE(cxl->region_active))
+		return VM_FAULT_SIGBUS;
+
+	pgoff = vmf->pgoff &
+		((1UL << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	if (pgoff >= (cxl->region_size >> PAGE_SHIFT))
+		return VM_FAULT_SIGBUS;
+
+	pfn = PHYS_PFN(cxl->region_hpa) + pgoff;
+
+	return vmf_insert_pfn(vmf->vma, vmf->address, pfn);
+}
+
+static const struct vm_operations_struct vfio_cxl_region_vm_ops = {
+	.fault = vfio_cxl_region_vm_fault,
+};
+
+static int vfio_cxl_region_mmap(struct vfio_pci_core_device *vdev,
+				struct vfio_pci_region *region,
+				struct vm_area_struct *vma)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	u64 req_len, pgoff, end;
+
+	if (!(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
+		return -EINVAL;
+
+	if (!(region->flags & VFIO_REGION_INFO_FLAG_READ) &&
+	    (vma->vm_flags & VM_READ))
+		return -EPERM;
+
+	if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE) &&
+	    (vma->vm_flags & VM_WRITE))
+		return -EPERM;
+
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
+	    check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
+		return -EOVERFLOW;
+
+	if (end > cxl->region_size)
+		return -EINVAL;
+
+	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
+
+	vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
+		     VM_DONTEXPAND | VM_DONTDUMP);
+
+	vma->vm_ops = &vfio_cxl_region_vm_ops;
+	vma->vm_private_data = region;
+
+	return 0;
+}
+
+/*
+ * vfio_cxl_zap_region_locked - Invalidate all DPA region PTEs.
+ *
+ * Must be called with vdev->memory_lock held for writing.  Sets
+ * region_active=false before zapping so any subsequent I/O to the region
+ * sees the inactive state and returns an error rather than accessing
+ * stale mappings.
+ */
+void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+	lockdep_assert_held_write(&vdev->memory_lock);
+
+	if (!cxl)
+		return;
+
+	WRITE_ONCE(cxl->region_active, false);
+}
+
+/*
+ * vfio_cxl_reactivate_region - Re-enable DPA region after successful reset.
+ *
+ * Must be called with vdev->memory_lock held for writing.  Re-reads the
+ * HDM decoder state from hardware (FLR cleared it) and sets region_active
+ * so that subsequent I/O to the region is permitted again.
+ */
+void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+	lockdep_assert_held_write(&vdev->memory_lock);
+
+	if (!cxl)
+		return;
+	/*
+	 * Re-initialise the emulated HDM comp_reg_virt[] from hardware.
+	 * After FLR the decoder registers read as zero; mirror that in
+	 * the emulated state so QEMU sees a clean slate.
+	 */
+	vfio_cxl_reinit_comp_regs(cxl);
+
+	/*
+	 * Only re-enable the DPA mmap if the hardware has actually
+	 * re-committed decoder 0 after FLR.  Read the COMMITTED bit from the
+	 * freshly-re-snapshotted comp_reg_virt[] so we check the post-FLR
+	 * hardware state, not stale pre-reset state.
+	 *
+	 * If COMMITTED is 0 (slow firmware re-commit path), leave
+	 * region_active=false.	 Guest faults will return VM_FAULT_SIGBUS
+	 * until the decoder is re-committed and the region is re-enabled.
+	 */
+	if (cxl->precommitted && cxl->comp_reg_virt) {
+		/*
+		 * Read CTRL via the full CXL.mem-relative index: hdm_reg_offset
+		 * (now CXL.mem-relative) plus the within-HDM-block offset.
+		 */
+		u32 ctrl = le32_to_cpu(*hdm_reg_ptr(cxl,
+					    CXL_HDM_DECODER0_CTRL_OFFSET(0)));
+
+		if (ctrl & CXL_HDM_DECODER0_CTRL_COMMITTED)
+			WRITE_ONCE(cxl->region_active, true);
+	}
+}
+
+static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device *core_dev,
+				  char __user *buf, size_t count, loff_t *ppos,
+				  bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	struct vfio_pci_cxl_state *cxl = core_dev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (!count || pos >= cxl->region_size)
+		return 0;
+
+	/*
+	 * Guard against access after a failed reset (region_active=false)
+	 * or a release race (region_vaddr=NULL).  Either condition means
+	 * the memremap'd window is no longer valid; touching it would produce
+	 * a Synchronous External Abort.  Return -EIO so the caller gets a
+	 * clean error rather than a kernel oops.
+	 */
+	if (!READ_ONCE(cxl->region_active) || !cxl->region_vaddr)
+		return -EIO;
+
+	count = min(count, (size_t)(cxl->region_size - pos));
+
+	if (iswrite) {
+		if (copy_from_user(cxl->region_vaddr + pos, buf, count))
+			return -EFAULT;
+	} else {
+		if (copy_to_user(buf, cxl->region_vaddr + pos, count))
+			return -EFAULT;
+	}
+
+	return count;
+}
+
+static void vfio_cxl_region_release(struct vfio_pci_core_device *vdev,
+				    struct vfio_pci_region *region)
+{
+	struct vfio_pci_cxl_state *cxl = region->data;
+
+	/*
+	 * Deactivate the region before removing user mappings so that any
+	 * fault handler racing the release returns VM_FAULT_SIGBUS rather
+	 * than inserting a PFN into an unmapped region.
+	 */
+	WRITE_ONCE(cxl->region_active, false);
+
+	if (cxl->region_vaddr) {
+		memunmap(cxl->region_vaddr);
+		cxl->region_vaddr = NULL;
+	}
+}
+
+static const struct vfio_pci_regops vfio_cxl_regops = {
+	.rw		= vfio_cxl_region_rw,
+	.mmap		= vfio_cxl_region_mmap,
+	.release	= vfio_cxl_region_release,
+};
+
 MODULE_IMPORT_NS("CXL");
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_emu.c b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
index 11195e8c21d7..781328a79b43 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_emu.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
@@ -33,7 +33,7 @@
  *     +0x1c: (reserved)
  */
 
-static inline __le32 *hdm_reg_ptr(struct vfio_pci_cxl_state *cxl, u32 hdm_off)
+__le32 *hdm_reg_ptr(struct vfio_pci_cxl_state *cxl, u32 hdm_off)
 {
 	/*
 	 * hdm_off is a byte offset within the HDM decoder block.
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index 72a0d7d7e183..3458768445af 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -33,6 +33,7 @@ struct vfio_pci_cxl_state {
 	u8                           comp_reg_bar;
 	bool                         cache_capable;
 	bool                         precommitted;
+	bool                         region_active;
 };
 
 /* Register access sizes */
@@ -96,4 +97,6 @@ int vfio_cxl_create_cxl_region(struct vfio_pci_cxl_state *cxl,
 			       resource_size_t size);
 void vfio_cxl_destroy_cxl_region(struct vfio_pci_cxl_state *cxl);
 
+__le32 *hdm_reg_ptr(struct vfio_pci_cxl_state *cxl, u32 hdm_off);
+
 #endif /* __LINUX_VFIO_CXL_PRIV_H */
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index b7364178e23d..48e0274c19aa 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1223,6 +1223,9 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
 
 	vfio_pci_zap_and_down_write_memory_lock(vdev);
 
+	/* Zap CXL DPA region PTEs before hardware reset clears HDM state */
+	vfio_cxl_zap_region_locked(vdev);
+
 	/*
 	 * This function can be invoked while the power state is non-D0. If
 	 * pci_try_reset_function() has been called while the power state is
@@ -1236,6 +1239,14 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
 
 	vfio_pci_dma_buf_move(vdev, true);
 	ret = pci_try_reset_function(vdev->pdev);
+
+	/*
+	 * Re-enable DPA region if reset succeeded; fault handler will
+	 * re-insert PFNs on next access without requiring a new mmap.
+	 */
+	if (!ret)
+		vfio_cxl_reactivate_region(vdev);
+
 	if (__vfio_pci_memory_enabled(vdev))
 		vfio_pci_dma_buf_move(vdev, false);
 	up_write(&vdev->memory_lock);
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 1082ba43bafe..726063b6ff70 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -145,6 +145,8 @@ static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
 
 void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev);
 void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev);
+void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev);
+void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev);
 
 #else
 
@@ -152,6 +154,10 @@ static inline void
 vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev) { }
 static inline void
 vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev) { }
+static inline void
+vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev) { }
+static inline void
+vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev) { }
 
 #endif /* CONFIG_VFIO_CXL_CORE */
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 15/20] vfio/cxl: Virtualize CXL DVSEC config writes
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (13 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 14/20] vfio/cxl: DPA VFIO region with demand fault mmap and reset zap mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer mhonap
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

CXL devices have CXL DVSEC registers in the configuration space.
Many of them affect the behaviors of the devices, e.g. enabling
CXL.io/CXL.mem/CXL.cache. However, these configurations are owned by
the host and a virtualization policy should be applied when handling
the access from the guest.

Introduce the emulation of CXL configuration space to handle the access
of the virtual CXL configuration space from the guest.

vfio-pci-core already allocates vdev->vconfig as the authoritative
virtual config space shadow. Directly use vdev->vconfig:
  - DVSEC reads return data from vdev->vconfig (already populated by
    vfio_config_init() via vfio_ecap_init())
  - DVSEC writes go through new CXL-aware write handlers that update
    vdev->vconfig in place
  - The writable DVSEC registers are marked virtual in vdev->pci_config_map

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/Makefile              |   2 +-
 drivers/vfio/pci/cxl/vfio_cxl_config.c | 306 +++++++++++++++++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_core.c   |   4 +-
 drivers/vfio/pci/cxl/vfio_cxl_priv.h   |  43 +++-
 drivers/vfio/pci/vfio_pci_config.c     |  46 +++-
 drivers/vfio/pci/vfio_pci_priv.h       |   3 +
 include/linux/vfio_pci_core.h          |   8 +-
 include/uapi/cxl/cxl_regs.h            |  98 ++++++++
 8 files changed, 498 insertions(+), 12 deletions(-)
 create mode 100644 drivers/vfio/pci/cxl/vfio_cxl_config.c

diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index bef916495eae..7c86b7845e8f 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 
 vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
-vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o cxl/vfio_cxl_emu.o
+vfio-pci-core-$(CONFIG_VFIO_CXL_CORE) += cxl/vfio_cxl_core.o cxl/vfio_cxl_emu.o cxl/vfio_cxl_config.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
 vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
 obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_config.c b/drivers/vfio/pci/cxl/vfio_cxl_config.c
new file mode 100644
index 000000000000..dee521118dd4
--- /dev/null
+++ b/drivers/vfio/pci/cxl/vfio_cxl_config.c
@@ -0,0 +1,306 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * CXL DVSEC configuration space emulation for vfio-pci.
+ *
+ * Integrates into the existing vfio-pci-core ecap_perms[] framework using
+ * vdev->vconfig as the sole shadow buffer for DVSEC registers.
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/pci.h>
+#include <linux/vfio_pci_core.h>
+
+#include "../vfio_pci_priv.h"
+#include "vfio_cxl_priv.h"
+
+static inline u16 _cxlds_get_dvsec(struct vfio_pci_cxl_state *cxl)
+{
+	return (u16)cxl->cxlds.cxl_dvsec;
+}
+
+/* Helpers to access vdev->vconfig at a DVSEC-relative offset */
+static inline u16 dvsec_virt_read16(struct vfio_pci_core_device *vdev,
+				    u16 off)
+{
+	u16 dvsec = _cxlds_get_dvsec(vdev->cxl);
+
+	return le16_to_cpu(*(u16 *)(vdev->vconfig + dvsec + off));
+}
+
+static inline void dvsec_virt_write16(struct vfio_pci_core_device *vdev,
+				      u16 off, u16 val)
+{
+	u16 dvsec = _cxlds_get_dvsec(vdev->cxl);
+
+	*(u16 *)(vdev->vconfig + dvsec + off) = cpu_to_le16(val);
+}
+
+static inline u32 dvsec_virt_read32(struct vfio_pci_core_device *vdev,
+				    u16 off)
+{
+	u16 dvsec = _cxlds_get_dvsec(vdev->cxl);
+
+	return le32_to_cpu(*(u32 *)(vdev->vconfig + dvsec + off));
+}
+
+static inline void dvsec_virt_write32(struct vfio_pci_core_device *vdev,
+				      u16 off, u32 val)
+{
+	u16 dvsec = _cxlds_get_dvsec(vdev->cxl);
+
+	*(u32 *)(vdev->vconfig + dvsec + off) = cpu_to_le32(val);
+}
+
+/* Individual DVSEC register write handlers */
+
+static void cxl_dvsec_control_write(struct vfio_pci_core_device *vdev,
+				    u16 new_val)
+{
+	u16 lock = dvsec_virt_read16(vdev, CXL_DVSEC_LOCK_OFFSET);
+	u16 cap3 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY3_OFFSET);
+	u16 rev_mask = CXL_CTRL_RESERVED_MASK;
+
+	if (lock & CXL_DVSEC_LOCK_CONFIG_LOCK)
+		return; /* register is locked after first write */
+
+	if (!(cap3 & CXL_DVSEC_CAP3_P2P_MEM_CAPABLE))
+		rev_mask |= CXL_CTRL_P2P_REV_MASK;
+
+	new_val &= ~rev_mask;
+	new_val |= CXL_DVSEC_CTRL_IO_ENABLE; /* IO_Enable always returns 1 */
+
+	dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL_OFFSET, new_val);
+}
+
+static void cxl_dvsec_status_write(struct vfio_pci_core_device *vdev,
+				   u16 new_val)
+{
+	u16 cur_val = dvsec_virt_read16(vdev, CXL_DVSEC_STATUS_OFFSET);
+
+	/*
+	 * VIRAL_STATUS (bit 14) is the only writable bit; all others are
+	 * reserved and always zero.
+	 */
+	new_val = cur_val & ~(new_val & CXL_DVSEC_STATUS_VIRAL_STATUS);
+	dvsec_virt_write16(vdev, CXL_DVSEC_STATUS_OFFSET, new_val);
+}
+
+static void cxl_dvsec_control2_write(struct vfio_pci_core_device *vdev,
+				     u16 new_val)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 dvsec = _cxlds_get_dvsec(vdev->cxl);
+	u16 abs_off = dvsec + CXL_DVSEC_CONTROL2_OFFSET;
+	u16 cap2 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY2_OFFSET);
+	u16 cap3 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY3_OFFSET);
+	u16 rev_mask = CXL_CTRL2_RESERVED_MASK;
+
+	if (!(cap3 & CXL_DVSEC_CAP3_VOLATILE_HDM_CONFIGURABILITY))
+		rev_mask |= CXL_CTRL2_VOLATILE_HDM_REV_MASK;
+	if (!(cap2 & CXL_DVSEC_CAP2_MOD_COMPLETION_CAPABLE))
+		rev_mask |= CXL_CTRL2_MODIFIED_COMP_REV_MASK;
+
+	new_val &= ~rev_mask;
+
+	/* Cache WBI: forward to hardware. */
+	if (new_val & CXL_DVSEC_CTRL2_INITIATE_CACHE_WBI)
+		pci_write_config_word(pdev, abs_off,
+				      CXL_DVSEC_CTRL2_INITIATE_CACHE_WBI);
+
+	/*
+	 * CXL Reset: not yet supported - do not forward to HW.
+	 * TODO: invoke CXL protocol reset via cxl subsystem
+	 */
+	if (new_val & CXL_DVSEC_CTRL2_INITIATE_CXL_RESET)
+		pci_warn(pdev, "vfio-cxl: CXL reset requested but not yet supported\n");
+
+	dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL2_OFFSET,
+			   new_val & ~CXL_CTRL2_HW_BITS_MASK);
+}
+
+static void cxl_dvsec_status2_write(struct vfio_pci_core_device *vdev,
+				    u16 new_val)
+{
+	u16 cap3 = dvsec_virt_read16(vdev, CXL_DVSEC_CAPABILITY3_OFFSET);
+	u16 dvsec = _cxlds_get_dvsec(vdev->cxl);
+	u16 abs_off = dvsec + CXL_DVSEC_STATUS2_OFFSET;
+
+	/* RW1CS: write 1 to clear, but only if the capability is supported */
+	if ((cap3 & CXL_DVSEC_CAP3_VOLATILE_HDM_CONFIGURABILITY) &&
+	    (new_val & CXL_DVSEC_STATUS2_VOLATILE_HDM_PRES_ERROR))
+		pci_write_config_word(vdev->pdev, abs_off,
+				      CXL_DVSEC_STATUS2_VOLATILE_HDM_PRES_ERROR);
+	/* STATUS2 is not mirrored in vconfig - reads go to hardware */
+}
+
+static void cxl_dvsec_lock_write(struct vfio_pci_core_device *vdev,
+				 u16 new_val)
+{
+	u16 cur_val = dvsec_virt_read16(vdev, CXL_DVSEC_LOCK_OFFSET);
+
+	/* Once the LOCK bit is set it can only be cleared by conventional reset */
+	if (cur_val & CXL_DVSEC_LOCK_CONFIG_LOCK)
+		return;
+
+	new_val &= ~CXL_LOCK_RESERVED_MASK;
+	dvsec_virt_write16(vdev, CXL_DVSEC_LOCK_OFFSET, new_val);
+}
+
+static void cxl_range_base_lo_write(struct vfio_pci_core_device *vdev,
+				    u16 dvsec_off, u32 new_val)
+{
+	new_val &= ~CXL_BASE_LO_RESERVED_MASK;
+	dvsec_virt_write32(vdev, dvsec_off, new_val);
+}
+
+/**
+ * vfio_cxl_dvsec_readfn - Per-device DVSEC read handler for CXL capable devices.
+ * @vdev:   VFIO PCI core device
+ * @pos:    Absolute byte position in PCI config space
+ * @count:  Number of bytes to read
+ * @perm:   Permission bits for this capability (passed through to fallback)
+ * @offset: Byte offset within the capability structure (passed through)
+ * @val:    Output buffer for the read value (little-endian)
+ *
+ * Called via vfio_pci_dvsec_dispatch_read() for CXL devices.  Returns shadow
+ * vconfig values for virtualized DVSEC registers (CONTROL, STATUS, CONTROL2,
+ * LOCK) so that userspace reads reflect emulated state rather than raw
+ * hardware.  All other DVSEC bytes pass through to vfio_raw_config_read().
+ *
+ * Return: @count on success, or negative error code from the fallback read.
+ */
+static int vfio_cxl_dvsec_readfn(struct vfio_pci_core_device *vdev,
+				 int pos, int count,
+				 struct perm_bits *perm,
+				 int offset, __le32 *val)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	u16 dvsec = _cxlds_get_dvsec(vdev->cxl);
+	u16 dvsec_off;
+
+	if (!cxl || (u16)pos < dvsec ||
+	    (u16)pos >= dvsec + cxl->dvsec_len)
+		return vfio_raw_config_read(vdev, pos, count, perm, offset, val);
+
+	dvsec_off = (u16)pos - dvsec;
+
+	switch (dvsec_off) {
+	case CXL_DVSEC_CONTROL_OFFSET:
+	case CXL_DVSEC_STATUS_OFFSET:
+	case CXL_DVSEC_CONTROL2_OFFSET:
+	case CXL_DVSEC_LOCK_OFFSET:
+		/* Return shadow vconfig value for virtualized registers */
+		memcpy(val, vdev->vconfig + pos, count);
+		return count;
+	default:
+		return vfio_raw_config_read(vdev, pos, count,
+					    perm, offset, val);
+	}
+}
+
+/**
+ * vfio_cxl_dvsec_writefn - ecap_perms write handler for PCI_EXT_CAP_ID_DVSEC.
+ *
+ * Installed once into ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn by
+ * vfio_pci_init_perm_bits() when CONFIG_VFIO_CXL_CORE=y.  Applies to every
+ * device opened under vfio-pci; the vdev->cxl NULL check distinguishes CXL
+ * devices from non-CXL devices that happen to expose a DVSEC capability.
+ *
+ * @vdev:   VFIO PCI core device
+ * @pos:    Absolute byte position in PCI config space
+ * @count:  Number of bytes to write
+ * @perm:   Permission bits for this capability (passed through to fallback)
+ * @offset: Byte offset within the capability structure (passed through)
+ * @val:    Value to write (little-endian)
+ *
+ * Return: @count on success; non-CXL devices continue to
+ *         vfio_raw_config_write() which also returns @count or negative error.
+ */
+static int vfio_cxl_dvsec_writefn(struct vfio_pci_core_device *vdev,
+				  int pos, int count,
+				  struct perm_bits *perm,
+				  int offset, __le32 val)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	u16 dvsec = _cxlds_get_dvsec(vdev->cxl);
+	u16 abs_off = (u16)pos;
+	u16 dvsec_off;
+	u16 wval16;
+	u32 wval32;
+
+	if (!cxl || (u16)pos < dvsec ||
+	    (u16)pos >= dvsec + cxl->dvsec_len)
+		return vfio_raw_config_write(vdev, pos, count, perm,
+					     offset, val);
+
+	pci_dbg(vdev->pdev,
+		"vfio_cxl: DVSEC write: abs=0x%04x dvsec_off=0x%04x count=%d raw_val=0x%08x\n",
+		abs_off, abs_off - dvsec, count, le32_to_cpu(val));
+
+	dvsec_off = abs_off - dvsec;
+
+	/* Route to the appropriate per-register handler */
+	switch (dvsec_off) {
+	case CXL_DVSEC_CONTROL_OFFSET:
+		wval16 = (u16)le32_to_cpu(val);
+		cxl_dvsec_control_write(vdev, wval16);
+		break;
+	case CXL_DVSEC_STATUS_OFFSET:
+		wval16 = (u16)le32_to_cpu(val);
+		cxl_dvsec_status_write(vdev, wval16);
+		break;
+	case CXL_DVSEC_CONTROL2_OFFSET:
+		wval16 = (u16)le32_to_cpu(val);
+		cxl_dvsec_control2_write(vdev, wval16);
+		break;
+	case CXL_DVSEC_STATUS2_OFFSET:
+		wval16 = (u16)le32_to_cpu(val);
+		cxl_dvsec_status2_write(vdev, wval16);
+		break;
+	case CXL_DVSEC_LOCK_OFFSET:
+		wval16 = (u16)le32_to_cpu(val);
+		cxl_dvsec_lock_write(vdev, wval16);
+		break;
+	case CXL_DVSEC_RANGE1_BASE_HIGH_OFFSET:
+	case CXL_DVSEC_RANGE2_BASE_HIGH_OFFSET:
+		wval32 = le32_to_cpu(val);
+		dvsec_virt_write32(vdev, dvsec_off, wval32);
+		break;
+	case CXL_DVSEC_RANGE1_BASE_LOW_OFFSET:
+	case CXL_DVSEC_RANGE2_BASE_LOW_OFFSET:
+		wval32 = le32_to_cpu(val);
+		cxl_range_base_lo_write(vdev, dvsec_off, wval32);
+		break;
+	default:
+		/* RO registers: header, capability, range sizes - discard */
+		break;
+	}
+
+	return count;
+}
+
+/**
+ * vfio_cxl_setup_dvsec_perms - Install per-device CXL DVSEC read/write hooks.
+ * @vdev: VFIO PCI core device
+ *
+ * Called once per device open after vfio_config_init() has seeded vdev->vconfig
+ * from hardware.  Installs vfio_cxl_dvsec_readfn and vfio_cxl_dvsec_writefn
+ * as per-device DVSEC handlers so that the global ecap_perms[DVSEC] dispatcher
+ * routes reads and writes through CXL-aware emulation.
+ *
+ * Forces CXL.io IO_ENABLE in the CONTROL vconfig shadow at init time so the
+ * initial guest read returns the correct value before the first write.
+ */
+void vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev)
+{
+	u16 ctrl = dvsec_virt_read16(vdev, CXL_DVSEC_CONTROL_OFFSET);
+
+	vdev->dvsec_readfn  = vfio_cxl_dvsec_readfn;
+	vdev->dvsec_writefn = vfio_cxl_dvsec_writefn;
+
+	/* Force IO_ENABLE; cxl_dvsec_control_write() maintains this invariant. */
+	ctrl |= CXL_DVSEC_CTRL_IO_ENABLE;
+	dvsec_virt_write16(vdev, CXL_DVSEC_CONTROL_OFFSET, ctrl);
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_setup_dvsec_perms);
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 19d3dc205f99..a3ff90b7a22c 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -68,13 +68,13 @@ vfio_cxl_create_device_state(struct pci_dev *pdev, u16 dvsec)
 	 * CACHE_CAPABLE is forwarded to the VMM so it knows whether a WBI
 	 * sequence is needed before FLR.
 	 */
-	if (!FIELD_GET(CXL_DVSEC_MEM_CAPABLE, cap_word) ||
+	if (!FIELD_GET(CXL_DVSEC_CAP_MEM_CAPABLE, cap_word) ||
 	    (pdev->class >> 8) == PCI_CLASS_MEMORY_CXL) {
 		devm_kfree(&pdev->dev, cxl);
 		return ERR_PTR(-ENODEV);
 	}
 
-	cxl->cache_capable = FIELD_GET(CXL_DVSEC_CACHE_CAPABLE, cap_word);
+	cxl->cache_capable = FIELD_GET(CXL_DVSEC_CAP_CACHE_CAPABLE, cap_word);
 
 	return cxl;
 }
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index 3458768445af..b86ee691d050 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -76,14 +76,43 @@ struct vfio_pci_cxl_state {
 #define CXL_HDM_DECODER_GLOBAL_CTRL_POISON_EN_BIT BIT(0)
 
 /*
- * CXL DVSEC for CXL Devices - register offsets within the DVSEC
- * (CXL 4.0 8.1.3).
- * Offsets are relative to the DVSEC capability base (cxl->dvsec).
+ * DVSEC register offsets and per-bit hardware definitions are in
+ * <uapi/cxl/cxl_regs.h> as CXL_DVSEC_*.  The masks below encode
+ * emulation policy: which bits to ignore, which to preserve separately
+ * from their raw hardware state.
  */
-#define CXL_DVSEC_CAPABILITY_OFFSET 0xa
-#define CXL_DVSEC_MEM_CAPABLE	    BIT(2)
-/* CXL DVSEC Capability register bit 0: device supports CXL.cache (HDM-DB) */
-#define CXL_DVSEC_CACHE_CAPABLE	    BIT(0)
+/* DVSEC Control (0x0C): bits 13 (RsvdP) and 15 (RsvdP) are always discarded */
+#define CXL_CTRL_RESERVED_MASK           (BIT(13) | BIT(15))
+/* bit 12 (P2P_Mem_Enable) treated as reserved if Cap3.P2P_Mem_Capable=0 */
+#define CXL_CTRL_P2P_REV_MASK            CXL_DVSEC_CTRL_P2P_MEM_ENABLE
+
+/* DVSEC Status (0x0E): bits 13:0 and 15 are RsvdZ */
+#define CXL_STATUS_RESERVED_MASK         (GENMASK(13, 0) | BIT(15))
+
+/*
+ * DVSEC Control2 (0x10) emulation masks.
+ *
+ * CXL_CTRL2_HW_BITS_MASK: bits 1 (Initiate_Cache_WBI) and 2
+ * (Initiate_CXL_Reset) always read 0 from hardware _ they are write-only
+ * action triggers per CXL 4.0 _8.1.3.8 Table 8-8.  Forward these to the
+ * device to trigger the hardware action; clear them from vconfig shadow so
+ * that subsequent guest reads return 0 as hardware requires.
+ *
+ * NOTE: bit 0 (Disable_Caching) and bit 3 (CXL_Reset_Mem_Clr_Enable) are
+ * ordinary RW fields _ they must be preserved in vconfig, not forwarded.
+ */
+#define CXL_CTRL2_RESERVED_MASK          GENMASK(15, 6)
+#define CXL_CTRL2_HW_BITS_MASK           (BIT(1) | BIT(2))
+/* bit 4 is RsvdP if Cap3.Volatile_HDM_Configurability=0 */
+#define CXL_CTRL2_VOLATILE_HDM_REV_MASK  CXL_DVSEC_CTRL2_DESIRED_VOLATILE_HDM
+/* bit 5 is RsvdP if Cap2.Mod_Completion_Capable=0 */
+#define CXL_CTRL2_MODIFIED_COMP_REV_MASK CXL_DVSEC_CTRL2_MOD_COMPLETION_ENABLE
+
+/* DVSEC Lock (0x14): bits 15:1 are RsvdP */
+#define CXL_LOCK_RESERVED_MASK           GENMASK(15, 1)
+
+/* DVSEC Range Base Low: bits 27:0 are reserved per Tables 8-15/8-19 */
+#define CXL_BASE_LO_RESERVED_MASK        CXL_DVSEC_RANGE_BASE_LOW_RSVD_MASK
 
 int vfio_cxl_setup_virt_regs(struct vfio_pci_core_device *vdev,
 			     struct vfio_pci_cxl_state *cxl,
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 79aaf270adb2..5708837a6c99 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1085,6 +1085,49 @@ static int __init init_pci_ext_cap_pwr_perm(struct perm_bits *perm)
 	return 0;
 }
 
+/*
+ * vfio_pci_dvsec_dispatch_read - per-device DVSEC read dispatcher.
+ *
+ * Installed as ecap_perms[PCI_EXT_CAP_ID_DVSEC].readfn at module init.
+ * Calls vdev->dvsec_readfn when a shadow-read handler has been registered
+ * (e.g. by vfio_cxl_setup_dvsec_perms() for CXL Type-2 devices), otherwise
+ * continue to vfio_raw_config_read for hardware pass-through.
+ *
+ * This indirection allows per-device DVSEC reads from vconfig shadow
+ * without touching the global ecap_perms[] table.
+ */
+static int vfio_pci_dvsec_dispatch_read(struct vfio_pci_core_device *vdev,
+					int pos, int count,
+					struct perm_bits *perm,
+					int offset, __le32 *val)
+{
+	if (vdev->dvsec_readfn)
+		return vdev->dvsec_readfn(vdev, pos, count, perm, offset, val);
+	return vfio_raw_config_read(vdev, pos, count, perm, offset, val);
+}
+
+/*
+ * vfio_pci_dvsec_dispatch_write - per-device DVSEC write dispatcher.
+ *
+ * Installed as ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn at module init.
+ * Calls vdev->dvsec_writefn when a handler has been registered for this
+ * device (e.g. by vfio_cxl_setup_dvsec_perms() for CXL Type-2 devices),
+ * otherwise proceed to vfio_raw_config_write so that non-CXL devices
+ * with a DVSEC capability continue to pass writes to hardware.
+ *
+ * This indirection allows per-device DVSEC handlers to be registered
+ * without touching the global ecap_perms[] table.
+ */
+static int vfio_pci_dvsec_dispatch_write(struct vfio_pci_core_device *vdev,
+					 int pos, int count,
+					 struct perm_bits *perm,
+					 int offset, __le32 val)
+{
+	if (vdev->dvsec_writefn)
+		return vdev->dvsec_writefn(vdev, pos, count, perm, offset, val);
+	return vfio_raw_config_write(vdev, pos, count, perm, offset, val);
+}
+
 /*
  * Initialize the shared permission tables
  */
@@ -1121,7 +1164,8 @@ int __init vfio_pci_init_perm_bits(void)
 	ret |= init_pci_ext_cap_err_perm(&ecap_perms[PCI_EXT_CAP_ID_ERR]);
 	ret |= init_pci_ext_cap_pwr_perm(&ecap_perms[PCI_EXT_CAP_ID_PWR]);
 	ecap_perms[PCI_EXT_CAP_ID_VNDR].writefn = vfio_raw_config_write;
-	ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn = vfio_raw_config_write;
+	ecap_perms[PCI_EXT_CAP_ID_DVSEC].readfn  = vfio_pci_dvsec_dispatch_read;
+	ecap_perms[PCI_EXT_CAP_ID_DVSEC].writefn = vfio_pci_dvsec_dispatch_write;
 
 	if (ret)
 		vfio_pci_uninit_perm_bits();
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 726063b6ff70..96f8361ce6f3 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -147,6 +147,7 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev);
 void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev);
 void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev);
 void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev);
+void vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev);
 
 #else
 
@@ -158,6 +159,8 @@ static inline void
 vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev) { }
 static inline void
 vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev) { }
+static inline void
+vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev) { }
 
 #endif /* CONFIG_VFIO_CXL_CORE */
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index cd8ed98a82a3..aa159d0c8da7 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -31,7 +31,7 @@ struct p2pdma_provider;
 struct dma_buf_phys_vec;
 struct dma_buf_attachment;
 struct vfio_pci_cxl_state;
-
+struct perm_bits;
 
 struct vfio_pci_eventfd {
 	struct eventfd_ctx	*ctx;
@@ -141,6 +141,12 @@ struct vfio_pci_core_device {
 	struct list_head	ioeventfds_list;
 	struct vfio_pci_vf_token	*vf_token;
 	struct vfio_pci_cxl_state *cxl;
+	int (*dvsec_readfn)(struct vfio_pci_core_device *vdev, int pos,
+			    int count, struct perm_bits *perm,
+			    int offset, __le32 *val);
+	int (*dvsec_writefn)(struct vfio_pci_core_device *vdev, int pos,
+			     int count, struct perm_bits *perm,
+			     int offset, __le32 val);
 	struct list_head		sriov_pfs_item;
 	struct vfio_pci_core_device	*sriov_pf_core_dev;
 	struct notifier_block	nb;
diff --git a/include/uapi/cxl/cxl_regs.h b/include/uapi/cxl/cxl_regs.h
index b6fcae91d216..e9746e75e09a 100644
--- a/include/uapi/cxl/cxl_regs.h
+++ b/include/uapi/cxl/cxl_regs.h
@@ -59,4 +59,102 @@
 #define CXL_HDM_DECODER0_SKIP_LOW(i) CXL_HDM_DECODER0_TL_LOW(i)
 #define CXL_HDM_DECODER0_SKIP_HIGH(i) CXL_HDM_DECODER0_TL_HIGH(i)
 
+/*
+ * CXL r4.0 8.1.3: DVSEC for CXL Devices
+ *
+ * Register offsets are relative to the DVSEC capability base address,
+ * as discovered via PCI_EXT_CAP_ID_DVSEC with DVSEC ID 0x0.
+ * All registers in this section are 16-bit wide.
+ */
+
+/* DVSEC register offsets */
+#define CXL_DVSEC_CAPABILITY_OFFSET       0x0a
+#define CXL_DVSEC_CONTROL_OFFSET          0x0c
+#define CXL_DVSEC_STATUS_OFFSET           0x0e
+#define CXL_DVSEC_CONTROL2_OFFSET         0x10
+#define CXL_DVSEC_STATUS2_OFFSET          0x12
+#define CXL_DVSEC_LOCK_OFFSET             0x14
+#define CXL_DVSEC_CAPABILITY2_OFFSET      0x16
+#define CXL_DVSEC_RANGE1_SIZE_HIGH_OFFSET 0x18
+#define CXL_DVSEC_RANGE1_SIZE_LOW_OFFSET  0x1c
+#define CXL_DVSEC_RANGE1_BASE_HIGH_OFFSET 0x20
+#define CXL_DVSEC_RANGE1_BASE_LOW_OFFSET  0x24
+#define CXL_DVSEC_RANGE2_SIZE_HIGH_OFFSET 0x28
+#define CXL_DVSEC_RANGE2_SIZE_LOW_OFFSET  0x2c
+#define CXL_DVSEC_RANGE2_BASE_HIGH_OFFSET 0x30
+#define CXL_DVSEC_RANGE2_BASE_LOW_OFFSET  0x34
+#define CXL_DVSEC_CAPABILITY3_OFFSET      0x38
+
+/* DVSEC Range Base Low registers: bits [27:0] are reserved */
+#define CXL_DVSEC_RANGE_BASE_LOW_RSVD_MASK __GENMASK(27, 0)
+
+/* CXL r4.0 8.1.3.1 Table 8-5 DVSEC CXL Capability (offset 0x0A) */
+#define CXL_DVSEC_CAP_CACHE_CAPABLE             _BITUL(0)
+#define CXL_DVSEC_CAP_IO_CAPABLE                _BITUL(1)
+#define CXL_DVSEC_CAP_MEM_CAPABLE               _BITUL(2)
+#define CXL_DVSEC_CAP_MEM_HW_INIT_MODE          _BITUL(3)
+#define CXL_DVSEC_CAP_HDM_COUNT_MASK            __GENMASK(5, 4)
+#define CXL_DVSEC_CAP_CACHE_WBI_CAPABLE         _BITUL(6)
+#define CXL_DVSEC_CAP_CXL_RESET_CAPABLE         _BITUL(7)
+#define CXL_DVSEC_CAP_CXL_RESET_TIMEOUT_MASK    __GENMASK(10, 8)
+#define CXL_DVSEC_CAP_CXL_RESET_MEM_CLR_CAPABLE _BITUL(11)
+#define CXL_DVSEC_CAP_TSP_CAPABLE               _BITUL(12)
+#define CXL_DVSEC_CAP_MLD_CAPABLE               _BITUL(13)
+#define CXL_DVSEC_CAP_VIRAL_CAPABLE             _BITUL(14)
+#define CXL_DVSEC_CAP_PM_INIT_REPORTING_CAPABLE _BITUL(15)
+
+/* CXL r4.0 8.1.3.2 Table 8-6 DVSEC CXL Control (offset 0x0C) */
+#define CXL_DVSEC_CTRL_CACHE_ENABLE              _BITUL(0)
+#define CXL_DVSEC_CTRL_IO_ENABLE                 _BITUL(1)
+#define CXL_DVSEC_CTRL_MEM_ENABLE                _BITUL(2)
+#define CXL_DVSEC_CTRL_CACHE_SF_COVERAGE_MASK    __GENMASK(7, 3)
+#define CXL_DVSEC_CTRL_CACHE_SF_GRANULARITY_MASK __GENMASK(10, 8)
+#define CXL_DVSEC_CTRL_CACHE_CLEAN_EVICTION      _BITUL(11)
+#define CXL_DVSEC_CTRL_P2P_MEM_ENABLE            _BITUL(12)
+/* bit 13: RsvdP */
+#define CXL_DVSEC_CTRL_VIRAL_ENABLE              _BITUL(14)
+/* bit 15: RsvdP */
+
+/* CXL r4.0 8.1.3.3 Table 8-7 DVSEC CXL Status (offset 0x0E) */
+/* bits 13:0 = RsvdZ */
+#define CXL_DVSEC_STATUS_VIRAL_STATUS _BITUL(14)
+/* bit 15 = RsvdZ */
+
+/* CXL r4.0 8.1.3.4 Table 8-8 DVSEC CXL Control2 (offset 0x10) */
+#define CXL_DVSEC_CTRL2_DISABLE_CACHING          _BITUL(0)
+#define CXL_DVSEC_CTRL2_INITIATE_CACHE_WBI       _BITUL(1)
+#define CXL_DVSEC_CTRL2_INITIATE_CXL_RESET       _BITUL(2)
+#define CXL_DVSEC_CTRL2_CXL_RESET_MEM_CLR_ENABLE _BITUL(3)
+#define CXL_DVSEC_CTRL2_DESIRED_VOLATILE_HDM     _BITUL(4)
+#define CXL_DVSEC_CTRL2_MOD_COMPLETION_ENABLE    _BITUL(5)
+/* bits 15:6 = RsvdP */
+
+/* CXL r4.0 8.1.3.5 Table 8-9 DVSEC CXL Status2 (offset 0x12) */
+#define CXL_DVSEC_STATUS2_CACHE_INVALID           _BITUL(0)
+#define CXL_DVSEC_STATUS2_CXL_RESET_COMPLETE      _BITUL(1)
+#define CXL_DVSEC_STATUS2_CXL_RESET_ERROR         _BITUL(2)
+/* RW1CS; RsvdZ if Cap3.Volatile_HDM_Configurability=0 */
+#define CXL_DVSEC_STATUS2_VOLATILE_HDM_PRES_ERROR _BITUL(3)
+/* bits 14:4 = RsvdZ */
+#define CXL_DVSEC_STATUS2_PM_INIT_COMPLETION      _BITUL(15)
+
+/* CXL r4.0 _8.1.3.6 Table 8-10 _ DVSEC CXL Lock (offset 0x14) */
+#define CXL_DVSEC_LOCK_CONFIG_LOCK _BITUL(0)
+/* bits 15:1 = RsvdP */
+
+/* CXL r4.0 8.1.3.7 Table 8-11 DVSEC CXL Capability2 (offset 0x16) */
+#define CXL_DVSEC_CAP2_CACHE_SIZE_UNIT_MASK     __GENMASK(3, 0)
+#define CXL_DVSEC_CAP2_FALLBACK_CAPABILITY_MASK __GENMASK(5, 4)
+#define CXL_DVSEC_CAP2_MOD_COMPLETION_CAPABLE   _BITUL(6)
+#define CXL_DVSEC_CAP2_NO_CLEAN_WRITEBACK       _BITUL(7)
+#define CXL_DVSEC_CAP2_CACHE_SIZE_MASK          __GENMASK(15, 8)
+
+/* CXL r4.0 8.1.3.14 Table 8-20 DVSEC CXL Capability3 (offset 0x38) */
+#define CXL_DVSEC_CAP3_DEFAULT_VOLATILE_HDM_COLD_RESET _BITUL(0)
+#define CXL_DVSEC_CAP3_DEFAULT_VOLATILE_HDM_WARM_RESET _BITUL(1)
+#define CXL_DVSEC_CAP3_DEFAULT_VOLATILE_HDM_HOT_RESET  _BITUL(2)
+#define CXL_DVSEC_CAP3_VOLATILE_HDM_CONFIGURABILITY    _BITUL(3)
+#define CXL_DVSEC_CAP3_P2P_MEM_CAPABLE                 _BITUL(4)
+/* bits 15:5 = RsvdP */
+
 #endif /* _UAPI_CXL_REGS_H_ */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (14 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 15/20] vfio/cxl: Virtualize CXL DVSEC config writes mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-03 19:35   ` Dan Williams
  2026-04-01 14:39 ` [PATCH v2 17/20] vfio/pci: Advertise CXL cap and sparse component BAR to userspace mhonap
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Register the DPA and component register region with VFIO layer.
Region indices for both these regions are cached for quick lookup.

vfio_cxl_register_cxl_region()
- memremap(WB) the region HPA (treat CXL.mem as RAM, not MMIO)
- Register VFIO_REGION_SUBTYPE_CXL
- Records dpa_region_idx.

vfio_cxl_register_comp_regs_region()
- Registers VFIO_REGION_SUBTYPE_CXL_COMP_REGS with size
  hdm_reg_offset + hdm_reg_size
- Records comp_reg_region_idx.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/cxl/vfio_cxl_core.c | 98 +++++++++++++++++++++++++++-
 drivers/vfio/pci/cxl/vfio_cxl_emu.c  | 34 ++++++++++
 drivers/vfio/pci/cxl/vfio_cxl_priv.h |  2 +
 drivers/vfio/pci/vfio_pci.c          | 23 +++++++
 drivers/vfio/pci/vfio_pci_priv.h     | 11 ++++
 5 files changed, 167 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index a3ff90b7a22c..b38a04301660 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -75,6 +75,8 @@ vfio_cxl_create_device_state(struct pci_dev *pdev, u16 dvsec)
 	}
 
 	cxl->cache_capable = FIELD_GET(CXL_DVSEC_CAP_CACHE_CAPABLE, cap_word);
+	cxl->dpa_region_idx = -1;
+	cxl->comp_reg_region_idx = -1;
 
 	return cxl;
 }
@@ -509,14 +511,19 @@ static int vfio_cxl_region_mmap(struct vfio_pci_core_device *vdev,
  */
 void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev)
 {
+	struct vfio_device *core_vdev = &vdev->vdev;
 	struct vfio_pci_cxl_state *cxl = vdev->cxl;
 
 	lockdep_assert_held_write(&vdev->memory_lock);
 
-	if (!cxl)
+	if (!cxl || cxl->dpa_region_idx < 0)
 		return;
 
 	WRITE_ONCE(cxl->region_active, false);
+	unmap_mapping_range(core_vdev->inode->i_mapping,
+			    VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_NUM_REGIONS +
+						     cxl->dpa_region_idx),
+			    cxl->region_size, true);
 }
 
 /*
@@ -601,6 +608,7 @@ static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device *core_dev,
 static void vfio_cxl_region_release(struct vfio_pci_core_device *vdev,
 				    struct vfio_pci_region *region)
 {
+	struct vfio_device *core_vdev = &vdev->vdev;
 	struct vfio_pci_cxl_state *cxl = region->data;
 
 	/*
@@ -610,6 +618,16 @@ static void vfio_cxl_region_release(struct vfio_pci_core_device *vdev,
 	 */
 	WRITE_ONCE(cxl->region_active, false);
 
+	/*
+	 * Remove all user mappings of the DPA region while the device is
+	 * still alive.
+	 */
+	if (cxl->dpa_region_idx >= 0)
+		unmap_mapping_range(core_vdev->inode->i_mapping,
+			    VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_NUM_REGIONS +
+						     cxl->dpa_region_idx),
+				    cxl->region_size, true);
+
 	if (cxl->region_vaddr) {
 		memunmap(cxl->region_vaddr);
 		cxl->region_vaddr = NULL;
@@ -622,4 +640,82 @@ static const struct vfio_pci_regops vfio_cxl_regops = {
 	.release	= vfio_cxl_region_release,
 };
 
+int vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	u32 flags;
+	int ret;
+
+	if (!cxl)
+		return -ENODEV;
+
+	if (!cxl->region || cxl->region_vaddr)
+		return -ENODEV;
+
+	/*
+	 * CXL device memory is RAM, not MMIO.  Use memremap() rather than
+	 * ioremap_cache() so the correct memory-mapping API is used.
+	 * The WB attribute matches the cache-coherent nature of CXL.mem.
+	 */
+	cxl->region_vaddr = memremap(cxl->region_hpa, cxl->region_size,
+				     MEMREMAP_WB);
+	if (!cxl->region_vaddr)
+		return -ENOMEM;
+
+	flags = VFIO_REGION_INFO_FLAG_READ |
+		VFIO_REGION_INFO_FLAG_WRITE |
+		VFIO_REGION_INFO_FLAG_MMAP;
+
+	ret = vfio_pci_core_register_dev_region(vdev,
+						PCI_VENDOR_ID_CXL |
+						VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+						VFIO_REGION_SUBTYPE_CXL,
+						&vfio_cxl_regops,
+						cxl->region_size, flags,
+						cxl);
+	if (ret) {
+		memunmap(cxl->region_vaddr);
+		cxl->region_vaddr = NULL;
+		return ret;
+	}
+
+	/*
+	 * Cache the vdev->region[] index before activating the region.
+	 * vfio_pci_core_register_dev_region() placed the new entry at
+	 * vdev->region[num_regions - 1] and incremented num_regions.
+	 * vfio_cxl_zap_region_locked() uses this to avoid scanning
+	 * vdev->region[] on every FLR.
+	 */
+	cxl->dpa_region_idx = vdev->num_regions - 1;
+
+	vfio_cxl_reinit_comp_regs(cxl);
+
+	WRITE_ONCE(cxl->region_active, true);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_register_cxl_region);
+
+/**
+ * vfio_cxl_unregister_cxl_region - Undo vfio_cxl_register_cxl_region()
+ * @vdev: VFIO PCI device
+ *
+ * Marks the DPA region inactive and resets dpa_region_idx.
+ * Does NOT touch CXL subsystem state (cxl->region, cxl->cxled, cxl->cxlrd).
+ * The caller must call vfio_cxl_destroy_cxl_region() separately to release
+ * those objects.
+ */
+void vfio_cxl_unregister_cxl_region(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+	if (!cxl || cxl->dpa_region_idx < 0)
+		return;
+
+	WRITE_ONCE(cxl->region_active, false);
+
+	cxl->dpa_region_idx = -1;
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_unregister_cxl_region);
+
 MODULE_IMPORT_NS("CXL");
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_emu.c b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
index 781328a79b43..50d3718b101d 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_emu.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_emu.c
@@ -473,3 +473,37 @@ void vfio_cxl_clean_virt_regs(struct vfio_pci_cxl_state *cxl)
 	kfree(cxl->comp_reg_virt);
 	cxl->comp_reg_virt = NULL;
 }
+
+/*
+ * vfio_cxl_register_comp_regs_region - Register the COMP_REGS device region.
+ *
+ * Exposes the emulated HDM decoder register state as a VFIO device region
+ * with type VFIO_REGION_SUBTYPE_CXL_COMP_REGS.	 QEMU attaches a
+ * notify_change callback to this region to intercept HDM COMMIT writes
+ * and map the DPA MemoryRegion at the appropriate GPA.
+ *
+ * The region is read+write only (no mmap) to ensure all accesses pass
+ * through comp_regs_dispatch_write() for proper bit-field enforcement.
+ */
+int vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	u32 flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+	int ret;
+
+	if (!cxl || !cxl->comp_reg_virt)
+		return -ENODEV;
+
+	ret = vfio_pci_core_register_dev_region(vdev,
+						PCI_VENDOR_ID_CXL |
+						VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+						VFIO_REGION_SUBTYPE_CXL_COMP_REGS,
+						&vfio_cxl_comp_regs_ops,
+						cxl->hdm_reg_offset +
+						cxl->hdm_reg_size, flags, cxl);
+	if (!ret)
+		cxl->comp_reg_region_idx = vdev->num_regions - 1;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_register_comp_regs_region);
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_priv.h b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
index b86ee691d050..b884689a1226 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_priv.h
+++ b/drivers/vfio/pci/cxl/vfio_cxl_priv.h
@@ -28,6 +28,8 @@ struct vfio_pci_cxl_state {
 	__le32                      *comp_reg_virt;
 	size_t                       dpa_size;
 	void __iomem                *hdm_iobase;
+	int                          dpa_region_idx;
+	int                          comp_reg_region_idx;
 	u16                          dvsec_len;
 	u8                           hdm_count;
 	u8                           comp_reg_bar;
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 0c771064c0b8..22cf9ea831f9 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -120,6 +120,29 @@ static int vfio_pci_open_device(struct vfio_device *core_vdev)
 		}
 	}
 
+	if (vdev->cxl) {
+		/*
+		 * pci_config_map and vconfig are valid now (allocated by
+		 * vfio_config_init() inside vfio_pci_core_enable() above).
+		 */
+		vfio_cxl_setup_dvsec_perms(vdev);
+
+		ret = vfio_cxl_register_cxl_region(vdev);
+		if (ret) {
+			pci_warn(pdev, "Failed to setup CXL region\n");
+			vfio_pci_core_disable(vdev);
+			return ret;
+		}
+
+		ret = vfio_cxl_register_comp_regs_region(vdev);
+		if (ret) {
+			pci_warn(pdev, "Failed to register COMP_REGS region\n");
+			vfio_cxl_unregister_cxl_region(vdev);
+			vfio_pci_core_disable(vdev);
+			return ret;
+		}
+	}
+
 	vfio_pci_core_finish_enable(vdev);
 
 	return 0;
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 96f8361ce6f3..ae0091d5096c 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -148,6 +148,9 @@ void vfio_pci_cxl_cleanup(struct vfio_pci_core_device *vdev);
 void vfio_cxl_zap_region_locked(struct vfio_pci_core_device *vdev);
 void vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev);
 void vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev);
+int vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev);
+void vfio_cxl_unregister_cxl_region(struct vfio_pci_core_device *vdev);
+int  vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev);
 
 #else
 
@@ -161,6 +164,14 @@ static inline void
 vfio_cxl_reactivate_region(struct vfio_pci_core_device *vdev) { }
 static inline void
 vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev) { }
+static inline int
+vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev)
+{ return 0; }
+static inline void
+vfio_cxl_unregister_cxl_region(struct vfio_pci_core_device *vdev) { }
+static inline int
+vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev)
+{ return 0; }
 
 #endif /* CONFIG_VFIO_CXL_CORE */
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 17/20] vfio/pci: Advertise CXL cap and sparse component BAR to userspace
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (15 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 18/20] vfio/cxl: Provide opt-out for CXL feature mhonap
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Expose CXL device capability through the VFIO device info ioctl and give
userspace access to the GPU/accelerator register windows in the component
BAR while protecting the CXL component register block.

vfio_cxl_get_info() fills VFIO_DEVICE_INFO_CAP_CXL with the HDM register
BAR index and byte offset, commit flags, and VFIO region indices for the
DPA and COMP_REGS regions. HDM decoder count and the HDM block offset
within COMP_REGS are not populated; both are derivable from the CXL
Capability Array in the COMP_REGS region itself.

vfio_cxl_get_region_info() handles VFIO_DEVICE_GET_REGION_INFO for the
component register BAR. It builds a sparse-mmap capability that advertises
only the GPU/accelerator register windows, carving out the CXL component
register block. Three physical layouts are handled:

  Topology A  comp block at BAR end:    one area [0, comp_reg_offset)
  Topology B  comp block at BAR start:  one area [comp_end, bar_len)
  Topology C  comp block in the middle: two areas, one on each side

vfio_cxl_mmap_overlaps_comp_regs() checks whether an mmap request overlaps
[comp_reg_offset, comp_reg_offset + comp_reg_size). vfio_pci_core_mmap()
calls it to reject access to the component register block while allowing
mmap of the GPU register windows in the sparse capability. This replaces
the earlier blanket rejection of any mmap on the component BAR index.

Hook both helpers into vfio_pci_ioctl_get_info() and
vfio_pci_ioctl_get_region_info() in vfio_pci_core.c.

The component BAR cannot be claimed exclusively since the CXL subsystem
holds persistent sub-range iomem claims during HDM decoder setup.
pci_request_selected_regions() returns EBUSY; pass bars=0 to skip the
request and map directly via pci_iomap(). Physical ownership is assured
by driver binding.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/cxl/vfio_cxl_core.c | 155 +++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_core.c     |  31 +++++-
 drivers/vfio/pci/vfio_pci_priv.h     |  24 +++++
 drivers/vfio/pci/vfio_pci_rdwr.c     |  16 ++-
 4 files changed, 221 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index b38a04301660..46430cbfa962 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -21,6 +21,161 @@
 #include "../vfio_pci_priv.h"
 #include "vfio_cxl_priv.h"
 
+u8 vfio_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{
+	return vdev->cxl->comp_reg_bar;
+}
+
+int vfio_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+			     struct vfio_region_info *info,
+			     struct vfio_info_cap *caps)
+{
+	unsigned long minsz = offsetofend(struct vfio_region_info, offset);
+	struct vfio_region_info_cap_sparse_mmap *sparse;
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	resource_size_t bar_len, comp_end;
+	u32 nr_areas, cap_size;
+	int ret;
+
+	if (!cxl)
+		return -ENOTTY;
+
+	if (!info)
+		return -ENOTTY;
+
+	if (info->argsz < minsz)
+		return -EINVAL;
+
+	if (info->index != cxl->comp_reg_bar)
+		return -ENOTTY;
+
+	/*
+	 * The device state is not fully initialised;
+	 * fall through to the default BAR handler.
+	 */
+	if (!cxl->comp_reg_size)
+		return -ENOTTY;
+
+	bar_len  = pci_resource_len(vdev->pdev, info->index);
+	comp_end = cxl->comp_reg_offset + cxl->comp_reg_size;
+
+	/*
+	 * Advertise the GPU/accelerator register windows as mmappable by
+	 * carving the CXL component register block out of the BAR.  The
+	 * number of sparse areas depends on where the block sits:
+	 *
+	 *  [A] comp block at BAR end  [gpu_regs | comp_regs]:
+	 *    comp_reg_offset > 0  &&  comp_end == bar_len
+	 *    = 1 area: [0, comp_reg_offset)
+	 *
+	 *  [B] comp block at BAR start [comp_regs | gpu_regs]:
+	 *    comp_reg_offset == 0 &&  comp_end < bar_len
+	 *    = 1 area: [comp_end, bar_len)
+	 *
+	 *  [C] comp block in middle    [gpu_regs | comp_regs | gpu_regs]:
+	 *    comp_reg_offset > 0  &&  comp_end < bar_len
+	 *    = 2 areas: [0, comp_reg_offset) and [comp_end, bar_len)
+	 */
+	if (cxl->comp_reg_offset > 0 && comp_end < bar_len)
+		nr_areas = 2;
+	else
+		nr_areas = 1;
+
+	cap_size = struct_size(sparse, areas, nr_areas);
+	sparse = kzalloc(cap_size, GFP_KERNEL);
+	if (!sparse)
+		return -ENOMEM;
+
+	sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+	sparse->header.version = 1;
+	sparse->nr_areas = nr_areas;
+
+	if (nr_areas == 2) {
+		/* [C]: window before and after comp block */
+		sparse->areas[0].offset = 0;
+		sparse->areas[0].size   = cxl->comp_reg_offset;
+		sparse->areas[1].offset = comp_end;
+		sparse->areas[1].size   = bar_len - comp_end;
+	} else if (cxl->comp_reg_offset == 0) {
+		/* [B]: comp block at BAR start, window follows */
+		sparse->areas[0].offset = comp_end;
+		sparse->areas[0].size   = bar_len - comp_end;
+	} else {
+		/* [A]: comp block at BAR end, window precedes */
+		sparse->areas[0].offset = 0;
+		sparse->areas[0].size   = cxl->comp_reg_offset;
+	}
+
+	ret = vfio_info_add_capability(caps, &sparse->header, cap_size);
+	kfree(sparse);
+	if (ret)
+		return ret;
+
+	info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
+	info->size   = bar_len;
+	info->flags  = VFIO_REGION_INFO_FLAG_READ |
+		       VFIO_REGION_INFO_FLAG_WRITE |
+		       VFIO_REGION_INFO_FLAG_MMAP;
+
+	return 0;
+}
+
+bool vfio_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+				       u64 req_start, u64 req_len)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+	if (!cxl->comp_reg_size)
+		return false;
+
+	return req_start < cxl->comp_reg_offset + cxl->comp_reg_size &&
+	       req_start + req_len > cxl->comp_reg_offset;
+}
+
+int vfio_cxl_get_info(struct vfio_pci_core_device *vdev,
+		      struct vfio_info_cap *caps)
+{
+	struct vfio_pci_cxl_state *cxl = vdev->cxl;
+	struct vfio_device_info_cap_cxl cxl_cap = {0};
+
+	if (!cxl)
+		return 0;
+
+	/*
+	 * Device is not fully initialised?
+	 */
+	if (WARN_ON(cxl->dpa_region_idx < 0 || cxl->comp_reg_region_idx < 0))
+		return -ENODEV;
+
+	/* Fill in from CXL device structure */
+	cxl_cap.header.id = VFIO_DEVICE_INFO_CAP_CXL;
+	cxl_cap.header.version = 1;
+	/*
+	 * COMP_REGS region starts at comp_reg_offset + CXL_CM_OFFSET within
+	 * the BAR.  This is the byte offset of the CXL.mem register area (where
+	 * the CXL Capability Array Header lives) within the component register
+	 * block. Userspace derives hdm_decoder_offset and hdm_count from the
+	 * COMP_REGS region itself (CXL Capability Array traversal + HDMC read).
+	 */
+	cxl_cap.hdm_regs_offset = cxl->comp_reg_offset + CXL_CM_OFFSET;
+	cxl_cap.hdm_regs_bar_index = cxl->comp_reg_bar;
+
+	if (cxl->precommitted)
+		cxl_cap.flags |= VFIO_CXL_CAP_FIRMWARE_COMMITTED;
+	if (cxl->cache_capable)
+		cxl_cap.flags |= VFIO_CXL_CAP_CACHE_CAPABLE;
+
+	/*
+	 * Populate absolute VFIO region indices so userspace can query them
+	 * directly with VFIO_DEVICE_GET_REGION_INFO.
+	 */
+	cxl_cap.dpa_region_index = VFIO_PCI_NUM_REGIONS + cxl->dpa_region_idx;
+	cxl_cap.comp_regs_region_index =
+		VFIO_PCI_NUM_REGIONS + cxl->comp_reg_region_idx;
+
+	return vfio_info_add_capability(caps, &cxl_cap.header, sizeof(cxl_cap));
+}
+
 /*
  * Scope-based cleanup wrappers for the CXL resource APIs
  */
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 48e0274c19aa..570775cc8711 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -591,7 +591,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_dummy_resource *dummy_res, *tmp;
 	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
-	int i, bar;
+	int i, bar, bars;
 
 	/* For needs_reset */
 	lockdep_assert_held(&vdev->vdev.dev_set->lock);
@@ -650,8 +650,10 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 		bar = i + PCI_STD_RESOURCES;
 		if (!vdev->barmap[bar])
 			continue;
+		bars = (vdev->cxl && i == vfio_cxl_get_component_reg_bar(vdev)) ?
+			0 : (1 << bar);
 		pci_iounmap(pdev, vdev->barmap[bar]);
-		pci_release_selected_regions(pdev, 1 << bar);
+		pci_release_selected_regions(pdev, bars);
 		vdev->barmap[bar] = NULL;
 	}
 
@@ -989,6 +991,13 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
 	if (vdev->reset_works)
 		info.flags |= VFIO_DEVICE_FLAGS_RESET;
 
+	if (vdev->cxl) {
+		ret = vfio_cxl_get_info(vdev, &caps);
+		if (ret)
+			return ret;
+		info.flags |= VFIO_DEVICE_FLAGS_CXL;
+	}
+
 	info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
 	info.num_irqs = VFIO_PCI_NUM_IRQS;
 
@@ -1034,6 +1043,12 @@ int vfio_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
 	struct pci_dev *pdev = vdev->pdev;
 	int i, ret;
 
+	if (vdev->cxl) {
+		ret = vfio_cxl_get_region_info(vdev, info, caps);
+		if (ret != -ENOTTY)
+			return ret;
+	}
+
 	switch (info->index) {
 	case VFIO_PCI_CONFIG_REGION_INDEX:
 		info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
@@ -1768,6 +1783,18 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
 	if (req_start + req_len > phys_len)
 		return -EINVAL;
 
+	/*
+	 * CXL devices: mmap is permitted for the GPU/accelerator register
+	 * windows listed in the sparse-mmap capability.  Block any request
+	 * that overlaps the CXL component register block
+	 * [comp_reg_offset, comp_reg_offset + comp_reg_size); those registers
+	 * must be accessed exclusively through the COMP_REGS device region so
+	 * that the emulation layer (notify_change) intercepts every write.
+	 */
+	if (vdev->cxl && index == vfio_cxl_get_component_reg_bar(vdev) &&
+	    vfio_cxl_mmap_overlaps_comp_regs(vdev, req_start, req_len))
+		return -EINVAL;
+
 	/*
 	 * Even though we don't make use of the barmap for the mmap,
 	 * we need to request the region and the barmap tracks that.
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index ae0091d5096c..2d4aadd1b35a 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -151,6 +151,14 @@ void vfio_cxl_setup_dvsec_perms(struct vfio_pci_core_device *vdev);
 int vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev);
 void vfio_cxl_unregister_cxl_region(struct vfio_pci_core_device *vdev);
 int  vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev);
+int vfio_cxl_get_info(struct vfio_pci_core_device *vdev,
+		      struct vfio_info_cap *caps);
+int vfio_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+			     struct vfio_region_info *info,
+			     struct vfio_info_cap *caps);
+u8 vfio_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev);
+bool vfio_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+				      u64 req_start, u64 req_len);
 
 #else
 
@@ -172,6 +180,22 @@ vfio_cxl_unregister_cxl_region(struct vfio_pci_core_device *vdev) { }
 static inline int
 vfio_cxl_register_comp_regs_region(struct vfio_pci_core_device *vdev)
 { return 0; }
+static inline int
+vfio_cxl_get_info(struct vfio_pci_core_device *vdev,
+		  struct vfio_info_cap *caps)
+{ return -ENOTTY; }
+static inline int
+vfio_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+			 struct vfio_region_info *info,
+			 struct vfio_info_cap *caps)
+{ return -ENOTTY; }
+static inline u8
+vfio_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{ return U8_MAX; }
+static inline bool
+vfio_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+				 u64 req_start, u64 req_len)
+{ return false; }
 
 #endif /* CONFIG_VFIO_CXL_CORE */
 
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index b38627b35c35..e95bdbdbcdb2 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -201,19 +201,29 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_do_io_rw);
 int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar)
 {
 	struct pci_dev *pdev = vdev->pdev;
-	int ret;
+	int ret, bars;
 	void __iomem *io;
 
 	if (vdev->barmap[bar])
 		return 0;
 
-	ret = pci_request_selected_regions(pdev, 1 << bar, "vfio");
+	/*
+	 * The CXL component register BAR cannot be claimed exclusively: the
+	 * CXL subsystem holds persistent sub-range iomem claims during HDM
+	 * decoder setup. pci_request_selected_regions() for the full BAR
+	 * fails with EBUSY. Pass bars=0 to make the request a no-op and map
+	 * directly via pci_iomap().
+	 */
+	bars = (vdev->cxl && bar == vfio_cxl_get_component_reg_bar(vdev)) ?
+		0 : (1 << bar);
+
+	ret = pci_request_selected_regions(pdev, bars, "vfio");
 	if (ret)
 		return ret;
 
 	io = pci_iomap(pdev, bar, 0);
 	if (!io) {
-		pci_release_selected_regions(pdev, 1 << bar);
+		pci_release_selected_regions(pdev, bars);
 		return -ENOMEM;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 18/20] vfio/cxl: Provide opt-out for CXL feature
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (16 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 17/20] vfio/pci: Advertise CXL cap and sparse component BAR to userspace mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 19/20] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
  2026-04-01 14:39 ` [PATCH v2 20/20] selftests/vfio: Add CXL Type-2 VFIO assignment test mhonap
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

This commit provides an opt-out mechanism to disable the CXL
support from vfio module. The opt-out is provided both
build time and module load time.

Build time option CONFIG_VFIO_CXL_CORE is used to enable/disable
CXL support in vfio-pci module.

For runtime disabling the CXL support, use the module parameter
disable_cxl. This is a per-device opt-out on the core device
set by the driver before registration.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 drivers/vfio/pci/cxl/vfio_cxl_core.c | 4 ++++
 drivers/vfio/pci/vfio_pci.c          | 9 +++++++++
 include/linux/vfio_pci_core.h        | 1 +
 3 files changed, 14 insertions(+)

diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 46430cbfa962..3ffc3e593d04 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -479,6 +479,10 @@ void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device *vdev)
 	u16 dvsec;
 	int ret;
 
+	/* Honor the user opt-out decision */
+	if (vdev->disable_cxl)
+		return;
+
 	if (!pcie_is_cxl(pdev))
 		return;
 
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 22cf9ea831f9..a6b0fb882b9f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -60,6 +60,12 @@ static bool disable_denylist;
 module_param(disable_denylist, bool, 0444);
 MODULE_PARM_DESC(disable_denylist, "Disable use of device denylist. Disabling the denylist allows binding to devices with known errata that may lead to exploitable stability or security issues when accessed by untrusted users.");
 
+#if IS_ENABLED(CONFIG_VFIO_CXL_CORE)
+static bool disable_cxl;
+module_param(disable_cxl, bool, 0444);
+MODULE_PARM_DESC(disable_cxl, "Disable CXL Type-2 extensions for all devices bound to vfio-pci. Variant drivers may instead set vdev->disable_cxl in their probe for per-device control without needing this parameter.");
+#endif
+
 static bool vfio_pci_dev_in_denylist(struct pci_dev *pdev)
 {
 	switch (pdev->vendor) {
@@ -189,6 +195,9 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return PTR_ERR(vdev);
 
 	dev_set_drvdata(&pdev->dev, vdev);
+#if IS_ENABLED(CONFIG_VFIO_CXL_CORE)
+	vdev->disable_cxl = disable_cxl;
+#endif
 	vdev->pci_ops = &vfio_pci_dev_ops;
 	ret = vfio_pci_core_register_device(vdev);
 	if (ret)
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index aa159d0c8da7..48dc69df52fa 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -130,6 +130,7 @@ struct vfio_pci_core_device {
 	bool			needs_pm_restore:1;
 	bool			pm_intx_masked:1;
 	bool			pm_runtime_engaged:1;
+	bool                    disable_cxl:1;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 19/20] docs: vfio-pci: Document CXL Type-2 device passthrough
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (17 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 18/20] vfio/cxl: Provide opt-out for CXL feature mhonap
@ 2026-04-01 14:39 ` mhonap
  2026-04-01 14:39 ` [PATCH v2 20/20] selftests/vfio: Add CXL Type-2 VFIO assignment test mhonap
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Add Documentation/driver-api/vfio-pci-cxl.rst describing the architecture,
VFIO interfaces, and operational constraints for CXL Type-2 (cache-coherent
accelerator) passthrough via vfio-pci-core, and link it from the driver-api
index.

The document covers:
- VFIO_DEVICE_FLAGS_CXL and VFIO_DEVICE_INFO_CAP_CXL: what the capability
  struct contains and what the FIRMWARE_COMMITTED and CACHE_CAPABLE flags mean
- How to derive hdm_decoder_offset and hdm_count from the COMP_REGS region
  by traversing the CXL Capability Array to find cap ID 0x5 and reading the
  HDM Decoder Capability register
- Topology-aware sparse mmap on the component BAR (topologies A, B, C
  covering comp block at end, start, or middle of the BAR)
- Two extra VFIO device regions: COMP_REGS for the emulated HDM register
  state and the DPA memory window
- DVSEC config write virtualization: what the guest sees vs. hardware
- FLR coordination: DPA PTEs zapped before reset, restored after

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 Documentation/driver-api/index.rst        |   1 +
 Documentation/driver-api/vfio-pci-cxl.rst | 382 ++++++++++++++++++++++
 2 files changed, 383 insertions(+)
 create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 1833e6a0687e..7ec661846f6b 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -47,6 +47,7 @@ of interest to most developers working on device drivers.
    vfio-mediated-device
    vfio
    vfio-pci-device-specific-driver-acceptance
+   vfio-pci-cxl
 
 Bus-level documentation
 =======================
diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
new file mode 100644
index 000000000000..1256e4d33fc6
--- /dev/null
+++ b/Documentation/driver-api/vfio-pci-cxl.rst
@@ -0,0 +1,382 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
+VFIO PCI CXL Type-2 device passthrough
+=======================================
+
+Overview
+--------
+
+Type-2 CXL devices are PCIe accelerators (GPUs, compute ASICs, and similar)
+with coherent device memory on CXL.mem. DPA is mapped into host physical
+address space through HDM decoders that the kernel's CXL subsystem owns. A
+guest cannot program that hardware directly.
+
+This ``vfio-pci`` mode hands a VMM:
+
+- A read/write VFIO device region (COMP_REGS) that emulates the HDM decoder
+  register block with CXL register rules enforced in kernel code.
+- A mmapable VFIO device region (DPA) backed by the kernel-chosen host physical
+  range for device memory.
+- DVSEC config-space emulation so the guest cannot change host-owned CXL.io /
+  CXL.mem enable bits.
+
+Build with ``CONFIG_VFIO_CXL_CORE=y``. At runtime you can turn it off with::
+
+    modprobe vfio-pci disable_cxl=1
+
+or, in a variant driver, set ``vdev->disable_cxl = true`` before registration.
+
+
+Device detection
+----------------
+
+At ``vfio_pci_core_register_device()`` the driver checks for a Type-2 style
+setup. All of the following must hold:
+
+1. CXL Device DVSEC present (PCIe DVSEC Vendor ID ``0x1E98``, DVSEC ID
+   ``0x0000``).
+2. ``Mem_Capable`` (bit 2) set in the CXL Capability register inside that DVSEC.
+3. PCI class code is **not** ``0x050210`` (CXL Type-3 memory expander).
+4. An HDM Decoder capability block reachable through the Register Locator DVSEC.
+5. At least one HDM decoder committed by firmware with non-zero size.
+
+The CXL spec labels "Type-2" as devices with both ``Mem_Capable`` and
+``Cache_Capable``. This driver also takes ``Mem_Capable``-only devices
+(``Cache_Capable=0``), which behave like Type-3 style accelerators without the
+usual class code. ``VFIO_CXL_CAP_CACHE_CAPABLE`` exposes the cache bit to
+userspace so a VMM can treat FLR differently when needed.
+
+When detection succeeds, ``VFIO_DEVICE_FLAGS_CXL`` is ORed into
+``vfio_device_info.flags`` together with ``VFIO_DEVICE_FLAGS_PCI``.
+
+.. note::
+
+   **Firmware must commit an HDM decoder before open.** The driver only
+   discovers DPA range and size from a decoder that firmware already committed.
+   Devices without that, or hot-plugged setups that never get it, are out of
+   scope for now.
+
+   Follow-up options under discussion include CXL range registers in the
+   Device DVSEC (often enough on single-decoder parts), CDAT over DOE, mailbox
+   Get Partition Info, or a future DVSEC field from the consortium for
+   base/size/NUMA without extra side channels. There is also talk of a sysfs
+   path, modeled on resizable BAR, where an orchestrator fixes the DPA window
+   before vfio-pci binds so the driver still sees a committed range.
+
+
+UAPI: VFIO_DEVICE_INFO_CAP_CXL
+------------------------------
+
+When ``VFIO_DEVICE_FLAGS_CXL`` is set, the device info capability chain
+includes a ``vfio_device_info_cap_cxl`` structure (cap ID 6, version 1)::
+
+    struct vfio_device_info_cap_cxl {
+        struct vfio_info_cap_header header; /* id=6, version=1 */
+        __u8   hdm_regs_bar_index;  /* BAR index containing component regs */
+        __u8   reserved[3];
+        __u32  flags;               /* VFIO_CXL_CAP_* flags */
+        __u64  hdm_regs_offset;     /* byte offset within the BAR to the
+                                     * CXL.mem register area start.  This
+                                     * equals comp_reg_offset + CXL_CM_OFFSET
+                                     * where CXL_CM_OFFSET = 0x1000. */
+        __u32  dpa_region_index;    /* VFIO region index for DPA memory */
+        __u32  comp_regs_region_index; /* VFIO region index for COMP_REGS */
+    };
+    /*
+     * hdm_count and hdm_decoder_offset are intentionally absent from this
+     * struct. Both are derivable from the COMP_REGS region. See the
+     * "Deriving HDM info from COMP_REGS" section below.
+     */
+
+    #define VFIO_CXL_CAP_FIRMWARE_COMMITTED  (1 << 0)
+    #define VFIO_CXL_CAP_CACHE_CAPABLE       (1 << 1)
+
+``VFIO_CXL_CAP_FIRMWARE_COMMITTED``
+    At least one HDM decoder was pre-committed by firmware. The DPA region
+    is live at device open; the VMM can map it without waiting for a guest
+    COMMIT cycle.
+
+``VFIO_CXL_CAP_CACHE_CAPABLE``
+    The device has an HDM-DB decoder (CXL.mem + CXL.cache). This mirrors the
+    ``Cache_Capable`` bit from the CXL DVSEC Capability register. The kernel
+    does not run Write-Back Invalidation (WBI) before FLR; with this flag set
+    that stays the VMM's job.
+
+DPA region size comes from ``VFIO_DEVICE_GET_REGION_INFO`` on
+``dpa_region_index``, not from this struct.
+
+
+VFIO regions
+------------
+
+A CXL device adds two device regions on top of the usual BARs. Their indices
+are in ``dpa_region_index`` and ``comp_regs_region_index``.
+
+DPA region (``VFIO_REGION_SUBTYPE_CXL``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Flags: ``READ | WRITE | MMAP``.
+
+The backing store is the host physical range the kernel assigned for DPA. The
+kernel maps it with ``memremap(MEMREMAP_WB)`` because CXL device memory on a
+coherent link sits in the CPU cache hierarchy. That mapping is normal cached
+memory, so ``copy_to/from_user`` works without extra barriers.
+
+Page faults are lazy: PFNs are installed per page on first touch via
+``vmf_insert_pfn``. ``mmap()`` does not populate the whole region up front.
+
+Region read/write through the fd uses the same ``MEMREMAP_WB`` mapping with
+``copy_to/from_user``. ``ioread``/``iowrite`` MMIO helpers are not used on
+this path.
+
+During FLR, ``unmap_mapping_range()`` drops user PTEs and ``region_active``
+clears before the reset runs. Ongoing faults or region I/O then error instead
+of touching a dead mapping. IOMMU ATC invalidation from the zap has to finish
+before the device resets; doing it the other way around can leave an SMMU
+waiting on a device that no longer responds.
+
+After reset, the region comes back once ``COMMITTED`` shows up again in fresh
+HDM hardware state. The VMM can fault pages in again without a new ``mmap()``.
+
+COMP_REGS region (``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Flags: ``READ | WRITE`` (no mmap).
+
+Emulated registers for the CXL.mem slice of the component register block: the
+CXL Capability Array header at offset 0, then the HDM Decoder capability
+starting at ``hdm_decoder_offset`` (the byte offset derived by traversing the
+CXL Capability Array — see "Deriving HDM info from COMP_REGS" below).
+Region size from ``VFIO_DEVICE_GET_REGION_INFO`` covers the full capability
+array prefix plus all HDM decoder blocks.
+
+Only 32-bit, 32-bit-aligned accesses are allowed. 8- and 16-bit attempts get
+``-EINVAL``.
+
+Offsets below ``hdm_decoder_offset`` return the snapshot from device open.
+Writes there are dropped (with a WARN); the capability array stays read-only.
+
+From ``hdm_decoder_offset`` upward the kernel keeps a shadow
+(``comp_reg_virt[]``) and applies field rules:
+
+- At open, hardware HDM state is snapshotted. For firmware-committed decoders
+  the LOCK bit is cleared and BASE_HI/BASE_LO are zeroed in the shadow so the
+  VMM can program guest GPA; the host HPA is not carried in the shadow after
+  that.
+- ``COMMIT`` (bit 9 of CTRL): writing 1 sets ``COMMITTED`` (bit 10) in the
+  shadow immediately. Real hardware stays committed; the shadow tracks what
+  the guest should see.
+- When LOCK is set, writes to BASE_HI and SIZE_HI are ignored so
+  firmware-committed values survive.
+
+Region type identifiers::
+
+    /* type = PCI_VENDOR_ID_CXL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE */
+    #define VFIO_REGION_SUBTYPE_CXL           1  /* DPA memory region */
+    #define VFIO_REGION_SUBTYPE_CXL_COMP_REGS 2  /* HDM register shadow */
+
+
+BAR access
+----------
+
+``VFIO_DEVICE_GET_REGION_INFO`` for ``hdm_regs_bar_index`` reports the full
+BAR size with ``READ | WRITE | MMAP`` flags and a
+``VFIO_REGION_INFO_CAP_SPARSE_MMAP`` capability listing the GPU or
+accelerator register windows — the mmappable parts of the BAR that do **not**
+contain CXL component registers.
+
+The number of sparse areas depends on where the CXL component register block
+``[comp_reg_offset, comp_reg_offset + comp_reg_size)`` sits within the BAR:
+
+* **Topology A** - component block at BAR end:
+  ``[gpu_regs | comp_regs]`` → 1 area: ``[0, comp_reg_offset)``
+
+* **Topology B** - component block at BAR start:
+  ``[comp_regs | gpu_regs]`` → 1 area: ``[comp_reg_size, bar_len)``
+
+* **Topology C** - component block in middle:
+  ``[gpu_regs | comp_regs | gpu_regs]`` → 2 areas:
+  ``[0, comp_reg_offset)`` and ``[comp_reg_offset + comp_reg_size, bar_len)``
+
+VMMs **must** iterate all ``nr_areas`` entries; do not assume a single area or
+that the first area starts at offset zero.
+
+The GPU/accelerator register windows listed in the sparse capability **are**
+physically mmappable: ``mmap()`` on the VFIO device fd at the corresponding
+BAR offset succeeds and yields a host-physical-backed mapping suitable for
+KVM stage-2 installation.
+
+The CXL component register block itself **is not** mmappable.  Any ``mmap()``
+request whose range overlaps ``[comp_reg_offset, comp_reg_offset +
+comp_reg_size)`` returns ``-EINVAL``; those registers must be accessed through
+the ``COMP_REGS`` device region.
+
+
+DVSEC configuration space emulation
+-----------------------------------
+
+With ``CONFIG_VFIO_CXL_CORE=y``, vfio-pci installs a handler for
+``PCI_EXT_CAP_ID_DVSEC`` (``0x23``) in the config access table. Non-CXL
+devices fall through as before.
+
+On CXL devices, writes to these DVSEC registers are caught and reflected in
+``vdev->vconfig`` (shadow config space):
+
++--------------------+--------+--------------------------------------------------+
+| Register           | Offset | Emulation                                        |
++====================+========+==================================================+
+| CXL Control        | +0x0c  | RWL; IO_Enable held at 1; locked when Lock       |
+|                    |        | bit 0 is set.                                    |
++--------------------+--------+--------------------------------------------------+
+| CXL Status         | +0x0e  | Bit 14 (Viral_Status) is RW1CS.                  |
++--------------------+--------+--------------------------------------------------+
+| CXL Control2       | +0x10  | Bits 1 and 2 forwarded to hardware.              |
++--------------------+--------+--------------------------------------------------+
+| CXL Status2        | +0x12  | Bit 3 forwarded when Capability3 bit 3 is set.   |
++--------------------+--------+--------------------------------------------------+
+| CXL Lock           | +0x14  | RWO; once set, Control becomes read-only until   |
+|                    |        | conventional reset.                              |
++--------------------+--------+--------------------------------------------------+
+| Range Base Hi/Lo   | varies | Stored in vconfig; Base Low [27:0] reserved bits |
+|                    |        | cleared on write.                                |
++--------------------+--------+--------------------------------------------------+
+
+Reads return the shadow. Read-only registers (Capability, Size High/Low) are
+filled from hardware at open.
+
+
+FLR and reset
+-------------
+
+FLR goes through ``vfio_pci_ioctl_reset()``. The CXL-specific part is:
+
+1. ``vfio_cxl_zap_region_locked()`` runs under the write side of
+   ``memory_lock``. It clears ``region_active`` and calls
+   ``unmap_mapping_range()`` on the DPA inode mapping so user PTEs go away.
+   Concurrent faults or fd I/O hit the inactive flag and error. IOMMU ATC must
+   drain before reset (see the DPA region notes above).
+
+2. After FLR, ``vfio_cxl_reactivate_region()`` reads HDM hardware again into
+   ``comp_reg_virt[]``. If ``COMMITTED`` is set (common when firmware left the
+   decoder committed), ``region_active`` turns back on and the VMM can refault
+   without remapping.
+
+
+Known limitations
+-----------------
+
+**Pre-committed HDM decoder required**
+    See `Device detection`_ and the note there.
+
+**CXL hot-plug not supported**
+    Slots need to be present and programmed by firmware at boot.
+
+**CXL.cache Write-Back Invalidation not implemented**
+    For HDM-DB devices (``VFIO_CXL_CAP_CACHE_CAPABLE``), the kernel does not
+    run WBI before FLR. The VMM must do it and expose Back-Invalidation in the
+    guest topology where required.
+
+
+VMM integration notes
+---------------------
+
+For a ``VFIO_CXL_CAP_FIRMWARE_COMMITTED`` device (what works today)::
+
+    /* 1. Get device info and locate the CXL cap */
+    vfio_device_get_info(fd, &dinfo);
+    assert(dinfo.flags & VFIO_DEVICE_FLAGS_CXL);
+    cxl = find_cap(&dinfo, VFIO_DEVICE_INFO_CAP_CXL);
+
+    /* 2. Get DPA and COMP_REGS region sizes */
+    get_region_info(fd, cxl->dpa_region_index, &dpa_ri);
+    get_region_info(fd, cxl->comp_regs_region_index, &comp_ri);
+
+    /* 3. Map DPA region at a guest physical address */
+    gpa_base = allocate_guest_phys(dpa_ri.size);
+    mmap(gpa_base, dpa_ri.size, PROT_READ|PROT_WRITE,
+         MAP_SHARED|MAP_FIXED, vfio_fd,
+         (off_t)cxl->dpa_region_index << VFIO_PCI_OFFSET_SHIFT);
+
+    /* 4. Derive hdm_decoder_offset from COMP_REGS (see section below) */
+    uint64_t hdm_decoder_offset = derive_hdm_offset(vfio_fd, comp_ri);
+
+    /* 5. Write guest GPA into HDM Decoder 0 BASE via COMP_REGS pwrite */
+    u32 base_hi = gpa_base >> 32;
+    comp_off = (off_t)cxl->comp_regs_region_index << VFIO_PCI_OFFSET_SHIFT;
+    pwrite(vfio_fd, &base_hi, 4,
+           comp_off + hdm_decoder_offset + CXL_HDM_DECODER0_BASE_HIGH_OFFSET);
+
+    /* 6. Build guest CXL topology using gpa_base and dpa_ri.size */
+    build_cfmws(gpa_base, dpa_ri.size);
+
+    /* 7. If CACHE_CAPABLE: issue WBI before any guest FLR */
+
+Extra detail:
+
+- DPA size is ``dpa_ri.size`` from region info.
+- ``CXL_HDM_DECODER0_BASE_HIGH_OFFSET`` lives in ``include/uapi/cxl/cxl_regs.h``.
+- On the BAR, ``mmaps[0].size`` from the sparse-mmap cap on
+  ``hdm_regs_bar_index`` splits GPU MMIO (BAR fd) from the CXL block (COMP_REGS
+  region).
+- If ``VFIO_CXL_CAP_CACHE_CAPABLE`` is set, the guest CXL topology should
+  advertise Back-Invalidation and the VMM should run WBI before FLR.
+
+
+Deriving HDM info from COMP_REGS
+---------------------------------
+
+``hdm_decoder_offset`` and ``hdm_count`` are not in ``vfio_device_info_cap_cxl``
+because both are directly readable from the ``COMP_REGS`` region.
+
+**Finding hdm_decoder_offset:**
+
+Read dwords from the COMP_REGS region starting at offset 0 (the CXL Capability
+Array).  ``comp_off`` is the VFIO file offset for the COMP_REGS region:
+``(off_t)cxl->comp_regs_region_index << VFIO_PCI_OFFSET_SHIFT``::
+
+    /* Dword 0: CXL Capability Array Header */
+    pread(fd, &hdr, 4, comp_off + 0);
+    /* bits[15:0] must be 1 (CM_CAP_HDR_CAP_ID) */
+    /* bits[31:24] = number of capability entries */
+    num_caps = (hdr >> 24) & 0xff;  /* CXL_CM_CAP_HDR_ARRAY_SIZE_MASK */
+
+    /* Walk entries at dword 1..num_caps */
+    for (i = 1; i <= num_caps; i++) {
+        pread(fd, &entry, 4, comp_off + i * 4);
+        cap_id = entry & 0xffff;           /* CXL_CM_CAP_HDR_ID_MASK */
+        if (cap_id == 0x5) {               /* CXL_CM_CAP_CAP_ID_HDM */
+            hdm_decoder_offset = (entry >> 20) & 0xfff; /* CXL_CM_CAP_PTR_MASK */
+            break;
+        }
+    }
+
+**Finding hdm_count:**
+
+Read the HDM Decoder Capability register (HDMC) at ``hdm_decoder_offset + 0``::
+
+    pread(fd, &hdmc, 4, comp_off + hdm_decoder_offset);
+    field = hdmc & 0xf;  /* CXL_HDM_DECODER_COUNT_MASK bits[3:0] */
+    hdm_count = field ? field * 2 : 1;  /* 0→1, N→N*2 decoders */
+
+All constants are in ``include/uapi/cxl/cxl_regs.h``.
+
+
+Kernel configuration
+--------------------
+
+``CONFIG_VFIO_CXL_CORE`` (bool)
+    CXL Type-2 passthrough in ``vfio-pci-core``. Needs ``CONFIG_VFIO_PCI_CORE``,
+    ``CONFIG_CXL_BUS``, and ``CONFIG_CXL_MEM``.
+
+References
+----------
+
+* CXL Specification 4.0, 8.1.3 - PCIe DVSEC for CXL Devices
+* CXL Specification 4.0, 8.2.4.20 - CXL HDM Decoder Capability Structure
+* ``include/uapi/linux/vfio.h`` - ``VFIO_DEVICE_INFO_CAP_CXL``,
+  ``VFIO_REGION_SUBTYPE_CXL``, ``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``
+* ``include/uapi/cxl/cxl_regs.h`` - ``CXL_CM_OFFSET``,
+  ``CXL_CM_CAP_HDR_ARRAY_SIZE_MASK``, ``CXL_CM_CAP_HDR_ID_MASK``,
+  ``CXL_CM_CAP_PTR_MASK``, ``CXL_HDM_DECODER_COUNT_MASK``,
+  ``CXL_HDM_DECODER0_BASE_HIGH_OFFSET``
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v2 20/20] selftests/vfio: Add CXL Type-2 VFIO assignment test
  2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
                   ` (18 preceding siblings ...)
  2026-04-01 14:39 ` [PATCH v2 19/20] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
@ 2026-04-01 14:39 ` mhonap
  19 siblings, 0 replies; 27+ messages in thread
From: mhonap @ 2026-04-01 14:39 UTC (permalink / raw)
  To: alwilliamson, dan.j.williams, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, jgg, yishaih, skolothumtho,
	kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

From: Manish Honap <mhonap@nvidia.com>

Add vfio_cxl_type2_test and build it from the vfio selftest Makefile. The
binary expects a PCI BDF (argv or VFIO_SELFTESTS_BDF) with the device
already on vfio-pci and CONFIG_VFIO_CXL_CORE enabled.

It exercises:
- VFIO_DEVICE_GET_INFO,
- GET_REGION_INFO,
- VFIO_DEVICE_INFO_CAP_CXL capability list,
- sparse component-BAR vs DPA/COMP_REG regions,
- HDM decoder emulation (masks, commit, lock),
- DVSEC-backed config where the driver exposes it.

Large region read/write loops and FLR-heavy test cases are still
pending; Need to revisit these in next version of patches.

vfio_pci_device_setup() skips auto-mmap for BARs that carry
sparse-mmap capabilities; those require the caller to mmap only the
windows advertised by the capability.

Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
 tools/testing/selftests/vfio/Makefile         |   1 +
 .../selftests/vfio/lib/vfio_pci_device.c      |   3 +-
 .../selftests/vfio/vfio_cxl_type2_test.c      | 920 ++++++++++++++++++
 3 files changed, 923 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vfio/vfio_cxl_type2_test.c

diff --git a/tools/testing/selftests/vfio/Makefile b/tools/testing/selftests/vfio/Makefile
index 3c796ca99a50..2cac98302609 100644
--- a/tools/testing/selftests/vfio/Makefile
+++ b/tools/testing/selftests/vfio/Makefile
@@ -4,6 +4,7 @@ TEST_GEN_PROGS += vfio_iommufd_setup_test
 TEST_GEN_PROGS += vfio_pci_device_test
 TEST_GEN_PROGS += vfio_pci_device_init_perf_test
 TEST_GEN_PROGS += vfio_pci_driver_test
+TEST_GEN_PROGS += vfio_cxl_type2_test
 
 TEST_FILES += scripts/cleanup.sh
 TEST_FILES += scripts/lib.sh
diff --git a/tools/testing/selftests/vfio/lib/vfio_pci_device.c b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
index fac4c0ecadef..98832acc31ac 100644
--- a/tools/testing/selftests/vfio/lib/vfio_pci_device.c
+++ b/tools/testing/selftests/vfio/lib/vfio_pci_device.c
@@ -257,7 +257,8 @@ static void vfio_pci_device_setup(struct vfio_pci_device *device)
 		struct vfio_pci_bar *bar = device->bars + i;
 
 		vfio_pci_region_get(device, i, &bar->info);
-		if (bar->info.flags & VFIO_REGION_INFO_FLAG_MMAP)
+		if ((bar->info.flags & VFIO_REGION_INFO_FLAG_MMAP) &&
+		    !(bar->info.flags & VFIO_REGION_INFO_FLAG_CAPS))
 			vfio_pci_bar_map(device, i);
 	}
 
diff --git a/tools/testing/selftests/vfio/vfio_cxl_type2_test.c b/tools/testing/selftests/vfio/vfio_cxl_type2_test.c
new file mode 100644
index 000000000000..272412a7b22f
--- /dev/null
+++ b/tools/testing/selftests/vfio/vfio_cxl_type2_test.c
@@ -0,0 +1,920 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * vfio_cxl_type2_test - selftests for CXL Type-2 device passthrough via vfio-pci
+ *
+ * Tests the UAPI and emulation layer introduced by CONFIG_VFIO_CXL_CORE
+ *
+ * Usage:
+ *   ./vfio_cxl_type2_test <BDF>
+ * or set the environment variable VFIO_SELFTESTS_BDF before running.
+ *
+ * The device must be a CXL Type-2 device (e.g. a GPU with coherent memory).
+ * Tests adapt automatically to firmware-committed (COMMITTED/COMMIT_LOCK set)
+ * and CONFIG_LOCK-set hardware states instead of skipping.
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <fcntl.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+#include <linux/pci_regs.h>
+#include <cxl/cxl_regs.h>
+#include <linux/sizes.h>
+#include <linux/vfio.h>
+
+#include <libvfio.h>
+
+#include "kselftest_harness.h"
+
+/* Userspace equivalents of kernel helpers not available in user headers */
+#ifndef BIT
+#define BIT(n)			(1u << (n))
+#endif
+#ifndef GENMASK
+#define GENMASK(h, l)		(((~0u) >> (31 - (h))) & ((~0u) << (l)))
+#endif
+#define VFIO_PCI_INDEX_TO_OFFSET(idx)	((uint64_t)(idx) << 40)
+
+static const char *device_bdf;
+
+/* ------------------------------------------------------------------ */
+/* CXL UAPI constants (mirrors include/uapi/linux/vfio.h)             */
+/* ------------------------------------------------------------------ */
+
+#define VFIO_DEVICE_INFO_CAP_CXL	6
+
+#define PCI_VENDOR_ID_CXL		0x1e98
+
+#ifndef VFIO_REGION_SUBTYPE_CXL
+#define VFIO_REGION_SUBTYPE_CXL		1
+#endif
+#ifndef VFIO_REGION_SUBTYPE_CXL_COMP_REGS
+#define VFIO_REGION_SUBTYPE_CXL_COMP_REGS 2
+#endif
+
+/*
+ * HDM Decoder register layout within the component register block.
+ * Offsets relative to the start of the HDM decoder capability block.
+ * The HDM decoder block begins at hdm_decoder_offset within the COMP_REGS
+ * region; add hdm_decoder_offset before indexing into the region.
+ */
+#define HDM_GLOBAL_CTRL_OFFSET		0x04
+#define HDM_DECODER_FIRST_OFFSET	0x10
+#define HDM_DECODER_STRIDE		0x20
+#define HDM_DECODER_BASE_LO		0x00
+#define HDM_DECODER_BASE_HI		0x04
+#define HDM_DECODER_SIZE_LO		0x08
+#define HDM_DECODER_SIZE_HI		0x0c
+#define HDM_DECODER_CTRL		0x10
+
+#define HDM_CTRL_COMMIT			BIT(9)
+#define HDM_CTRL_COMMITTED		BIT(10)
+#define HDM_CTRL_RESERVED_MASK		(BIT(15) | GENMASK(31, 28))
+
+#define CXL_LOCK_RESERVED_MASK GENMASK(15, 1)
+
+/* ------------------------------------------------------------------ */
+/* Helpers                                                            */
+/* ------------------------------------------------------------------ */
+
+/*
+ * Walk the vfio_device_info capability chain embedded in @buf.
+ * Returns a pointer to the capability with the given @id, or NULL.
+ */
+static const struct vfio_info_cap_header *
+find_device_cap(const void *buf, size_t bufsz, uint16_t id)
+{
+	const struct vfio_device_info *info = buf;
+	const struct vfio_info_cap_header *cap;
+
+	if (!(info->flags & VFIO_DEVICE_FLAGS_CAPS) || !info->cap_offset)
+		return NULL;
+
+	cap = (const struct vfio_info_cap_header *)
+		((const char *)buf + info->cap_offset);
+
+	while (true) {
+		if (cap->id == id)
+			return cap;
+		if (!cap->next)
+			return NULL;
+		cap = (const struct vfio_info_cap_header *)
+			((const char *)buf + cap->next);
+		if ((const char *)cap + sizeof(*cap) > (const char *)buf + bufsz)
+			return NULL;
+	}
+}
+
+/*
+ * Walk the vfio_region_info capability chain embedded in @buf.
+ * Returns a pointer to the capability with the given @id, or NULL.
+ * @buf must have been obtained from VFIO_DEVICE_GET_REGION_INFO with
+ * argsz large enough to hold the full capability chain.
+ */
+static const struct vfio_info_cap_header *
+find_region_cap(const void *buf, size_t bufsz, uint16_t id)
+{
+	const struct vfio_region_info *info = buf;
+	const struct vfio_info_cap_header *cap;
+
+	if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS) || !info->cap_offset)
+		return NULL;
+
+	cap = (const struct vfio_info_cap_header *)
+		((const char *)buf + info->cap_offset);
+
+	while (true) {
+		if (cap->id == id)
+			return cap;
+		if (!cap->next)
+			return NULL;
+		cap = (const struct vfio_info_cap_header *)
+			((const char *)buf + cap->next);
+		if ((const char *)cap + sizeof(*cap) > (const char *)buf + bufsz)
+			return NULL;
+	}
+}
+
+/*
+ * Read a 32-bit value from the COMP_REGS region at @offset (HDM-relative).
+ */
+static uint32_t comp_regs_read32(struct vfio_pci_device *dev,
+				 uint32_t region_idx, uint64_t offset)
+{
+	uint32_t val;
+	loff_t pos = (loff_t)VFIO_PCI_INDEX_TO_OFFSET(region_idx) + offset;
+	ssize_t r;
+
+	r = pread(dev->fd, &val, sizeof(val), pos);
+	if (r != sizeof(val))
+		return ~0u;
+	return val;
+}
+
+/*
+ * Write a 32-bit value to the COMP_REGS region at @offset.
+ * Mirrors the error-propagation contract of comp_regs_read32() which returns
+ * ~0u on a short or failed pread.
+ */
+static ssize_t comp_regs_write32(struct vfio_pci_device *dev,
+				 uint32_t region_idx, uint64_t offset,
+				 uint32_t val)
+{
+	loff_t pos = (loff_t)VFIO_PCI_INDEX_TO_OFFSET(region_idx) + offset;
+
+	return pwrite(dev->fd, &val, sizeof(val), pos);
+}
+
+/*
+ * HDM register accessors.
+ *
+ * The COMP_REGS region starts at the CXL component register block
+ * start (comp_reg_offset).  The HDM decoder capability block begins at
+ * hdm_decoder_offset within this region. These helpers add
+ * hdm_decoder_offset so that callers can continue to use the HDM-relative
+ * offsets defined by the macros above.
+ */
+static uint32_t hdm_regs_read32(struct vfio_pci_device *dev,
+				uint32_t region_idx,
+				uint64_t hdm_decoder_offset,
+				uint64_t hdm_off)
+{
+	return comp_regs_read32(dev, region_idx, hdm_decoder_offset + hdm_off);
+}
+
+static ssize_t hdm_regs_write32(struct vfio_pci_device *dev,
+				uint32_t region_idx,
+				uint64_t hdm_decoder_offset,
+				uint64_t hdm_off,
+				uint32_t val)
+{
+	return comp_regs_write32(dev, region_idx, hdm_decoder_offset + hdm_off, val);
+}
+
+/*
+ * Traverse the CXL Capability Array at COMP_REGS region offset 0 to find the
+ * HDM Decoder capability block offset and decoder count.
+ *
+ * COMP_REGS region layout at offset 0 (CXL Capability Array):
+ *   Dword 0 bits[31:24] (CXL_CM_CAP_HDR_ARRAY_SIZE_MASK): entry count N.
+ *   Dwords 1..N at offset (cap*4): bits[15:0] = cap ID (CXL_CM_CAP_HDR_ID_MASK),
+ *   bits[31:20] = byte offset from COMP_REGS start (CXL_CM_CAP_PTR_MASK).
+ *
+ * HDM Decoder cap ID = 0x5 (CXL_CM_CAP_CAP_ID_HDM).
+ * HDMC at hdm_decoder_offset+0 bits[3:0]: count = (field==0) ? 1 : field*2.
+ *
+ * Returns true on success; sets *hdm_off and *hdm_cnt.
+ */
+static bool find_hdm_decoder_info(struct vfio_pci_device *dev,
+				  uint32_t comp_regs_idx,
+				  uint64_t *hdm_off, uint8_t *hdm_cnt)
+{
+	uint32_t hdr, num_caps, i;
+
+	/* Read CXL Capability Array Header (dword 0) */
+	hdr = comp_regs_read32(dev, comp_regs_idx, 0);
+	if (hdr == ~0u)
+		return false;
+
+	/* Validate: bits[15:0] must be CM_CAP_HDR_CAP_ID (1) */
+	if ((hdr & 0xffff) != 1)
+		return false;
+
+	/* bits[31:24] = number of capability entries */
+	num_caps = (hdr >> 24) & 0xff;
+
+	for (i = 1; i <= num_caps; i++) {
+		uint32_t entry = comp_regs_read32(dev, comp_regs_idx, i * 4);
+		uint32_t cap_id = entry & 0xffff; /* CXL_CM_CAP_HDR_ID_MASK */
+
+		if (cap_id == 0x5) { /* CXL_CM_CAP_CAP_ID_HDM */
+			uint32_t hdmc;
+			uint32_t field;
+
+			/* bits[31:20]: byte offset from COMP_REGS start */
+			*hdm_off = (entry >> 20) & 0xfff;
+
+			/* Read HDMC register at hdm_decoder_offset + 0 */
+			hdmc = comp_regs_read32(dev, comp_regs_idx, *hdm_off);
+			if (hdmc == ~0u)
+				return false;
+
+			/* bits[3:0]: 0 = 1 decoder, N = N*2 decoders */
+			field = hdmc & 0xf;
+			*hdm_cnt = field ? (uint8_t)(field * 2) : 1;
+			return true;
+		}
+	}
+	return false;
+}
+
+/*
+ * Find the CXL DVSEC capability base in config space.
+ */
+#define PCI_DVSEC_VENDOR_ID_CXL	0x1e98
+#define PCI_DVSEC_ID_CXL_DEVICE	0x0000
+#define PCI_EXT_CAP_ID_DVSEC	0x23
+
+static uint16_t find_cxl_dvsec(struct vfio_pci_device *dev)
+{
+	uint16_t pos = PCI_CFG_SPACE_SIZE; /* 0x100 */
+	int iter = 0;
+
+	while (pos && iter++ < 64) {
+		uint32_t hdr  = vfio_pci_config_readl(dev, pos);
+		uint32_t hdr1, hdr2;
+		uint16_t cap_id	 = hdr & 0xffff;
+		uint16_t next	 = (hdr >> 20) & 0xffc;
+
+		if (cap_id == PCI_EXT_CAP_ID_DVSEC) {
+			hdr1 = vfio_pci_config_readl(dev, pos + 4);
+			hdr2 = vfio_pci_config_readl(dev, pos + 8);
+			/*
+			 * PCIe DVSEC Header 1 layout (Table 9-16):
+			 *   Bits [15: 0] = DVSEC Vendor ID
+			 *   Bits [19:16] = DVSEC Revision
+			 *   Bits [31:20] = DVSEC Length
+			 * DVSEC Header 2 layout:
+			 *   Bits [15: 0] = DVSEC ID
+			 */
+			if ((hdr1 & 0xffff) == PCI_DVSEC_VENDOR_ID_CXL &&
+			    (hdr2 & 0xffff) == PCI_DVSEC_ID_CXL_DEVICE)
+				return pos;
+		}
+		pos = next;
+	}
+	return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* Fixture                                                            */
+/* ------------------------------------------------------------------ */
+
+FIXTURE(cxl_type2) {
+	struct iommu *iommu;
+	struct vfio_pci_device *dev;
+
+	/* Filled in during FIXTURE_SETUP from the CXL cap */
+	struct vfio_device_info_cap_cxl cxl_cap;
+	uint16_t dvsec_base;
+
+	/*
+	 * Sizes derived from VFIO_DEVICE_GET_REGION_INFO at setup time.
+	 * These are not in the CXL cap struct; query the region directly.
+	 */
+	uint64_t dpa_size;       /* size of the DPA region */
+	uint64_t hdm_regs_size;  /* size of the COMP_REGS region */
+
+	/*
+	 * HDM decoder info derived from the COMP_REGS region at setup time.
+	 * hdm_count and hdm_decoder_offset are no longer in the UAPI cap struct;
+	 * they are derived by traversing the CXL Capability Array and reading
+	 * the HDM Decoder Capability register (HDMC).
+	 */
+	uint64_t hdm_decoder_offset; /* byte offset in COMP_REGS to HDM block */
+	uint8_t  hdm_count;          /* number of HDM decoders */
+
+	/* DPA mmap pointer (may be NULL if test skips mmap sub-tests) */
+	void *dpa_mmap;
+	size_t dpa_mmap_size;
+};
+
+FIXTURE_SETUP(cxl_type2)
+{
+	uint8_t infobuf[512] = {};
+	struct vfio_device_info *info = (void *)infobuf;
+	const struct vfio_device_info_cap_cxl *cap;
+
+	self->iommu = iommu_init(default_iommu_mode);
+	self->dev   = vfio_pci_device_init(device_bdf, self->iommu);
+
+	/* Query device info with space for capability chain */
+	info->argsz = sizeof(infobuf);
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_INFO, info));
+
+	if (!(info->flags & VFIO_DEVICE_FLAGS_CXL)) {
+		printf("Device %s is not a CXL Type-2 device — skipping\n",
+		       device_bdf);
+		SKIP(return, "not a CXL Type-2 device");
+	}
+
+	cap = (const struct vfio_device_info_cap_cxl *)
+		find_device_cap(infobuf, sizeof(infobuf),
+				VFIO_DEVICE_INFO_CAP_CXL);
+	ASSERT_NE(NULL, cap);
+	memcpy(&self->cxl_cap, cap, sizeof(*cap));
+
+	/*
+	 * Populate dpa_size and hdm_regs_size from region queries.
+	 */
+	{
+		struct vfio_region_info ri = { .argsz = sizeof(ri) };
+
+		ri.index = cap->dpa_region_index;
+		ASSERT_EQ(0, ioctl(self->dev->fd,
+				   VFIO_DEVICE_GET_REGION_INFO, &ri));
+		self->dpa_size = ri.size;
+
+		ri.index = cap->comp_regs_region_index;
+		ASSERT_EQ(0, ioctl(self->dev->fd,
+				   VFIO_DEVICE_GET_REGION_INFO, &ri));
+		self->hdm_regs_size = ri.size;
+	}
+
+	/*
+	 * Derive hdm_decoder_offset and hdm_count from the COMP_REGS region.
+	 * These fields were removed from vfio_device_info_cap_cxl to keep the
+	 * UAPI minimal; userspace derives them via the CXL Capability Array.
+	 */
+	ASSERT_TRUE(find_hdm_decoder_info(self->dev,
+					  cap->comp_regs_region_index,
+					  &self->hdm_decoder_offset,
+					  &self->hdm_count));
+
+	self->dvsec_base    = find_cxl_dvsec(self->dev);
+	self->dpa_mmap      = MAP_FAILED;
+	self->dpa_mmap_size = 0;
+}
+
+FIXTURE_TEARDOWN(cxl_type2)
+{
+	if (self->dpa_mmap != MAP_FAILED && self->dpa_mmap_size)
+		munmap(self->dpa_mmap, self->dpa_mmap_size);
+	vfio_pci_device_cleanup(self->dev);
+	iommu_cleanup(self->iommu);
+}
+
+/* ------------------------------------------------------------------ */
+/* Tests: VFIO_DEVICE_GET_INFO                                        */
+/* ------------------------------------------------------------------ */
+
+/*
+ * CXL and PCI flags must both be set; CAPS must be set since we have a cap.
+ */
+TEST_F(cxl_type2, device_flags)
+{
+	uint8_t infobuf[512] = {};
+	struct vfio_device_info *info = (void *)infobuf;
+
+	info->argsz = sizeof(infobuf);
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_INFO, info));
+
+	ASSERT_TRUE(info->flags & VFIO_DEVICE_FLAGS_CXL);
+	ASSERT_TRUE(info->flags & VFIO_DEVICE_FLAGS_PCI);
+	ASSERT_TRUE(info->flags & VFIO_DEVICE_FLAGS_CAPS);
+
+	printf("device flags: 0x%x  num_regions: %u\n",
+	       info->flags, info->num_regions);
+}
+
+/*
+ * The CXL capability must report sane HDM and DPA values.
+ * hdm_count and hdm_decoder_offset are no longer in the cap struct; they
+ * are derived from the COMP_REGS region in FIXTURE_SETUP and stored in
+ * self->hdm_count and self->hdm_decoder_offset.
+ */
+TEST_F(cxl_type2, cxl_cap_fields)
+{
+	const struct vfio_device_info_cap_cxl *c = &self->cxl_cap;
+
+	ASSERT_EQ(VFIO_DEVICE_INFO_CAP_CXL, c->header.id);
+	ASSERT_EQ(1, c->header.version);
+
+	/* Must have at least one HDM decoder (derived from HDMC bits[3:0]) */
+	ASSERT_GT(self->hdm_count, 0);
+
+	/* COMP_REGS region size must be non-zero and 4-byte aligned */
+	ASSERT_GT(self->hdm_regs_size, 0ULL);
+	ASSERT_EQ(0ULL, self->hdm_regs_size % 4);
+
+	/*
+	 * hdm_decoder_offset is derived from the CXL Capability Array.
+	 * It must be:
+	 *   - non-zero (the CXL Capability Array Header precedes the HDM block)
+	 *   - dword-aligned
+	 *   - strictly less than hdm_regs_size (HDM block fits in the region)
+	 */
+	ASSERT_GT(self->hdm_decoder_offset, 0ULL);
+	ASSERT_EQ(0ULL, self->hdm_decoder_offset % 4);
+	ASSERT_LT(self->hdm_decoder_offset, self->hdm_regs_size);
+
+	/* Region indices must not be ~0U (sentinel for "not found") */
+	ASSERT_NE(~0U, c->dpa_region_index);
+	ASSERT_NE(~0U, c->comp_regs_region_index);
+
+	/* The two regions must be distinct */
+	ASSERT_NE(c->dpa_region_index, c->comp_regs_region_index);
+
+	/*
+	 * FIRMWARE_COMMITTED: decoder was pre-programmed by firmware; DPA
+	 * region is immediately live.  dpa_size must be non-zero in this case.
+	 */
+	if (c->flags & VFIO_CXL_CAP_FIRMWARE_COMMITTED)
+		ASSERT_GT(self->dpa_size, 0ULL);
+
+	printf("hdm_count=%u dpa_size=0x%llx hdm_regs_size=0x%llx "
+	       "hdm_decoder_offset=0x%llx "
+	       "dpa_idx=%u comp_regs_idx=%u flags=0x%x "
+	       "(firmware_committed=%d cache_capable=%d)\n",
+	       self->hdm_count, (unsigned long long)self->dpa_size,
+	       (unsigned long long)self->hdm_regs_size,
+	       (unsigned long long)self->hdm_decoder_offset,
+	       c->dpa_region_index, c->comp_regs_region_index, c->flags,
+	       !!(c->flags & VFIO_CXL_CAP_FIRMWARE_COMMITTED),
+	       !!(c->flags & VFIO_CXL_CAP_CACHE_CAPABLE));
+}
+
+/* ------------------------------------------------------------------ */
+/* Tests: VFIO_DEVICE_GET_REGION_INFO                                 */
+/* ------------------------------------------------------------------ */
+
+/*
+ * The component register BAR must report its real (non-zero) size with
+ * READ/WRITE/MMAP flags and a VFIO_REGION_INFO_CAP_SPARSE_MMAP capability.
+ * The sparse areas advertise the GPU/accelerator register windows — the
+ * mmappable parts of the BAR that do NOT contain CXL component registers.
+ *
+ * Three topologies are possible depending on where comp_regs sits in the BAR:
+ *   Topology A [gpu_regs | comp_regs]      → 1 area: [0, comp_reg_offset)
+ *   Topology B [comp_regs | gpu_regs]      → 1 area: [comp_end, bar_len)
+ *   Topology C [gpu_regs | comp_regs | gpu_regs] → 2 areas
+ *
+ * In all cases each sparse area is a GPU register window; no area may overlap
+ * the CXL component register block at [comp_reg_offset, comp_reg_offset +
+ * comp_reg_size).
+ */
+TEST_F(cxl_type2, component_bar_sparse_mmap)
+{
+	struct vfio_region_info probe = { .argsz = sizeof(probe) };
+	struct vfio_region_info *reg;
+	const struct vfio_region_info_cap_sparse_mmap *sparse;
+	uint32_t bar_idx = self->cxl_cap.hdm_regs_bar_index;
+	uint64_t comp_reg_offset;
+	uint64_t total_gpu_size;
+	uint8_t *buf;
+	uint32_t needed;
+	uint32_t i;
+
+	/* First probe: learn required buffer size and basic flags */
+	probe.index = bar_idx;
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &probe));
+
+	ASSERT_GT(probe.size, 0ULL);
+	ASSERT_TRUE(probe.flags & VFIO_REGION_INFO_FLAG_READ);
+	ASSERT_TRUE(probe.flags & VFIO_REGION_INFO_FLAG_WRITE);
+	ASSERT_TRUE(probe.flags & VFIO_REGION_INFO_FLAG_MMAP);
+
+	/* Kernel must signal caps are present by expanding argsz */
+	ASSERT_GT(probe.argsz, (uint32_t)sizeof(probe));
+	needed = probe.argsz;
+
+	buf = calloc(1, needed);
+	ASSERT_NE(NULL, buf);
+	reg = (struct vfio_region_info *)buf;
+	reg->argsz = needed;
+	reg->index = bar_idx;
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, reg));
+
+	/* Must carry a sparse-mmap cap */
+	sparse = (const struct vfio_region_info_cap_sparse_mmap *)
+		find_region_cap(buf, needed, VFIO_REGION_INFO_CAP_SPARSE_MMAP);
+	ASSERT_NE(NULL, sparse);
+
+	/* 1 area (topology A or B) or 2 areas (topology C); never more */
+	ASSERT_GE(sparse->nr_areas, 1U);
+	ASSERT_LE(sparse->nr_areas, 2U);
+
+	/*
+	 * comp_reg_offset = hdm_regs_offset - CXL_CM_OFFSET.
+	 * hdm_regs_offset is the BAR-relative address of the CXL.mem area
+	 * start, which sits CXL_CM_OFFSET (0x1000) bytes into the component
+	 * register block.
+	 */
+	ASSERT_GE(self->cxl_cap.hdm_regs_offset, (uint64_t)CXL_CM_OFFSET);
+	comp_reg_offset = self->cxl_cap.hdm_regs_offset - CXL_CM_OFFSET;
+
+	total_gpu_size = 0;
+	for (i = 0; i < sparse->nr_areas; i++) {
+		uint64_t area_start = sparse->areas[i].offset;
+		uint64_t area_end   = area_start + sparse->areas[i].size;
+
+		/* Each area must be non-empty and fit within the BAR */
+		ASSERT_GT(sparse->areas[i].size, 0ULL);
+		ASSERT_LE(area_end, reg->size);
+
+		/*
+		 * No sparse area may overlap the CXL component register block.
+		 * Use hdm_regs_offset as a witness point: it is comp_reg_offset
+		 * + CXL_CM_OFFSET, guaranteed inside the block.
+		 */
+		ASSERT_FALSE(area_start <= self->cxl_cap.hdm_regs_offset &&
+			     self->cxl_cap.hdm_regs_offset < area_end);
+
+		total_gpu_size += sparse->areas[i].size;
+
+		printf("  sparse area[%u]: offset=0x%llx size=0x%llx\n", i,
+		       (unsigned long long)area_start,
+		       (unsigned long long)sparse->areas[i].size);
+	}
+
+	/* GPU windows together must be strictly smaller than the full BAR */
+	ASSERT_LT(total_gpu_size, reg->size);
+
+	printf("component BAR %u: bar_size=0x%llx comp_reg_offset=0x%llx "
+	       "nr_areas=%u total_gpu=0x%llx flags=0x%x\n",
+	       bar_idx, (unsigned long long)reg->size,
+	       (unsigned long long)comp_reg_offset,
+	       sparse->nr_areas, (unsigned long long)total_gpu_size,
+	       reg->flags);
+
+	free(buf);
+}
+
+/*
+ * DPA region must be readable, writable, and mmappable.
+ * Its size must be non-zero (verified in fixture setup via self->dpa_size).
+ */
+TEST_F(cxl_type2, dpa_region_info)
+{
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+	reg.index = self->cxl_cap.dpa_region_index;
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg));
+
+	ASSERT_EQ(self->dpa_size, reg.size);
+	ASSERT_TRUE(reg.flags & VFIO_REGION_INFO_FLAG_READ);
+	ASSERT_TRUE(reg.flags & VFIO_REGION_INFO_FLAG_WRITE);
+	ASSERT_TRUE(reg.flags & VFIO_REGION_INFO_FLAG_MMAP);
+
+	printf("DPA region: size=0x%llx offset=0x%llx flags=0x%x\n",
+	       (unsigned long long)reg.size,
+	       (unsigned long long)reg.offset, reg.flags);
+}
+
+/*
+ * COMP_REGS region must be readable and writable but not mmappable.
+ * Its size covers [comp_reg_offset, comp_reg_offset + hdm_regs_size), which
+ * includes both the CXL Capability Array prefix (hdm_decoder_offset bytes)
+ * and the HDM decoder block. Size is available in self->hdm_regs_size
+ * (populated from this same region query at fixture setup time).
+ */
+TEST_F(cxl_type2, comp_regs_region_info)
+{
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+	reg.index = self->cxl_cap.comp_regs_region_index;
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg));
+
+	ASSERT_EQ(self->hdm_regs_size, reg.size);
+	ASSERT_TRUE(reg.flags  & VFIO_REGION_INFO_FLAG_READ);
+	ASSERT_TRUE(reg.flags  & VFIO_REGION_INFO_FLAG_WRITE);
+	ASSERT_FALSE(reg.flags & VFIO_REGION_INFO_FLAG_MMAP);
+
+	printf("COMP_REGS region: size=0x%llx offset=0x%llx flags=0x%x\n",
+	       (unsigned long long)reg.size,
+	       (unsigned long long)reg.offset, reg.flags);
+}
+
+/* ------------------------------------------------------------------ */
+/* Tests: DPA region mmap                                             */
+/* ------------------------------------------------------------------ */
+
+/*
+ * mmap() the DPA region and verify the first page can be read.
+ * The region uses lazy fault insertion so the first access triggers the
+ * vfio_cxl_region_page_fault path.
+ */
+TEST_F(cxl_type2, dpa_mmap_fault)
+{
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	size_t map_size;
+	void *ptr;
+	uint8_t *p;
+	uint8_t val;
+
+	reg.index = self->cxl_cap.dpa_region_index;
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg));
+
+	/* Map just the first 2MB or the full region, whichever is smaller */
+	map_size = (size_t)reg.size < (size_t)(2 * SZ_1M)
+		 ? (size_t)reg.size : (size_t)(2 * SZ_1M);
+
+	ptr = mmap(NULL, map_size, PROT_READ | PROT_WRITE,
+		   MAP_SHARED, self->dev->fd, (off_t)reg.offset);
+	ASSERT_NE(MAP_FAILED, ptr);
+
+	self->dpa_mmap = ptr;
+	self->dpa_mmap_size = map_size;
+
+	/* First access - triggers vmf_insert_pfn */
+	p = (uint8_t *)ptr;
+	val = *p;
+	(void)val;
+
+	printf("DPA mmap: ptr=%p size=0x%zx first byte=0x%02x\n",
+	       ptr, map_size, (uint8_t)val);
+
+	/* Write a pattern and read it back */
+	*p = 0xab;
+	ASSERT_EQ(0xab, *p);
+}
+
+/*
+ * mmap() of the COMP_REGS region (no MMAP flag) must fail.
+ */
+TEST_F(cxl_type2, comp_regs_no_mmap)
+{
+	struct vfio_region_info reg = { .argsz = sizeof(reg) };
+	void *ptr;
+
+	reg.index = self->cxl_cap.comp_regs_region_index;
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &reg));
+
+	ptr = mmap(NULL, (size_t)reg.size, PROT_READ,
+		   MAP_SHARED, self->dev->fd, (off_t)reg.offset);
+	ASSERT_EQ(MAP_FAILED, ptr);
+
+	printf("mmap of COMP_REGS correctly failed (errno=%d)\n", errno);
+}
+
+/*
+ * mmap() of the CXL component register block within the component BAR must
+ * fail with EINVAL.  The kernel blocks any mmap request whose range overlaps
+ * [comp_reg_offset, comp_reg_offset + comp_reg_size) even though the BAR as
+ * a whole carries the MMAP flag (GPU windows are mmappable).
+ *
+ * hdm_regs_offset (= comp_reg_offset + CXL_CM_OFFSET) is a page-aligned
+ * address guaranteed to lie inside the component register block.
+ */
+TEST_F(cxl_type2, comp_reg_mmap_blocked)
+{
+	struct vfio_region_info bar_reg = { .argsz = sizeof(bar_reg) };
+	void *ptr;
+
+	bar_reg.index = self->cxl_cap.hdm_regs_bar_index;
+	ASSERT_EQ(0, ioctl(self->dev->fd, VFIO_DEVICE_GET_REGION_INFO, &bar_reg));
+	ASSERT_TRUE(bar_reg.flags & VFIO_REGION_INFO_FLAG_MMAP);
+
+	/*
+	 * hdm_regs_offset is page-aligned and is comp_reg_offset + CXL_CM_OFFSET
+	 * (0x1000), so it is always within the component register block.
+	 */
+	ASSERT_EQ(0ULL, self->cxl_cap.hdm_regs_offset % SZ_4K);
+
+	ptr = mmap(NULL, (size_t)SZ_4K, PROT_READ,
+		   MAP_SHARED, self->dev->fd,
+		   (off_t)(bar_reg.offset + self->cxl_cap.hdm_regs_offset));
+	ASSERT_EQ(MAP_FAILED, ptr);
+	ASSERT_EQ(EINVAL, errno);
+
+	printf("comp_reg_mmap_blocked: hdm_regs_offset=0x%llx correctly blocked "
+	       "(errno=%d)\n",
+	       (unsigned long long)self->cxl_cap.hdm_regs_offset, errno);
+}
+
+/* ------------------------------------------------------------------ */
+/* Tests: COMP_REGS region (HDM decoder emulation)                    */
+/* ------------------------------------------------------------------ */
+
+/*
+ * Reading HDM Capability (offset 0x00) must return a non-zero value
+ * consistent with at least one decoder being present.
+ * Bits [3:0] encode the HDM decoder count.
+ */
+TEST_F(cxl_type2, hdm_cap_read)
+{
+	uint32_t cap;
+	uint32_t idx = self->cxl_cap.comp_regs_region_index;
+	uint64_t hdm_off = self->hdm_decoder_offset;
+
+	cap = hdm_regs_read32(self->dev, idx, hdm_off, CXL_HDM_DECODER_CAP_OFFSET);
+	ASSERT_NE(~0u, cap);
+
+	/*
+	 * Verify the live HDMC register matches the count we derived in setup.
+	 * Encoding: bits[3:0] = 0 → 1 decoder; N → N*2 decoders.
+	 */
+	{
+		uint32_t field = cap & 0xf;
+		uint8_t expected = field ? (uint8_t)(field * 2) : 1;
+
+		ASSERT_EQ(self->hdm_count, expected);
+	}
+
+	printf("HDM Capability register: 0x%08x  decoder_count_field=%u  hdm_count=%u\n",
+	       cap, cap & 0xf, self->hdm_count);
+}
+
+/*
+ * HDM decoder COMMIT -> COMMITTED transition.
+ *
+ * On firmware-committed hardware (COMMITTED already set) the COMMIT path
+ * is not exercisable.  Instead verify the committed state is self-consistent:
+ * COMMITTED set, BASE/SIZE non-zero and large enough to cover dpa_size, and
+ * reserved bits cleared by the emulation layer.
+ *
+ * On hardware where the decoder is not yet committed, exercise the full
+ * COMMIT=1 -> COMMITTED=1 path followed by COMMIT=0 -> COMMITTED=0.
+ */
+TEST_F(cxl_type2, hdm_ctrl_commit_to_committed)
+{
+	uint32_t idx = self->cxl_cap.comp_regs_region_index;
+	uint64_t hdm_off = self->hdm_decoder_offset;
+	uint64_t base_lo_off = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_BASE_LO;
+	uint64_t base_hi_off = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_BASE_HI;
+	uint64_t size_lo_off = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_SIZE_LO;
+	uint64_t size_hi_off = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_SIZE_HI;
+	uint64_t ctrl_off    = HDM_DECODER_FIRST_OFFSET + HDM_DECODER_CTRL;
+	uint32_t ctrl_readback;
+	uint32_t base_lo, base_hi, size_lo, size_hi;
+	uint64_t dec_base, dec_size;
+
+	ctrl_readback = hdm_regs_read32(self->dev, idx, hdm_off, ctrl_off);
+
+	if (ctrl_readback & HDM_CTRL_COMMITTED) {
+		/*
+		 * Firmware-committed decoder: verify the committed state is
+		 * self-consistent.
+		 *
+		 * BASE is expected to be zero: the kernel clears BASE_LO/HI in
+		 * the shadow for firmware-committed decoders so that the host
+		 * HPA does not leak to the guest.  The VMM will write the guest
+		 * GPA into BASE before booting the VM.
+		 *
+		 * SIZE must cover at least dpa_size, and reserved bits must be
+		 * clear (the emulation scrubs them on every write).
+		 */
+		base_lo = hdm_regs_read32(self->dev, idx, hdm_off, base_lo_off);
+		base_hi = hdm_regs_read32(self->dev, idx, hdm_off, base_hi_off);
+		size_lo = hdm_regs_read32(self->dev, idx, hdm_off, size_lo_off);
+		size_hi = hdm_regs_read32(self->dev, idx, hdm_off, size_hi_off);
+		dec_base = ((uint64_t)base_hi << 32) | (base_lo & ~GENMASK(27, 0));
+		dec_size = ((uint64_t)size_hi << 32) | (size_lo & ~GENMASK(27, 0));
+
+		ASSERT_EQ(0ULL, dec_base);
+		ASSERT_GE(dec_size, self->dpa_size);
+		ASSERT_EQ(0u, ctrl_readback & HDM_CTRL_RESERVED_MASK);
+
+		printf("Decoder 0 firmware-committed: ctrl=0x%08x "
+		       "base=0x%llx (zeroed by kernel) size=0x%llx dpa_size=0x%llx\n",
+		       ctrl_readback,
+		       (unsigned long long)dec_base,
+		       (unsigned long long)dec_size,
+		       (unsigned long long)self->dpa_size);
+		return;
+	}
+
+	/* Decoder not committed: exercise COMMIT=1 -> COMMITTED=1 path */
+	ASSERT_EQ(4, hdm_regs_write32(self->dev, idx, hdm_off, base_lo_off, 0x10000000));
+	ASSERT_EQ(4, hdm_regs_write32(self->dev, idx, hdm_off, base_hi_off, 0));
+	ASSERT_EQ(4, hdm_regs_write32(self->dev, idx, hdm_off, size_lo_off, 0x10000000));
+	ASSERT_EQ(4, hdm_regs_write32(self->dev, idx, hdm_off, size_hi_off, 0));
+
+	ASSERT_EQ(4, hdm_regs_write32(self->dev, idx, hdm_off, ctrl_off, HDM_CTRL_COMMIT));
+	ctrl_readback = hdm_regs_read32(self->dev, idx, hdm_off, ctrl_off);
+	ASSERT_TRUE(ctrl_readback & HDM_CTRL_COMMITTED);
+
+	printf("HDM decoder 0 CTRL after COMMIT=1: 0x%08x (COMMITTED set)\n",
+	       ctrl_readback);
+
+	ASSERT_EQ(4, hdm_regs_write32(self->dev, idx, hdm_off, ctrl_off, 0));
+	ctrl_readback = hdm_regs_read32(self->dev, idx, hdm_off, ctrl_off);
+	ASSERT_FALSE(ctrl_readback & HDM_CTRL_COMMITTED);
+
+	printf("HDM decoder 0 CTRL after COMMIT=0: 0x%08x (COMMITTED cleared)\n",
+	       ctrl_readback);
+}
+
+/*
+ * CXL Lock (DVSEC offset 0x14):
+ *   - Reserved bits GENMASK(15,1) must be cleared.
+ *   - Once locked, CXL Control writes must be discarded.
+ *
+ * On firmware-committed hardware CONFIG_LOCK is set before OS load by the
+ * BIOS.
+ * In this case verify:
+ * (a) Lock reserved bits are zero,
+ * (b) a write to CXL Control is silently discarded by the emulation.
+ * Both are directly testable without needing to transition from unlocked
+ * to locked.
+ *
+ * On hardware where CONFIG_LOCK is not yet set, exercise the full sequence:
+ * write reserved bits (must be cleared), set CONFIG_LOCK, verify Control
+ * writes are then discarded.
+ */
+TEST_F(cxl_type2, dvsec_lock_semantics)
+{
+	uint16_t dvsec = self->dvsec_base;
+	uint16_t lock_val, ctrl_before, ctrl_after;
+
+	if (!dvsec)
+		SKIP(return, "CXL DVSEC not found in config space");
+
+	lock_val = vfio_pci_config_readw(self->dev,
+					 dvsec + CXL_DVSEC_LOCK_OFFSET);
+
+	if (lock_val & CXL_DVSEC_LOCK_CONFIG_LOCK) {
+		/*
+		 * Lock is already set: verify reserved bits are zero in the
+		 * current shadow, then verify a Control write is discarded.
+		 */
+		ASSERT_EQ(0u, lock_val & CXL_LOCK_RESERVED_MASK);
+
+		ctrl_before = vfio_pci_config_readw(self->dev,
+						    dvsec + CXL_DVSEC_CONTROL_OFFSET);
+		/* Attempt to flip CXL_Mem_Enable (bit 2) */
+		vfio_pci_config_writew(self->dev, dvsec + CXL_DVSEC_CONTROL_OFFSET,
+				       ctrl_before ^ BIT(2));
+		ctrl_after = vfio_pci_config_readw(self->dev,
+						   dvsec + CXL_DVSEC_CONTROL_OFFSET);
+		ASSERT_EQ(ctrl_before, ctrl_after);
+
+		printf("CONFIG_LOCK set: lock=0x%04x, "
+		       "Control write discarded (ctrl=0x%04x unchanged)\n",
+		       lock_val, ctrl_after);
+		return;
+	}
+
+	/* Lock is not set: exercise reserved-bit masking and lock-set sequence */
+	vfio_pci_config_writew(self->dev, dvsec + CXL_DVSEC_LOCK_OFFSET,
+			       CXL_LOCK_RESERVED_MASK);
+	lock_val = vfio_pci_config_readw(self->dev,
+					 dvsec + CXL_DVSEC_LOCK_OFFSET);
+	ASSERT_EQ(0u, lock_val & CXL_LOCK_RESERVED_MASK);
+	ASSERT_FALSE(lock_val & CXL_DVSEC_LOCK_CONFIG_LOCK);
+
+	ctrl_before = vfio_pci_config_readw(self->dev,
+					    dvsec + CXL_DVSEC_CONTROL_OFFSET);
+	vfio_pci_config_writew(self->dev, dvsec + CXL_DVSEC_LOCK_OFFSET,
+			       CXL_DVSEC_LOCK_CONFIG_LOCK);
+	lock_val = vfio_pci_config_readw(self->dev,
+					 dvsec + CXL_DVSEC_LOCK_OFFSET);
+	ASSERT_TRUE(lock_val & CXL_DVSEC_LOCK_CONFIG_LOCK);
+
+	vfio_pci_config_writew(self->dev, dvsec + CXL_DVSEC_CONTROL_OFFSET,
+			       ctrl_before ^ BIT(0));
+	ctrl_after = vfio_pci_config_readw(self->dev,
+					   dvsec + CXL_DVSEC_CONTROL_OFFSET);
+	ASSERT_EQ(ctrl_before, ctrl_after);
+
+	printf("Lock set, Control write discarded: "
+	       "lock=0x%04x ctrl_before=0x%04x ctrl_after=0x%04x\n",
+	       lock_val, ctrl_before, ctrl_after);
+}
+
+/* ------------------------------------------------------------------ */
+/* main                                                               */
+/* ------------------------------------------------------------------ */
+
+int main(int argc, char *argv[])
+{
+	device_bdf = vfio_selftests_get_bdf(&argc, argv);
+	return test_harness_run(argc, argv);
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer
  2026-04-01 14:39 ` [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer mhonap
@ 2026-04-03 19:35   ` Dan Williams
  2026-04-04 18:53     ` Jason Gunthorpe
  0 siblings, 1 reply; 27+ messages in thread
From: Dan Williams @ 2026-04-03 19:35 UTC (permalink / raw)
  To: mhonap, alwilliamson, dan.j.williams, jonathan.cameron,
	dave.jiang, alejandro.lucero-palau, dave, alison.schofield,
	vishal.l.verma, ira.weiny, dmatlack, shuah, jgg, yishaih,
	skolothumtho, kevin.tian, ankita
  Cc: vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm, mhonap

mhonap@ wrote:
> From: Manish Honap <mhonap@nvidia.com>
> 
> Register the DPA and component register region with VFIO layer.
> Region indices for both these regions are cached for quick lookup.
> 
> vfio_cxl_register_cxl_region()
> - memremap(WB) the region HPA (treat CXL.mem as RAM, not MMIO)
> - Register VFIO_REGION_SUBTYPE_CXL
> - Records dpa_region_idx.
> 
> vfio_cxl_register_comp_regs_region()
> - Registers VFIO_REGION_SUBTYPE_CXL_COMP_REGS with size
>   hdm_reg_offset + hdm_reg_size
> - Records comp_reg_region_idx.
> 
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
>  drivers/vfio/pci/cxl/vfio_cxl_core.c | 98 +++++++++++++++++++++++++++-
>  drivers/vfio/pci/cxl/vfio_cxl_emu.c  | 34 ++++++++++
>  drivers/vfio/pci/cxl/vfio_cxl_priv.h |  2 +
>  drivers/vfio/pci/vfio_pci.c          | 23 +++++++
>  drivers/vfio/pci/vfio_pci_priv.h     | 11 ++++
>  5 files changed, 167 insertions(+), 1 deletion(-)
[..]
> @@ -622,4 +640,82 @@ static const struct vfio_pci_regops vfio_cxl_regops = {
>  	.release	= vfio_cxl_region_release,
>  };
>  
> +int vfio_cxl_register_cxl_region(struct vfio_pci_core_device *vdev)
> +{
> +	struct vfio_pci_cxl_state *cxl = vdev->cxl;
> +	u32 flags;
> +	int ret;
> +
> +	if (!cxl)
> +		return -ENODEV;
> +
> +	if (!cxl->region || cxl->region_vaddr)
> +		return -ENODEV;
> +
> +	/*
> +	 * CXL device memory is RAM, not MMIO.  Use memremap() rather than

Right, CXL.mem is RAM, not MMIO, so I question why is this not being
mapped via a RAM mechanism? Can you explain a bit more about why reusing
vfio MMIO mapping mechanisms is suitable here and what happens when we
get to questions like guest_memfd integration, page conversions, and
large page mapping support? I am looking at proposals like Gregory's as
a way to get full MM semantics for CXL.mem but without the memory being
consumed for other purposes [1].

Maybe this is the start of the conversation for something simple, but I
would actually prefer uncached ioremap() if only to avoid all the
coherence management and suitability of MMIO mechanisms questions.
ioremap() being a poor fit for CXL is a "feature" to force the "find the
right, something better" direction.

[1]: https://lore.kernel.org/all/20260222084842.1824063-1-gourry@gourry.net

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer
  2026-04-03 19:35   ` Dan Williams
@ 2026-04-04 18:53     ` Jason Gunthorpe
  2026-04-04 19:36       ` Dan Williams
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Gunthorpe @ 2026-04-04 18:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: mhonap, alwilliamson, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, yishaih, skolothumtho, kevin.tian,
	ankita, vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm

On Fri, Apr 03, 2026 at 12:35:45PM -0700, Dan Williams wrote:
> > +	/*
> > +	 * CXL device memory is RAM, not MMIO.  Use memremap() rather than
> 
> Right, CXL.mem is RAM, not MMIO, so I question why is this not being
> mapped via a RAM mechanism? Can you explain a bit more about why reusing
> vfio MMIO mapping mechanisms is suitable here and what happens when we
> get to questions like guest_memfd integration, page conversions, and
> large page mapping support? 

None of that is applicable to VFIO. VFIO owns the entire address space
and does not share it with the mm or anything else.

> Maybe this is the start of the conversation for something simple, but I
> would actually prefer uncached ioremap() if only to avoid all the
> coherence management and suitability of MMIO mechanisms questions.

The entire thing needs to be mmapable in a cache coherent way since
that is what the HW semantic is. If you try to do something else you
will break KVM support since it follows the VMA.

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer
  2026-04-04 18:53     ` Jason Gunthorpe
@ 2026-04-04 19:36       ` Dan Williams
  2026-04-06 21:22         ` Gregory Price
  2026-04-06 22:10         ` Jason Gunthorpe
  0 siblings, 2 replies; 27+ messages in thread
From: Dan Williams @ 2026-04-04 19:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: mhonap, alwilliamson, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, yishaih, skolothumtho, kevin.tian,
	ankita, vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm

Jason Gunthorpe wrote:
> On Fri, Apr 03, 2026 at 12:35:45PM -0700, Dan Williams wrote:
> > > +	/*
> > > +	 * CXL device memory is RAM, not MMIO.  Use memremap() rather than
> > 
> > Right, CXL.mem is RAM, not MMIO, so I question why is this not being
> > mapped via a RAM mechanism? Can you explain a bit more about why reusing
> > vfio MMIO mapping mechanisms is suitable here and what happens when we
> > get to questions like guest_memfd integration, page conversions, and
> > large page mapping support? 
> 
> None of that is applicable to VFIO. VFIO owns the entire address space
> and does not share it with the mm or anything else.

I was worried less about sharing and more about how the map eventually
gets used, but makes sense all of that is enabled through the VFIO VMA.
I have more reading to do in this space.

> > Maybe this is the start of the conversation for something simple, but I
> > would actually prefer uncached ioremap() if only to avoid all the
> > coherence management and suitability of MMIO mechanisms questions.
> 
> The entire thing needs to be mmapable in a cache coherent way since
> that is what the HW semantic is. If you try to do something else you
> will break KVM support since it follows the VMA.

Then I assume it matters that memremap() sometimes silently falls back
to the direct map. The "VFIO owns" expectation needs to guard against
some helpful platform firmware mapping accelerator memory as System RAM.

At a minimum having VFIO fail to map in that case helps with the
argument I have been making that "no, EFI_CONVENTIONAL_MEMORY type +
EFI_SPECIFIC_PURPOSE flag" is not suitable for accelerators with private
CXL memory. Those want to be enforcing "EFI_RESERVED".

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer
  2026-04-04 19:36       ` Dan Williams
@ 2026-04-06 21:22         ` Gregory Price
  2026-04-06 22:05           ` Jason Gunthorpe
  2026-04-06 22:10         ` Jason Gunthorpe
  1 sibling, 1 reply; 27+ messages in thread
From: Gregory Price @ 2026-04-06 21:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, mhonap, alwilliamson, jonathan.cameron,
	dave.jiang, alejandro.lucero-palau, dave, alison.schofield,
	vishal.l.verma, ira.weiny, dmatlack, shuah, yishaih, skolothumtho,
	kevin.tian, ankita, vsethi, cjia, targupta, zhiw, kjaju,
	linux-kselftest, linux-kernel, linux-cxl, kvm

On Sat, Apr 04, 2026 at 12:36:53PM -0700, Dan Williams wrote:
> Jason Gunthorpe wrote:
> > 
> > The entire thing needs to be mmapable in a cache coherent way since
> > that is what the HW semantic is. If you try to do something else you
> > will break KVM support since it follows the VMA.
> 
> Then I assume it matters that memremap() sometimes silently falls back
> to the direct map. The "VFIO owns" expectation needs to guard against
> some helpful platform firmware mapping accelerator memory as System RAM.
> 
> At a minimum having VFIO fail to map in that case helps with the
> argument I have been making that "no, EFI_CONVENTIONAL_MEMORY type +
> EFI_SPECIFIC_PURPOSE flag" is not suitable for accelerators with private
> CXL memory. Those want to be enforcing "EFI_RESERVED".

Agree - in fact I would argue any potential user of private nodes should
have some kind of splat saying the memory should be marked reserved in
the first place, otherwise it's a firmware bug.

I need to read up a little bit on this area, but i don't see the "needs
to be mmap'able in a cache coherent way" to be an argument for one
particular method or another (hotplug, private node, memremap, etc).

If this extension starts bleeding into special-casing a bunch more mm/
stuff to be aware of and handle special memremap'd memory, that's when I
would argue maybe we need to take a look at whether nodes apply.  But if
this just wants to pass a whole chunk from driver to guest and otherwise
the host is supposed to stay out of the way - memremap seems like an ok
start / a reasonable existing upstream solution.

~Gregory

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer
  2026-04-06 21:22         ` Gregory Price
@ 2026-04-06 22:05           ` Jason Gunthorpe
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2026-04-06 22:05 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dan Williams, mhonap, alwilliamson, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, yishaih, skolothumtho, kevin.tian,
	ankita, vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm

On Mon, Apr 06, 2026 at 05:22:06PM -0400, Gregory Price wrote:
> On Sat, Apr 04, 2026 at 12:36:53PM -0700, Dan Williams wrote:
> > Jason Gunthorpe wrote:
> > > 
> > > The entire thing needs to be mmapable in a cache coherent way since
> > > that is what the HW semantic is. If you try to do something else you
> > > will break KVM support since it follows the VMA.
> > 
> > Then I assume it matters that memremap() sometimes silently falls back
> > to the direct map. The "VFIO owns" expectation needs to guard against
> > some helpful platform firmware mapping accelerator memory as System RAM.
> > 
> > At a minimum having VFIO fail to map in that case helps with the
> > argument I have been making that "no, EFI_CONVENTIONAL_MEMORY type +
> > EFI_SPECIFIC_PURPOSE flag" is not suitable for accelerators with private
> > CXL memory. Those want to be enforcing "EFI_RESERVED".
> 
> Agree - in fact I would argue any potential user of private nodes should
> have some kind of splat saying the memory should be marked reserved in
> the first place, otherwise it's a firmware bug.

I would expect this to happen via the request_resource mechanism, if
MM is using the CXL range it should be locked in the resource tree and
vfio should request it and then fail during the initial startup
phases.

> I need to read up a little bit on this area, but i don't see the "needs
> to be mmap'able in a cache coherent way" to be an argument for one
> particular method or another (hotplug, private node, memremap, etc).

It is speaking to what VFIO has to do. It is the exclusive owner of
the physical range, it does not have struct pages, it must be cachable
for VFIO and KVM to work - not alot of choices here. Turn the
phys_addr_t's into large cachable special PTEs inside a VMA.

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer
  2026-04-04 19:36       ` Dan Williams
  2026-04-06 21:22         ` Gregory Price
@ 2026-04-06 22:10         ` Jason Gunthorpe
  1 sibling, 0 replies; 27+ messages in thread
From: Jason Gunthorpe @ 2026-04-06 22:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: mhonap, alwilliamson, jonathan.cameron, dave.jiang,
	alejandro.lucero-palau, dave, alison.schofield, vishal.l.verma,
	ira.weiny, dmatlack, shuah, yishaih, skolothumtho, kevin.tian,
	ankita, vsethi, cjia, targupta, zhiw, kjaju, linux-kselftest,
	linux-kernel, linux-cxl, kvm

On Sat, Apr 04, 2026 at 12:36:53PM -0700, Dan Williams wrote:

> Then I assume it matters that memremap() sometimes silently falls back
> to the direct map. The "VFIO owns" expectation needs to guard against
> some helpful platform firmware mapping accelerator memory as System RAM.

I don't think how memremap works under the covers matters to vfio, it
takes in a phys_addr_t an gives back a KVA that is cachable kernel
memory.

We just have to be mindful to not allow mismatched attributes on
virtual aliases.

> At a minimum having VFIO fail to map in that case helps with the
> argument I have been making that "no, EFI_CONVENTIONAL_MEMORY type +
> EFI_SPECIFIC_PURPOSE flag" is not suitable for accelerators with private
> CXL memory. Those want to be enforcing "EFI_RESERVED".

Certainly it should fail the request region if something else is using
it, so if those EFI flags plug it into the mm and it is plugged when
VFIO starts then it should stop.

The direct map isn't really "plugged into the mm" but it does raise
the mismatched attributed issue.

Jason

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-04-06 22:10 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-01 14:38 [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-04-01 14:38 ` [PATCH v2 01/20] cxl: Add cxl_get_hdm_info() for HDM decoder metadata mhonap
2026-04-01 14:38 ` [PATCH v2 02/20] cxl: Declare cxl_find_regblock and cxl_probe_component_regs in public header mhonap
2026-04-01 14:39 ` [PATCH v2 03/20] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
2026-04-01 14:39 ` [PATCH v2 04/20] cxl: Split cxl_await_range_active() from media-ready wait mhonap
2026-04-01 14:39 ` [PATCH v2 05/20] cxl: Record BIR and BAR offset in cxl_register_map mhonap
2026-04-01 14:39 ` [PATCH v2 06/20] vfio: UAPI for CXL-capable PCI device assignment mhonap
2026-04-01 14:39 ` [PATCH v2 07/20] vfio/pci: Add CXL state to vfio_pci_core_device mhonap
2026-04-01 14:39 ` [PATCH v2 08/20] vfio/pci: Add CONFIG_VFIO_CXL_CORE and stub CXL hooks mhonap
2026-04-01 14:39 ` [PATCH v2 09/20] vfio/cxl: Detect CXL DVSEC and probe HDM block mhonap
2026-04-01 14:39 ` [PATCH v2 10/20] vfio/pci: Export config access helpers mhonap
2026-04-01 14:39 ` [PATCH v2 11/20] vfio/cxl: Introduce HDM decoder register emulation framework mhonap
2026-04-01 14:39 ` [PATCH v2 12/20] vfio/cxl: Wait for HDM ranges and create memdev mhonap
2026-04-01 14:39 ` [PATCH v2 13/20] vfio/cxl: CXL region management support mhonap
2026-04-01 14:39 ` [PATCH v2 14/20] vfio/cxl: DPA VFIO region with demand fault mmap and reset zap mhonap
2026-04-01 14:39 ` [PATCH v2 15/20] vfio/cxl: Virtualize CXL DVSEC config writes mhonap
2026-04-01 14:39 ` [PATCH v2 16/20] vfio/cxl: Register regions with VFIO layer mhonap
2026-04-03 19:35   ` Dan Williams
2026-04-04 18:53     ` Jason Gunthorpe
2026-04-04 19:36       ` Dan Williams
2026-04-06 21:22         ` Gregory Price
2026-04-06 22:05           ` Jason Gunthorpe
2026-04-06 22:10         ` Jason Gunthorpe
2026-04-01 14:39 ` [PATCH v2 17/20] vfio/pci: Advertise CXL cap and sparse component BAR to userspace mhonap
2026-04-01 14:39 ` [PATCH v2 18/20] vfio/cxl: Provide opt-out for CXL feature mhonap
2026-04-01 14:39 ` [PATCH v2 19/20] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
2026-04-01 14:39 ` [PATCH v2 20/20] selftests/vfio: Add CXL Type-2 VFIO assignment test mhonap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox