From: <mhonap@nvidia.com>
To: <djbw@kernel.org>, <alex@shazbot.org>, <jgg@ziepe.ca>,
<jic23@kernel.org>, <dave.jiang@intel.com>, <ankita@nvidia.com>,
<alejandro.lucero-palau@amd.com>, <alison.schofield@intel.com>,
<dave@stgolabs.net>, <dmatlack@google.com>, <gourry@gourry.net>,
<ira.weiny@intel.com>
Cc: <cjia@nvidia.com>, <kjaju@nvidia.com>, <vsethi@nvidia.com>,
<zhiw@nvidia.com>, <mhonap@nvidia.com>, <kvm@vger.kernel.org>,
<linux-cxl@vger.kernel.org>, <linux-doc@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <linux-kselftest@vger.kernel.org>
Subject: [PATCH v3 08/11] vfio/pci/cxl: Add HDM + COMP_REGS regions and DVSEC clipping shim
Date: Thu, 25 Jun 2026 22:24:04 +0530 [thread overview]
Message-ID: <20260625165407.1769572-9-mhonap@nvidia.com> (raw)
In-Reply-To: <20260625165407.1769572-1-mhonap@nvidia.com>
From: Manish Honap <mhonap@nvidia.com>
Complete the vfio-pci-core integration of CXL Type-2 device
passthrough by exposing two VFIO regions to userspace, wiring DVSEC
config-space accesses through cxl-core's register-virtualization
helpers, and reserving the CXL component register block from BAR
mmap and BAR resource claim.
HDM region (VFIO_REGION_SUBTYPE_CXL):
- mmappable view of the device's firmware-committed HPA range
- mmap fault handler calls vmf_insert_pfn() from the physical HPA
so the guest gets the same backing memory the host sees
- pread/pwrite go through the memremap_wb() kva captured at
bind time by vfio_cxl_map_hdm()
COMP_REGS region (VFIO_REGION_SUBTYPE_CXL_COMP_REGS):
- pread/pwrite only, dword-aligned (-EINVAL on misalignment)
- thin transport: each dword dispatches by offset to
cxl_passthrough_cm_rw() (CM cap-array snapshot) or
cxl_passthrough_hdm_rw() (HDM Decoder block). No shadow buffer
on the vfio side; all per-field semantics live in cxl-core.
DVSEC config-space access:
- vfio_pci_cxl_config_boundary() clips a chunk at the CXL Device
DVSEC body edge in vfio_pci_config_rw_single() so the generic
perm-bits path handles the DVSEC header bytes and the CXL hook
handles the body bytes. The clipping shim is used instead of
re-pointing the ecap_perms[] readfn/writefn (which would mutate
a module-init static and race across multiple CXL devices).
- vfio_pci_cxl_config_rw() forwards clipped accesses to
cxl_passthrough_dvsec_rw(); cxl-core enforces the per-field
write semantics (LOCK/RWO, CONTROL/RWL, STATUS/RW1C,
RANGE1/HwInit, RANGE2/RsvdZ).
GET_INFO / GET_REGION_INFO:
- VFIO_DEVICE_INFO_CAP_CXL advertises the two region indices, the
component BAR layout, and HOST_FIRMWARE_COMMITTED.
- GET_REGION_INFO on the component BAR returns a sparse-mmap cap
that excludes [comp_reg_offset, comp_reg_offset+comp_reg_size).
BAR resource handling:
- cxl-core holds request_mem_region() on the CXL component
register sub-range from devm_cxl_probe_mem(), so vfio_pci-core's
pci_request_selected_regions() on the full BAR would collide.
map_bars() skips the request for the component BAR (still iomaps
it; vfio holds the BAR via driver binding); disable() mirrors
the asymmetric skip.
- mmap of the component BAR refuses any range overlapping the CXL
sub-range via vfio_pci_cxl_mmap_overlaps_comp_regs().
vfio_pci_cxl_open() now registers both VFIO regions; close()
unregisters them. Raw BAR rw redirect into the CXL sub-range is
intentionally not implemented: VMMs use the COMP_REGS region
directly.
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/cxl/vfio_cxl_core.c | 521 ++++++++++++++++++++++++++-
drivers/vfio/pci/vfio_pci_config.c | 31 ++
drivers/vfio/pci/vfio_pci_core.c | 44 ++-
drivers/vfio/pci/vfio_pci_priv.h | 72 ++++
drivers/vfio/pci/vfio_pci_rdwr.c | 17 +
5 files changed, 679 insertions(+), 6 deletions(-)
diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c b/drivers/vfio/pci/cxl/vfio_cxl_core.c
index 42cd00bbe869..8a00b776d7c7 100644
--- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
+++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
@@ -123,12 +123,24 @@ static int vfio_cxl_probe_regs(struct vfio_pci_cxl_state *cxl)
if (rc)
return rc;
+ /*
+ * The CXL Component Register block is a fixed 64 KiB area (CXL r4.0
+ * §8.2.3). cxl_pci_setup_regs() records the remaining BAR length
+ * after the regblock offset in reg_map.max_size, which is an upper
+ * bound, not the spec-defined size. Bail if the BAR does not have
+ * room for a full component register block at the recorded offset,
+ * and publish the spec size so the UAPI, sparse-mmap exclusion, and
+ * COMP_REGS region all agree on the same window.
+ */
+ if (cxlds->reg_map.max_size < CXL_COMPONENT_REG_BLOCK_SIZE)
+ return -ENXIO;
+
cxl->info.hdm_count = hdm_count;
cxl->info.hdm_reg_offset = hdm_off;
cxl->info.hdm_reg_size = hdm_size;
cxl->info.comp_reg_bir = bir;
cxl->info.comp_reg_offset = bar_off;
- cxl->info.comp_reg_size = cxlds->reg_map.max_size;
+ cxl->info.comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
cxl->info.host_firmware_committed = true;
/*
@@ -354,16 +366,515 @@ void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev)
vdev->cxl = NULL;
}
+static int vfio_pci_cxl_register_hdm(struct vfio_pci_core_device *vdev);
+static int vfio_pci_cxl_register_comp_regs(struct vfio_pci_core_device *vdev);
+
int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ int rc;
+
+ if (!cxl)
+ return 0; /* plain vfio-pci device */
+
+ rc = vfio_pci_cxl_register_comp_regs(vdev);
+ if (rc) {
+ pci_warn(vdev->pdev,
+ "vfio-cxl: COMP_REGS region register failed (%d)\n",
+ rc);
+ return rc;
+ }
+
+ rc = vfio_pci_cxl_register_hdm(vdev);
+ if (rc) {
+ pci_warn(vdev->pdev,
+ "vfio-cxl: HDM region register failed (%d)\n", rc);
+ /*
+ * COMP_REGS already registered above. vfio core does not
+ * call close_device() when open_device() returns an error,
+ * so roll back the COMP_REGS dynamic region here to avoid
+ * a leaked half-registered open state.
+ */
+ vfio_pci_cxl_close(vdev);
+ return rc;
+ }
+ return 0;
+}
+
+void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ unsigned int i;
+
+ if (!cxl)
+ return;
+
+ for (i = vdev->num_regions; i > 0; i--) {
+ struct vfio_pci_region *r = &vdev->region[i - 1];
+
+ if (r->data != cxl)
+ break;
+ if (r->ops->release)
+ r->ops->release(vdev, r);
+ vdev->num_regions--;
+ }
+}
+
+/* ------------------------------------------------------------------ */
+/* HDM region: mmappable view of the device's HPA range */
+/* ------------------------------------------------------------------ */
+
+static vm_fault_t hdm_region_fault(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct vfio_pci_cxl_state *cxl = vma->vm_private_data;
+ unsigned long off = (vmf->address - vma->vm_start) +
+ (vma->vm_pgoff << PAGE_SHIFT);
+ phys_addr_t pa;
+
+ if (!cxl || !cxl->info.hpa_size)
+ return VM_FAULT_SIGBUS;
+ if (off >= cxl->info.hpa_size)
+ return VM_FAULT_SIGBUS;
+
+ pa = cxl->info.hpa_base + off;
+ return vmf_insert_pfn(vma, vmf->address, PHYS_PFN(pa));
+}
+
+static const struct vm_operations_struct hdm_region_vm_ops = {
+ .fault = hdm_region_fault,
+};
+
+static int hdm_region_mmap(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region,
+ struct vm_area_struct *vma)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ pgoff_t pgoff;
+ u64 req_start, req_len;
+
+ if (!cxl || !cxl->info.hpa_size)
+ return -ENODEV;
+
/*
- * Region registration (HDM, COMP_REGS) is added by the next
- * patch in this series. This hook exists so vfio-pci-core's
- * fd-open path has a stable call site.
+ * vfio_pci_core_mmap() forwards the VMA with vm_pgoff still
+ * carrying the VFIO region index in the high bits. Mask it off
+ * so req_start is the in-region offset; also overwrite vm_pgoff
+ * with the normalised value so the fault handler computes the
+ * physical address from a clean offset.
*/
+ pgoff = vma->vm_pgoff &
+ ((1ULL << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+ req_start = (u64)pgoff << PAGE_SHIFT;
+ req_len = vma->vm_end - vma->vm_start;
+ if (req_start > cxl->info.hpa_size ||
+ req_len > cxl->info.hpa_size - req_start)
+ return -EINVAL;
+
+ vma->vm_pgoff = pgoff;
+ vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
+ vma->vm_ops = &hdm_region_vm_ops;
+ vma->vm_private_data = cxl;
return 0;
}
-void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev)
+static ssize_t hdm_region_rw(struct vfio_pci_core_device *vdev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ void *kva;
+
+ if (!cxl || !cxl->hdm_kva)
+ return -EINVAL;
+ if (pos < 0 || (u64)pos > cxl->info.hpa_size ||
+ count > cxl->info.hpa_size - (u64)pos)
+ return -EINVAL;
+
+ kva = (u8 *)cxl->hdm_kva + pos;
+ if (iswrite) {
+ if (copy_from_user(kva, buf, count))
+ return -EFAULT;
+ } else {
+ if (copy_to_user(buf, kva, count))
+ return -EFAULT;
+ }
+
+ *ppos += count;
+ return count;
+}
+
+static void hdm_region_release(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region)
+{
+}
+
+static const struct vfio_pci_regops vfio_pci_cxl_hdm_ops = {
+ .rw = hdm_region_rw,
+ .mmap = hdm_region_mmap,
+ .release = hdm_region_release,
+};
+
+static int vfio_pci_cxl_register_hdm(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 region_type = VFIO_REGION_TYPE_PCI_VENDOR_TYPE | PCI_VENDOR_ID_CXL;
+ u32 region_flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE |
+ VFIO_REGION_INFO_FLAG_MMAP;
+ int rc;
+
+ rc = vfio_pci_core_register_dev_region(vdev, region_type,
+ VFIO_REGION_SUBTYPE_CXL,
+ &vfio_pci_cxl_hdm_ops,
+ cxl->info.hpa_size,
+ region_flags, cxl);
+ if (rc)
+ return rc;
+
+ cxl->hdm_region_idx = VFIO_PCI_NUM_REGIONS + vdev->num_regions - 1;
+ return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* COMP_REGS region: thin transport to cxl-core register helpers */
+/* ------------------------------------------------------------------ */
+
+/*
+ * COMP_REGS exposes the CXL component register sub-range of the
+ * device's component BAR as a pread/pwrite-only VFIO region. Access
+ * is dword-only (4-byte aligned); sub-dword access returns -EINVAL.
+ * The dispatch maps each dword to one of cxl-core's three rw helpers:
+ *
+ * pos < CXL_CM_OFFSET → zero-fill / drop
+ * CXL_CM_OFFSET <= pos < hdm_reg_offset → cxl_passthrough_cm_rw
+ * hdm_reg_offset <= pos < hdm_reg_offset+size → cxl_passthrough_hdm_rw
+ * pos >= hdm_reg_offset + hdm_reg_size → zero-fill / drop
+ *
+ * vfio holds no shadow buffer of its own; the per-field write
+ * semantics live entirely in cxl-core.
+ */
+static ssize_t comp_regs_rw(struct vfio_pci_core_device *vdev,
+ char __user *buf, size_t count,
+ loff_t *ppos, bool iswrite)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ resource_size_t cm_off, hdm_start, hdm_end;
+ size_t done = 0;
+
+ if (!cxl || !cxl->cxlpt)
+ return -EINVAL;
+ if (pos < 0 || (u64)pos > cxl->info.comp_reg_size ||
+ count > cxl->info.comp_reg_size - (u64)pos)
+ return -EINVAL;
+ if (!IS_ALIGNED(pos, 4) || !IS_ALIGNED(count, 4))
+ return -EINVAL;
+
+ cm_off = CXL_CM_OFFSET;
+ hdm_start = cxl->info.hdm_reg_offset;
+ hdm_end = hdm_start + cxl->info.hdm_reg_size;
+
+ while (done < count) {
+ __le32 le = 0;
+ u32 v32 = 0;
+ int rc;
+
+ if (iswrite) {
+ if (copy_from_user(&le, buf + done, 4))
+ return done ?: -EFAULT;
+ v32 = le32_to_cpu(le);
+ }
+
+ if (pos >= cm_off && pos < hdm_start) {
+ rc = cxl_passthrough_cm_rw(cxl->cxlpt,
+ (u32)(pos - cm_off),
+ &v32, iswrite);
+ if (rc)
+ return done ?: rc;
+ } else if (pos >= hdm_start && pos < hdm_end) {
+ rc = cxl_passthrough_hdm_rw(cxl->cxlpt,
+ (u32)(pos - hdm_start),
+ &v32, iswrite);
+ if (rc)
+ return done ?: rc;
+ } else if (!iswrite) {
+ v32 = 0; /* outside modelled ranges: read 0 */
+ }
+ /* writes outside modelled ranges are silently dropped */
+
+ if (!iswrite) {
+ le = cpu_to_le32(v32);
+ if (copy_to_user(buf + done, &le, 4))
+ return done ?: -EFAULT;
+ }
+
+ pos += 4;
+ done += 4;
+ }
+
+ *ppos += done;
+ return done;
+}
+
+static void comp_regs_release(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region)
+{
+}
+
+static const struct vfio_pci_regops vfio_pci_cxl_comp_regs_ops = {
+ .rw = comp_regs_rw,
+ .release = comp_regs_release,
+};
+
+static int vfio_pci_cxl_register_comp_regs(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 region_type = VFIO_REGION_TYPE_PCI_VENDOR_TYPE | PCI_VENDOR_ID_CXL;
+ u32 region_flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE;
+ int rc;
+
+ rc = vfio_pci_core_register_dev_region(vdev, region_type,
+ VFIO_REGION_SUBTYPE_CXL_COMP_REGS,
+ &vfio_pci_cxl_comp_regs_ops,
+ cxl->info.comp_reg_size,
+ region_flags, cxl);
+ if (rc)
+ return rc;
+
+ cxl->comp_reg_region_idx = VFIO_PCI_NUM_REGIONS + vdev->num_regions - 1;
+ return 0;
+}
+
+/* ------------------------------------------------------------------ */
+/* DVSEC config-space clipping shim */
+/* ------------------------------------------------------------------ */
+
+/*
+ * vfio_pci_cxl_config_boundary - clip a config-rw chunk at the DVSEC body edge
+ *
+ * Returns the maximum byte count the caller may pass through the
+ * generic chunker without straddling the CXL Device DVSEC body
+ * boundary, or SIZE_MAX when no clip is required. Used by
+ * vfio_pci_config_rw_single() so the DVSEC header bytes stay on the
+ * generic perm-bits path and the body bytes reach the CXL hook.
+ */
+size_t vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev,
+ loff_t pos)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 body_start, body_end;
+
+ if (!cxl)
+ return SIZE_MAX;
+
+ body_start = cxl->info.dvsec_offset + PCI_DVSEC_CXL_CAP;
+ body_end = cxl->info.dvsec_offset + cxl->info.dvsec_size;
+
+ if (pos < body_start)
+ return body_start - pos;
+ if (pos < body_end)
+ return body_end - pos;
+ return SIZE_MAX;
+}
+
+/*
+ * vfio_pci_cxl_config_rw - forward CXL DVSEC config accesses to cxl-core
+ *
+ * Returns the number of bytes processed on success, -ENOENT if the
+ * access lies entirely outside the CXL Device DVSEC body (caller
+ * takes the standard perm-bits path), or another negative errno on
+ * hard failure. vfio_pci_config_rw_single() applies
+ * vfio_pci_cxl_config_boundary() before width selection, so any
+ * access that reaches here was already clipped to lie entirely inside
+ * the DVSEC body.
+ */
+ssize_t vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev,
+ loff_t pos, size_t count, __le32 *val,
+ bool iswrite)
{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ u32 dvsec_off, body_start, body_end, off;
+ u32 host_val;
+ int rc;
+
+ if (!cxl || !cxl->cxlpt)
+ return -ENOENT;
+
+ dvsec_off = cxl->info.dvsec_offset;
+ body_start = dvsec_off + PCI_DVSEC_CXL_CAP;
+ body_end = dvsec_off + cxl->info.dvsec_size;
+
+ if (pos + count <= body_start || pos >= body_end)
+ return -ENOENT;
+ if (WARN_ON_ONCE(pos < body_start || pos + count > body_end))
+ return -EINVAL; /* caller failed to clip at body boundary */
+
+ off = (u32)(pos - dvsec_off);
+ host_val = iswrite ? le32_to_cpu(*val) : 0;
+
+ rc = cxl_passthrough_dvsec_rw(cxl->cxlpt, off, &host_val, count,
+ iswrite);
+ if (rc)
+ return rc;
+
+ if (!iswrite)
+ *val = cpu_to_le32(host_val);
+ return count;
+}
+
+/* ------------------------------------------------------------------ */
+/* GET_INFO / GET_REGION_INFO / mmap helpers */
+/* ------------------------------------------------------------------ */
+
+u8 vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ return cxl ? cxl->info.comp_reg_bir : U8_MAX;
+}
+
+bool vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+ size_t *start, size_t *end)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl || !cxl->info.comp_reg_size)
+ return false;
+
+ *start = cxl->info.comp_reg_offset;
+ *end = cxl->info.comp_reg_offset + cxl->info.comp_reg_size;
+ return true;
+}
+
+bool vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ u64 req_start, u64 req_len)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl || !cxl->info.comp_reg_size)
+ return false;
+
+ return req_start < cxl->info.comp_reg_offset + cxl->info.comp_reg_size &&
+ req_start + req_len > cxl->info.comp_reg_offset;
+}
+
+/*
+ * vfio_pci_cxl_bar_overlaps_comp_regs - check whether a BAR-relative access
+ * overlaps the CXL component register sub-range.
+ *
+ * Returns true when @bar is the component BAR and the [@start, @start + @len)
+ * window overlaps [comp_reg_offset, comp_reg_offset + comp_reg_size). Used
+ * by the raw BAR read/write and ioeventfd paths to reject accesses that
+ * would bypass the COMP_REGS region and reach the physical component
+ * registers directly, sidestepping cxl-core's shadow and per-field write
+ * semantics.
+ */
+bool vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ int bar, u64 start, u64 len)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+
+ if (!cxl || !cxl->info.comp_reg_size || !len)
+ return false;
+ if (bar != cxl->info.comp_reg_bir)
+ return false;
+
+ return start < cxl->info.comp_reg_offset + cxl->info.comp_reg_size &&
+ start + len > cxl->info.comp_reg_offset;
+}
+
+int vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ struct vfio_device_info_cap_cxl cap = { };
+
+ if (!cxl)
+ return 0;
+
+ cap.header.id = VFIO_DEVICE_INFO_CAP_CXL;
+ cap.header.version = 1;
+ if (cxl->info.host_firmware_committed)
+ cap.flags |= VFIO_CXL_CAP_HOST_FIRMWARE_COMMITTED;
+ cap.hdm_region_idx = cxl->hdm_region_idx;
+ cap.comp_reg_region_idx = cxl->comp_reg_region_idx;
+ cap.comp_reg_bar = cxl->info.comp_reg_bir;
+ cap.comp_reg_offset = cxl->info.comp_reg_offset;
+ cap.comp_reg_size = cxl->info.comp_reg_size;
+
+ return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
+}
+
+/*
+ * Build a VFIO_REGION_INFO_CAP_SPARSE_MMAP that excludes the CXL
+ * component register block from the mmappable areas of the
+ * component BAR. Returns -ENOTTY when the request is not for the
+ * component BAR or the component BAR is not mmappable; the caller
+ * (vfio_pci_ioctl_get_region_info) then continues with the standard
+ * BAR path.
+ */
+int vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps)
+{
+ struct vfio_pci_cxl_state *cxl = vdev->cxl;
+ struct vfio_region_info_cap_sparse_mmap *sparse;
+ u64 bar_len, comp_start, comp_end;
+ u64 before_end, after_start;
+ struct vfio_region_sparse_mmap_area areas[2];
+ u32 nr_areas = 0, cap_size;
+ int ret;
+
+ if (!cxl)
+ return -ENOTTY;
+ if (info->index != cxl->info.comp_reg_bir)
+ return -ENOTTY;
+ if (!cxl->info.comp_reg_size)
+ return -ENOTTY;
+ if (!vdev->bar_mmap_supported[info->index])
+ return -ENOTTY;
+
+ bar_len = pci_resource_len(vdev->pdev, info->index);
+ comp_start = cxl->info.comp_reg_offset;
+ comp_end = comp_start + cxl->info.comp_reg_size;
+
+ before_end = round_down(comp_start, PAGE_SIZE);
+ after_start = round_up(comp_end, PAGE_SIZE);
+
+ if (before_end > 0) {
+ areas[nr_areas].offset = 0;
+ areas[nr_areas].size = before_end;
+ nr_areas++;
+ }
+ if (after_start < bar_len) {
+ areas[nr_areas].offset = after_start;
+ areas[nr_areas].size = bar_len - after_start;
+ nr_areas++;
+ }
+
+ info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
+ info->size = bar_len;
+ info->flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE;
+ if (!nr_areas)
+ return 0;
+
+ info->flags |= VFIO_REGION_INFO_FLAG_MMAP;
+
+ cap_size = struct_size(sparse, areas, nr_areas);
+ sparse = kzalloc(cap_size, GFP_KERNEL);
+ if (!sparse)
+ return -ENOMEM;
+
+ sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ sparse->header.version = 1;
+ sparse->nr_areas = nr_areas;
+ memcpy(sparse->areas, areas, nr_areas * sizeof(areas[0]));
+
+ ret = vfio_info_add_capability(caps, &sparse->header, cap_size);
+ kfree(sparse);
+ return ret;
}
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index a10ed733f0e3..b9f30a33515a 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1898,8 +1898,15 @@ ssize_t vfio_pci_config_rw_single(struct vfio_pci_core_device *vdev,
/*
* Chop accesses into aligned chunks containing no more than a
* single capability. Caller increments to the next chunk.
+ *
+ * For CXL Type-2 devices also clip at the CXL Device DVSEC body
+ * boundary so the generic perm-bits path handles the DVSEC
+ * header bytes and the CXL hook handles the body bytes; without
+ * this clip a 32-bit access at dvsec + 0x08 would span the
+ * generic Header2 word and the CXL CAPABILITY word.
*/
count = min(count, vfio_pci_cap_remaining_dword(vdev, *ppos));
+ count = min(count, vfio_pci_cxl_config_boundary(vdev, *ppos));
if (count >= 4 && !(*ppos % 4))
count = 4;
else if (count >= 2 && !(*ppos % 2))
@@ -1909,6 +1916,30 @@ ssize_t vfio_pci_config_rw_single(struct vfio_pci_core_device *vdev,
ret = count;
+ /*
+ * Give the CXL Type-2 hook first claim on this access: if the
+ * range lies inside the CXL Device DVSEC body, forward it to
+ * cxl-core's register-virtualization helpers instead of the
+ * standard perm-bits path. -ENOENT means "not for me; use the
+ * default path"; any other negative value is a hard error.
+ */
+ if (vdev->cxl) {
+ __le32 le_val = 0;
+ ssize_t cxl_ret;
+
+ if (iswrite && copy_from_user(&le_val, buf, count))
+ return -EFAULT;
+ cxl_ret = vfio_pci_cxl_config_rw(vdev, *ppos, count, &le_val,
+ iswrite);
+ if (cxl_ret >= 0) {
+ if (!iswrite && copy_to_user(buf, &le_val, count))
+ return -EFAULT;
+ return cxl_ret;
+ }
+ if (cxl_ret != -ENOENT)
+ return cxl_ret;
+ }
+
cap_id = vdev->pci_config_map[*ppos];
if (cap_id == PCI_CAP_ID_INVALID) {
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 05ab4ae59157..2d2dae278d1e 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -501,6 +501,23 @@ static void vfio_pci_core_map_bars(struct vfio_pci_core_device *vdev)
if (!pci_resource_len(pdev, i))
continue;
+ /*
+ * cxl-core already holds request_mem_region() on the CXL
+ * component register sub-range of this BAR. Skip the
+ * full-BAR request so we do not collide with that
+ * sub-region; vfio still owns the BAR via the driver
+ * binding and the iomap below succeeds without a region
+ * claim.
+ */
+ if (vdev->cxl && bar == vfio_pci_cxl_get_component_reg_bar(vdev)) {
+ vdev->barmap[bar] = pci_iomap(pdev, bar, 0);
+ if (!vdev->barmap[bar]) {
+ pci_dbg(pdev, "Failed to iomap region %d\n", bar);
+ vdev->barmap[bar] = IOMEM_ERR_PTR(-ENOMEM);
+ }
+ continue;
+ }
+
if (pci_request_selected_regions(pdev, 1 << bar, "vfio")) {
pci_dbg(pdev, "Failed to reserve region %d\n", bar);
vdev->barmap[bar] = IOMEM_ERR_PTR(-EBUSY);
@@ -701,7 +718,10 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
if (IS_ERR_OR_NULL(vdev->barmap[bar]))
continue;
pci_iounmap(pdev, vdev->barmap[bar]);
- pci_release_selected_regions(pdev, 1 << bar);
+ /* Mirror the asymmetric setup-time skip in map_bars(). */
+ if (!(vdev->cxl &&
+ i == vfio_pci_cxl_get_component_reg_bar(vdev)))
+ pci_release_selected_regions(pdev, 1 << bar);
vdev->barmap[bar] = NULL;
}
@@ -1051,6 +1071,16 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
info.num_irqs = VFIO_PCI_NUM_IRQS;
+ if (vdev->cxl) {
+ ret = vfio_pci_cxl_get_info(vdev, &caps);
+ if (ret) {
+ pci_warn(vdev->pdev,
+ "Failed to add CXL info capability\n");
+ return ret;
+ }
+ info.flags |= VFIO_DEVICE_FLAGS_CXL;
+ }
+
ret = vfio_pci_info_zdev_add_caps(vdev, &caps);
if (ret && ret != -ENODEV) {
pci_warn(vdev->pdev,
@@ -1093,6 +1123,12 @@ int vfio_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
struct pci_dev *pdev = vdev->pdev;
int i, ret;
+ if (vdev->cxl) {
+ ret = vfio_pci_cxl_get_region_info(vdev, info, caps);
+ if (ret != -ENOTTY)
+ return ret;
+ }
+
switch (info->index) {
case VFIO_PCI_CONFIG_REGION_INDEX:
info->offset = VFIO_PCI_INDEX_TO_OFFSET(info->index);
@@ -1811,6 +1847,12 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
if (req_start + req_len > phys_len)
return -EINVAL;
+ /* Block mmap of the CXL component register block. */
+ if (vdev->cxl &&
+ index == vfio_pci_cxl_get_component_reg_bar(vdev) &&
+ vfio_pci_cxl_mmap_overlaps_comp_regs(vdev, req_start, req_len))
+ return -EINVAL;
+
/*
* Even though we don't make use of the barmap for the mmap,
* we need to request the region and the barmap tracks that.
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
index 94bf7c6a8548..88b89da6dd5a 100644
--- a/drivers/vfio/pci/vfio_pci_priv.h
+++ b/drivers/vfio/pci/vfio_pci_priv.h
@@ -114,6 +114,23 @@ int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev);
void vfio_pci_cxl_release(struct vfio_pci_core_device *vdev);
int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev);
void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev);
+size_t vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev,
+ loff_t pos);
+ssize_t vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev,
+ loff_t pos, size_t count, __le32 *val,
+ bool iswrite);
+int vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps);
+int vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps);
+u8 vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev);
+bool vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+ size_t *start, size_t *end);
+bool vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ u64 req_start, u64 req_len);
+bool vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ int bar, u64 start, u64 len);
#else
static inline int vfio_pci_cxl_acquire(struct vfio_pci_core_device *vdev)
{
@@ -128,6 +145,61 @@ static inline int vfio_pci_cxl_open(struct vfio_pci_core_device *vdev)
}
static inline void vfio_pci_cxl_close(struct vfio_pci_core_device *vdev) { }
+
+static inline size_t
+vfio_pci_cxl_config_boundary(struct vfio_pci_core_device *vdev, loff_t pos)
+{
+ return SIZE_MAX;
+}
+
+static inline ssize_t
+vfio_pci_cxl_config_rw(struct vfio_pci_core_device *vdev, loff_t pos,
+ size_t count, __le32 *val, bool iswrite)
+{
+ return -ENOENT;
+}
+
+static inline int
+vfio_pci_cxl_get_info(struct vfio_pci_core_device *vdev,
+ struct vfio_info_cap *caps)
+{
+ return 0;
+}
+
+static inline int
+vfio_pci_cxl_get_region_info(struct vfio_pci_core_device *vdev,
+ struct vfio_region_info *info,
+ struct vfio_info_cap *caps)
+{
+ return -ENOTTY;
+}
+
+static inline u8
+vfio_pci_cxl_get_component_reg_bar(struct vfio_pci_core_device *vdev)
+{
+ return U8_MAX;
+}
+
+static inline bool
+vfio_pci_cxl_get_comp_reg_range(struct vfio_pci_core_device *vdev,
+ size_t *start, size_t *end)
+{
+ return false;
+}
+
+static inline bool
+vfio_pci_cxl_mmap_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ u64 req_start, u64 req_len)
+{
+ return false;
+}
+
+static inline bool
+vfio_pci_cxl_bar_overlaps_comp_regs(struct vfio_pci_core_device *vdev,
+ int bar, u64 start, u64 len)
+{
+ return false;
+}
#endif
static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 3bfbb879a005..a856f29a3c94 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -236,6 +236,15 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf,
count = min(count, (size_t)(end - pos));
+ /*
+ * Reject raw BAR access that would land inside the CXL component
+ * register sub-range. cxl-core owns the per-field shadow and
+ * spec-defined write semantics; userspace must use the dedicated
+ * COMP_REGS VFIO region for that range.
+ */
+ if (vfio_pci_cxl_bar_overlaps_comp_regs(vdev, bar, pos, count))
+ return -EINVAL;
+
if (bar == PCI_ROM_RESOURCE) {
/*
* The ROM can fill less space than the BAR, so we start the
@@ -437,6 +446,14 @@ int vfio_pci_ioeventfd(struct vfio_pci_core_device *vdev, loff_t offset,
pos >= vdev->msix_offset + vdev->msix_size))
return -EINVAL;
+ /*
+ * Disallow ioeventfds arming against the CXL component register
+ * sub-range; that area is fronted by cxl-core's shadow and must
+ * not be reached through the raw BAR map.
+ */
+ if (vfio_pci_cxl_bar_overlaps_comp_regs(vdev, bar, pos, count))
+ return -EINVAL;
+
if (count == 8)
return -EINVAL;
--
2.25.1
next prev parent reply other threads:[~2026-06-25 16:56 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-25 16:53 [PATCH v3 00/11] vfio/pci: Add CXL Type-2 device passthrough support mhonap
2026-06-25 16:53 ` [PATCH v3 01/11] cxl: Add cxl_get_hdm_info() helper for HDM decoder metadata mhonap
2026-06-25 16:53 ` [PATCH v3 02/11] cxl: Split cxl_await_range_active() from media-ready wait mhonap
2026-06-25 16:53 ` [PATCH v3 03/11] cxl: Record BIR and BAR offset in cxl_register_map mhonap
2026-06-25 16:54 ` [PATCH v3 04/11] cxl: Move component/HDM register defines to uapi/cxl/cxl_regs.h mhonap
2026-06-25 16:54 ` [PATCH v3 05/11] vfio: UAPI for CXL Type-2 device passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 06/11] cxl: Add register-virtualization helpers for vfio Type-2 passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 07/11] vfio/pci: Add CONFIG_VFIO_PCI_CXL with bind-time CXL Type-2 acquisition mhonap
2026-06-25 16:54 ` mhonap [this message]
2026-06-25 16:54 ` [PATCH v3 09/11] selftests/vfio: Add CXL Type-2 device passthrough smoke test mhonap
2026-06-25 16:54 ` [PATCH v3 10/11] docs: vfio-pci: Document CXL Type-2 device passthrough mhonap
2026-06-25 16:54 ` [PATCH v3 11/11] vfio/pci: Provide opt-out for CXL Type-2 extensions mhonap
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260625165407.1769572-9-mhonap@nvidia.com \
--to=mhonap@nvidia.com \
--cc=alejandro.lucero-palau@amd.com \
--cc=alex@shazbot.org \
--cc=alison.schofield@intel.com \
--cc=ankita@nvidia.com \
--cc=cjia@nvidia.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=djbw@kernel.org \
--cc=dmatlack@google.com \
--cc=gourry@gourry.net \
--cc=ira.weiny@intel.com \
--cc=jgg@ziepe.ca \
--cc=jic23@kernel.org \
--cc=kjaju@nvidia.com \
--cc=kvm@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=vsethi@nvidia.com \
--cc=zhiw@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox