* [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all,
@ 2025-12-09 16:50 mhonap
2025-12-09 16:50 ` [RFC v2 01/15] cxl: factor out cxl_await_range_active() and cxl_media_ready() mhonap
` (14 more replies)
0 siblings, 15 replies; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
This is the re-spin of VFIO-CXL patches which Zhi had sent earlier[1] and
rebased to Alejandro's patch series[3] v20 for "CXL type-2 device support".
Current patchset only modifies the RFC V1 sent by Zhi in accordance
with the CXL type-2 device support currently in upstream review. In
the next version, I will reorganize the patch series to first create
VFIO variant driver for QEMU CXL accelerator emulated device and
incrementally implement the features for ease of review.
This will create a logical separation between code which is required
in VFIO-CXL-CORE, CXL-CORE and in variant driver. This will also help
for review to understand the delta between CXL specific initialization
as compared to standard PCI.
V2 changes:
===========
- Address all the comments from Alex on RFC V1.
- Besides addressing Alex comments, this series also solves the leftovers:
- Introduce an emulation framework.
- Proper CXL DVSEC configuration emulation.
- Proper CXL MMIO BAR emmulation.
- Re-use PCI mmap ops in vfio-pci for CXL region.
- Introduce sparse map for CXL MMIO BAR.
- Big-little endian considerations.
- Correct teardown path. (which is missing due to the previous CXL
type 2 patches doesn't have it.)
- Refine the APIs and architecture of VFIO CXL.
- Configurable params for HW.
- Pre-commited CXL region.
- PCI DOE registers passthrough. (For CDAT)
- Media ready support. (SFC doesn't need this)
- Introduce new changes for the CXL core.
- Teardown path of CXL memdev.
- Committed regions.
- Media ready for CXL type-2 device.
- Update the sample driver with latest VFIO-CXL APIs.
Patchwise description:
=====================
PATCH 1-5: Expose the necessary routines required by vfio-cxl.
PATCH 6: Introduce the preludes of vfio-cxl, including CXL device
initialization, CXL region creation as directed in patch v20 for vfio
type-2 device initialization.
PATCH 7: Expose the CXL region to the userspace.
PATCH 8: Add logic to discover precommitted CXL region.
PATCH 9: Introduce vfio-cxl read/write routines.
PATCH 10: Prepare to emulate the HDM decoder registers.
PATCH 11: Emulate the HDM decoder registers.
PATCH 12: Emulate CXL configuration space.
PATCH 13: Tweak vfio-cxl to be aware of working on a CXL device.
PATCH 14: An example variant driver to demonstrate the usage of
vfio-cxl-core from the perspective of the VFIO variant driver.
PATCH 15: NULL pointer dereference fixes found out during testing.
Background:
===========
Compute Express Link (CXL) is an open standard interconnect built upon
industrial PCI layers to enhance the performance and efficiency of data
centers by enabling high-speed, low-latency communication between CPUs
and various types of devices such as accelerators, memory.
It supports three key protocols: CXL.io as the control protocol, CXL.cache
as the cache-coherent host-device data transfer protocol, and CXL.mem as
memory expansion protocol. CXL Type 2 devices leverage the three protocols
to seamlessly integrate with host CPUs, providing a unified and efficient
interface for high-speed data transfer and memory sharing. This integration
is crucial for heterogeneous computing environments where accelerators,
such as GPUs, and other specialized processors, are used to handle
intensive workloads.
Goal:
=====
Although CXL is built upon the PCI layers, passing a CXL type-2 device can
be different than PCI devices according to CXL specification[4]:
- CXL type-2 device initialization. CXL type-2 device requires an
additional initialization sequence besides the PCI device initialization.
CXL type-2 device initialization can be pretty complicated due to its
hierarchy of register interfaces. Thus, a standard CXL type-2 driver
initialization sequence provided by the kernel CXL core is used.
- Create a CXL region and map it to the VM. A mapping between HPA and DPA
(Device PA) needs to be created to access the device memory directly. HDM
decoders in the CXL topology need to be configured level by level to
manage the mapping. After the region is created, it needs to be mapped to
GPA in the virtual HDM decoders configured by the VM.
- CXL reset. The CXL device reset is different from the PCI device reset.
A CXL reset sequence is introduced by the CXL spec.
- Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
configuration for device enumeration and device control. (E.g. if a device
is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
by the kernel CXL core, and the VM can not modify them.
- Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
that can sit in a PCI BAR. The location of register groups sit in the PCI
BAR is indicated by the register locator in the CXL DVSECs. They are also
owned by the kernel CXL core. Some of them need to be emulated.
Design:
=======
To achieve the purpose above, the vfio-cxl-core is introduced to host the
common routines that variant driver requires for device passthrough.
Similar with the vfio-pci-core, the vfio-cxl-core provides common
routines of vfio_device_ops for the variant driver to hook and perform the
CXL routines behind it.
Besides, several extra APIs are introduced for the variant driver to
provide the necessary information the kernel CXL core to initialize
the CXL device. E.g., Device DPA.
CXL is built upon the PCI layers but with differences. Thus, the
vfio-pci-core is aimed to be re-used as much as possible with the
awareness of operating on a CXL device.
A new VFIO device region is introduced to expose the CXL region to the
userspace. A new CXL VFIO device cap has also been introduced to convey
the necessary CXL device information to the userspace.
Test:
=====
To test the patches and hack around, a virtual passthrough with nested
virtualization approach is used.
The host QEMU[5] emulates a CXL type-2 accel device based on Ira's patches
with the changes to emulate HDM decoders.
While running the vfio-cxl in the L1 guest, an example VFIO variant
driver is used to attach with the QEMU CXL access device.
The L2 guest can be booted via the QEMU with the vfio-cxl support in the
VFIOStub.
In the L2 guest, a dummy CXL device driver is provided to attach to the
virtual pass-through device.
The dummy CXL type-2 device driver can successfully be loaded with the
kernel cxl core type2 support, create CXL region by requesting the CXL
core to allocate HPA and DPA and configure the HDM decoders.
To make sure everyone can test the patches, the kernel config of L1 and
L2 are provided in the repos, the required kernel command params and
qemu command line can be found from the demonstration video[6]
Repos:
======
QEMU host:
https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
L1 Kernel:
https://github.com/mhonap-nvidia/vfio-cxl-l1-kernel-rfc-v2/tree/vfio-cxl-l1-kernel-rfc-v2
L1 QEMU:
https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
L2 Kernel:
https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
Feedback expected:
==================
- Architecture level between vfio-pci-core and vfio-cxl-core.
- Variant driver requirements from more hardware vendors.
- vfio-cxl-core UABI to QEMU.
Applying patches:
=================
This patchset should be applied on the tree with base commit v6.18-rc2
along with patches from [2] and [3].
[1] https://lore.kernel.org/all/20240920223446.1908673-1-zhiw@nvidia.com/
[2] https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
[3] https://lore.kernel.org/linux-cxl/20251110153657.2706192-1-alejandro.lucero-palau@amd.com/
[4] https://computeexpresslink.org/cxl-specification/
[5] https://lore.kernel.org/linux-cxl/20251104170305.4163840-1-terry.bowman@amd.com/
[6] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
Manish Honap (15):
cxl: factor out cxl_await_range_active() and cxl_media_ready()
cxl: introduce cxl_get_hdm_reg_info()
cxl: introduce cxl_find_comp_reglock_offset()
cxl: introduce devm_cxl_del_memdev()
cxl: introduce cxl_get_committed_regions()
vfio/cxl: introduce vfio-cxl core preludes
vfio/cxl: expose CXL region to the userspace via a new VFIO device
region
vfio/cxl: discover precommitted CXL region
vfio/cxl: introduce vfio_cxl_core_{read, write}()
vfio/cxl: introduce the register emulation framework
vfio/cxl: introduce the emulation of HDM registers
vfio/cxl: introduce the emulation of CXL configuration space
vfio/pci: introduce CXL device awareness
vfio/cxl: VFIO variant driver for QEMU CXL accel device
cxl: NULL checks for CXL memory devices
drivers/cxl/core/memdev.c | 8 +-
drivers/cxl/core/pci.c | 46 +-
drivers/cxl/core/pci_drv.c | 3 +-
drivers/cxl/core/region.c | 73 +++
drivers/cxl/core/regs.c | 22 +
drivers/cxl/cxlmem.h | 3 +-
drivers/cxl/mem.c | 3 +
drivers/vfio/pci/Kconfig | 13 +
drivers/vfio/pci/Makefile | 5 +
drivers/vfio/pci/cxl-accel/Kconfig | 9 +
drivers/vfio/pci/cxl-accel/Makefile | 4 +
drivers/vfio/pci/cxl-accel/main.c | 143 +++++
drivers/vfio/pci/vfio_cxl_core.c | 695 +++++++++++++++++++++++
drivers/vfio/pci/vfio_cxl_core_emu.c | 778 ++++++++++++++++++++++++++
drivers/vfio/pci/vfio_cxl_core_priv.h | 17 +
drivers/vfio/pci/vfio_pci_core.c | 41 +-
drivers/vfio/pci/vfio_pci_rdwr.c | 11 +-
include/cxl/cxl.h | 9 +
include/linux/vfio_pci_core.h | 96 ++++
include/uapi/linux/vfio.h | 14 +
tools/testing/cxl/Kbuild | 3 +-
tools/testing/cxl/test/mock.c | 21 +-
22 files changed, 1992 insertions(+), 25 deletions(-)
create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
create mode 100644 drivers/vfio/pci/cxl-accel/main.c
create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
create mode 100644 drivers/vfio/pci/vfio_cxl_core_emu.c
create mode 100644 drivers/vfio/pci/vfio_cxl_core_priv.h
--
2.25.1
^ permalink raw reply [flat|nested] 25+ messages in thread
* [RFC v2 01/15] cxl: factor out cxl_await_range_active() and cxl_media_ready()
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-22 12:21 ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 02/15] cxl: introduce cxl_get_hdm_reg_info() mhonap
` (13 subsequent siblings)
14 siblings, 1 reply; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap, Li Ming
From: Zhi Wang <zhiw@nvidia.com>
Before accessing the CXL device memory after reset/power-on, the driver
needs to ensure the device memory media is ready.
However, not every CXL device implements the CXL memory device register
groups. E.g. a CXL type-2 device. Thus calling cxl_await_media_ready()
on these device will lead to a kernel panic. This problem was found when
testing the emulated CXL type-2 device without a CXL memory device
register.
[ 97.662720] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 97.663963] #PF: supervisor read access in kernel mode
[ 97.664860] #PF: error_code(0x0000) - not-present page
[ 97.665753] PGD 0 P4D 0
[ 97.666198] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 97.667053] CPU: 8 UID: 0 PID: 7340 Comm: qemu-system-x86 Tainted: G E 6.11.0-rc2+ #52
[ 97.668656] Tainted: [E]=UNSIGNED_MODULE
[ 97.669340] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 97.671243] RIP: 0010:cxl_await_media_ready+0x1ac/0x1d0
[ 97.672157] Code: e9 03 ff ff ff 0f b7 1d d6 80 31 01 48 8b 7d b8 89 da 48 c7 c6 60 52 c6 b0 e8 00 46 f6 ff e9 27 ff ff ff 49 8b 86 a0 00 00 00 <48> 8b 00 83 e0 0c 48 83 f8 04 0f 94 c0 0f b6 c0 8d 44 80 fb e9 0c
[ 97.675391] RSP: 0018:ffffb5bac7627c20 EFLAGS: 00010246
[ 97.676298] RAX: 0000000000000000 RBX: 000000000000003c RCX: 0000000000000000
[ 97.677527] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 97.678733] RBP: ffffb5bac7627c70 R08: 0000000000000000 R09: 0000000000000000
[ 97.679951] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 97.681144] R13: ffff9ef9028a8000 R14: ffff9ef90c1d1a28 R15: 0000000000000000
[ 97.682370] FS: 00007386aa4f3d40(0000) GS:ffff9efa77200000(0000) knlGS:0000000000000000
[ 97.683721] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 97.684703] CR2: 0000000000000000 CR3: 0000000169a14003 CR4: 0000000000770ef0
[ 97.685909] PKRU: 55555554
[ 97.686397] Call Trace:
[ 97.686819] <TASK>
[ 97.687243] ? show_regs+0x6c/0x80
[ 97.687840] ? __die+0x24/0x80
[ 97.688391] ? page_fault_oops+0x155/0x570
[ 97.689090] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.689973] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.690848] ? __vunmap_range_noflush+0x420/0x4e0
[ 97.691700] ? do_user_addr_fault+0x4b2/0x870
[ 97.692606] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.693502] ? exc_page_fault+0x82/0x1b0
[ 97.694200] ? asm_exc_page_fault+0x27/0x30
[ 97.694975] ? cxl_await_media_ready+0x1ac/0x1d0
[ 97.695816] vfio_cxl_core_enable+0x386/0x800 [vfio_cxl_core]
[ 97.696829] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.697685] cxl_open_device+0xa6/0xd0 [cxl_accel_vfio_pci]
[ 97.698673] vfio_df_open+0xcb/0xf0
[ 97.699313] vfio_group_fops_unl_ioctl+0x294/0x720
[ 97.700149] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.701011] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.701858] __x64_sys_ioctl+0xa3/0xf0
[ 97.702536] x64_sys_call+0x11ad/0x25f0
[ 97.703214] do_syscall_64+0x7e/0x170
[ 97.703878] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.704726] ? do_syscall_64+0x8a/0x170
[ 97.705425] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.706282] ? kvm_device_ioctl+0xae/0x130 [kvm]
[ 97.707135] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.708001] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.708853] ? syscall_exit_to_user_mode+0x4e/0x250
[ 97.709724] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.710609] ? do_syscall_64+0x8a/0x170
[ 97.711300] ? srso_alias_return_thunk+0x5/0xfbef5
[ 97.712132] ? exc_page_fault+0x93/0x1b0
[ 97.712839] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 97.713735] RIP: 0033:0x7386ab124ded
[ 97.714382] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
[ 97.717664] RSP: 002b:00007ffcda2a6480 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 97.718965] RAX: ffffffffffffffda RBX: 00006293226d9f20 RCX: 00007386ab124ded
[ 97.720222] RDX: 00006293226db730 RSI: 0000000000003b6a RDI: 0000000000000009
[ 97.721522] RBP: 00007ffcda2a64d0 R08: 00006293214e9010 R09: 0000000000000007
[ 97.722858] R10: 00006293226db730 R11: 0000000000000246 R12: 00006293226e0880
[ 97.724193] R13: 00006293226db730 R14: 00007ffcda2a7740 R15: 00006293226d94f0
[ 97.725491] </TASK>
[ 97.725883] Modules linked in: cxl_accel_vfio_pci(E) vfio_cxl_core(E) vfio_pci_core(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_seq_device(E) snd_timer(E) snd(E) soundcore(E) qrtr(E) intel_rapl_msr(E) intel_rapl_common(E) kvm_amd(E) ccp(E) binfmt_misc(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) polyval_clmulni(E) polyval_generic(E) ghash_clmulni_intel(E) sha256_ssse3(E) sha1_ssse3(E) aesni_intel(E) i2c_i801(E) crypto_simd(E) cryptd(E) i2c_smbus(E) lpc_ich(E) joydev(E) input_leds(E) mac_hid(E) serio_raw(E) msr(E) parport_pc(E) ppdev(E) lp(E) parport(E) efi_pstore(E) dmi_sysfs(E) qemu_fw_cfg(E) autofs4(E) bochs(E) e1000e(E) drm_vram_helper(E) psmouse(E) drm_ttm_helper(E) ahci(E) ttm(E) libahci(E)
[ 97.736690] CR2: 0000000000000000
[ 97.737285] ---[ end trace 0000000000000000 ]---
Factor out cxl_await_range_active() and cxl_media_ready(). Type-3 device
should call both for ensuring media ready while type-2 device should only
call cxl_await_range_active().
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Li Ming <ming.li@zohomail.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Li Ming <ming.li@zohomail.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/pci.c | 18 +++++++++++-------
drivers/cxl/core/pci_drv.c | 3 +--
drivers/cxl/cxlmem.h | 3 ++-
include/cxl/cxl.h | 1 +
tools/testing/cxl/Kbuild | 3 ++-
tools/testing/cxl/test/mock.c | 21 ++++++++++++++++++---
6 files changed, 35 insertions(+), 14 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 90a0763e72c4..a0cda2a8fdba 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -225,12 +225,11 @@ static int cxl_dvsec_mem_range_active(struct cxl_dev_state *cxlds, int id)
* Wait up to @media_ready_timeout for the device to report memory
* active.
*/
-int cxl_await_media_ready(struct cxl_dev_state *cxlds)
+int cxl_await_range_active(struct cxl_dev_state *cxlds)
{
struct pci_dev *pdev = to_pci_dev(cxlds->dev);
int d = cxlds->cxl_dvsec;
int rc, i, hdm_count;
- u64 md_status;
u16 cap;
rc = pci_read_config_word(pdev,
@@ -251,13 +250,18 @@ int cxl_await_media_ready(struct cxl_dev_state *cxlds)
return rc;
}
- md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
- if (!CXLMDEV_READY(md_status))
- return -EIO;
-
return 0;
}
-EXPORT_SYMBOL_NS_GPL(cxl_await_media_ready, "CXL");
+EXPORT_SYMBOL_NS_GPL(cxl_await_range_active, "CXL");
+
+int cxl_media_ready(struct cxl_dev_state *cxlds)
+{
+ u64 md_status;
+
+ md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
+ return CXLMDEV_READY(md_status) ? 0 : -EIO;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_media_ready, "CXL");
static int cxl_set_mem_enable(struct cxl_dev_state *cxlds, u16 val)
{
diff --git a/drivers/cxl/core/pci_drv.c b/drivers/cxl/core/pci_drv.c
index 4c767e2471b8..6e519b197f0d 100644
--- a/drivers/cxl/core/pci_drv.c
+++ b/drivers/cxl/core/pci_drv.c
@@ -899,8 +899,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
if (rc)
return rc;
- rc = cxl_await_media_ready(cxlds);
- if (rc == 0)
+ if (!cxl_await_range_active(cxlds) && !cxl_media_ready(cxlds))
cxlds->media_ready = true;
else
dev_warn(&pdev->dev, "Media not active (%d)\n", rc);
diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h
index 918784edd23c..62ace404d681 100644
--- a/drivers/cxl/cxlmem.h
+++ b/drivers/cxl/cxlmem.h
@@ -767,7 +767,8 @@ enum {
int cxl_internal_send_cmd(struct cxl_mailbox *cxl_mbox,
struct cxl_mbox_cmd *cmd);
int cxl_dev_state_identify(struct cxl_memdev_state *mds);
-int cxl_await_media_ready(struct cxl_dev_state *cxlds);
+int cxl_await_range_active(struct cxl_dev_state *cxlds);
+int cxl_media_ready(struct cxl_dev_state *cxlds);
int cxl_enumerate_cmds(struct cxl_memdev_state *mds);
int cxl_mem_dpa_fetch(struct cxl_memdev_state *mds, struct cxl_dpa_info *info);
struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev, u64 serial,
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index e5d1e5a20e06..f18194b9e3e2 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -262,6 +262,7 @@ int cxl_map_component_regs(const struct cxl_register_map *map,
struct cxl_component_regs *regs,
unsigned long map_mask);
int cxl_set_capacity(struct cxl_dev_state *cxlds, u64 capacity);
+int cxl_await_range_active(struct cxl_dev_state *cxlds);
struct cxl_memdev *devm_cxl_add_memdev(struct device *host,
struct cxl_dev_state *cxlds,
const struct cxl_memdev_ops *ops);
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index d422c81cefa3..4b05b21083ad 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -5,7 +5,8 @@ ldflags-y += --wrap=acpi_evaluate_integer
ldflags-y += --wrap=acpi_pci_find_root
ldflags-y += --wrap=nvdimm_bus_register
ldflags-y += --wrap=devm_cxl_port_enumerate_dports
-ldflags-y += --wrap=cxl_await_media_ready
+ldflags-y += --wrap=cxl_await_range_active
+ldflags-y += --wrap=cxl_media_ready
ldflags-y += --wrap=devm_cxl_add_rch_dport
ldflags-y += --wrap=cxl_endpoint_parse_cdat
ldflags-y += --wrap=cxl_dport_init_ras_reporting
diff --git a/tools/testing/cxl/test/mock.c b/tools/testing/cxl/test/mock.c
index 92fd5c69bef3..4f1f65e50e87 100644
--- a/tools/testing/cxl/test/mock.c
+++ b/tools/testing/cxl/test/mock.c
@@ -187,7 +187,7 @@ int __wrap_devm_cxl_port_enumerate_dports(struct cxl_port *port)
}
EXPORT_SYMBOL_NS_GPL(__wrap_devm_cxl_port_enumerate_dports, "CXL");
-int __wrap_cxl_await_media_ready(struct cxl_dev_state *cxlds)
+int __wrap_cxl_await_range_active(struct cxl_dev_state *cxlds)
{
int rc, index;
struct cxl_mock_ops *ops = get_cxl_mock_ops(&index);
@@ -195,12 +195,27 @@ int __wrap_cxl_await_media_ready(struct cxl_dev_state *cxlds)
if (ops && ops->is_mock_dev(cxlds->dev))
rc = 0;
else
- rc = cxl_await_media_ready(cxlds);
+ rc = cxl_await_range_active(cxlds);
put_cxl_mock_ops(index);
return rc;
}
-EXPORT_SYMBOL_NS_GPL(__wrap_cxl_await_media_ready, "CXL");
+EXPORT_SYMBOL_NS_GPL(__wrap_cxl_await_range_active, "CXL");
+
+int __wrap_cxl_media_ready(struct cxl_dev_state *cxlds)
+{
+ int rc, index;
+ struct cxl_mock_ops *ops = get_cxl_mock_ops(&index);
+
+ if (ops && ops->is_mock_dev(cxlds->dev))
+ rc = 0;
+ else
+ rc = cxl_media_ready(cxlds);
+ put_cxl_mock_ops(index);
+
+ return rc;
+}
+EXPORT_SYMBOL_NS_GPL(__wrap_cxl_media_ready, "CXL");
struct cxl_dport *__wrap_devm_cxl_add_rch_dport(struct cxl_port *port,
struct device *dport_dev,
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 02/15] cxl: introduce cxl_get_hdm_reg_info()
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
2025-12-09 16:50 ` [RFC v2 01/15] cxl: factor out cxl_await_range_active() and cxl_media_ready() mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-09 16:50 ` [RFC v2 03/15] cxl: introduce cxl_find_comp_reglock_offset() mhonap
` (12 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Zhi Wang <zhiw@nvidia.com>
CXL core has the information of what CXL register groups a device has.
When initializing the device, the CXL core probes the register groups
and saves the information. The probing sequence is quite complicated.
vfio-cxl requires the HDM register information to emualte the HDM decoder
registers.
Introduce cxl_get_hdm_reg_info() for vfio-cxl to leverage the HDM
register information in the CXL core. Thus, it doesn't need to implement
its own probing sequence.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/pci.c | 28 ++++++++++++++++++++++++++++
include/cxl/cxl.h | 4 ++++
2 files changed, 32 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index a0cda2a8fdba..f998096050cf 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -532,6 +532,34 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
}
EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, "CXL");
+int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u64 *count, u64 *offset,
+ u64 *size)
+{
+ struct cxl_component_reg_map *map =
+ &cxlds->reg_map.component_map;
+ struct pci_dev *pdev = to_pci_dev(cxlds->dev);
+ int d = cxlds->cxl_dvsec;
+ u16 cap;
+ int rc;
+
+ if (!map->hdm_decoder.valid) {
+ *count = *offset = *size = 0;
+ return 0;
+ }
+
+ *offset = map->hdm_decoder.offset;
+ *size = map->hdm_decoder.size;
+
+ rc = pci_read_config_word(pdev,
+ d + PCI_DVSEC_CXL_CAP_OFFSET, &cap);
+ if (rc)
+ return rc;
+
+ *count = FIELD_GET(PCI_DVSEC_CXL_HDM_COUNT_MASK, cap);
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_reg_info, "CXL");
+
#define CXL_DOE_TABLE_ACCESS_REQ_CODE 0x000000ff
#define CXL_DOE_TABLE_ACCESS_REQ_CODE_READ 0
#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE 0x0000ff00
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index f18194b9e3e2..d84405afc72e 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -289,4 +289,8 @@ int cxl_decoder_detach(struct cxl_region *cxlr,
enum cxl_detach_mode mode);
struct range;
int cxl_get_region_range(struct cxl_region *region, struct range *range);
+
+int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u64 *count, u64 *offset,
+ u64 *size);
+
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 03/15] cxl: introduce cxl_find_comp_reglock_offset()
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
2025-12-09 16:50 ` [RFC v2 01/15] cxl: factor out cxl_await_range_active() and cxl_media_ready() mhonap
2025-12-09 16:50 ` [RFC v2 02/15] cxl: introduce cxl_get_hdm_reg_info() mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-09 16:50 ` [RFC v2 04/15] cxl: introduce devm_cxl_del_memdev() mhonap
` (11 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Zhi Wang <zhiw@nvidia.com>
CXL core has the information of what CXL register groups a device
has.When initializing the device, the CXL core probes the register
groups and saves the information. The probing sequence is quite
complicated.
vfio-cxl needs to handle the CXL MMIO BAR specially. E.g. emulate
the HDM decoder register inside the component registers. Thus it
requires to know the offset of the CXL component register to locate
the PCI BAR where the component register sits.
Introduce cxl_find_comp_regblock_offset() for vfio-cxl to leverage the
register information in the CXL core. Thus, it doesn't need to
implement its own probing sequence.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/regs.c | 22 ++++++++++++++++++++++
include/cxl/cxl.h | 2 ++
2 files changed, 24 insertions(+)
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index dcf444f1fe48..c5f31627fa20 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -345,6 +345,28 @@ static int __cxl_find_regblock_instance(struct pci_dev *pdev, enum cxl_regloc_ty
return -ENODEV;
}
+/**
+ * cxl_find_comp_regblock_offset() - Locate the offset of component
+ * register blocks
+ * @pdev: The CXL PCI device to enumerate.
+ * @offset: Enumeration output, clobbered on error
+ *
+ * Return: 0 if register block enumerated, negative error code otherwise
+ */
+int cxl_find_comp_regblock_offset(struct pci_dev *pdev, u64 *offset)
+{
+ struct cxl_register_map map;
+ int ret;
+
+ ret = cxl_find_regblock(pdev, CXL_REGLOC_RBI_COMPONENT, &map);
+ if (ret)
+ return ret;
+
+ *offset = map.resource;
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_find_comp_regblock_offset, "CXL");
+
/**
* cxl_find_regblock_instance() - Locate a register block by type / index
* @pdev: The CXL PCI device to enumerate.
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index d84405afc72e..28a39bfd74bc 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -7,6 +7,7 @@
#include <linux/node.h>
#include <linux/ioport.h>
+#include <linux/pci.h>
#include <linux/range.h>
#include <cxl/mailbox.h>
@@ -292,5 +293,6 @@ int cxl_get_region_range(struct cxl_region *region, struct range *range);
int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u64 *count, u64 *offset,
u64 *size);
+int cxl_find_comp_regblock_offset(struct pci_dev *pdev, u64 *offset);
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 04/15] cxl: introduce devm_cxl_del_memdev()
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (2 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 03/15] cxl: introduce cxl_find_comp_reglock_offset() mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-09 16:50 ` [RFC v2 05/15] cxl: introduce cxl_get_committed_regions() mhonap
` (10 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Zhi Wang <zhiw@nvidia.com>
The teardown path of kernel CXL core heavily leverages the device
resource manager. Thus, the lifecycle of many created resources are
tied to the refcount of parent object and the resourced are freed
when the parent object is freed.
However, this creates a gap when an external caller wants to swept the
resource but keep the parent object for a re-initialization sequence.
E.g. in vfio-cxl.
Introduce the devm_cxl_del_memdev() for an external caller to destroy
the CXL memdev.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/memdev.c | 6 ++++++
include/cxl/cxl.h | 1 +
2 files changed, 7 insertions(+)
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 9de2ecb2abdc..d281843fb2f4 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -775,6 +775,12 @@ int devm_cxl_memdev_add_or_reset(struct device *host, struct cxl_memdev *cxlmd)
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_memdev_add_or_reset, "CXL");
+void devm_cxl_del_memdev(struct device *host, struct cxl_memdev *cxlmd)
+{
+ devm_release_action(host, cxl_memdev_unregister, cxlmd);
+}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_del_memdev, "CXL");
+
static long __cxl_memdev_ioctl(struct cxl_memdev *cxlmd, unsigned int cmd,
unsigned long arg)
{
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index 28a39bfd74bc..e3bf8cf0b6d6 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -267,6 +267,7 @@ int cxl_await_range_active(struct cxl_dev_state *cxlds);
struct cxl_memdev *devm_cxl_add_memdev(struct device *host,
struct cxl_dev_state *cxlds,
const struct cxl_memdev_ops *ops);
+void devm_cxl_del_memdev(struct device *host, struct cxl_memdev *cxlmd);
struct cxl_port;
struct cxl_root_decoder *cxl_get_hpa_freespace(struct cxl_memdev *cxlmd,
int interleave_ways,
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 05/15] cxl: introduce cxl_get_committed_regions()
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (3 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 04/15] cxl: introduce devm_cxl_del_memdev() mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-22 12:31 ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 06/15] vfio/cxl: introduce vfio-cxl core preludes mhonap
` (9 subsequent siblings)
14 siblings, 1 reply; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Zhi Wang <zhiw@nvidia.com>
The kernel CXL core can discover the configured and committed CXL regions
from BIOS or firmware, respect its configuration and create the related
kernel CXL core data structures without configuring and committing the CXL
region.
However, those information are kept within the kernel CXL core. A type-2
device can have the same usage and a type-2 driver would like to know
about it before creating the CXL regions.
Introduce cxl_get_committed_regions() for a type-2 driver to discover the
committed regions.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/region.c | 73 +++++++++++++++++++++++++++++++++++++++
include/cxl/cxl.h | 1 +
2 files changed, 74 insertions(+)
diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
index e89a98780e76..6c368b4641f1 100644
--- a/drivers/cxl/core/region.c
+++ b/drivers/cxl/core/region.c
@@ -2785,6 +2785,79 @@ int cxl_get_region_range(struct cxl_region *region, struct range *range)
}
EXPORT_SYMBOL_NS_GPL(cxl_get_region_range, "CXL");
+struct match_region_info {
+ struct cxl_memdev *cxlmd;
+ struct cxl_region **cxlrs;
+ int nr_regions;
+};
+
+static int match_region_by_device(struct device *match, void *data)
+{
+ struct match_region_info *info = data;
+ struct cxl_endpoint_decoder *cxled;
+ struct cxl_memdev *cxlmd;
+ struct cxl_region_params *p;
+ struct cxl_region *cxlr;
+ int i;
+
+ if (!is_cxl_region(match))
+ return 0;
+
+ lockdep_assert_held(&cxl_rwsem.region);
+ cxlr = to_cxl_region(match);
+ p = &cxlr->params;
+
+ if (p->state != CXL_CONFIG_COMMIT)
+ return 0;
+
+ for (i = 0; i < p->nr_targets; i++) {
+ void *cxlrs;
+
+ cxled = p->targets[i];
+ cxlmd = cxled_to_memdev(cxled);
+
+ if (info->cxlmd != cxlmd)
+ continue;
+
+ cxlrs = krealloc(info->cxlrs, sizeof(cxlr) * (info->nr_regions + 1),
+ GFP_KERNEL);
+ if (!cxlrs) {
+ kfree(info->cxlrs);
+ return -ENOMEM;
+ }
+ info->cxlrs = cxlrs;
+
+ info->cxlrs[info->nr_regions++] = cxlr;
+ }
+
+ return 0;
+}
+
+int cxl_get_committed_regions(struct cxl_memdev *cxlmd, struct cxl_region ***cxlrs, int *num)
+{
+ struct match_region_info info = {0};
+ int ret = 0;
+
+ ret = down_write_killable(&cxl_rwsem.region);
+ if (ret)
+ return ret;
+
+ info.cxlmd = cxlmd;
+
+ ret = bus_for_each_dev(&cxl_bus_type, NULL, &info, match_region_by_device);
+ if (ret) {
+ kfree(info.cxlrs);
+ } else {
+ *cxlrs = info.cxlrs;
+ *num = info.nr_regions;
+ }
+
+ up_write(&cxl_rwsem.region);
+
+ return ret;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_committed_regions, "CXL");
+
static ssize_t __create_region_show(struct cxl_root_decoder *cxlrd, char *buf)
{
return sysfs_emit(buf, "region%u\n", atomic_read(&cxlrd->region_id));
diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
index e3bf8cf0b6d6..0a1f245557f4 100644
--- a/include/cxl/cxl.h
+++ b/include/cxl/cxl.h
@@ -295,5 +295,6 @@ int cxl_get_region_range(struct cxl_region *region, struct range *range);
int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u64 *count, u64 *offset,
u64 *size);
int cxl_find_comp_regblock_offset(struct pci_dev *pdev, u64 *offset);
+int cxl_get_committed_regions(struct cxl_memdev *cxlmd, struct cxl_region ***cxlrs, int *num);
#endif /* __CXL_CXL_H__ */
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 06/15] vfio/cxl: introduce vfio-cxl core preludes
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (4 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 05/15] cxl: introduce cxl_get_committed_regions() mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-22 13:54 ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace via a new VFIO device region mhonap
` (8 subsequent siblings)
14 siblings, 1 reply; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
In VFIO, common functions that used by VFIO variant drivers are managed
in a set of "core" functions. E.g. the vfio-pci-core provides the common
functions used by VFIO variant drviers to support PCI device
passhthrough.
Although the CXL type-2 device has a PCI-compatible interface for device
configuration and programming, they still needs special handlings when
initialize the device:
- Probing the CXL DVSECs in the configuration.
- Probing the CXL register groups implemented by the device.
- Configuring the CXL device state required by the kernel CXL core.
- Create the CXL region.
- Special handlings of the CXL MMIO BAR.
Introduce vfio-cxl core preludes to hold all the common functions used
by VFIO variant drivers to support CXL device passthrough.
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/Kconfig | 10 ++
drivers/vfio/pci/Makefile | 3 +
drivers/vfio/pci/vfio_cxl_core.c | 238 +++++++++++++++++++++++++++++++
include/linux/vfio_pci_core.h | 50 +++++++
4 files changed, 301 insertions(+)
create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 2b0172f54665..2f441d118f1c 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -7,6 +7,16 @@ config VFIO_PCI_CORE
select VFIO_VIRQFD
select IRQ_BYPASS_MANAGER
+config VFIO_CXL_CORE
+ tristate "VFIO CXL core"
+ select VFIO_PCI_CORE
+ depends on CXL_BUS
+ help
+ Support for the generic PCI VFIO-CXL bus driver which can
+ connect CXL devices to the VFIO framework.
+
+ If you don't know what to do here, say N.
+
config VFIO_PCI_INTX
def_bool y if !S390
depends on VFIO_PCI_CORE
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index cf00c0a7e55c..b51221b94b0b 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -8,6 +8,9 @@ vfio-pci-y := vfio_pci.o
vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+vfio-cxl-core-y := vfio_cxl_core.o
+obj-$(CONFIG_VFIO_CXL_CORE) += vfio-cxl-core.o
+
obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/
obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
new file mode 100644
index 000000000000..cf53720c0cb7
--- /dev/null
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -0,0 +1,238 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+
+#include "vfio_pci_priv.h"
+
+#define DRIVER_AUTHOR "Zhi Wang <zhiw@nvidia.com>"
+#define DRIVER_DESC "core driver for VFIO based CXL devices"
+
+/* Standard CXL-type 2 driver initialization sequence */
+static int enable_cxl(struct vfio_cxl_core_device *cxl, u16 dvsec,
+ struct vfio_cxl_dev_info *info)
+{
+ struct vfio_pci_core_device *pci = &cxl->pci_core;
+ struct vfio_cxl *cxl_core = cxl->cxl_core;
+ struct pci_dev *pdev = pci->pdev;
+ u64 offset, size, count;
+ int ret;
+
+ ret = cxl_pci_setup_regs(pdev, CXL_REGLOC_RBI_COMPONENT,
+ &cxl_core->cxlds.reg_map);
+ if (ret) {
+ pci_err(pdev, "VFIO-CXL: CXL component registers not found\n");
+ return ret;
+ }
+
+ ret = cxl_get_hdm_reg_info(&cxl_core->cxlds, &count, &offset, &size);
+ if (ret)
+ return ret;
+
+ if (WARN_ON(!count || !size))
+ return -ENODEV;
+
+ cxl->hdm_count = count;
+ cxl->hdm_reg_offset = offset;
+ cxl->hdm_reg_size = size;
+
+ if (!info->no_media_ready) {
+ ret = cxl_await_range_active(&cxl_core->cxlds);
+ if (ret)
+ return -ENODEV;
+
+ cxl_core->cxlds.media_ready = true;
+ } else {
+ /* Some devices don't have media ready support. E.g. AMD SFC. */
+ cxl_core->cxlds.media_ready = true;
+ }
+
+ if (cxl_set_capacity(&cxl_core->cxlds, SZ_256M)) {
+ pci_err(pdev, "dpa capacity setup failed\n");
+ return -ENODEV;
+ }
+
+ cxl_core->cxlmd = devm_cxl_add_memdev(&pdev->dev,
+ &cxl_core->cxlds, NULL);
+ if (IS_ERR(cxl_core->cxlmd))
+ return PTR_ERR(cxl_core->cxlmd);
+
+ cxl_core->region.noncached = info->noncached_region;
+
+ return 0;
+}
+
+static void disable_cxl(struct vfio_cxl_core_device *cxl)
+{
+ struct vfio_cxl *cxl_core = cxl->cxl_core;
+
+ WARN_ON(cxl_core->region.region);
+
+ if (!cxl->hdm_count)
+ return;
+
+ if (cxl_core->cxled) {
+ cxl_decoder_detach(NULL, cxl_core->cxled, 0, DETACH_INVALIDATE);
+ cxl_dpa_free(cxl_core->cxled);
+ }
+
+ if (cxl_core->cxlrd)
+ cxl_put_root_decoder(cxl_core->cxlrd);
+}
+
+int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
+ struct vfio_cxl_dev_info *info)
+{
+ struct vfio_pci_core_device *pci = &cxl->pci_core;
+ struct pci_dev *pdev = pci->pdev;
+ struct vfio_cxl *cxl_core = cxl->cxl_core;
+ u16 dvsec;
+ int ret;
+
+ dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+ PCI_DVSEC_CXL_DEVICE);
+ if (!dvsec)
+ return -ENODEV;
+
+ cxl_core = devm_cxl_dev_state_create(&pdev->dev, CXL_DEVTYPE_DEVMEM,
+ pdev->dev.id, dvsec, struct vfio_cxl,
+ cxlds, false);
+ if (!cxl_core) {
+ pci_err(pdev, "VFIO-CXL: CXL state creation failed");
+ return -ENOMEM;
+ }
+
+ ret = vfio_pci_core_enable(pci);
+ if (ret)
+ return ret;
+
+ ret = enable_cxl(cxl, dvsec, info);
+ if (ret)
+ goto err;
+
+ return 0;
+
+err:
+ vfio_pci_core_disable(pci);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_enable);
+
+void vfio_cxl_core_finish_enable(struct vfio_cxl_core_device *cxl)
+{
+ struct vfio_pci_core_device *pci = &cxl->pci_core;
+
+ vfio_pci_core_finish_enable(pci);
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_finish_enable);
+
+static void disable_device(struct vfio_cxl_core_device *cxl)
+{
+ disable_cxl(cxl);
+}
+
+void vfio_cxl_core_disable(struct vfio_cxl_core_device *cxl)
+{
+ disable_device(cxl);
+ vfio_pci_core_disable(&cxl->pci_core);
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_disable);
+
+void vfio_cxl_core_close_device(struct vfio_device *vdev)
+{
+ struct vfio_pci_core_device *pci =
+ container_of(vdev, struct vfio_pci_core_device, vdev);
+ struct vfio_cxl_core_device *cxl = vfio_pci_core_to_cxl(pci);
+
+ disable_device(cxl);
+ vfio_pci_core_close_device(vdev);
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_close_device);
+
+static int get_hpa_and_request_dpa(struct vfio_cxl_core_device *cxl, u64 size)
+{
+ u64 max;
+ struct vfio_cxl *cxl_core = cxl->cxl_core;
+
+ cxl_core->cxlrd = cxl_get_hpa_freespace(cxl_core->cxlmd, 1,
+ CXL_DECODER_F_RAM |
+ CXL_DECODER_F_TYPE2,
+ &max);
+ if (IS_ERR(cxl_core->cxlrd))
+ return PTR_ERR(cxl_core->cxlrd);
+
+ if (max < size)
+ return -ENOSPC;
+
+ cxl_core->cxled = cxl_request_dpa(cxl_core->cxlmd, CXL_PARTMODE_RAM, size);
+ if (IS_ERR(cxl_core->cxled))
+ return PTR_ERR(cxl_core->cxled);
+
+ return 0;
+}
+
+int vfio_cxl_core_create_cxl_region(struct vfio_cxl_core_device *cxl, u64 size)
+{
+ struct cxl_region *region;
+ struct range range;
+ int ret;
+ struct vfio_cxl *cxl_core = cxl->cxl_core;
+
+ if (WARN_ON(cxl_core->region.region))
+ return -EEXIST;
+
+ ret = get_hpa_and_request_dpa(cxl, size);
+ if (ret)
+ return ret;
+
+ region = cxl_create_region(cxl_core->cxlrd, &cxl_core->cxled, true);
+ if (IS_ERR(region)) {
+ ret = PTR_ERR(region);
+ cxl_dpa_free(cxl_core->cxled);
+ return ret;
+ }
+
+ cxl_get_region_range(region, &range);
+
+ cxl_core->region.addr = range.start;
+ cxl_core->region.size = size;
+ cxl_core->region.region = region;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_create_cxl_region);
+
+void vfio_cxl_core_destroy_cxl_region(struct vfio_cxl_core_device *cxl)
+{
+ struct vfio_cxl *cxl_core = cxl->cxl_core;
+
+ if (!cxl_core->region.region)
+ return;
+
+ cxl_decoder_detach(NULL, cxl_core->cxled, 0, DETACH_INVALIDATE);
+ cxl_put_root_decoder(cxl_core->cxlrd);
+ cxl_dpa_free(cxl_core->cxled);
+ cxl_core->region.region = NULL;
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_destroy_cxl_region);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+MODULE_IMPORT_NS("CXL");
+MODULE_SOFTDEP("pre: cxl_core cxl_port cxl_acpi cxl-mem");
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index f541044e42a2..a343b91d2580 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -15,6 +15,8 @@
#include <linux/types.h>
#include <linux/uuid.h>
#include <linux/notifier.h>
+#include <cxl/cxl.h>
+#include <cxl/pci.h>
#ifndef VFIO_PCI_CORE_H
#define VFIO_PCI_CORE_H
@@ -96,6 +98,40 @@ struct vfio_pci_core_device {
struct rw_semaphore memory_lock;
};
+struct vfio_cxl_region {
+ struct cxl_region *region;
+ u64 size;
+ u64 addr;
+ bool noncached;
+};
+
+struct vfio_cxl {
+ struct cxl_dev_state cxlds;
+ struct cxl_memdev *cxlmd;
+ struct cxl_root_decoder *cxlrd;
+ struct cxl_port *endpoint;
+ struct cxl_endpoint_decoder *cxled;
+
+ struct vfio_cxl_region region;
+};
+
+struct vfio_cxl_core_device {
+ struct vfio_pci_core_device pci_core;
+ struct vfio_cxl *cxl_core;
+
+ u32 hdm_count;
+ u64 hdm_reg_offset;
+ u64 hdm_reg_size;
+};
+
+struct vfio_cxl_dev_info {
+ unsigned long *dev_caps;
+ struct resource dpa_res;
+ struct resource ram_res;
+ bool no_media_ready;
+ bool noncached_region;
+};
+
/* Will be exported for vfio pci drivers usage */
int vfio_pci_core_register_dev_region(struct vfio_pci_core_device *vdev,
unsigned int type, unsigned int subtype,
@@ -161,4 +197,18 @@ VFIO_IOREAD_DECLARATION(32)
VFIO_IOREAD_DECLARATION(64)
#endif
+static inline struct vfio_cxl_core_device *
+vfio_pci_core_to_cxl(struct vfio_pci_core_device *pci)
+{
+ return container_of(pci, struct vfio_cxl_core_device, pci_core);
+}
+
+int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
+ struct vfio_cxl_dev_info *info);
+void vfio_cxl_core_finish_enable(struct vfio_cxl_core_device *cxl);
+void vfio_cxl_core_disable(struct vfio_cxl_core_device *cxl);
+void vfio_cxl_core_close_device(struct vfio_device *vdev);
+int vfio_cxl_core_create_cxl_region(struct vfio_cxl_core_device *cxl, u64 size);
+void vfio_cxl_core_destroy_cxl_region(struct vfio_cxl_core_device *cxl);
+
#endif /* VFIO_PCI_CORE_H */
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace via a new VFIO device region
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (5 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 06/15] vfio/cxl: introduce vfio-cxl core preludes mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-11 16:06 ` Dave Jiang
2025-12-22 14:00 ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 08/15] vfio/cxl: discover precommitted CXL region mhonap
` (7 subsequent siblings)
14 siblings, 2 replies; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
To directly access the device memory, a CXL region is required. Creating
a CXL region requires to configure HDM decoders on the path to map the
access of HPA level by level and evetually hit the DPA in the CXL
topology.
For the userspace, e.g. QEMU, to access the CXL region, the region is
required to be exposed via VFIO interfaces.
Introduce a new VFIO device region and region ops to expose the created
CXL region when initialize the device in the vfio-cxl-core. Introduce a
new sub region type for the userspace to identify a CXL region.
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/vfio_cxl_core.c | 122 +++++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_core.c | 3 +-
include/linux/vfio_pci_core.h | 5 ++
include/uapi/linux/vfio.h | 4 +
4 files changed, 133 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
index cf53720c0cb7..35d95de47fa8 100644
--- a/drivers/vfio/pci/vfio_cxl_core.c
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -231,6 +231,128 @@ void vfio_cxl_core_destroy_cxl_region(struct vfio_cxl_core_device *cxl)
}
EXPORT_SYMBOL_GPL(vfio_cxl_core_destroy_cxl_region);
+static int vfio_cxl_region_mmap(struct vfio_pci_core_device *pci,
+ struct vfio_pci_region *region,
+ struct vm_area_struct *vma)
+{
+ struct vfio_cxl_region *cxl_region = region->data;
+ u64 req_len, pgoff, req_start, end;
+ int ret;
+
+ if (!(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
+ return -EINVAL;
+
+ if (!(region->flags & VFIO_REGION_INFO_FLAG_READ) &&
+ (vma->vm_flags & VM_READ))
+ return -EPERM;
+
+ if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE) &&
+ (vma->vm_flags & VM_WRITE))
+ return -EPERM;
+
+ pgoff = vma->vm_pgoff &
+ ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+ if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
+ check_add_overflow(PHYS_PFN(cxl_region->addr), pgoff, &req_start) ||
+ check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
+ return -EOVERFLOW;
+
+ if (end > cxl_region->size)
+ return -EINVAL;
+
+ if (cxl_region->noncached)
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
+
+ vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
+ VM_DONTEXPAND | VM_DONTDUMP);
+
+ ret = remap_pfn_range(vma, vma->vm_start, req_start,
+ req_len, vma->vm_page_prot);
+ if (ret)
+ return ret;
+
+ vma->vm_pgoff = req_start;
+
+ return 0;
+}
+
+static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device *core_dev,
+ char __user *buf, size_t count, loff_t *ppos,
+ bool iswrite)
+{
+ unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+ struct vfio_cxl_region *cxl_region = core_dev->region[i].data;
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+ if (!count)
+ return 0;
+
+ return vfio_pci_core_do_io_rw(core_dev, false,
+ cxl_region->vaddr,
+ (char __user *)buf, pos, count,
+ 0, 0, iswrite);
+}
+
+static void vfio_cxl_region_release(struct vfio_pci_core_device *vdev,
+ struct vfio_pci_region *region)
+{
+}
+
+static const struct vfio_pci_regops vfio_cxl_regops = {
+ .rw = vfio_cxl_region_rw,
+ .mmap = vfio_cxl_region_mmap,
+ .release = vfio_cxl_region_release,
+};
+
+int vfio_cxl_core_register_cxl_region(struct vfio_cxl_core_device *cxl)
+{
+ struct vfio_pci_core_device *pci = &cxl->pci_core;
+ struct vfio_cxl *cxl_core = cxl->cxl_core;
+ u32 flags;
+ int ret;
+
+ if (WARN_ON(!cxl_core->region.region || cxl_core->region.vaddr))
+ return -EEXIST;
+
+ cxl_core->region.vaddr = ioremap(cxl_core->region.addr, cxl_core->region.size);
+ if (!cxl_core->region.addr)
+ return -EFAULT;
+
+ flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE |
+ VFIO_REGION_INFO_FLAG_MMAP;
+
+ ret = vfio_pci_core_register_dev_region(pci,
+ PCI_VENDOR_ID_CXL |
+ VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+ VFIO_REGION_SUBTYPE_CXL,
+ &vfio_cxl_regops,
+ cxl_core->region.size, flags,
+ &cxl_core->region);
+ if (ret) {
+ iounmap(cxl_core->region.vaddr);
+ cxl_core->region.vaddr = NULL;
+ return ret;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_register_cxl_region);
+
+void vfio_cxl_core_unregister_cxl_region(struct vfio_cxl_core_device *cxl)
+{
+ struct vfio_cxl *cxl_core = cxl->cxl_core;
+
+ if (WARN_ON(!cxl_core->region.region || !cxl_core->region.vaddr))
+ return;
+
+ iounmap(cxl_core->region.vaddr);
+ cxl_core->region.vaddr = NULL;
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_unregister_cxl_region);
+
MODULE_LICENSE("GPL");
MODULE_AUTHOR(DRIVER_AUTHOR);
MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 7dcf5439dedc..c0695b5db66d 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1698,12 +1698,13 @@ static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf)
return vfio_pci_mmap_huge_fault(vmf, 0);
}
-static const struct vm_operations_struct vfio_pci_mmap_ops = {
+const struct vm_operations_struct vfio_pci_mmap_ops = {
.fault = vfio_pci_mmap_page_fault,
#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
.huge_fault = vfio_pci_mmap_huge_fault,
#endif
};
+EXPORT_SYMBOL_GPL(vfio_pci_mmap_ops);
int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma)
{
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index a343b91d2580..3474835f5d65 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -102,6 +102,7 @@ struct vfio_cxl_region {
struct cxl_region *region;
u64 size;
u64 addr;
+ void *vaddr;
bool noncached;
};
@@ -203,6 +204,8 @@ vfio_pci_core_to_cxl(struct vfio_pci_core_device *pci)
return container_of(pci, struct vfio_cxl_core_device, pci_core);
}
+extern const struct vm_operations_struct vfio_pci_mmap_ops;
+
int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
struct vfio_cxl_dev_info *info);
void vfio_cxl_core_finish_enable(struct vfio_cxl_core_device *cxl);
@@ -210,5 +213,7 @@ void vfio_cxl_core_disable(struct vfio_cxl_core_device *cxl);
void vfio_cxl_core_close_device(struct vfio_device *vdev);
int vfio_cxl_core_create_cxl_region(struct vfio_cxl_core_device *cxl, u64 size);
void vfio_cxl_core_destroy_cxl_region(struct vfio_cxl_core_device *cxl);
+int vfio_cxl_core_register_cxl_region(struct vfio_cxl_core_device *cxl);
+void vfio_cxl_core_unregister_cxl_region(struct vfio_cxl_core_device *cxl);
#endif /* VFIO_PCI_CORE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 75100bf009ba..95be987d2ed5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -372,6 +372,10 @@ struct vfio_region_info_cap_type {
/* sub-types for VFIO_REGION_TYPE_GFX */
#define VFIO_REGION_SUBTYPE_GFX_EDID (1)
+/* 1e98 vendor PCI sub-types */
+/* sub-type for VFIO CXL region */
+#define VFIO_REGION_SUBTYPE_CXL (1)
+
/**
* struct vfio_region_gfx_edid - EDID region layout.
*
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 08/15] vfio/cxl: discover precommitted CXL region
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (6 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace via a new VFIO device region mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-22 14:09 ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 09/15] vfio/cxl: introduce vfio_cxl_core_{read, write}() mhonap
` (6 subsequent siblings)
14 siblings, 1 reply; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Zhi Wang <zhiw@nvidia.com>
A type-2 device can have precommitted CXL region that is configured by
BIOS. Before letting a VFIO CXL variant driver create a new CXL region,
the VFIO CXL core first needs to discover the precommited CXL region.
Discover the precommited CXL region when enabling CXL devices.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/vfio_cxl_core.c | 29 +++++++++++++++++++++++++++--
include/linux/vfio_pci_core.h | 1 +
2 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
index 35d95de47fa8..099d35866a39 100644
--- a/drivers/vfio/pci/vfio_cxl_core.c
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -82,11 +82,16 @@ static void disable_cxl(struct vfio_cxl_core_device *cxl)
{
struct vfio_cxl *cxl_core = cxl->cxl_core;
- WARN_ON(cxl_core->region.region);
-
if (!cxl->hdm_count)
return;
+ if (cxl_core->region.precommitted) {
+ kfree(cxl_core->region.region);
+ cxl_core->region.region = NULL;
+ }
+
+ WARN_ON(cxl_core->region.region);
+
if (cxl_core->cxled) {
cxl_decoder_detach(NULL, cxl_core->cxled, 0, DETACH_INVALIDATE);
cxl_dpa_free(cxl_core->cxled);
@@ -96,6 +101,24 @@ static void disable_cxl(struct vfio_cxl_core_device *cxl)
cxl_put_root_decoder(cxl_core->cxlrd);
}
+static void discover_precommitted_region(struct vfio_cxl_core_device *cxl)
+{
+ struct cxl_region **cxlrs = NULL;
+ struct vfio_cxl *cxl_core = cxl->cxl_core;
+ int num, ret;
+
+ ret = cxl_get_committed_regions(cxl_core->cxlmd, &cxlrs, &num);
+ if (ret || !cxlrs) {
+ kfree(cxlrs);
+ return;
+ }
+
+ WARN_ON(num > 1);
+
+ cxl_core->region.region = cxlrs[0];
+ cxl_core->region.precommitted = true;
+}
+
int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
struct vfio_cxl_dev_info *info)
{
@@ -126,6 +149,8 @@ int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
if (ret)
goto err;
+ discover_precommitted_region(cxl);
+
return 0;
err:
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 3474835f5d65..7237fcaecbb6 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -104,6 +104,7 @@ struct vfio_cxl_region {
u64 addr;
void *vaddr;
bool noncached;
+ bool precommitted;
};
struct vfio_cxl {
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 09/15] vfio/cxl: introduce vfio_cxl_core_{read, write}()
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (7 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 08/15] vfio/cxl: discover precommitted CXL region mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-09 16:50 ` [RFC v2 10/15] vfio/cxl: introduce the register emulation framework mhonap
` (5 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Zhi Wang <zhiw@nvidia.com>
The read/write callbacks in vfio_device_ops is for accessing the device
when mmap is not support. It is also used for VFIO variant driver to
emulate the device registers.
CXL spec illusrates the standard programming interface, part of them
are MMIO registers sit in a PCI BAR. Some of them are emulated when
passing the CXL type-2 device to the VM. E.g. HDM decoder registers are
emulated.
Introduce vfio_cxl_core_{read, write}() in the vfio-cxl-core to prepare
for emulating the CXL MMIO registers in the PCI BAR.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/vfio_cxl_core.c | 20 ++++++++++++++++++++
drivers/vfio/pci/vfio_pci_core.c | 5 +++--
include/linux/vfio_pci_core.h | 6 ++++++
3 files changed, 29 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
index 099d35866a39..460f1ee910af 100644
--- a/drivers/vfio/pci/vfio_cxl_core.c
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -378,6 +378,26 @@ void vfio_cxl_core_unregister_cxl_region(struct vfio_cxl_core_device *cxl)
}
EXPORT_SYMBOL_GPL(vfio_cxl_core_unregister_cxl_region);
+ssize_t vfio_cxl_core_read(struct vfio_device *core_vdev, char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_pci_core_device *vdev =
+ container_of(core_vdev, struct vfio_pci_core_device, vdev);
+
+ return vfio_pci_rw(vdev, buf, count, ppos, false);
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_read);
+
+ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ struct vfio_pci_core_device *vdev =
+ container_of(core_vdev, struct vfio_pci_core_device, vdev);
+
+ return vfio_pci_rw(vdev, (char __user *)buf, count, ppos, true);
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_write);
+
MODULE_LICENSE("GPL");
MODULE_AUTHOR(DRIVER_AUTHOR);
MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index c0695b5db66d..502880e927fc 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1520,8 +1520,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
}
EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl_feature);
-static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
- size_t count, loff_t *ppos, bool iswrite)
+ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
+ size_t count, loff_t *ppos, bool iswrite)
{
unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
int ret;
@@ -1566,6 +1566,7 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
pm_runtime_put(&vdev->pdev->dev);
return ret;
}
+EXPORT_SYMBOL_GPL(vfio_pci_rw);
ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
size_t count, loff_t *ppos)
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 7237fcaecbb6..a6885b48f26f 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -153,6 +153,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
unsigned long arg);
int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
void __user *arg, size_t argsz);
+ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
+ size_t count, loff_t *ppos, bool iswrite);
ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
size_t count, loff_t *ppos);
ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
@@ -216,5 +218,9 @@ int vfio_cxl_core_create_cxl_region(struct vfio_cxl_core_device *cxl, u64 size);
void vfio_cxl_core_destroy_cxl_region(struct vfio_cxl_core_device *cxl);
int vfio_cxl_core_register_cxl_region(struct vfio_cxl_core_device *cxl);
void vfio_cxl_core_unregister_cxl_region(struct vfio_cxl_core_device *cxl);
+ssize_t vfio_cxl_core_read(struct vfio_device *core_vdev, char __user *buf,
+ size_t count, loff_t *ppos);
+ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *buf,
+ size_t count, loff_t *ppos);
#endif /* VFIO_PCI_CORE_H */
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 10/15] vfio/cxl: introduce the register emulation framework
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (8 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 09/15] vfio/cxl: introduce vfio_cxl_core_{read, write}() mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-09 16:50 ` [RFC v2 11/15] vfio/cxl: introduce the emulation of HDM registers mhonap
` (4 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
CXL devices have the CXL DVSEC registers in its configuration space
and CXL MMIO registers in its CXL MMIO BAR.
When passing a CXL type-2 device to a VM, the CXL DVSEC registers in the
configuration space and the HDM registers in the CXL MMIO BAR should be
emulated.
Introduce an emulation framework to handle the CXL device configuration
emulation and the CXL MMIO registers emulation.
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/Makefile | 2 +-
drivers/vfio/pci/vfio_cxl_core.c | 281 +++++++++++++++++++++++++-
drivers/vfio/pci/vfio_cxl_core_emu.c | 184 +++++++++++++++++
drivers/vfio/pci/vfio_cxl_core_priv.h | 17 ++
include/linux/vfio_pci_core.h | 29 +++
5 files changed, 509 insertions(+), 4 deletions(-)
create mode 100644 drivers/vfio/pci/vfio_cxl_core_emu.c
create mode 100644 drivers/vfio/pci/vfio_cxl_core_priv.h
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index b51221b94b0b..452b7387f9fb 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -8,7 +8,7 @@ vfio-pci-y := vfio_pci.o
vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
-vfio-cxl-core-y := vfio_cxl_core.o
+vfio-cxl-core-y := vfio_cxl_core.o vfio_cxl_core_emu.o
obj-$(CONFIG_VFIO_CXL_CORE) += vfio-cxl-core.o
obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/
diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
index 460f1ee910af..cb75e9f668a7 100644
--- a/drivers/vfio/pci/vfio_cxl_core.c
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -20,6 +20,7 @@
#include <linux/uaccess.h>
#include "vfio_pci_priv.h"
+#include "vfio_cxl_core_priv.h"
#define DRIVER_AUTHOR "Zhi Wang <zhiw@nvidia.com>"
#define DRIVER_DESC "core driver for VFIO based CXL devices"
@@ -119,6 +120,119 @@ static void discover_precommitted_region(struct vfio_cxl_core_device *cxl)
cxl_core->region.precommitted = true;
}
+static int find_bar(struct pci_dev *pdev, u64 *offset, int *bar, u64 size)
+{
+ u64 start, end, flags;
+ int index, i;
+
+ for (i = 0; i < PCI_STD_NUM_BARS; i++) {
+ index = i + PCI_STD_RESOURCES;
+ flags = pci_resource_flags(pdev, index);
+
+ start = pci_resource_start(pdev, index);
+ end = pci_resource_end(pdev, index);
+
+ if (*offset >= start && *offset + size - 1 <= end)
+ break;
+
+ if (flags & IORESOURCE_MEM_64)
+ i++;
+ }
+
+ if (i == PCI_STD_NUM_BARS)
+ return -ENODEV;
+
+ *offset = *offset - start;
+ *bar = index;
+
+ return 0;
+}
+
+static int find_comp_regs(struct vfio_cxl_core_device *cxl)
+{
+ struct vfio_pci_core_device *pci = &cxl->pci_core;
+ struct pci_dev *pdev = pci->pdev;
+ u64 offset;
+ int ret, bar;
+
+ ret = cxl_find_comp_regblock_offset(pdev, &offset);
+ if (ret)
+ return ret;
+
+ ret = find_bar(pdev, &offset, &bar, SZ_64K);
+ if (ret)
+ return ret;
+
+ cxl->comp_reg_bar = bar;
+ cxl->comp_reg_offset = offset;
+ cxl->comp_reg_size = SZ_64K;
+ return 0;
+}
+
+static void clean_virt_regs(struct vfio_cxl_core_device *cxl)
+{
+ kvfree(cxl->comp_reg_virt);
+ kvfree(cxl->config_virt);
+}
+
+static void reset_virt_regs(struct vfio_cxl_core_device *cxl)
+{
+ memcpy(cxl->config_virt, cxl->initial_config_virt, cxl->config_size);
+ memcpy(cxl->comp_reg_virt, cxl->initial_comp_reg_virt, cxl->comp_reg_size);
+}
+
+static int setup_virt_regs(struct vfio_cxl_core_device *cxl)
+{
+ struct vfio_pci_core_device *pci = &cxl->pci_core;
+ struct pci_dev *pdev = pci->pdev;
+ u64 offset = cxl->comp_reg_offset;
+ int bar = cxl->comp_reg_bar;
+ u64 size = cxl->comp_reg_size;
+ void *regs;
+ unsigned int i;
+
+ regs = kvzalloc(size * 2, GFP_KERNEL);
+ if (!regs)
+ return -ENOMEM;
+
+ cxl->comp_reg_virt = regs;
+ cxl->initial_comp_reg_virt = regs + size;
+
+ regs = ioremap(pci_resource_start(pdev, bar) + offset, size);
+ if (!regs) {
+ kvfree(cxl->comp_reg_virt);
+ return -EFAULT;
+ }
+
+ for (i = 0; i < size; i += 4)
+ *(u32 *)(cxl->initial_comp_reg_virt + i) =
+ cpu_to_le32(readl(regs + i));
+
+ iounmap(regs);
+
+ regs = kvzalloc(pdev->cfg_size * 2, GFP_KERNEL);
+ if (!regs) {
+ kvfree(cxl->comp_reg_virt);
+ return -ENOMEM;
+ }
+
+ cxl->config_virt = regs;
+ cxl->initial_config_virt = regs + pdev->cfg_size;
+ cxl->config_size = pdev->cfg_size;
+
+ regs = cxl->initial_config_virt + cxl->dvsec;
+
+ for (i = 0; i < 0x40; i += 4) {
+ u32 val;
+
+ pci_read_config_dword(pdev, cxl->dvsec + i, &val);
+ *(u32 *)(regs + i) = cpu_to_le32(val);
+ }
+
+ reset_virt_regs(cxl);
+ return 0;
+}
+
int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
struct vfio_cxl_dev_info *info)
{
@@ -133,6 +247,8 @@ int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
if (!dvsec)
return -ENODEV;
+ cxl->dvsec = dvsec;
+
cxl_core = devm_cxl_dev_state_create(&pdev->dev, CXL_DEVTYPE_DEVMEM,
pdev->dev.id, dvsec, struct vfio_cxl,
cxlds, false);
@@ -141,20 +257,37 @@ int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
return -ENOMEM;
}
- ret = vfio_pci_core_enable(pci);
+ ret = find_comp_regs(cxl);
+ if (ret)
+ return -ENODEV;
+
+ ret = setup_virt_regs(cxl);
if (ret)
return ret;
+ ret = vfio_pci_core_enable(pci);
+ if (ret)
+ goto err_pci_core_enable;
+
ret = enable_cxl(cxl, dvsec, info);
if (ret)
- goto err;
+ goto err_enable_cxl;
+
+ ret = vfio_cxl_core_setup_register_emulation(cxl);
+ if (ret)
+ goto err_register_emulation;
discover_precommitted_region(cxl);
return 0;
-err:
+err_register_emulation:
+ disable_cxl(cxl);
+err_pci_core_enable:
+ clean_virt_regs(cxl);
+err_enable_cxl:
vfio_pci_core_disable(pci);
+
return ret;
}
EXPORT_SYMBOL_GPL(vfio_cxl_core_enable);
@@ -169,7 +302,9 @@ EXPORT_SYMBOL_GPL(vfio_cxl_core_finish_enable);
static void disable_device(struct vfio_cxl_core_device *cxl)
{
+ vfio_cxl_core_clean_register_emulation(cxl);
disable_cxl(cxl);
+ clean_virt_regs(cxl);
}
void vfio_cxl_core_disable(struct vfio_cxl_core_device *cxl)
@@ -383,6 +518,20 @@ ssize_t vfio_cxl_core_read(struct vfio_device *core_vdev, char __user *buf,
{
struct vfio_pci_core_device *vdev =
container_of(core_vdev, struct vfio_pci_core_device, vdev);
+ struct vfio_cxl_core_device *cxl =
+ container_of(vdev, struct vfio_cxl_core_device, pci_core);
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+
+ if (!count)
+ return 0;
+
+ if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+ return vfio_cxl_core_config_rw(core_vdev, buf, count, ppos,
+ false);
+
+ if (index == cxl->comp_reg_bar)
+ return vfio_cxl_core_mmio_bar_rw(core_vdev, buf, count, ppos,
+ false);
return vfio_pci_rw(vdev, buf, count, ppos, false);
}
@@ -393,11 +542,137 @@ ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *bu
{
struct vfio_pci_core_device *vdev =
container_of(core_vdev, struct vfio_pci_core_device, vdev);
+ struct vfio_cxl_core_device *cxl =
+ container_of(vdev, struct vfio_cxl_core_device, pci_core);
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+
+ if (!count)
+ return 0;
+
+ if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+ return vfio_cxl_core_config_rw(core_vdev, (char __user *)buf,
+ count, ppos, true);
+
+ if (index == cxl->comp_reg_bar)
+ return vfio_cxl_core_mmio_bar_rw(core_vdev, (char __user *)buf,
+ count, ppos, true);
return vfio_pci_rw(vdev, (char __user *)buf, count, ppos, true);
}
EXPORT_SYMBOL_GPL(vfio_cxl_core_write);
+static int comp_reg_bar_get_region_info(struct vfio_pci_core_device *pci,
+ void __user *uarg)
+{
+ struct vfio_cxl_core_device *cxl =
+ container_of(pci, struct vfio_cxl_core_device, pci_core);
+ struct pci_dev *pdev = pci->pdev;
+ unsigned long minsz = offsetofend(struct vfio_region_info, offset);
+ struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+ struct vfio_region_info_cap_sparse_mmap *sparse;
+ struct vfio_region_info info;
+ u64 start, end, len;
+ u32 size;
+ int ret;
+
+ if (copy_from_user(&info, uarg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ start = pci_resource_start(pdev, cxl->comp_reg_bar);
+ end = pci_resource_end(pdev, cxl->comp_reg_bar) - start;
+ len = pci_resource_len(pdev, cxl->comp_reg_bar);
+
+ if (!cxl->comp_reg_offset ||
+ cxl->comp_reg_offset + cxl->comp_reg_size == end) {
+ size = struct_size(sparse, areas, 1);
+
+ sparse = kzalloc(size, GFP_KERNEL);
+ if (!sparse)
+ return -ENOMEM;
+
+ sparse->nr_areas = 1;
+ sparse->areas[0].offset = cxl->comp_reg_offset ? 0 : cxl->comp_reg_size;
+ sparse->areas[0].size = len - cxl->comp_reg_size;
+ } else {
+ size = struct_size(sparse, areas, 2);
+
+ sparse = kzalloc(size, GFP_KERNEL);
+ if (!sparse)
+ return -ENOMEM;
+
+ sparse->nr_areas = 2;
+
+ sparse->areas[0].offset = 0;
+ sparse->areas[0].size = cxl->comp_reg_offset;
+
+ sparse->areas[1].offset = sparse->areas[0].size + cxl->comp_reg_size;
+ sparse->areas[1].size = len - sparse->areas[0].size - cxl->comp_reg_size;
+ }
+
+ sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ sparse->header.version = 1;
+
+ ret = vfio_info_add_capability(&caps, &sparse->header, size);
+ kfree(sparse);
+ if (ret)
+ return ret;
+
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = len;
+ info.flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE |
+ VFIO_REGION_INFO_FLAG_MMAP;
+
+ if (caps.size) {
+ info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
+ if (info.argsz < sizeof(info) + caps.size) {
+ info.argsz = sizeof(info) + caps.size;
+ info.cap_offset = 0;
+ } else {
+ vfio_info_cap_shift(&caps, sizeof(info));
+ if (copy_to_user(uarg + sizeof(info), caps.buf,
+ caps.size)) {
+ kfree(caps.buf);
+ return -EFAULT;
+ }
+ info.cap_offset = sizeof(info);
+ }
+ kfree(caps.buf);
+ }
+
+ return copy_to_user(uarg, &info, minsz) ? -EFAULT : 0;
+}
+
+long vfio_cxl_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
+ unsigned long arg)
+{
+ struct vfio_pci_core_device *pci =
+ container_of(core_vdev, struct vfio_pci_core_device, vdev);
+ struct vfio_cxl_core_device *cxl =
+ container_of(pci, struct vfio_cxl_core_device, pci_core);
+ void __user *uarg = (void __user *)arg;
+
+ if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+ struct vfio_region_info info;
+ unsigned long minsz = offsetofend(struct vfio_region_info, offset);
+
+ if (copy_from_user(&info, (void *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ if (info.index == cxl->comp_reg_bar)
+ return comp_reg_bar_get_region_info(pci, uarg);
+ }
+
+ return vfio_pci_core_ioctl(core_vdev, cmd, arg);
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_ioctl);
+
MODULE_LICENSE("GPL");
MODULE_AUTHOR(DRIVER_AUTHOR);
MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_cxl_core_emu.c b/drivers/vfio/pci/vfio_cxl_core_emu.c
new file mode 100644
index 000000000000..a0674bacecd7
--- /dev/null
+++ b/drivers/vfio/pci/vfio_cxl_core_emu.c
@@ -0,0 +1,184 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include "vfio_cxl_core_priv.h"
+
+void vfio_cxl_core_clean_register_emulation(struct vfio_cxl_core_device *cxl)
+{
+ struct list_head *pos, *n;
+
+ list_for_each_safe(pos, n, &cxl->config_regblocks_head)
+ kfree(list_entry(pos, struct vfio_emulated_regblock, list));
+ list_for_each_safe(pos, n, &cxl->mmio_regblocks_head)
+ kfree(list_entry(pos, struct vfio_emulated_regblock, list));
+}
+
+int vfio_cxl_core_setup_register_emulation(struct vfio_cxl_core_device *cxl)
+{
+ INIT_LIST_HEAD(&cxl->config_regblocks_head);
+ INIT_LIST_HEAD(&cxl->mmio_regblocks_head);
+
+ return 0;
+}
+
+static struct vfio_emulated_regblock *
+find_regblock(struct list_head *head, u64 offset, u64 size)
+{
+ struct vfio_emulated_regblock *block;
+ struct list_head *pos;
+
+ list_for_each(pos, head) {
+ block = list_entry(pos, struct vfio_emulated_regblock, list);
+
+ if (block->range.start == ALIGN_DOWN(offset,
+ range_len(&block->range)))
+ return block;
+ }
+ return NULL;
+}
+
+static ssize_t emulate_read(struct list_head *head, struct vfio_device *vdev,
+ char __user *buf, size_t count, loff_t *ppos)
+{
+ struct vfio_pci_core_device *pci =
+ container_of(vdev, struct vfio_pci_core_device, vdev);
+ struct vfio_cxl_core_device *cxl =
+ container_of(pci, struct vfio_cxl_core_device, pci_core);
+ struct vfio_emulated_regblock *block;
+ u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ ssize_t ret;
+ u32 v;
+
+ block = find_regblock(head, pos, count);
+ if (!block || !block->read)
+ return vfio_pci_rw(pci, buf, count, ppos, false);
+
+ if (WARN_ON_ONCE(!IS_ALIGNED(pos, range_len(&block->range))))
+ return -EINVAL;
+
+ if (count > range_len(&block->range))
+ count = range_len(&block->range);
+
+ ret = block->read(cxl, &v, pos, count);
+ if (ret < 0)
+ return ret;
+
+ if (copy_to_user(buf, &v, count))
+ return -EFAULT;
+
+ return count;
+}
+
+static ssize_t emulate_write(struct list_head *head, struct vfio_device *vdev,
+ char __user *buf, size_t count, loff_t *ppos)
+{
+ struct vfio_pci_core_device *pci =
+ container_of(vdev, struct vfio_pci_core_device, vdev);
+ struct vfio_cxl_core_device *cxl =
+ container_of(pci, struct vfio_cxl_core_device, pci_core);
+ struct vfio_emulated_regblock *block;
+ u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ ssize_t ret;
+ u32 v;
+
+ block = find_regblock(head, pos, count);
+ if (!block || !block->write)
+ return vfio_pci_rw(pci, buf, count, ppos, true);
+
+ if (WARN_ON_ONCE(!IS_ALIGNED(pos, range_len(&block->range))))
+ return -EINVAL;
+
+ if (count > range_len(&block->range))
+ count = range_len(&block->range);
+
+ if (copy_from_user(&v, buf, count))
+ return -EFAULT;
+
+ ret = block->write(cxl, &v, pos, count);
+ if (ret < 0)
+ return ret;
+
+ return count;
+}
+
+ssize_t vfio_cxl_core_config_rw(struct vfio_device *vdev, char __user *buf,
+ size_t count, loff_t *ppos, bool write)
+{
+ struct vfio_pci_core_device *pci =
+ container_of(vdev, struct vfio_pci_core_device, vdev);
+ struct vfio_cxl_core_device *cxl =
+ container_of(pci, struct vfio_cxl_core_device, pci_core);
+ size_t done = 0;
+ ssize_t ret = 0;
+ loff_t tmp, pos = *ppos;
+
+ while (count) {
+ tmp = pos;
+
+ if (count >= 4 && IS_ALIGNED(pos, 4))
+ ret = 4;
+ else if (count >= 2 && IS_ALIGNED(pos, 2))
+ ret = 2;
+ else
+ ret = 1;
+
+ if (write)
+ ret = emulate_write(&cxl->config_regblocks_head,
+ vdev, buf, ret, &tmp);
+ else
+ ret = emulate_read(&cxl->config_regblocks_head,
+ vdev, buf, ret, &tmp);
+ if (ret < 0)
+ return ret;
+
+ count -= ret;
+ done += ret;
+ buf += ret;
+ pos += ret;
+ }
+
+ *ppos += done;
+ return done;
+}
+
+ssize_t vfio_cxl_core_mmio_bar_rw(struct vfio_device *vdev, char __user *buf,
+ size_t count, loff_t *ppos, bool write)
+{
+ struct vfio_pci_core_device *pci =
+ container_of(vdev, struct vfio_pci_core_device, vdev);
+ struct vfio_cxl_core_device *cxl =
+ container_of(pci, struct vfio_cxl_core_device, pci_core);
+ size_t done = 0;
+ ssize_t ret = 0;
+ loff_t tmp, pos = *ppos;
+
+ while (count) {
+ tmp = pos;
+
+ if (count >= 4 && IS_ALIGNED(pos, 4))
+ ret = 4;
+ else if (count >= 2 && IS_ALIGNED(pos, 2))
+ ret = 2;
+ else
+ ret = 1;
+
+ if (write)
+ ret = emulate_write(&cxl->mmio_regblocks_head,
+ vdev, buf, ret, &tmp);
+ else
+ ret = emulate_read(&cxl->mmio_regblocks_head,
+ vdev, buf, ret, &tmp);
+ if (ret < 0)
+ return ret;
+
+ count -= ret;
+ done += ret;
+ buf += ret;
+ pos += ret;
+ }
+
+ *ppos += done;
+ return done;
+}
diff --git a/drivers/vfio/pci/vfio_cxl_core_priv.h b/drivers/vfio/pci/vfio_cxl_core_priv.h
new file mode 100644
index 000000000000..b5d96e3872d2
--- /dev/null
+++ b/drivers/vfio/pci/vfio_cxl_core_priv.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef VFIO_CXL_CORE_PRIV_H
+#define VFIO_CXL_CORE_PRIV_H
+
+#include <linux/vfio_pci_core.h>
+
+#include "vfio_pci_priv.h"
+
+void vfio_cxl_core_clean_register_emulation(struct vfio_cxl_core_device *cxl);
+int vfio_cxl_core_setup_register_emulation(struct vfio_cxl_core_device *cxl);
+
+ssize_t vfio_cxl_core_config_rw(struct vfio_device *vdev, char __user *buf,
+ size_t count, loff_t *ppos, bool write);
+ssize_t vfio_cxl_core_mmio_bar_rw(struct vfio_device *vdev, char __user *buf,
+ size_t count, loff_t *ppos, bool write);
+
+#endif
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index a6885b48f26f..12ded67c7db7 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -117,13 +117,40 @@ struct vfio_cxl {
struct vfio_cxl_region region;
};
+struct vfio_cxl_core_device;
+
+struct vfio_emulated_regblock {
+ struct range range;
+ ssize_t (*read)(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size);
+ ssize_t (*write)(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size);
+ struct list_head list;
+};
+
struct vfio_cxl_core_device {
struct vfio_pci_core_device pci_core;
struct vfio_cxl *cxl_core;
+ struct list_head config_regblocks_head;
+ struct list_head mmio_regblocks_head;
+
+ void *initial_comp_reg_virt;
+ void *comp_reg_virt;
+ u64 comp_reg_size;
+
+ void *initial_config_virt;
+ void *config_virt;
+ u64 config_size;
+
+ u16 dvsec;
+
u32 hdm_count;
u64 hdm_reg_offset;
u64 hdm_reg_size;
+
+ int comp_reg_bar;
+ u64 comp_reg_offset;
};
struct vfio_cxl_dev_info {
@@ -222,5 +249,7 @@ ssize_t vfio_cxl_core_read(struct vfio_device *core_vdev, char __user *buf,
size_t count, loff_t *ppos);
ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *buf,
size_t count, loff_t *ppos);
+long vfio_cxl_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
+ unsigned long arg);
#endif /* VFIO_PCI_CORE_H */
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 11/15] vfio/cxl: introduce the emulation of HDM registers
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (9 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 10/15] vfio/cxl: introduce the register emulation framework mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-11 18:13 ` Dave Jiang
2025-12-09 16:50 ` [RFC v2 12/15] vfio/cxl: introduce the emulation of CXL configuration space mhonap
` (3 subsequent siblings)
14 siblings, 1 reply; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
CXL devices have HDM registers in its CXL MMIO bar. Many HDM registers
requires a PA and they are owned by the host in virtualization.
Thus, the HDM registers needs to be emulated accordingly so that the
guest kernel CXL core can configure the virtual HDM decoders.
Intorduce the emulation of HDM registers that emulates the HDM decoders.
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/vfio_cxl_core.c | 7 +-
drivers/vfio/pci/vfio_cxl_core_emu.c | 242 +++++++++++++++++++++++++++
include/linux/vfio_pci_core.h | 2 +
3 files changed, 248 insertions(+), 3 deletions(-)
diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
index cb75e9f668a7..c0bdf55997da 100644
--- a/drivers/vfio/pci/vfio_cxl_core.c
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -247,8 +247,6 @@ int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
if (!dvsec)
return -ENODEV;
- cxl->dvsec = dvsec;
-
cxl_core = devm_cxl_dev_state_create(&pdev->dev, CXL_DEVTYPE_DEVMEM,
pdev->dev.id, dvsec, struct vfio_cxl,
cxlds, false);
@@ -257,9 +255,12 @@ int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
return -ENOMEM;
}
+ cxl->dvsec = dvsec;
+ cxl->cxl_core = cxl_core;
+
ret = find_comp_regs(cxl);
if (ret)
- return -ENODEV;
+ return ret;
ret = setup_virt_regs(cxl);
if (ret)
diff --git a/drivers/vfio/pci/vfio_cxl_core_emu.c b/drivers/vfio/pci/vfio_cxl_core_emu.c
index a0674bacecd7..6711ff8975ef 100644
--- a/drivers/vfio/pci/vfio_cxl_core_emu.c
+++ b/drivers/vfio/pci/vfio_cxl_core_emu.c
@@ -5,6 +5,239 @@
#include "vfio_cxl_core_priv.h"
+typedef ssize_t reg_handler_t(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size);
+
+static struct vfio_emulated_regblock *
+new_reg_block(struct vfio_cxl_core_device *cxl, u64 offset, u64 size,
+ reg_handler_t *read, reg_handler_t *write)
+{
+ struct vfio_emulated_regblock *block;
+
+ block = kzalloc(sizeof(*block), GFP_KERNEL);
+ if (!block)
+ return ERR_PTR(-ENOMEM);
+
+ block->range.start = offset;
+ block->range.end = offset + size - 1;
+ block->read = read;
+ block->write = write;
+
+ INIT_LIST_HEAD(&block->list);
+
+ return block;
+}
+
+static int new_mmio_block(struct vfio_cxl_core_device *cxl, u64 offset, u64 size,
+ reg_handler_t *read, reg_handler_t *write)
+{
+ struct vfio_emulated_regblock *block;
+
+ block = new_reg_block(cxl, offset, size, read, write);
+ if (IS_ERR(block))
+ return PTR_ERR(block);
+
+ list_add_tail(&block->list, &cxl->mmio_regblocks_head);
+ return 0;
+}
+
+static u64 hdm_reg_base(struct vfio_cxl_core_device *cxl)
+{
+ return cxl->comp_reg_offset + cxl->hdm_reg_offset;
+}
+
+static u64 to_hdm_reg_offset(struct vfio_cxl_core_device *cxl, u64 offset)
+{
+ return offset - hdm_reg_base(cxl);
+}
+
+static void *hdm_reg_virt(struct vfio_cxl_core_device *cxl, u64 hdm_reg_offset)
+{
+ return cxl->comp_reg_virt + cxl->hdm_reg_offset + hdm_reg_offset;
+}
+
+static ssize_t virt_hdm_reg_read(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ offset = to_hdm_reg_offset(cxl, offset);
+ memcpy(buf, hdm_reg_virt(cxl, offset), size);
+
+ return size;
+}
+
+static ssize_t virt_hdm_reg_write(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ offset = to_hdm_reg_offset(cxl, offset);
+ memcpy(hdm_reg_virt(cxl, offset), buf, size);
+
+ return size;
+}
+
+static ssize_t virt_hdm_rev_reg_write(struct vfio_cxl_core_device *cxl,
+ void *buf, u64 offset, u64 size)
+{
+ /* Discard writes on reserved registers. */
+ return size;
+}
+
+static ssize_t hdm_decoder_n_lo_write(struct vfio_cxl_core_device *cxl,
+ void *buf, u64 offset, u64 size)
+{
+ u32 new_val = le32_to_cpu(*(u32 *)buf);
+
+ if (WARN_ON_ONCE(size != 4))
+ return -EINVAL;
+
+ /* Bit [27:0] are reserved. */
+ new_val &= ~GENMASK(27, 0);
+
+ new_val = cpu_to_le32(new_val);
+ offset = to_hdm_reg_offset(cxl, offset);
+ memcpy(hdm_reg_virt(cxl, offset), &new_val, size);
+ return size;
+}
+
+static ssize_t hdm_decoder_global_ctrl_write(struct vfio_cxl_core_device *cxl,
+ void *buf, u64 offset, u64 size)
+{
+ u32 hdm_decoder_global_cap;
+ u32 new_val = le32_to_cpu(*(u32 *)buf);
+
+ if (WARN_ON_ONCE(size != 4))
+ return -EINVAL;
+
+ /* Bit [31:2] are reserved. */
+ new_val &= ~GENMASK(31, 2);
+
+ /* Poison On Decode Error Enable bit is 0 and RO if not support. */
+ hdm_decoder_global_cap = le32_to_cpu(*(u32 *)hdm_reg_virt(cxl, 0));
+ if (!(hdm_decoder_global_cap & BIT(10)))
+ new_val &= ~BIT(0);
+
+ new_val = cpu_to_le32(new_val);
+ offset = to_hdm_reg_offset(cxl, offset);
+ memcpy(hdm_reg_virt(cxl, offset), &new_val, size);
+ return size;
+}
+
+static ssize_t hdm_decoder_n_ctrl_write(struct vfio_cxl_core_device *cxl,
+ void *buf, u64 offset, u64 size)
+{
+ u32 hdm_decoder_global_cap;
+ u32 ro_mask, rev_mask;
+ u32 new_val = le32_to_cpu(*(u32 *)buf);
+ u32 cur_val;
+
+ if (WARN_ON_ONCE(size != 4))
+ return -EINVAL;
+
+ offset = to_hdm_reg_offset(cxl, offset);
+ cur_val = le32_to_cpu(*(u32 *)hdm_reg_virt(cxl, offset));
+
+ /* Lock on commit */
+ if (cur_val & BIT(8))
+ return size;
+
+ hdm_decoder_global_cap = le32_to_cpu(*(u32 *)hdm_reg_virt(cxl, 0));
+
+ /* RO and reserved bits in the spec */
+ ro_mask = BIT(10) | BIT(11);
+ rev_mask = BIT(15) | GENMASK(31, 28);
+
+ /* bits are not valid for devices */
+ ro_mask |= BIT(12);
+ rev_mask |= GENMASK(19, 16) | GENMASK(23, 20);
+
+ /* bits are reserved when UIO is not supported */
+ if (!(hdm_decoder_global_cap & BIT(13)))
+ rev_mask |= BIT(14) | GENMASK(27, 24);
+
+ /* clear reserved bits */
+ new_val &= ~rev_mask;
+
+ /* keep the RO bits */
+ cur_val &= ro_mask;
+ new_val &= ~ro_mask;
+ new_val |= cur_val;
+
+ /* emulate HDM decoder commit/de-commit */
+ if (new_val & BIT(9))
+ new_val |= BIT(10);
+ else
+ new_val &= ~BIT(10);
+
+ new_val = cpu_to_le32(new_val);
+ memcpy(hdm_reg_virt(cxl, offset), &new_val, size);
+ return size;
+}
+
+static int setup_mmio_emulation(struct vfio_cxl_core_device *cxl)
+{
+ u64 offset, base;
+ int ret;
+
+ base = hdm_reg_base(cxl);
+
+#define ALLOC_BLOCK(offset, size, read, write) do { \
+ ret = new_mmio_block(cxl, offset, size, read, write); \
+ if (ret) \
+ return ret; \
+ } while (0)
+
+ ALLOC_BLOCK(base + 0x4, 4,
+ virt_hdm_reg_read,
+ hdm_decoder_global_ctrl_write);
+
+ offset = base + 0x10;
+ while (offset < base + cxl->hdm_reg_size) {
+ /* HDM N BASE LOW */
+ ALLOC_BLOCK(offset, 4,
+ virt_hdm_reg_read,
+ hdm_decoder_n_lo_write);
+
+ /* HDM N BASE HIGH */
+ ALLOC_BLOCK(offset + 0x4, 4,
+ virt_hdm_reg_read,
+ virt_hdm_reg_write);
+
+ /* HDM N SIZE LOW */
+ ALLOC_BLOCK(offset + 0x8, 4,
+ virt_hdm_reg_read,
+ hdm_decoder_n_lo_write);
+
+ /* HDM N SIZE HIGH */
+ ALLOC_BLOCK(offset + 0xc, 4,
+ virt_hdm_reg_read,
+ virt_hdm_reg_write);
+
+ /* HDM N CONTROL */
+ ALLOC_BLOCK(offset + 0x10, 4,
+ virt_hdm_reg_read,
+ hdm_decoder_n_ctrl_write);
+
+ /* HDM N TARGET LIST LOW */
+ ALLOC_BLOCK(offset + 0x14, 0x4,
+ virt_hdm_reg_read,
+ virt_hdm_rev_reg_write);
+
+ /* HDM N TARGET LIST HIGH */
+ ALLOC_BLOCK(offset + 0x18, 0x4,
+ virt_hdm_reg_read,
+ virt_hdm_rev_reg_write);
+
+ /* HDM N REV */
+ ALLOC_BLOCK(offset + 0x1c, 0x4,
+ virt_hdm_reg_read,
+ virt_hdm_rev_reg_write);
+
+ offset += 0x20;
+ }
+
+#undef ALLOC_BLOCK
+ return 0;
+}
+
void vfio_cxl_core_clean_register_emulation(struct vfio_cxl_core_device *cxl)
{
struct list_head *pos, *n;
@@ -17,10 +250,19 @@ void vfio_cxl_core_clean_register_emulation(struct vfio_cxl_core_device *cxl)
int vfio_cxl_core_setup_register_emulation(struct vfio_cxl_core_device *cxl)
{
+ int ret;
+
INIT_LIST_HEAD(&cxl->config_regblocks_head);
INIT_LIST_HEAD(&cxl->mmio_regblocks_head);
+ ret = setup_mmio_emulation(cxl);
+ if (ret)
+ goto err;
+
return 0;
+err:
+ vfio_cxl_core_clean_register_emulation(cxl);
+ return ret;
}
static struct vfio_emulated_regblock *
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 12ded67c7db7..31fd28626846 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -251,5 +251,7 @@ ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *bu
size_t count, loff_t *ppos);
long vfio_cxl_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
unsigned long arg);
+int vfio_cxl_core_setup_register_emulation(struct vfio_cxl_core_device *cxl);
+void vfio_cxl_core_clean_register_emulation(struct vfio_cxl_core_device *cxl);
#endif /* VFIO_PCI_CORE_H */
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 12/15] vfio/cxl: introduce the emulation of CXL configuration space
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (10 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 11/15] vfio/cxl: introduce the emulation of HDM registers mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-09 16:50 ` [RFC v2 13/15] vfio/pci: introduce CXL device awareness mhonap
` (2 subsequent siblings)
14 siblings, 0 replies; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Zhi Wang <zhiw@nvidia.com>
CXL devices have CXL DVSEC registers in the configuration space.
Many of them affects the behaviors of the devices. E.g. enabling
CXL.io/CXL.mem/CXL.cache.
However, these configuration are owned by the host and a virtualization
policy should be applied when handling the access from the guest.
Introduce the emulation of CXL configuration space to handle the access
of the virtual CXL configuration space from the guest.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/vfio_cxl_core_emu.c | 340 ++++++++++++++++++++++++++-
drivers/vfio/pci/vfio_pci_config.c | 10 +-
include/linux/vfio_pci_core.h | 4 +
3 files changed, 346 insertions(+), 8 deletions(-)
diff --git a/drivers/vfio/pci/vfio_cxl_core_emu.c b/drivers/vfio/pci/vfio_cxl_core_emu.c
index 6711ff8975ef..8037737838ba 100644
--- a/drivers/vfio/pci/vfio_cxl_core_emu.c
+++ b/drivers/vfio/pci/vfio_cxl_core_emu.c
@@ -28,6 +28,334 @@ new_reg_block(struct vfio_cxl_core_device *cxl, u64 offset, u64 size,
return block;
}
+static int new_config_block(struct vfio_cxl_core_device *cxl, u64 offset,
+ u64 size, reg_handler_t *read, reg_handler_t *write)
+{
+ struct vfio_emulated_regblock *block;
+
+ block = new_reg_block(cxl, offset, size, read, write);
+ if (IS_ERR(block))
+ return PTR_ERR(block);
+
+ list_add_tail(&block->list, &cxl->config_regblocks_head);
+ return 0;
+}
+
+static ssize_t virt_config_reg_read(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ memcpy(buf, cxl->config_virt + offset, size);
+ return size;
+}
+
+static ssize_t virt_config_reg_write(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ memcpy(cxl->config_virt + offset, buf, size);
+ return size;
+}
+
+static ssize_t hw_config_reg_read(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ return vfio_user_config_read(cxl->pci_core.pdev, offset, buf, size);
+}
+
+static ssize_t hw_config_reg_write(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ __le32 write_val = *(__le32 *)buf;
+
+ return vfio_user_config_write(cxl->pci_core.pdev, offset, write_val, size);
+}
+
+static ssize_t cxl_control_write(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ u16 lock = le16_to_cpu(*(u16 *)(cxl->config_virt + cxl->dvsec + 0x14));
+ u16 cap3 = le16_to_cpu(*(u16 *)(cxl->config_virt + cxl->dvsec + 0x38));
+ u16 new_val = le16_to_cpu(*(u16 *)buf);
+ u16 rev_mask;
+
+ if (WARN_ON_ONCE(size != 2))
+ return -EINVAL;
+
+ /* register is locked */
+ if (lock & BIT(0))
+ return size;
+
+ /* handle reserved bits in the spec */
+ rev_mask = BIT(13) | BIT(15);
+
+ /* no direct p2p cap */
+ if (!(cap3 & BIT(4)))
+ rev_mask |= BIT(12);
+
+ new_val &= ~rev_mask;
+
+ /* CXL.io is always enabled. */
+ new_val |= BIT(1);
+
+ memcpy(cxl->config_virt + offset, &new_val, size);
+ return size;
+}
+
+static ssize_t cxl_status_write(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ u16 cur_val = le16_to_cpu(*(u16 *)(cxl->config_virt + offset));
+ u16 new_val = le16_to_cpu(*(u16 *)buf);
+ u16 rev_mask = GENMASK(13, 0) | BIT(15);
+
+ if (WARN_ON_ONCE(size != 2))
+ return -EINVAL;
+
+ /* handle reserved bits in the spec */
+ new_val &= ~rev_mask;
+
+ /* emulate RW1C bit */
+ if (new_val & BIT(14)) {
+ new_val &= ~BIT(14);
+ } else {
+ new_val &= ~BIT(14);
+ new_val |= cur_val & BIT(14);
+ }
+
+ new_val = cpu_to_le16(new_val);
+ memcpy(cxl->config_virt + offset, &new_val, size);
+ return size;
+}
+
+static ssize_t cxl_control_2_write(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ struct pci_dev *pdev = cxl->pci_core.pdev;
+ u16 cap2 = le16_to_cpu(*(u16 *)(cxl->config_virt + cxl->dvsec + 0x16));
+ u16 cap3 = le16_to_cpu(*(u16 *)(cxl->config_virt + cxl->dvsec + 0x38));
+ u16 new_val = le16_to_cpu(*(u16 *)buf);
+ u16 rev_mask = GENMASK(15, 6) | BIT(1) | BIT(2);
+ u16 hw_bits = BIT(0) | BIT(1) | BIT(3);
+ bool initiate_cxl_reset = new_val & BIT(2);
+
+ if (WARN_ON_ONCE(size != 2))
+ return -EINVAL;
+
+ /* no desired volatile HDM state after host reset */
+ if (!(cap3 & BIT(2)))
+ rev_mask |= BIT(4);
+
+ /* no modified completion enable */
+ if (!(cap2 & BIT(6)))
+ rev_mask |= BIT(5);
+
+ /* handle reserved bits in the spec */
+ new_val &= ~rev_mask;
+
+ /* bits go to the HW */
+ hw_bits &= new_val;
+
+ /* update the virt regs */
+ new_val = cpu_to_le16(new_val);
+ memcpy(cxl->config_virt + offset, &new_val, size);
+
+ if (hw_bits)
+ pci_write_config_word(pdev, offset, cpu_to_le16(hw_bits));
+
+ if (initiate_cxl_reset) {
+ /* TODO: call linux CXL reset */
+ }
+ return size;
+}
+
+static ssize_t cxl_status_2_write(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ struct pci_dev *pdev = cxl->pci_core.pdev;
+ u16 cap3 = le16_to_cpu(*(u16 *)(cxl->config_virt + cxl->dvsec + 0x38));
+ u16 new_val = le16_to_cpu(*(u16 *)buf);
+
+ if (WARN_ON_ONCE(size != 2))
+ return -EINVAL;
+
+ /* write RW1CS if supports */
+ if ((cap3 & BIT(2)) && (new_val & BIT(3)))
+ pci_write_config_word(pdev, offset, BIT(3));
+
+ /* No need to update the virt regs, CXL status reads from the HW */
+ return size;
+}
+
+static ssize_t cxl_lock_write(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ u16 cur_val = le16_to_cpu(*(u16 *)(cxl->config_virt + offset));
+ u16 new_val = le16_to_cpu(*(u16 *)buf);
+ u16 rev_mask = GENMASK(15, 1);
+
+ if (WARN_ON_ONCE(size != 2))
+ return -EINVAL;
+
+ /* LOCK is not allowed to be cleared unless conventional reset. */
+ if (cur_val & BIT(0))
+ return size;
+
+ /* handle reserved bits in the spec */
+ new_val &= ~rev_mask;
+
+ new_val = cpu_to_le16(new_val);
+ memcpy(cxl->config_virt + offset, &new_val, size);
+ return size;
+}
+
+static ssize_t cxl_base_lo_write(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ u32 new_val = le32_to_cpu(*(u32 *)buf);
+ u32 rev_mask = GENMASK(27, 0);
+
+ if (WARN_ON_ONCE(size != 4))
+ return -EINVAL;
+
+ /* handle reserved bits in the spec */
+ new_val &= ~rev_mask;
+
+ new_val = cpu_to_le32(new_val);
+ memcpy(cxl->config_virt + offset, &new_val, size);
+ return size;
+}
+
+static ssize_t virt_config_reg_ro_write(struct vfio_cxl_core_device *cxl, void *buf,
+ u64 offset, u64 size)
+{
+ return size;
+}
+
+static int setup_config_emulation(struct vfio_cxl_core_device *cxl)
+{
+ u16 offset;
+ int ret;
+
+#define ALLOC_BLOCK(offset, size, read, write) do { \
+ ret = new_config_block(cxl, offset, size, read, write); \
+ if (ret) \
+ return ret; \
+ } while (0)
+
+ ALLOC_BLOCK(cxl->dvsec, 4,
+ virt_config_reg_read,
+ virt_config_reg_ro_write);
+
+ ALLOC_BLOCK(cxl->dvsec + 0x4, 4,
+ virt_config_reg_read,
+ virt_config_reg_ro_write);
+
+ ALLOC_BLOCK(cxl->dvsec + 0x8, 2,
+ virt_config_reg_read,
+ virt_config_reg_ro_write);
+
+ /* CXL CAPABILITY */
+ ALLOC_BLOCK(cxl->dvsec + 0xa, 2,
+ virt_config_reg_read,
+ virt_config_reg_ro_write);
+
+ /* CXL CONTROL */
+ ALLOC_BLOCK(cxl->dvsec + 0xc, 2,
+ virt_config_reg_read,
+ cxl_control_write);
+
+ /* CXL STATUS */
+ ALLOC_BLOCK(cxl->dvsec + 0xe, 2,
+ virt_config_reg_read,
+ cxl_status_write);
+
+ /* CXL CONTROL 2 */
+ ALLOC_BLOCK(cxl->dvsec + 0x10, 2,
+ virt_config_reg_read,
+ cxl_control_2_write);
+
+ /* CXL STATUS 2 */
+ ALLOC_BLOCK(cxl->dvsec + 0x12, 2,
+ hw_config_reg_read,
+ cxl_status_2_write);
+
+ /* CXL LOCK */
+ ALLOC_BLOCK(cxl->dvsec + 0x14, 2,
+ virt_config_reg_read,
+ cxl_lock_write);
+
+ /* CXL CAPABILITY 2 */
+ ALLOC_BLOCK(cxl->dvsec + 0x16, 2,
+ virt_config_reg_read,
+ virt_config_reg_ro_write);
+
+ /* CXL RANGE 1 SIZE HIGH & LOW */
+ ALLOC_BLOCK(cxl->dvsec + 0x18, 4,
+ virt_config_reg_read,
+ virt_config_reg_ro_write);
+
+ ALLOC_BLOCK(cxl->dvsec + 0x1c, 4,
+ virt_config_reg_read,
+ virt_config_reg_ro_write);
+
+ /* CXL RANG BASE 1 HIGH */
+ ALLOC_BLOCK(cxl->dvsec + 0x20, 4,
+ virt_config_reg_read,
+ virt_config_reg_write);
+
+ /* CXL RANG BASE 1 LOW */
+ ALLOC_BLOCK(cxl->dvsec + 0x24, 4,
+ virt_config_reg_read,
+ cxl_base_lo_write);
+
+ /* CXL RANGE 2 SIZE HIGH & LOW */
+ ALLOC_BLOCK(cxl->dvsec + 0x28, 4,
+ virt_config_reg_read,
+ virt_config_reg_ro_write);
+
+ ALLOC_BLOCK(cxl->dvsec + 0x2c, 4,
+ virt_config_reg_read,
+ virt_config_reg_ro_write);
+
+ /* CXL RANG BASE 2 HIGH */
+ ALLOC_BLOCK(cxl->dvsec + 0x30, 4,
+ virt_config_reg_read,
+ virt_config_reg_write);
+
+ /* CXL RANG BASE 2 LOW */
+ ALLOC_BLOCK(cxl->dvsec + 0x34, 4,
+ virt_config_reg_read,
+ cxl_base_lo_write);
+
+ /* CXL CAPABILITY 3 */
+ ALLOC_BLOCK(cxl->dvsec + 0x38, 2,
+ virt_config_reg_read,
+ virt_config_reg_ro_write);
+
+ while ((offset = pci_find_next_ext_capability(cxl->pci_core.pdev,
+ offset,
+ PCI_EXT_CAP_ID_DOE))) {
+ ALLOC_BLOCK(offset + PCI_DOE_CTRL, 4,
+ hw_config_reg_read,
+ hw_config_reg_write);
+
+ ALLOC_BLOCK(offset + PCI_DOE_STATUS, 4,
+ hw_config_reg_read,
+ hw_config_reg_write);
+
+ ALLOC_BLOCK(offset + PCI_DOE_WRITE, 4,
+ hw_config_reg_read,
+ hw_config_reg_write);
+
+ ALLOC_BLOCK(offset + PCI_DOE_READ, 4,
+ hw_config_reg_read,
+ hw_config_reg_write);
+ }
+
+#undef ALLOC_BLOCK
+
+ return 0;
+}
+
static int new_mmio_block(struct vfio_cxl_core_device *cxl, u64 offset, u64 size,
reg_handler_t *read, reg_handler_t *write)
{
@@ -179,10 +507,10 @@ static int setup_mmio_emulation(struct vfio_cxl_core_device *cxl)
base = hdm_reg_base(cxl);
-#define ALLOC_BLOCK(offset, size, read, write) do { \
- ret = new_mmio_block(cxl, offset, size, read, write); \
- if (ret) \
- return ret; \
+#define ALLOC_BLOCK(offset, size, read, write) do { \
+ ret = new_mmio_block(cxl, offset, size, read, write); \
+ if (ret) \
+ return ret; \
} while (0)
ALLOC_BLOCK(base + 0x4, 4,
@@ -255,6 +583,10 @@ int vfio_cxl_core_setup_register_emulation(struct vfio_cxl_core_device *cxl)
INIT_LIST_HEAD(&cxl->config_regblocks_head);
INIT_LIST_HEAD(&cxl->mmio_regblocks_head);
+ ret = setup_config_emulation(cxl);
+ if (ret)
+ goto err;
+
ret = setup_mmio_emulation(cxl);
if (ret)
goto err;
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 8f02f236b5b4..4847d09e58b4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -120,8 +120,8 @@ struct perm_bits {
#define NO_WRITE 0
#define ALL_WRITE 0xFFFFFFFFU
-static int vfio_user_config_read(struct pci_dev *pdev, int offset,
- __le32 *val, int count)
+int vfio_user_config_read(struct pci_dev *pdev, int offset,
+ __le32 *val, int count)
{
int ret = -EINVAL;
u32 tmp_val = 0;
@@ -150,9 +150,10 @@ static int vfio_user_config_read(struct pci_dev *pdev, int offset,
return ret;
}
+EXPORT_SYMBOL_GPL(vfio_user_config_read);
-static int vfio_user_config_write(struct pci_dev *pdev, int offset,
- __le32 val, int count)
+int vfio_user_config_write(struct pci_dev *pdev, int offset,
+ __le32 val, int count)
{
int ret = -EINVAL;
u32 tmp_val = le32_to_cpu(val);
@@ -171,6 +172,7 @@ static int vfio_user_config_write(struct pci_dev *pdev, int offset,
return ret;
}
+EXPORT_SYMBOL_GPL(vfio_user_config_write);
static int vfio_default_config_read(struct vfio_pci_core_device *vdev, int pos,
int count, struct perm_bits *perm,
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 31fd28626846..8293910e0a96 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -201,6 +201,10 @@ ssize_t vfio_pci_core_do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
void __iomem *io, char __user *buf,
loff_t off, size_t count, size_t x_start,
size_t x_end, bool iswrite);
+int vfio_user_config_read(struct pci_dev *pdev, int offset,
+ __le32 *val, int count);
+int vfio_user_config_write(struct pci_dev *pdev, int offset,
+ __le32 val, int count);
bool vfio_pci_core_range_intersect_range(loff_t buf_start, size_t buf_cnt,
loff_t reg_start, size_t reg_cnt,
loff_t *buf_offset,
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 13/15] vfio/pci: introduce CXL device awareness
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (11 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 12/15] vfio/cxl: introduce the emulation of CXL configuration space mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-09 16:50 ` [RFC v2 14/15] vfio/cxl: VFIO variant driver for QEMU CXL accel device mhonap
2025-12-09 16:50 ` [RFC v2 15/15] cxl/mem: Fix NULL pointer deference in memory device paths mhonap
14 siblings, 0 replies; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Zhi Wang <zhiw@nvidia.com>
CXL device programming interfaces are built upon PCI interfaces. Thus
the vfio-pci-core can be leveraged to handle a CXL device.
However, CXL device also has difference with PCI devicce:
- No INTX support, only MSI/MSIX is supported.
- Reset is done via CXL reset. FLR only reset CXL.io.
Introduce the CXL device awareness to the vfio-pci-core. Expose a new
VFIO device flags to the userspace to identify the VFIO device is a CXL
device. Disable INTX support in the vfio-pci-core. Disable FLR reset for
the CXL device as the kernel CXL core hasn't support CXL reset yet.
Disable mmap support on the CXL MMIO BAR in vfio-pci-core.
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/vfio_cxl_core.c | 18 +++++++++++++++++
drivers/vfio/pci/vfio_pci_core.c | 33 ++++++++++++++++++++++++++++----
drivers/vfio/pci/vfio_pci_rdwr.c | 11 ++++++++---
include/linux/vfio_pci_core.h | 3 +++
include/uapi/linux/vfio.h | 10 ++++++++++
5 files changed, 68 insertions(+), 7 deletions(-)
diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
index c0bdf55997da..84e4f42d97de 100644
--- a/drivers/vfio/pci/vfio_cxl_core.c
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -25,6 +25,19 @@
#define DRIVER_AUTHOR "Zhi Wang <zhiw@nvidia.com>"
#define DRIVER_DESC "core driver for VFIO based CXL devices"
+static void init_cxl_cap(struct vfio_cxl_core_device *cxl)
+{
+ struct vfio_pci_core_device *pci = &cxl->pci_core;
+ struct vfio_device_info_cap_cxl *cap = &pci->cxl_cap;
+
+ cap->header.id = VFIO_DEVICE_INFO_CAP_CXL;
+ cap->header.version = 1;
+ cap->hdm_count = cxl->hdm_count;
+ cap->hdm_reg_offset = cxl->comp_reg_offset + cxl->hdm_reg_offset;
+ cap->hdm_reg_size = cxl->hdm_reg_size;
+ cap->hdm_reg_bar_index = cxl->comp_reg_bar;
+}
+
/* Standard CXL-type 2 driver initialization sequence */
static int enable_cxl(struct vfio_cxl_core_device *cxl, u16 dvsec,
struct vfio_cxl_dev_info *info)
@@ -74,6 +87,8 @@ static int enable_cxl(struct vfio_cxl_core_device *cxl, u16 dvsec,
if (IS_ERR(cxl_core->cxlmd))
return PTR_ERR(cxl_core->cxlmd);
+ init_cxl_cap(cxl);
+
cxl_core->region.noncached = info->noncached_region;
return 0;
@@ -266,6 +281,9 @@ int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
if (ret)
return ret;
+ pci->is_cxl = true;
+ pci->comp_reg_bar = cxl->comp_reg_bar;
+
ret = vfio_pci_core_enable(pci);
if (ret)
goto err_pci_core_enable;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 502880e927fc..5f8334748841 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -483,7 +483,12 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
goto out_power;
/* If reset fails because of the device lock, fail this path entirely */
- ret = pci_try_reset_function(pdev);
+ if (!vdev->is_cxl)
+ ret = pci_try_reset_function(pdev);
+ else
+ /* TODO: CXL reset support is on-going. */
+ ret = -ENODEV;
+
if (ret == -EAGAIN)
goto out_disable_device;
@@ -618,8 +623,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
if (!vdev->barmap[bar])
continue;
pci_iounmap(pdev, vdev->barmap[bar]);
- pci_release_selected_regions(pdev, 1 << bar);
vdev->barmap[bar] = NULL;
+
+ if (vdev->is_cxl && i == vdev->comp_reg_bar)
+ continue;
+
+ pci_release_selected_regions(pdev, 1 << bar);
}
list_for_each_entry_safe(dummy_res, tmp,
@@ -960,6 +969,15 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
if (vdev->reset_works)
info.flags |= VFIO_DEVICE_FLAGS_RESET;
+ if (vdev->is_cxl) {
+ ret = vfio_info_add_capability(&caps, &vdev->cxl_cap.header,
+ sizeof(vdev->cxl_cap));
+ if (ret)
+ return ret;
+
+ info.flags |= VFIO_DEVICE_FLAGS_CXL;
+ }
+
info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
info.num_irqs = VFIO_PCI_NUM_IRQS;
@@ -1752,14 +1770,21 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
* we need to request the region and the barmap tracks that.
*/
if (!vdev->barmap[index]) {
+ int bars;
+
+ if (vdev->is_cxl && vdev->comp_reg_bar == index)
+ bars = 0;
+ else
+ bars = 1 << index;
+
ret = pci_request_selected_regions(pdev,
- 1 << index, "vfio-pci");
+ bars, "vfio-pci");
if (ret)
return ret;
vdev->barmap[index] = pci_iomap(pdev, index, 0);
if (!vdev->barmap[index]) {
- pci_release_selected_regions(pdev, 1 << index);
+ pci_release_selected_regions(pdev, bars);
return -ENOMEM;
}
}
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 6192788c8ba3..057cd0c69f2a 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -201,19 +201,24 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_do_io_rw);
int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar)
{
struct pci_dev *pdev = vdev->pdev;
- int ret;
+ int bars, ret;
void __iomem *io;
if (vdev->barmap[bar])
return 0;
- ret = pci_request_selected_regions(pdev, 1 << bar, "vfio");
+ if (vdev->is_cxl && vdev->comp_reg_bar == bar)
+ bars = 0;
+ else
+ bars = 1 << bar;
+
+ ret = pci_request_selected_regions(pdev, bars, "vfio");
if (ret)
return ret;
io = pci_iomap(pdev, bar, 0);
if (!io) {
- pci_release_selected_regions(pdev, 1 << bar);
+ pci_release_selected_regions(pdev, bars);
return -ENOMEM;
}
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 8293910e0a96..0a354c7788b3 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -82,6 +82,9 @@ struct vfio_pci_core_device {
bool needs_pm_restore:1;
bool pm_intx_masked:1;
bool pm_runtime_engaged:1;
+ bool is_cxl:1;
+ int comp_reg_bar;
+ struct vfio_device_info_cap_cxl cxl_cap;
struct pci_saved_state *pci_saved_state;
struct pci_saved_state *pm_save;
int ioeventfds_nr;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 95be987d2ed5..0a9968cd6601 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -214,6 +214,7 @@ struct vfio_device_info {
#define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6) /* vfio-fsl-mc device */
#define VFIO_DEVICE_FLAGS_CAPS (1 << 7) /* Info supports caps */
#define VFIO_DEVICE_FLAGS_CDX (1 << 8) /* vfio-cdx device */
+#define VFIO_DEVICE_FLAGS_CXL (1 << 9) /* Device supports CXL */
__u32 num_regions; /* Max region index + 1 */
__u32 num_irqs; /* Max IRQ index + 1 */
__u32 cap_offset; /* Offset within info struct of first cap */
@@ -256,6 +257,15 @@ struct vfio_device_info_cap_pci_atomic_comp {
__u32 reserved;
};
+#define VFIO_DEVICE_INFO_CAP_CXL 6
+struct vfio_device_info_cap_cxl {
+ struct vfio_info_cap_header header;
+ __u8 hdm_count;
+ __u8 hdm_reg_bar_index;
+ __u64 hdm_reg_size;
+ __u64 hdm_reg_offset;
+};
+
/**
* VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
* struct vfio_region_info)
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 14/15] vfio/cxl: VFIO variant driver for QEMU CXL accel device
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (12 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 13/15] vfio/pci: introduce CXL device awareness mhonap
@ 2025-12-09 16:50 ` mhonap
2025-12-09 16:50 ` [RFC v2 15/15] cxl/mem: Fix NULL pointer deference in memory device paths mhonap
14 siblings, 0 replies; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
To demonstrate the VFIO CXL core, a VFIO variant driver for QEMU CXL
accel device is introduced, so that people to test can try the patches.
This patch is not meant to be merged.
Co-developed-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/vfio/pci/Kconfig | 2 +
drivers/vfio/pci/Makefile | 2 +
drivers/vfio/pci/cxl-accel/Kconfig | 9 ++
drivers/vfio/pci/cxl-accel/Makefile | 4 +
drivers/vfio/pci/cxl-accel/main.c | 143 ++++++++++++++++++++++++++++
5 files changed, 160 insertions(+)
create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
create mode 100644 drivers/vfio/pci/cxl-accel/main.c
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 2f441d118f1c..441ded7ea035 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -77,4 +77,6 @@ source "drivers/vfio/pci/nvgrace-gpu/Kconfig"
source "drivers/vfio/pci/qat/Kconfig"
+source "drivers/vfio/pci/cxl-accel/Kconfig"
+
endmenu
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 452b7387f9fb..1b81d75b8ef7 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -22,3 +22,5 @@ obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu/
obj-$(CONFIG_QAT_VFIO_PCI) += qat/
+
+obj-$(CONFIG_CXL_ACCEL_VFIO_PCI) += cxl-accel/
diff --git a/drivers/vfio/pci/cxl-accel/Kconfig b/drivers/vfio/pci/cxl-accel/Kconfig
new file mode 100644
index 000000000000..9a8884ded049
--- /dev/null
+++ b/drivers/vfio/pci/cxl-accel/Kconfig
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config CXL_ACCEL_VFIO_PCI
+ tristate "VFIO support for the QEMU CXL accel device"
+ select VFIO_CXL_CORE
+ help
+ VFIO support for the CXL devices is needed for assigning the CXL
+ devices to userspace using KVM/qemu/etc.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/cxl-accel/Makefile b/drivers/vfio/pci/cxl-accel/Makefile
new file mode 100644
index 000000000000..8d0e076f405f
--- /dev/null
+++ b/drivers/vfio/pci/cxl-accel/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+obj-$(CONFIG_CXL_ACCEL_VFIO_PCI) += cxl-accel-vfio-pci.o
+cxl-accel-vfio-pci-y := main.o
diff --git a/drivers/vfio/pci/cxl-accel/main.c b/drivers/vfio/pci/cxl-accel/main.c
new file mode 100644
index 000000000000..3e5001ed5e2a
--- /dev/null
+++ b/drivers/vfio/pci/cxl-accel/main.c
@@ -0,0 +1,143 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/sizes.h>
+#include <linux/vfio_pci_core.h>
+
+struct cxl_device {
+ struct vfio_pci_core_device core_device;
+};
+
+static int cxl_open_device(struct vfio_device *vdev)
+{
+ struct vfio_cxl_core_device *cxl =
+ container_of(vdev, struct vfio_cxl_core_device, pci_core.vdev);
+ struct vfio_cxl *cxl_core = cxl->cxl_core;
+ struct vfio_cxl_dev_info info = {0};
+ int ret;
+
+ /* Driver reports the device DPA and RAM size */
+ info.dpa_res = DEFINE_RES_MEM(0, SZ_256M);
+ info.ram_res = DEFINE_RES_MEM_NAMED(0, SZ_256M, "ram");
+
+ /* Initialize the CXL device and enable the vfio-pci-core */
+ ret = vfio_cxl_core_enable(cxl, &info);
+ if (ret)
+ return ret;
+
+ vfio_cxl_core_finish_enable(cxl);
+
+ cxl_core = cxl->cxl_core;
+
+ /* No precommitted region, create one. */
+ if (!cxl_core->region.region) {
+ /*
+ * Driver can choose to create cxl region at a certain time
+ * E.g. at driver initialization or later
+ */
+ ret = vfio_cxl_core_create_cxl_region(cxl, SZ_256M);
+ if (ret)
+ goto fail_create_cxl_region;
+ }
+
+ ret = vfio_cxl_core_register_cxl_region(cxl);
+ if (ret)
+ goto fail_register_cxl_region;
+
+ return 0;
+
+fail_register_cxl_region:
+ if (cxl_core->region.region)
+ vfio_cxl_core_destroy_cxl_region(cxl);
+fail_create_cxl_region:
+ vfio_cxl_core_disable(cxl);
+ return ret;
+}
+
+static void cxl_close_device(struct vfio_device *vdev)
+{
+ struct vfio_cxl_core_device *cxl =
+ container_of(vdev, struct vfio_cxl_core_device, pci_core.vdev);
+
+ vfio_cxl_core_unregister_cxl_region(cxl);
+ vfio_cxl_core_destroy_cxl_region(cxl);
+ vfio_cxl_core_close_device(vdev);
+}
+
+static const struct vfio_device_ops cxl_core_ops = {
+ .name = "cxl-vfio-pci",
+ .init = vfio_pci_core_init_dev,
+ .release = vfio_pci_core_release_dev,
+ .open_device = cxl_open_device,
+ .close_device = cxl_close_device,
+ .ioctl = vfio_cxl_core_ioctl,
+ .device_feature = vfio_pci_core_ioctl_feature,
+ .read = vfio_cxl_core_read,
+ .write = vfio_cxl_core_write,
+ .mmap = vfio_pci_core_mmap,
+ .request = vfio_pci_core_request,
+ .match = vfio_pci_core_match,
+ .match_token_uuid = vfio_pci_core_match_token_uuid,
+ .bind_iommufd = vfio_iommufd_physical_bind,
+ .unbind_iommufd = vfio_iommufd_physical_unbind,
+ .attach_ioas = vfio_iommufd_physical_attach_ioas,
+ .detach_ioas = vfio_iommufd_physical_detach_ioas,
+};
+
+static int cxl_probe(struct pci_dev *pdev,
+ const struct pci_device_id *id)
+{
+ const struct vfio_device_ops *ops = &cxl_core_ops;
+ struct vfio_cxl_core_device *cxl_device;
+ int ret;
+
+ cxl_device = vfio_alloc_device(vfio_cxl_core_device, pci_core.vdev,
+ &pdev->dev, ops);
+ if (IS_ERR(cxl_device))
+ return PTR_ERR(cxl_device);
+
+ dev_set_drvdata(&pdev->dev, &cxl_device->pci_core);
+
+ ret = vfio_pci_core_register_device(&cxl_device->pci_core);
+ if (ret)
+ goto out_put_vdev;
+
+ return ret;
+
+out_put_vdev:
+ vfio_put_device(&cxl_device->pci_core.vdev);
+ return ret;
+}
+
+static void cxl_remove(struct pci_dev *pdev)
+{
+ struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+
+ vfio_pci_core_unregister_device(core_device);
+ vfio_put_device(&core_device->vdev);
+}
+
+static const struct pci_device_id cxl_vfio_pci_table[] = {
+ { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_INTEL, 0xd94) },
+ {}
+};
+
+MODULE_DEVICE_TABLE(pci, cxl_vfio_pci_table);
+
+static struct pci_driver cxl_vfio_pci_driver = {
+ .name = KBUILD_MODNAME,
+ .id_table = cxl_vfio_pci_table,
+ .probe = cxl_probe,
+ .remove = cxl_remove,
+ .err_handler = &vfio_pci_core_err_handlers,
+ .driver_managed_dma = true,
+};
+
+module_pci_driver(cxl_vfio_pci_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Zhi Wang <zhiw@nvidia.com>");
+MODULE_DESCRIPTION("VFIO variant driver for QEMU CXL accel device");
+MODULE_IMPORT_NS("CXL");
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* [RFC v2 15/15] cxl/mem: Fix NULL pointer deference in memory device paths
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
` (13 preceding siblings ...)
2025-12-09 16:50 ` [RFC v2 14/15] vfio/cxl: VFIO variant driver for QEMU CXL accel device mhonap
@ 2025-12-09 16:50 ` mhonap
14 siblings, 0 replies; 25+ messages in thread
From: mhonap @ 2025-12-09 16:50 UTC (permalink / raw)
To: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm, mhonap
From: Manish Honap <mhonap@nvidia.com>
Add NULL pointer validation in CXL memory device code paths that can
be triggered during error scenarios and device cleanup operations.
Two crash scenarios have been identified during VFIO-CXL testing:
1. __cxlmd_free() can be called with a NULL cxlmd pointer during
error handling paths in device probe/remove sequences. This leads
to a NULL pointer dereference when accessing cxlmd->cxlds.
2. cxl_memdev_has_poison_cmd() can receive a cxlmd where the
conversion to cxl_memdev_state via to_cxl_memdev_state() returns
NULL. This occurs when the device state hasn't been fully
initialized yet, causing a crash when test_bit() attempts to
access mds->poison.enabled_cmds.
Fix by adding defensive NULL checks:
- In __cxlmd_free(), return early if cxlmd is NULL to avoid
dereferencing an invalid pointer
- In cxl_memdev_has_poison_cmd(), validate mds before accessing
the poison.enabled_cmds bitmap
Signed-off-by: Manish Honap <mhonap@nvidia.com>
---
drivers/cxl/core/memdev.c | 2 +-
drivers/cxl/mem.c | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)
diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index d281843fb2f4..eb694203a259 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -207,7 +207,7 @@ bool cxl_memdev_has_poison_cmd(struct cxl_memdev *cxlmd,
{
struct cxl_memdev_state *mds = to_cxl_memdev_state(cxlmd->cxlds);
- return test_bit(cmd, mds->poison.enabled_cmds);
+ return (mds) ? test_bit(cmd, mds->poison.enabled_cmds) : false;
}
static int cxl_get_poison_by_memdev(struct cxl_memdev *cxlmd)
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index d91d08d25bc4..d5a942ba97b2 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -188,6 +188,9 @@ static int cxl_mem_probe(struct device *dev)
static void __cxlmd_free(struct cxl_memdev *cxlmd)
{
+ if (!cxlmd)
+ return;
+
cxlmd->cxlds->cxlmd = NULL;
put_device(&cxlmd->dev);
kfree(cxlmd);
--
2.25.1
^ permalink raw reply related [flat|nested] 25+ messages in thread
* Re: [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace via a new VFIO device region
2025-12-09 16:50 ` [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace via a new VFIO device region mhonap
@ 2025-12-11 16:06 ` Dave Jiang
2025-12-11 17:31 ` Manish Honap
2025-12-22 14:00 ` Jonathan Cameron
1 sibling, 1 reply; 25+ messages in thread
From: Dave Jiang @ 2025-12-11 16:06 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm
On 12/9/25 9:50 AM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> To directly access the device memory, a CXL region is required. Creating
> a CXL region requires to configure HDM decoders on the path to map the
> access of HPA level by level and evetually hit the DPA in the CXL
> topology.
>
> For the userspace, e.g. QEMU, to access the CXL region, the region is
> required to be exposed via VFIO interfaces.
>
> Introduce a new VFIO device region and region ops to expose the created
> CXL region when initialize the device in the vfio-cxl-core. Introduce a
> new sub region type for the userspace to identify a CXL region.
>
> Co-developed-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
> drivers/vfio/pci/vfio_cxl_core.c | 122 +++++++++++++++++++++++++++++++
> drivers/vfio/pci/vfio_pci_core.c | 3 +-
> include/linux/vfio_pci_core.h | 5 ++
> include/uapi/linux/vfio.h | 4 +
> 4 files changed, 133 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
> index cf53720c0cb7..35d95de47fa8 100644
> --- a/drivers/vfio/pci/vfio_cxl_core.c
> +++ b/drivers/vfio/pci/vfio_cxl_core.c
> @@ -231,6 +231,128 @@ void vfio_cxl_core_destroy_cxl_region(struct vfio_cxl_core_device *cxl)
> }
> EXPORT_SYMBOL_GPL(vfio_cxl_core_destroy_cxl_region);
>
> +static int vfio_cxl_region_mmap(struct vfio_pci_core_device *pci,
> + struct vfio_pci_region *region,
> + struct vm_area_struct *vma)
> +{
> + struct vfio_cxl_region *cxl_region = region->data;
> + u64 req_len, pgoff, req_start, end;
> + int ret;
> +
> + if (!(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
> + return -EINVAL;
> +
> + if (!(region->flags & VFIO_REGION_INFO_FLAG_READ) &&
> + (vma->vm_flags & VM_READ))
> + return -EPERM;
> +
> + if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE) &&
> + (vma->vm_flags & VM_WRITE))
> + return -EPERM;
> +
> + pgoff = vma->vm_pgoff &
> + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> + if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
> + check_add_overflow(PHYS_PFN(cxl_region->addr), pgoff, &req_start) ||
> + check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
> + return -EOVERFLOW;
> +
> + if (end > cxl_region->size)
> + return -EINVAL;
> +
> + if (cxl_region->noncached)
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> +
> + vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
> + VM_DONTEXPAND | VM_DONTDUMP);
> +
> + ret = remap_pfn_range(vma, vma->vm_start, req_start,
> + req_len, vma->vm_page_prot);
> + if (ret)
> + return ret;
> +
> + vma->vm_pgoff = req_start;
> +
> + return 0;
> +}
> +
> +static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device *core_dev,
> + char __user *buf, size_t count, loff_t *ppos,
> + bool iswrite)
> +{
> + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> + struct vfio_cxl_region *cxl_region = core_dev->region[i].data;
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> + if (!count)
> + return 0;
> +
> + return vfio_pci_core_do_io_rw(core_dev, false,
> + cxl_region->vaddr,
> + (char __user *)buf, pos, count,
> + 0, 0, iswrite);
> +}
> +
> +static void vfio_cxl_region_release(struct vfio_pci_core_device *vdev,
> + struct vfio_pci_region *region)
> +{
> +}
> +
> +static const struct vfio_pci_regops vfio_cxl_regops = {
> + .rw = vfio_cxl_region_rw,
> + .mmap = vfio_cxl_region_mmap,
> + .release = vfio_cxl_region_release,
> +};
> +
> +int vfio_cxl_core_register_cxl_region(struct vfio_cxl_core_device *cxl)
> +{
> + struct vfio_pci_core_device *pci = &cxl->pci_core;
> + struct vfio_cxl *cxl_core = cxl->cxl_core;
> + u32 flags;
> + int ret;
> +
> + if (WARN_ON(!cxl_core->region.region || cxl_core->region.vaddr))
> + return -EEXIST;
> +
> + cxl_core->region.vaddr = ioremap(cxl_core->region.addr, cxl_core->region.size);
> + if (!cxl_core->region.addr)
I think you are wanting to check cxl_core->region.vaddr here right?
Also, what is the ioremap'd region for?
DJ
> + return -EFAULT;
> +
> + flags = VFIO_REGION_INFO_FLAG_READ |
> + VFIO_REGION_INFO_FLAG_WRITE |
> + VFIO_REGION_INFO_FLAG_MMAP;
> +
> + ret = vfio_pci_core_register_dev_region(pci,
> + PCI_VENDOR_ID_CXL |
> + VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> + VFIO_REGION_SUBTYPE_CXL,
> + &vfio_cxl_regops,
> + cxl_core->region.size, flags,
> + &cxl_core->region);
> + if (ret) {
> + iounmap(cxl_core->region.vaddr);
> + cxl_core->region.vaddr = NULL;
> + return ret;
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(vfio_cxl_core_register_cxl_region);
> +
> +void vfio_cxl_core_unregister_cxl_region(struct vfio_cxl_core_device *cxl)
> +{
> + struct vfio_cxl *cxl_core = cxl->cxl_core;
> +
> + if (WARN_ON(!cxl_core->region.region || !cxl_core->region.vaddr))
> + return;
> +
> + iounmap(cxl_core->region.vaddr);
> + cxl_core->region.vaddr = NULL;
> +}
> +EXPORT_SYMBOL_GPL(vfio_cxl_core_unregister_cxl_region);
> +
> MODULE_LICENSE("GPL");
> MODULE_AUTHOR(DRIVER_AUTHOR);
> MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 7dcf5439dedc..c0695b5db66d 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1698,12 +1698,13 @@ static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf)
> return vfio_pci_mmap_huge_fault(vmf, 0);
> }
>
> -static const struct vm_operations_struct vfio_pci_mmap_ops = {
> +const struct vm_operations_struct vfio_pci_mmap_ops = {
> .fault = vfio_pci_mmap_page_fault,
> #ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> .huge_fault = vfio_pci_mmap_huge_fault,
> #endif
> };
> +EXPORT_SYMBOL_GPL(vfio_pci_mmap_ops);
>
> int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma)
> {
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index a343b91d2580..3474835f5d65 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -102,6 +102,7 @@ struct vfio_cxl_region {
> struct cxl_region *region;
> u64 size;
> u64 addr;
> + void *vaddr;
> bool noncached;
> };
>
> @@ -203,6 +204,8 @@ vfio_pci_core_to_cxl(struct vfio_pci_core_device *pci)
> return container_of(pci, struct vfio_cxl_core_device, pci_core);
> }
>
> +extern const struct vm_operations_struct vfio_pci_mmap_ops;
> +
> int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
> struct vfio_cxl_dev_info *info);
> void vfio_cxl_core_finish_enable(struct vfio_cxl_core_device *cxl);
> @@ -210,5 +213,7 @@ void vfio_cxl_core_disable(struct vfio_cxl_core_device *cxl);
> void vfio_cxl_core_close_device(struct vfio_device *vdev);
> int vfio_cxl_core_create_cxl_region(struct vfio_cxl_core_device *cxl, u64 size);
> void vfio_cxl_core_destroy_cxl_region(struct vfio_cxl_core_device *cxl);
> +int vfio_cxl_core_register_cxl_region(struct vfio_cxl_core_device *cxl);
> +void vfio_cxl_core_unregister_cxl_region(struct vfio_cxl_core_device *cxl);
>
> #endif /* VFIO_PCI_CORE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 75100bf009ba..95be987d2ed5 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -372,6 +372,10 @@ struct vfio_region_info_cap_type {
> /* sub-types for VFIO_REGION_TYPE_GFX */
> #define VFIO_REGION_SUBTYPE_GFX_EDID (1)
>
> +/* 1e98 vendor PCI sub-types */
> +/* sub-type for VFIO CXL region */
> +#define VFIO_REGION_SUBTYPE_CXL (1)
> +
> /**
> * struct vfio_region_gfx_edid - EDID region layout.
> *
^ permalink raw reply [flat|nested] 25+ messages in thread
* RE: [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace via a new VFIO device region
2025-12-11 16:06 ` Dave Jiang
@ 2025-12-11 17:31 ` Manish Honap
2025-12-11 18:01 ` Dave Jiang
0 siblings, 1 reply; 25+ messages in thread
From: Manish Honap @ 2025-12-11 17:31 UTC (permalink / raw)
To: Dave Jiang, Aniket Agashe, Ankit Agrawal, Alex Williamson,
Vikram Sethi, Jason Gunthorpe, Matt Ochs, Shameer Kolothum,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
jonathan.cameron@huawei.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, jgg@ziepe.ca, Yishai Hadas,
kevin.tian@intel.com
Cc: Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang,
Krishnakant Jaju, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, kvm@vger.kernel.org, Manish Honap
> -----Original Message-----
> From: Dave Jiang <dave.jiang@intel.com>
> Sent: 11 December 2025 21:36
> To: Manish Honap <mhonap@nvidia.com>; Aniket Agashe
> <aniketa@nvidia.com>; Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer
> Kolothum <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
> dave@stgolabs.net; jonathan.cameron@huawei.com;
> alison.schofield@intel.com; vishal.l.verma@intel.com;
> ira.weiny@intel.com; dan.j.williams@intel.com; jgg@ziepe.ca; Yishai
> Hadas <yishaih@nvidia.com>; kevin.tian@intel.com
> Cc: Neo Jia <cjia@nvidia.com>; Kirti Wankhede <kwankhede@nvidia.com>;
> Tarun Gupta (SW-GPU) <targupta@nvidia.com>; Zhi Wang <zhiw@nvidia.com>;
> Krishnakant Jaju <kjaju@nvidia.com>; linux-kernel@vger.kernel.org;
> linux-cxl@vger.kernel.org; kvm@vger.kernel.org
> Subject: Re: [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace
> via a new VFIO device region
>
> External email: Use caution opening links or attachments
>
>
> On 12/9/25 9:50 AM, mhonap@nvidia.com wrote:
> > From: Manish Honap <mhonap@nvidia.com>
> >
> > To directly access the device memory, a CXL region is required.
> > Creating a CXL region requires to configure HDM decoders on the path
> > to map the access of HPA level by level and evetually hit the DPA in
> > the CXL topology.
> >
> > For the userspace, e.g. QEMU, to access the CXL region, the region is
> > required to be exposed via VFIO interfaces.
> >
> > Introduce a new VFIO device region and region ops to expose the
> > created CXL region when initialize the device in the vfio-cxl-core.
> > Introduce a new sub region type for the userspace to identify a CXL
> region.
> >
> > Co-developed-by: Zhi Wang <zhiw@nvidia.com>
> > Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> > Signed-off-by: Manish Honap <mhonap@nvidia.com>
> > ---
> > drivers/vfio/pci/vfio_cxl_core.c | 122
> +++++++++++++++++++++++++++++++
> > drivers/vfio/pci/vfio_pci_core.c | 3 +-
> > include/linux/vfio_pci_core.h | 5 ++
> > include/uapi/linux/vfio.h | 4 +
> > 4 files changed, 133 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_cxl_core.c
> > b/drivers/vfio/pci/vfio_cxl_core.c
> > index cf53720c0cb7..35d95de47fa8 100644
> > --- a/drivers/vfio/pci/vfio_cxl_core.c
> > +++ b/drivers/vfio/pci/vfio_cxl_core.c
> > @@ -231,6 +231,128 @@ void vfio_cxl_core_destroy_cxl_region(struct
> > vfio_cxl_core_device *cxl) }
> > EXPORT_SYMBOL_GPL(vfio_cxl_core_destroy_cxl_region);
> >
> > +static int vfio_cxl_region_mmap(struct vfio_pci_core_device *pci,
> > + struct vfio_pci_region *region,
> > + struct vm_area_struct *vma) {
> > + struct vfio_cxl_region *cxl_region = region->data;
> > + u64 req_len, pgoff, req_start, end;
> > + int ret;
> > +
> > + if (!(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
> > + return -EINVAL;
> > +
> > + if (!(region->flags & VFIO_REGION_INFO_FLAG_READ) &&
> > + (vma->vm_flags & VM_READ))
> > + return -EPERM;
> > +
> > + if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE) &&
> > + (vma->vm_flags & VM_WRITE))
> > + return -EPERM;
> > +
> > + pgoff = vma->vm_pgoff &
> > + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> > +
> > + if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
> > + check_add_overflow(PHYS_PFN(cxl_region->addr), pgoff,
> &req_start) ||
> > + check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
> > + return -EOVERFLOW;
> > +
> > + if (end > cxl_region->size)
> > + return -EINVAL;
> > +
> > + if (cxl_region->noncached)
> > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> > + vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> > +
> > + vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
> > + VM_DONTEXPAND | VM_DONTDUMP);
> > +
> > + ret = remap_pfn_range(vma, vma->vm_start, req_start,
> > + req_len, vma->vm_page_prot);
> > + if (ret)
> > + return ret;
> > +
> > + vma->vm_pgoff = req_start;
> > +
> > + return 0;
> > +}
> > +
> > +static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device
> *core_dev,
> > + char __user *buf, size_t count, loff_t
> *ppos,
> > + bool iswrite) {
> > + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) -
> VFIO_PCI_NUM_REGIONS;
> > + struct vfio_cxl_region *cxl_region = core_dev->region[i].data;
> > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> > +
> > + if (!count)
> > + return 0;
> > +
> > + return vfio_pci_core_do_io_rw(core_dev, false,
> > + cxl_region->vaddr,
> > + (char __user *)buf, pos, count,
> > + 0, 0, iswrite); }
> > +
> > +static void vfio_cxl_region_release(struct vfio_pci_core_device
> *vdev,
> > + struct vfio_pci_region *region) { }
> > +
> > +static const struct vfio_pci_regops vfio_cxl_regops = {
> > + .rw = vfio_cxl_region_rw,
> > + .mmap = vfio_cxl_region_mmap,
> > + .release = vfio_cxl_region_release,
> > +};
> > +
> > +int vfio_cxl_core_register_cxl_region(struct vfio_cxl_core_device
> > +*cxl) {
> > + struct vfio_pci_core_device *pci = &cxl->pci_core;
> > + struct vfio_cxl *cxl_core = cxl->cxl_core;
> > + u32 flags;
> > + int ret;
> > +
> > + if (WARN_ON(!cxl_core->region.region || cxl_core->region.vaddr))
> > + return -EEXIST;
> > +
> > + cxl_core->region.vaddr = ioremap(cxl_core->region.addr,
> cxl_core->region.size);
> > + if (!cxl_core->region.addr)
>
> I think you are wanting to check cxl_core->region.vaddr here right?
Yes, you are correct. I will update this check.
>
> Also, what is the ioremap'd region for?
It is to handle read/write operations when QEMU performs I/O on the VFIO CXL device region via the read()/write() syscalls.
>
> DJ
>
> > + return -EFAULT;
> > +
> > + flags = VFIO_REGION_INFO_FLAG_READ |
> > + VFIO_REGION_INFO_FLAG_WRITE |
> > + VFIO_REGION_INFO_FLAG_MMAP;
> > +
> > + ret = vfio_pci_core_register_dev_region(pci,
> > + PCI_VENDOR_ID_CXL |
> > +
> VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> > + VFIO_REGION_SUBTYPE_CXL,
> > + &vfio_cxl_regops,
> > + cxl_core->region.size,
> flags,
> > + &cxl_core->region);
> > + if (ret) {
> > + iounmap(cxl_core->region.vaddr);
> > + cxl_core->region.vaddr = NULL;
> > + return ret;
> > + }
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_cxl_core_register_cxl_region);
> > +
> > +void vfio_cxl_core_unregister_cxl_region(struct vfio_cxl_core_device
> > +*cxl) {
> > + struct vfio_cxl *cxl_core = cxl->cxl_core;
> > +
> > + if (WARN_ON(!cxl_core->region.region || !cxl_core-
> >region.vaddr))
> > + return;
> > +
> > + iounmap(cxl_core->region.vaddr);
> > + cxl_core->region.vaddr = NULL;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_cxl_core_unregister_cxl_region);
> > +
> > MODULE_LICENSE("GPL");
> > MODULE_AUTHOR(DRIVER_AUTHOR);
> > MODULE_DESCRIPTION(DRIVER_DESC);
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c
> > b/drivers/vfio/pci/vfio_pci_core.c
> > index 7dcf5439dedc..c0695b5db66d 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1698,12 +1698,13 @@ static vm_fault_t
> vfio_pci_mmap_page_fault(struct vm_fault *vmf)
> > return vfio_pci_mmap_huge_fault(vmf, 0); }
> >
> > -static const struct vm_operations_struct vfio_pci_mmap_ops = {
> > +const struct vm_operations_struct vfio_pci_mmap_ops = {
> > .fault = vfio_pci_mmap_page_fault, #ifdef
> > CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> > .huge_fault = vfio_pci_mmap_huge_fault, #endif };
> > +EXPORT_SYMBOL_GPL(vfio_pci_mmap_ops);
> >
> > int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct
> > vm_area_struct *vma) { diff --git a/include/linux/vfio_pci_core.h
> > b/include/linux/vfio_pci_core.h index a343b91d2580..3474835f5d65
> > 100644
> > --- a/include/linux/vfio_pci_core.h
> > +++ b/include/linux/vfio_pci_core.h
> > @@ -102,6 +102,7 @@ struct vfio_cxl_region {
> > struct cxl_region *region;
> > u64 size;
> > u64 addr;
> > + void *vaddr;
> > bool noncached;
> > };
> >
> > @@ -203,6 +204,8 @@ vfio_pci_core_to_cxl(struct vfio_pci_core_device
> *pci)
> > return container_of(pci, struct vfio_cxl_core_device, pci_core);
> > }
> >
> > +extern const struct vm_operations_struct vfio_pci_mmap_ops;
> > +
> > int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
> > struct vfio_cxl_dev_info *info); void
> > vfio_cxl_core_finish_enable(struct vfio_cxl_core_device *cxl); @@
> > -210,5 +213,7 @@ void vfio_cxl_core_disable(struct
> > vfio_cxl_core_device *cxl); void vfio_cxl_core_close_device(struct
> > vfio_device *vdev); int vfio_cxl_core_create_cxl_region(struct
> > vfio_cxl_core_device *cxl, u64 size); void
> > vfio_cxl_core_destroy_cxl_region(struct vfio_cxl_core_device *cxl);
> > +int vfio_cxl_core_register_cxl_region(struct vfio_cxl_core_device
> > +*cxl); void vfio_cxl_core_unregister_cxl_region(struct
> > +vfio_cxl_core_device *cxl);
> >
> > #endif /* VFIO_PCI_CORE_H */
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 75100bf009ba..95be987d2ed5 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -372,6 +372,10 @@ struct vfio_region_info_cap_type {
> > /* sub-types for VFIO_REGION_TYPE_GFX */
> > #define VFIO_REGION_SUBTYPE_GFX_EDID (1)
> >
> > +/* 1e98 vendor PCI sub-types */
> > +/* sub-type for VFIO CXL region */
> > +#define VFIO_REGION_SUBTYPE_CXL (1)
> > +
> > /**
> > * struct vfio_region_gfx_edid - EDID region layout.
> > *
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace via a new VFIO device region
2025-12-11 17:31 ` Manish Honap
@ 2025-12-11 18:01 ` Dave Jiang
0 siblings, 0 replies; 25+ messages in thread
From: Dave Jiang @ 2025-12-11 18:01 UTC (permalink / raw)
To: Manish Honap, Aniket Agashe, Ankit Agrawal, Alex Williamson,
Vikram Sethi, Jason Gunthorpe, Matt Ochs, Shameer Kolothum,
alejandro.lucero-palau@amd.com, dave@stgolabs.net,
jonathan.cameron@huawei.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, ira.weiny@intel.com,
dan.j.williams@intel.com, jgg@ziepe.ca, Yishai Hadas,
kevin.tian@intel.com
Cc: Neo Jia, Kirti Wankhede, Tarun Gupta (SW-GPU), Zhi Wang,
Krishnakant Jaju, linux-kernel@vger.kernel.org,
linux-cxl@vger.kernel.org, kvm@vger.kernel.org
On 12/11/25 10:31 AM, Manish Honap wrote:
>
>
>> -----Original Message-----
>> From: Dave Jiang <dave.jiang@intel.com>
>> Sent: 11 December 2025 21:36
>> To: Manish Honap <mhonap@nvidia.com>; Aniket Agashe
>> <aniketa@nvidia.com>; Ankit Agrawal <ankita@nvidia.com>; Alex Williamson
>> <alwilliamson@nvidia.com>; Vikram Sethi <vsethi@nvidia.com>; Jason
>> Gunthorpe <jgg@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Shameer
>> Kolothum <skolothumtho@nvidia.com>; alejandro.lucero-palau@amd.com;
>> dave@stgolabs.net; jonathan.cameron@huawei.com;
>> alison.schofield@intel.com; vishal.l.verma@intel.com;
>> ira.weiny@intel.com; dan.j.williams@intel.com; jgg@ziepe.ca; Yishai
>> Hadas <yishaih@nvidia.com>; kevin.tian@intel.com
>> Cc: Neo Jia <cjia@nvidia.com>; Kirti Wankhede <kwankhede@nvidia.com>;
>> Tarun Gupta (SW-GPU) <targupta@nvidia.com>; Zhi Wang <zhiw@nvidia.com>;
>> Krishnakant Jaju <kjaju@nvidia.com>; linux-kernel@vger.kernel.org;
>> linux-cxl@vger.kernel.org; kvm@vger.kernel.org
>> Subject: Re: [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace
>> via a new VFIO device region
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 12/9/25 9:50 AM, mhonap@nvidia.com wrote:
>>> From: Manish Honap <mhonap@nvidia.com>
>>>
>>> To directly access the device memory, a CXL region is required.
>>> Creating a CXL region requires to configure HDM decoders on the path
>>> to map the access of HPA level by level and evetually hit the DPA in
>>> the CXL topology.
>>>
>>> For the userspace, e.g. QEMU, to access the CXL region, the region is
>>> required to be exposed via VFIO interfaces.
>>>
>>> Introduce a new VFIO device region and region ops to expose the
>>> created CXL region when initialize the device in the vfio-cxl-core.
>>> Introduce a new sub region type for the userspace to identify a CXL
>> region.
>>>
>>> Co-developed-by: Zhi Wang <zhiw@nvidia.com>
>>> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
>>> Signed-off-by: Manish Honap <mhonap@nvidia.com>
>>> ---
>>> drivers/vfio/pci/vfio_cxl_core.c | 122
>> +++++++++++++++++++++++++++++++
>>> drivers/vfio/pci/vfio_pci_core.c | 3 +-
>>> include/linux/vfio_pci_core.h | 5 ++
>>> include/uapi/linux/vfio.h | 4 +
>>> 4 files changed, 133 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/vfio/pci/vfio_cxl_core.c
>>> b/drivers/vfio/pci/vfio_cxl_core.c
>>> index cf53720c0cb7..35d95de47fa8 100644
>>> --- a/drivers/vfio/pci/vfio_cxl_core.c
>>> +++ b/drivers/vfio/pci/vfio_cxl_core.c
>>> @@ -231,6 +231,128 @@ void vfio_cxl_core_destroy_cxl_region(struct
>>> vfio_cxl_core_device *cxl) }
>>> EXPORT_SYMBOL_GPL(vfio_cxl_core_destroy_cxl_region);
>>>
>>> +static int vfio_cxl_region_mmap(struct vfio_pci_core_device *pci,
>>> + struct vfio_pci_region *region,
>>> + struct vm_area_struct *vma) {
>>> + struct vfio_cxl_region *cxl_region = region->data;
>>> + u64 req_len, pgoff, req_start, end;
>>> + int ret;
>>> +
>>> + if (!(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
>>> + return -EINVAL;
>>> +
>>> + if (!(region->flags & VFIO_REGION_INFO_FLAG_READ) &&
>>> + (vma->vm_flags & VM_READ))
>>> + return -EPERM;
>>> +
>>> + if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE) &&
>>> + (vma->vm_flags & VM_WRITE))
>>> + return -EPERM;
>>> +
>>> + pgoff = vma->vm_pgoff &
>>> + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
>>> +
>>> + if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
>>> + check_add_overflow(PHYS_PFN(cxl_region->addr), pgoff,
>> &req_start) ||
>>> + check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
>>> + return -EOVERFLOW;
>>> +
>>> + if (end > cxl_region->size)
>>> + return -EINVAL;
>>> +
>>> + if (cxl_region->noncached)
>>> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
>>> + vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
>>> +
>>> + vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
>>> + VM_DONTEXPAND | VM_DONTDUMP);
>>> +
>>> + ret = remap_pfn_range(vma, vma->vm_start, req_start,
>>> + req_len, vma->vm_page_prot);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + vma->vm_pgoff = req_start;
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device
>> *core_dev,
>>> + char __user *buf, size_t count, loff_t
>> *ppos,
>>> + bool iswrite) {
>>> + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) -
>> VFIO_PCI_NUM_REGIONS;
>>> + struct vfio_cxl_region *cxl_region = core_dev->region[i].data;
>>> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>>> +
>>> + if (!count)
>>> + return 0;
>>> +
>>> + return vfio_pci_core_do_io_rw(core_dev, false,
>>> + cxl_region->vaddr,
>>> + (char __user *)buf, pos, count,
>>> + 0, 0, iswrite); }
>>> +
>>> +static void vfio_cxl_region_release(struct vfio_pci_core_device
>> *vdev,
>>> + struct vfio_pci_region *region) { }
>>> +
>>> +static const struct vfio_pci_regops vfio_cxl_regops = {
>>> + .rw = vfio_cxl_region_rw,
>>> + .mmap = vfio_cxl_region_mmap,
>>> + .release = vfio_cxl_region_release,
>>> +};
>>> +
>>> +int vfio_cxl_core_register_cxl_region(struct vfio_cxl_core_device
>>> +*cxl) {
>>> + struct vfio_pci_core_device *pci = &cxl->pci_core;
>>> + struct vfio_cxl *cxl_core = cxl->cxl_core;
>>> + u32 flags;
>>> + int ret;
>>> +
>>> + if (WARN_ON(!cxl_core->region.region || cxl_core->region.vaddr))
>>> + return -EEXIST;
>>> +
>>> + cxl_core->region.vaddr = ioremap(cxl_core->region.addr,
>> cxl_core->region.size);
>>> + if (!cxl_core->region.addr)
>>
>> I think you are wanting to check cxl_core->region.vaddr here right?
>
> Yes, you are correct. I will update this check.
>
>>
>> Also, what is the ioremap'd region for?
>
> It is to handle read/write operations when QEMU performs I/O on the VFIO CXL device region via the read()/write() syscalls.
For the CXL device region, for the most part the operations are done via the region being mmap()'d by qemu right? I understand read/write to BAR0 MMIO. What specific operations are done via read/write to the region? It may be worth mentioning in the commit log.
>
>>
>> DJ
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC v2 11/15] vfio/cxl: introduce the emulation of HDM registers
2025-12-09 16:50 ` [RFC v2 11/15] vfio/cxl: introduce the emulation of HDM registers mhonap
@ 2025-12-11 18:13 ` Dave Jiang
0 siblings, 0 replies; 25+ messages in thread
From: Dave Jiang @ 2025-12-11 18:13 UTC (permalink / raw)
To: mhonap, aniketa, ankita, alwilliamson, vsethi, jgg, mochs,
skolothumtho, alejandro.lucero-palau, dave, jonathan.cameron,
alison.schofield, vishal.l.verma, ira.weiny, dan.j.williams, jgg,
yishaih, kevin.tian
Cc: cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel, linux-cxl,
kvm
On 12/9/25 9:50 AM, mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> CXL devices have HDM registers in its CXL MMIO bar. Many HDM registers
> requires a PA and they are owned by the host in virtualization.
>
> Thus, the HDM registers needs to be emulated accordingly so that the
> guest kernel CXL core can configure the virtual HDM decoders.
>
> Intorduce the emulation of HDM registers that emulates the HDM decoders.
>
> Co-developed-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
> ---
> drivers/vfio/pci/vfio_cxl_core.c | 7 +-
> drivers/vfio/pci/vfio_cxl_core_emu.c | 242 +++++++++++++++++++++++++++
> include/linux/vfio_pci_core.h | 2 +
> 3 files changed, 248 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
> index cb75e9f668a7..c0bdf55997da 100644
> --- a/drivers/vfio/pci/vfio_cxl_core.c
> +++ b/drivers/vfio/pci/vfio_cxl_core.c
> @@ -247,8 +247,6 @@ int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
> if (!dvsec)
> return -ENODEV;
>
> - cxl->dvsec = dvsec;
> -
> cxl_core = devm_cxl_dev_state_create(&pdev->dev, CXL_DEVTYPE_DEVMEM,
> pdev->dev.id, dvsec, struct vfio_cxl,
> cxlds, false);
> @@ -257,9 +255,12 @@ int vfio_cxl_core_enable(struct vfio_cxl_core_device *cxl,
> return -ENOMEM;
> }
>
> + cxl->dvsec = dvsec;
> + cxl->cxl_core = cxl_core;
> +
> ret = find_comp_regs(cxl);
> if (ret)
> - return -ENODEV;
> + return ret;
>
> ret = setup_virt_regs(cxl);
> if (ret)
> diff --git a/drivers/vfio/pci/vfio_cxl_core_emu.c b/drivers/vfio/pci/vfio_cxl_core_emu.c
> index a0674bacecd7..6711ff8975ef 100644
> --- a/drivers/vfio/pci/vfio_cxl_core_emu.c
> +++ b/drivers/vfio/pci/vfio_cxl_core_emu.c
> @@ -5,6 +5,239 @@
>
> #include "vfio_cxl_core_priv.h"
>
> +typedef ssize_t reg_handler_t(struct vfio_cxl_core_device *cxl, void *buf,
> + u64 offset, u64 size);
> +
> +static struct vfio_emulated_regblock *
> +new_reg_block(struct vfio_cxl_core_device *cxl, u64 offset, u64 size,
> + reg_handler_t *read, reg_handler_t *write)
> +{
> + struct vfio_emulated_regblock *block;
> +
> + block = kzalloc(sizeof(*block), GFP_KERNEL);
> + if (!block)
> + return ERR_PTR(-ENOMEM);
> +
> + block->range.start = offset;
> + block->range.end = offset + size - 1;
> + block->read = read;
> + block->write = write;
> +
> + INIT_LIST_HEAD(&block->list);
> +
> + return block;
> +}
> +
> +static int new_mmio_block(struct vfio_cxl_core_device *cxl, u64 offset, u64 size,
> + reg_handler_t *read, reg_handler_t *write)
> +{
> + struct vfio_emulated_regblock *block;
> +
> + block = new_reg_block(cxl, offset, size, read, write);
> + if (IS_ERR(block))
> + return PTR_ERR(block);
> +
> + list_add_tail(&block->list, &cxl->mmio_regblocks_head);
> + return 0;
> +}
> +
> +static u64 hdm_reg_base(struct vfio_cxl_core_device *cxl)
> +{
> + return cxl->comp_reg_offset + cxl->hdm_reg_offset;
> +}
> +
> +static u64 to_hdm_reg_offset(struct vfio_cxl_core_device *cxl, u64 offset)
> +{
> + return offset - hdm_reg_base(cxl);
> +}
> +
> +static void *hdm_reg_virt(struct vfio_cxl_core_device *cxl, u64 hdm_reg_offset)
> +{
> + return cxl->comp_reg_virt + cxl->hdm_reg_offset + hdm_reg_offset;
> +}
> +
> +static ssize_t virt_hdm_reg_read(struct vfio_cxl_core_device *cxl, void *buf,
> + u64 offset, u64 size)
> +{
> + offset = to_hdm_reg_offset(cxl, offset);
> + memcpy(buf, hdm_reg_virt(cxl, offset), size);
> +
> + return size;
> +}
> +
> +static ssize_t virt_hdm_reg_write(struct vfio_cxl_core_device *cxl, void *buf,
> + u64 offset, u64 size)
> +{
> + offset = to_hdm_reg_offset(cxl, offset);
> + memcpy(hdm_reg_virt(cxl, offset), buf, size);
> +
> + return size;
> +}
> +
> +static ssize_t virt_hdm_rev_reg_write(struct vfio_cxl_core_device *cxl,
> + void *buf, u64 offset, u64 size)
> +{
> + /* Discard writes on reserved registers. */
> + return size;
> +}
> +
> +static ssize_t hdm_decoder_n_lo_write(struct vfio_cxl_core_device *cxl,
> + void *buf, u64 offset, u64 size)
> +{
> + u32 new_val = le32_to_cpu(*(u32 *)buf);
> +
> + if (WARN_ON_ONCE(size != 4))
> + return -EINVAL;
> +
> + /* Bit [27:0] are reserved. */
> + new_val &= ~GENMASK(27, 0);
maybe define the mask
> +
> + new_val = cpu_to_le32(new_val);
> + offset = to_hdm_reg_offset(cxl, offset);
> + memcpy(hdm_reg_virt(cxl, offset), &new_val, size);
> + return size;
> +}
> +
> +static ssize_t hdm_decoder_global_ctrl_write(struct vfio_cxl_core_device *cxl,
> + void *buf, u64 offset, u64 size)
> +{
> + u32 hdm_decoder_global_cap;
> + u32 new_val = le32_to_cpu(*(u32 *)buf);
> +
> + if (WARN_ON_ONCE(size != 4))
> + return -EINVAL;
> +
> + /* Bit [31:2] are reserved. */
> + new_val &= ~GENMASK(31, 2);
same here re mask
> +
> + /* Poison On Decode Error Enable bit is 0 and RO if not support. */
> + hdm_decoder_global_cap = le32_to_cpu(*(u32 *)hdm_reg_virt(cxl, 0));
> + if (!(hdm_decoder_global_cap & BIT(10)))
> + new_val &= ~BIT(0);
Would be good to define the register bits to ease reading the code
> +
> + new_val = cpu_to_le32(new_val);
> + offset = to_hdm_reg_offset(cxl, offset);
> + memcpy(hdm_reg_virt(cxl, offset), &new_val, size);
> + return size;
> +}
> +
> +static ssize_t hdm_decoder_n_ctrl_write(struct vfio_cxl_core_device *cxl,
> + void *buf, u64 offset, u64 size)
> +{
> + u32 hdm_decoder_global_cap;
> + u32 ro_mask, rev_mask;
> + u32 new_val = le32_to_cpu(*(u32 *)buf);
> + u32 cur_val;
> +
> + if (WARN_ON_ONCE(size != 4))
> + return -EINVAL;
> +
> + offset = to_hdm_reg_offset(cxl, offset);
> + cur_val = le32_to_cpu(*(u32 *)hdm_reg_virt(cxl, offset));
> +
> + /* Lock on commit */
> + if (cur_val & BIT(8))
define bit(s). same comment for the rest of the patch.
DJ
> + return size;
> +
> + hdm_decoder_global_cap = le32_to_cpu(*(u32 *)hdm_reg_virt(cxl, 0));
> +
> + /* RO and reserved bits in the spec */
> + ro_mask = BIT(10) | BIT(11);
> + rev_mask = BIT(15) | GENMASK(31, 28);
> +
> + /* bits are not valid for devices */
> + ro_mask |= BIT(12);
> + rev_mask |= GENMASK(19, 16) | GENMASK(23, 20);
> +
> + /* bits are reserved when UIO is not supported */
> + if (!(hdm_decoder_global_cap & BIT(13)))
> + rev_mask |= BIT(14) | GENMASK(27, 24);
> +
> + /* clear reserved bits */
> + new_val &= ~rev_mask;
> +
> + /* keep the RO bits */
> + cur_val &= ro_mask;
> + new_val &= ~ro_mask;
> + new_val |= cur_val;
> +
> + /* emulate HDM decoder commit/de-commit */
> + if (new_val & BIT(9))
> + new_val |= BIT(10);
> + else
> + new_val &= ~BIT(10);
> +
> + new_val = cpu_to_le32(new_val);
> + memcpy(hdm_reg_virt(cxl, offset), &new_val, size);
> + return size;
> +}
> +
> +static int setup_mmio_emulation(struct vfio_cxl_core_device *cxl)
> +{
> + u64 offset, base;
> + int ret;
> +
> + base = hdm_reg_base(cxl);
> +
> +#define ALLOC_BLOCK(offset, size, read, write) do { \
> + ret = new_mmio_block(cxl, offset, size, read, write); \
> + if (ret) \
> + return ret; \
> + } while (0)
> +
> + ALLOC_BLOCK(base + 0x4, 4,
> + virt_hdm_reg_read,
> + hdm_decoder_global_ctrl_write);
> +
> + offset = base + 0x10;
> + while (offset < base + cxl->hdm_reg_size) {
> + /* HDM N BASE LOW */
> + ALLOC_BLOCK(offset, 4,
> + virt_hdm_reg_read,
> + hdm_decoder_n_lo_write);
> +
> + /* HDM N BASE HIGH */
> + ALLOC_BLOCK(offset + 0x4, 4,
> + virt_hdm_reg_read,
> + virt_hdm_reg_write);
> +
> + /* HDM N SIZE LOW */
> + ALLOC_BLOCK(offset + 0x8, 4,
> + virt_hdm_reg_read,
> + hdm_decoder_n_lo_write);
> +
> + /* HDM N SIZE HIGH */
> + ALLOC_BLOCK(offset + 0xc, 4,
> + virt_hdm_reg_read,
> + virt_hdm_reg_write);
> +
> + /* HDM N CONTROL */
> + ALLOC_BLOCK(offset + 0x10, 4,
> + virt_hdm_reg_read,
> + hdm_decoder_n_ctrl_write);
> +
> + /* HDM N TARGET LIST LOW */
> + ALLOC_BLOCK(offset + 0x14, 0x4,
> + virt_hdm_reg_read,
> + virt_hdm_rev_reg_write);
> +
> + /* HDM N TARGET LIST HIGH */
> + ALLOC_BLOCK(offset + 0x18, 0x4,
> + virt_hdm_reg_read,
> + virt_hdm_rev_reg_write);
> +
> + /* HDM N REV */
> + ALLOC_BLOCK(offset + 0x1c, 0x4,
> + virt_hdm_reg_read,
> + virt_hdm_rev_reg_write);
> +
> + offset += 0x20;
> + }
> +
> +#undef ALLOC_BLOCK
> + return 0;
> +}
> +
> void vfio_cxl_core_clean_register_emulation(struct vfio_cxl_core_device *cxl)
> {
> struct list_head *pos, *n;
> @@ -17,10 +250,19 @@ void vfio_cxl_core_clean_register_emulation(struct vfio_cxl_core_device *cxl)
>
> int vfio_cxl_core_setup_register_emulation(struct vfio_cxl_core_device *cxl)
> {
> + int ret;
> +
> INIT_LIST_HEAD(&cxl->config_regblocks_head);
> INIT_LIST_HEAD(&cxl->mmio_regblocks_head);
>
> + ret = setup_mmio_emulation(cxl);
> + if (ret)
> + goto err;
> +
> return 0;
> +err:
> + vfio_cxl_core_clean_register_emulation(cxl);
> + return ret;
> }
>
> static struct vfio_emulated_regblock *
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 12ded67c7db7..31fd28626846 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -251,5 +251,7 @@ ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *bu
> size_t count, loff_t *ppos);
> long vfio_cxl_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> unsigned long arg);
> +int vfio_cxl_core_setup_register_emulation(struct vfio_cxl_core_device *cxl);
> +void vfio_cxl_core_clean_register_emulation(struct vfio_cxl_core_device *cxl);
>
> #endif /* VFIO_PCI_CORE_H */
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC v2 01/15] cxl: factor out cxl_await_range_active() and cxl_media_ready()
2025-12-09 16:50 ` [RFC v2 01/15] cxl: factor out cxl_await_range_active() and cxl_media_ready() mhonap
@ 2025-12-22 12:21 ` Jonathan Cameron
0 siblings, 0 replies; 25+ messages in thread
From: Jonathan Cameron @ 2025-12-22 12:21 UTC (permalink / raw)
To: mhonap
Cc: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel,
linux-cxl, kvm, Li Ming
On Tue, 9 Dec 2025 22:20:05 +0530
mhonap@nvidia.com wrote:
> From: Zhi Wang <zhiw@nvidia.com>
>
> Before accessing the CXL device memory after reset/power-on, the driver
> needs to ensure the device memory media is ready.
>
> However, not every CXL device implements the CXL memory device register
> groups. E.g. a CXL type-2 device. Thus calling cxl_await_media_ready()
> on these device will lead to a kernel panic. This problem was found when
> testing the emulated CXL type-2 device without a CXL memory device
> register.
>
> [ 97.662720] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 97.663963] #PF: supervisor read access in kernel mode
> [ 97.664860] #PF: error_code(0x0000) - not-present page
> [ 97.665753] PGD 0 P4D 0
> [ 97.666198] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> [ 97.667053] CPU: 8 UID: 0 PID: 7340 Comm: qemu-system-x86 Tainted: G E 6.11.0-rc2+ #52
> [ 97.668656] Tainted: [E]=UNSIGNED_MODULE
> [ 97.669340] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [ 97.671243] RIP: 0010:cxl_await_media_ready+0x1ac/0x1d0
> [ 97.672157] Code: e9 03 ff ff ff 0f b7 1d d6 80 31 01 48 8b 7d b8 89 da 48 c7 c6 60 52 c6 b0 e8 00 46 f6 ff e9 27 ff ff ff 49 8b 86 a0 00 00 00 <48> 8b 00 83 e0 0c 48 83 f8 04 0f 94 c0 0f b6 c0 8d 44 80 fb e9 0c
> [ 97.675391] RSP: 0018:ffffb5bac7627c20 EFLAGS: 00010246
> [ 97.676298] RAX: 0000000000000000 RBX: 000000000000003c RCX: 0000000000000000
> [ 97.677527] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [ 97.678733] RBP: ffffb5bac7627c70 R08: 0000000000000000 R09: 0000000000000000
> [ 97.679951] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [ 97.681144] R13: ffff9ef9028a8000 R14: ffff9ef90c1d1a28 R15: 0000000000000000
> [ 97.682370] FS: 00007386aa4f3d40(0000) GS:ffff9efa77200000(0000) knlGS:0000000000000000
> [ 97.683721] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 97.684703] CR2: 0000000000000000 CR3: 0000000169a14003 CR4: 0000000000770ef0
> [ 97.685909] PKRU: 55555554
> [ 97.686397] Call Trace:
> [ 97.686819] <TASK>
> [ 97.687243] ? show_regs+0x6c/0x80
> [ 97.687840] ? __die+0x24/0x80
> [ 97.688391] ? page_fault_oops+0x155/0x570
> [ 97.689090] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.689973] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.690848] ? __vunmap_range_noflush+0x420/0x4e0
> [ 97.691700] ? do_user_addr_fault+0x4b2/0x870
> [ 97.692606] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.693502] ? exc_page_fault+0x82/0x1b0
> [ 97.694200] ? asm_exc_page_fault+0x27/0x30
> [ 97.694975] ? cxl_await_media_ready+0x1ac/0x1d0
> [ 97.695816] vfio_cxl_core_enable+0x386/0x800 [vfio_cxl_core]
> [ 97.696829] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.697685] cxl_open_device+0xa6/0xd0 [cxl_accel_vfio_pci]
> [ 97.698673] vfio_df_open+0xcb/0xf0
> [ 97.699313] vfio_group_fops_unl_ioctl+0x294/0x720
> [ 97.700149] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.701011] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.701858] __x64_sys_ioctl+0xa3/0xf0
> [ 97.702536] x64_sys_call+0x11ad/0x25f0
> [ 97.703214] do_syscall_64+0x7e/0x170
> [ 97.703878] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.704726] ? do_syscall_64+0x8a/0x170
> [ 97.705425] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.706282] ? kvm_device_ioctl+0xae/0x130 [kvm]
> [ 97.707135] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.708001] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.708853] ? syscall_exit_to_user_mode+0x4e/0x250
> [ 97.709724] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.710609] ? do_syscall_64+0x8a/0x170
> [ 97.711300] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 97.712132] ? exc_page_fault+0x93/0x1b0
> [ 97.712839] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 97.713735] RIP: 0033:0x7386ab124ded
> [ 97.714382] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
> [ 97.717664] RSP: 002b:00007ffcda2a6480 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 97.718965] RAX: ffffffffffffffda RBX: 00006293226d9f20 RCX: 00007386ab124ded
> [ 97.720222] RDX: 00006293226db730 RSI: 0000000000003b6a RDI: 0000000000000009
> [ 97.721522] RBP: 00007ffcda2a64d0 R08: 00006293214e9010 R09: 0000000000000007
> [ 97.722858] R10: 00006293226db730 R11: 0000000000000246 R12: 00006293226e0880
> [ 97.724193] R13: 00006293226db730 R14: 00007ffcda2a7740 R15: 00006293226d94f0
> [ 97.725491] </TASK>
> [ 97.725883] Modules linked in: cxl_accel_vfio_pci(E) vfio_cxl_core(E) vfio_pci_core(E) snd_seq_dummy(E) snd_hrtimer(E) snd_seq(E) snd_seq_device(E) snd_timer(E) snd(E) soundcore(E) qrtr(E) intel_rapl_msr(E) intel_rapl_common(E) kvm_amd(E) ccp(E) binfmt_misc(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) polyval_clmulni(E) polyval_generic(E) ghash_clmulni_intel(E) sha256_ssse3(E) sha1_ssse3(E) aesni_intel(E) i2c_i801(E) crypto_simd(E) cryptd(E) i2c_smbus(E) lpc_ich(E) joydev(E) input_leds(E) mac_hid(E) serio_raw(E) msr(E) parport_pc(E) ppdev(E) lp(E) parport(E) efi_pstore(E) dmi_sysfs(E) qemu_fw_cfg(E) autofs4(E) bochs(E) e1000e(E) drm_vram_helper(E) psmouse(E) drm_ttm_helper(E) ahci(E) ttm(E) libahci(E)
> [ 97.736690] CR2: 0000000000000000
> [ 97.737285] ---[ end trace 0000000000000000 ]---
>
> Factor out cxl_await_range_active() and cxl_media_ready(). Type-3 device
> should call both for ensuring media ready while type-2 device should only
> call cxl_await_range_active().
>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Li Ming <ming.li@zohomail.com>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Li Ming <ming.li@zohomail.com>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
One bit of visual oddity inline.
> ---
> drivers/cxl/core/pci.c | 18 +++++++++++-------
> drivers/cxl/core/pci_drv.c | 3 +--
> drivers/cxl/cxlmem.h | 3 ++-
> include/cxl/cxl.h | 1 +
> tools/testing/cxl/Kbuild | 3 ++-
> tools/testing/cxl/test/mock.c | 21 ++++++++++++++++++---
> 6 files changed, 35 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 90a0763e72c4..a0cda2a8fdba 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -225,12 +225,11 @@ static int cxl_dvsec_mem_range_active(struct cxl_dev_state *cxlds, int id)
> * Wait up to @media_ready_timeout for the device to report memory
> * active.
> */
> -int cxl_await_media_ready(struct cxl_dev_state *cxlds)
> +int cxl_await_range_active(struct cxl_dev_state *cxlds)
> {
> struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> int d = cxlds->cxl_dvsec;
> int rc, i, hdm_count;
> - u64 md_status;
> u16 cap;
>
> rc = pci_read_config_word(pdev,
> @@ -251,13 +250,18 @@ int cxl_await_media_ready(struct cxl_dev_state *cxlds)
> return rc;
> }
>
> - md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
> - if (!CXLMDEV_READY(md_status))
> - return -EIO;
> -
> return 0;
> }
> -EXPORT_SYMBOL_NS_GPL(cxl_await_media_ready, "CXL");
> +EXPORT_SYMBOL_NS_GPL(cxl_await_range_active, "CXL");
> +
> +int cxl_media_ready(struct cxl_dev_state *cxlds)
> +{
> + u64 md_status;
> +
> + md_status = readq(cxlds->regs.memdev + CXLMDEV_STATUS_OFFSET);
> + return CXLMDEV_READY(md_status) ? 0 : -EIO;
See below for suggestion that this should return a bool to say
if the media was ready or not.
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_media_ready, "CXL");
>
> static int cxl_set_mem_enable(struct cxl_dev_state *cxlds, u16 val)
> {
> diff --git a/drivers/cxl/core/pci_drv.c b/drivers/cxl/core/pci_drv.c
> index 4c767e2471b8..6e519b197f0d 100644
> --- a/drivers/cxl/core/pci_drv.c
> +++ b/drivers/cxl/core/pci_drv.c
> @@ -899,8 +899,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> if (rc)
> return rc;
>
> - rc = cxl_await_media_ready(cxlds);
> - if (rc == 0)
> + if (!cxl_await_range_active(cxlds) && !cxl_media_ready(cxlds))
Syntax here is odd because you are treating the output of
cxl_media_ready() as a boolean. So someone naively looking at this
sees that media_ready is set to true when a check called cxl_media_ready()
returned false. It made me blink.
I'd either use explicit == 0 for each of these, or perhaps for cxl_media_ready()
return a bool.
> cxlds->media_ready = true;
> else
> dev_warn(&pdev->dev, "Media not active (%d)\n", rc);
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC v2 05/15] cxl: introduce cxl_get_committed_regions()
2025-12-09 16:50 ` [RFC v2 05/15] cxl: introduce cxl_get_committed_regions() mhonap
@ 2025-12-22 12:31 ` Jonathan Cameron
0 siblings, 0 replies; 25+ messages in thread
From: Jonathan Cameron @ 2025-12-22 12:31 UTC (permalink / raw)
To: mhonap
Cc: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel,
linux-cxl, kvm
On Tue, 9 Dec 2025 22:20:09 +0530
mhonap@nvidia.com wrote:
> From: Zhi Wang <zhiw@nvidia.com>
>
> The kernel CXL core can discover the configured and committed CXL regions
> from BIOS or firmware, respect its configuration and create the related
> kernel CXL core data structures without configuring and committing the CXL
> region.
>
> However, those information are kept within the kernel CXL core. A type-2
> device can have the same usage and a type-2 driver would like to know
> about it before creating the CXL regions.
>
> Introduce cxl_get_committed_regions() for a type-2 driver to discover the
> committed regions.
>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
A few trivial things inline.
Thanks,
Jonathan
> ---
> drivers/cxl/core/region.c | 73 +++++++++++++++++++++++++++++++++++++++
> include/cxl/cxl.h | 1 +
> 2 files changed, 74 insertions(+)
>
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index e89a98780e76..6c368b4641f1 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -2785,6 +2785,79 @@ int cxl_get_region_range(struct cxl_region *region, struct range *range)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_get_region_range, "CXL");
>
> +struct match_region_info {
> + struct cxl_memdev *cxlmd;
> + struct cxl_region **cxlrs;
> + int nr_regions;
> +};
> +
> +static int match_region_by_device(struct device *match, void *data)
> +{
> + struct match_region_info *info = data;
> + struct cxl_endpoint_decoder *cxled;
> + struct cxl_memdev *cxlmd;
> + struct cxl_region_params *p;
> + struct cxl_region *cxlr;
> + int i;
> +
> + if (!is_cxl_region(match))
> + return 0;
> +
> + lockdep_assert_held(&cxl_rwsem.region);
> + cxlr = to_cxl_region(match);
> + p = &cxlr->params;
> +
> + if (p->state != CXL_CONFIG_COMMIT)
> + return 0;
> +
> + for (i = 0; i < p->nr_targets; i++) {
> + void *cxlrs;
Might be worth giving this a type.
> +
> + cxled = p->targets[i];
> + cxlmd = cxled_to_memdev(cxled);
> +
> + if (info->cxlmd != cxlmd)
> + continue;
> +
> + cxlrs = krealloc(info->cxlrs, sizeof(cxlr) * (info->nr_regions + 1),
> + GFP_KERNEL);
krealloc_array() slightly better here I think.
> + if (!cxlrs) {
> + kfree(info->cxlrs);
> + return -ENOMEM;
> + }
> + info->cxlrs = cxlrs;
> +
> + info->cxlrs[info->nr_regions++] = cxlr;
> + }
> +
> + return 0;
> +}
> +
> +int cxl_get_committed_regions(struct cxl_memdev *cxlmd, struct cxl_region ***cxlrs, int *num)
> +{
> + struct match_region_info info = {0};
> + int ret = 0;
Always set, so don't initialize here.
> +
> + ret = down_write_killable(&cxl_rwsem.region);
Look at the ACQUIRE() stuff for this.
> + if (ret)
> + return ret;
> +
> + info.cxlmd = cxlmd;
> +
> + ret = bus_for_each_dev(&cxl_bus_type, NULL, &info, match_region_by_device);
> + if (ret) {
> + kfree(info.cxlrs);
With acquire magic above, ran return directly here.
> + } else {
> + *cxlrs = info.cxlrs;
> + *num = info.nr_regions;
> + }
> +
> + up_write(&cxl_rwsem.region);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_get_committed_regions, "CXL");
> +
> static ssize_t __create_region_show(struct cxl_root_decoder *cxlrd, char *buf)
> {
> return sysfs_emit(buf, "region%u\n", atomic_read(&cxlrd->region_id));
> diff --git a/include/cxl/cxl.h b/include/cxl/cxl.h
> index e3bf8cf0b6d6..0a1f245557f4 100644
> --- a/include/cxl/cxl.h
> +++ b/include/cxl/cxl.h
> @@ -295,5 +295,6 @@ int cxl_get_region_range(struct cxl_region *region, struct range *range);
> int cxl_get_hdm_reg_info(struct cxl_dev_state *cxlds, u64 *count, u64 *offset,
> u64 *size);
> int cxl_find_comp_regblock_offset(struct pci_dev *pdev, u64 *offset);
> +int cxl_get_committed_regions(struct cxl_memdev *cxlmd, struct cxl_region ***cxlrs, int *num);
>
> #endif /* __CXL_CXL_H__ */
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC v2 06/15] vfio/cxl: introduce vfio-cxl core preludes
2025-12-09 16:50 ` [RFC v2 06/15] vfio/cxl: introduce vfio-cxl core preludes mhonap
@ 2025-12-22 13:54 ` Jonathan Cameron
0 siblings, 0 replies; 25+ messages in thread
From: Jonathan Cameron @ 2025-12-22 13:54 UTC (permalink / raw)
To: mhonap
Cc: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel,
linux-cxl, kvm
On Tue, 9 Dec 2025 22:20:10 +0530
mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> In VFIO, common functions that used by VFIO variant drivers are managed
> in a set of "core" functions. E.g. the vfio-pci-core provides the common
> functions used by VFIO variant drviers to support PCI device
> passhthrough.
>
> Although the CXL type-2 device has a PCI-compatible interface for device
> configuration and programming, they still needs special handlings when
> initialize the device:
>
> - Probing the CXL DVSECs in the configuration.
> - Probing the CXL register groups implemented by the device.
> - Configuring the CXL device state required by the kernel CXL core.
> - Create the CXL region.
> - Special handlings of the CXL MMIO BAR.
>
> Introduce vfio-cxl core preludes to hold all the common functions used
> by VFIO variant drivers to support CXL device passthrough.
>
> Co-developed-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
One trivial thing from a first look.
> diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
> new file mode 100644
> index 000000000000..cf53720c0cb7
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_cxl_core.c
> @@ -0,0 +1,238 @@
> +
> +int vfio_cxl_core_create_cxl_region(struct vfio_cxl_core_device *cxl, u64 size)
> +{
> + struct cxl_region *region;
> + struct range range;
> + int ret;
> + struct vfio_cxl *cxl_core = cxl->cxl_core;
> +
> + if (WARN_ON(cxl_core->region.region))
> + return -EEXIST;
> +
> + ret = get_hpa_and_request_dpa(cxl, size);
> + if (ret)
> + return ret;
> +
> + region = cxl_create_region(cxl_core->cxlrd, &cxl_core->cxled, true);
> + if (IS_ERR(region)) {
> + ret = PTR_ERR(region);
> + cxl_dpa_free(cxl_core->cxled);
> + return ret;
Trivial but might as well do:
return PTR_ERR(region);
and save a line.
> + }
> +
> + cxl_get_region_range(region, &range);
> +
> + cxl_core->region.addr = range.start;
> + cxl_core->region.size = size;
> + cxl_core->region.region = region;
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(vfio_cxl_core_create_cxl_region);
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace via a new VFIO device region
2025-12-09 16:50 ` [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace via a new VFIO device region mhonap
2025-12-11 16:06 ` Dave Jiang
@ 2025-12-22 14:00 ` Jonathan Cameron
1 sibling, 0 replies; 25+ messages in thread
From: Jonathan Cameron @ 2025-12-22 14:00 UTC (permalink / raw)
To: mhonap
Cc: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel,
linux-cxl, kvm
On Tue, 9 Dec 2025 22:20:11 +0530
mhonap@nvidia.com wrote:
> From: Manish Honap <mhonap@nvidia.com>
>
> To directly access the device memory, a CXL region is required. Creating
> a CXL region requires to configure HDM decoders on the path to map the
> access of HPA level by level and evetually hit the DPA in the CXL
> topology.
>
> For the userspace, e.g. QEMU, to access the CXL region, the region is
> required to be exposed via VFIO interfaces.
>
> Introduce a new VFIO device region and region ops to expose the created
> CXL region when initialize the device in the vfio-cxl-core. Introduce a
> new sub region type for the userspace to identify a CXL region.
>
> Co-developed-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
A few really minor things inline.
> ---
> drivers/vfio/pci/vfio_cxl_core.c | 122 +++++++++++++++++++++++++++++++
> drivers/vfio/pci/vfio_pci_core.c | 3 +-
> include/linux/vfio_pci_core.h | 5 ++
> include/uapi/linux/vfio.h | 4 +
> 4 files changed, 133 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
> index cf53720c0cb7..35d95de47fa8 100644
> --- a/drivers/vfio/pci/vfio_cxl_core.c
> +++ b/drivers/vfio/pci/vfio_cxl_core.c
> @@ -231,6 +231,128 @@ void vfio_cxl_core_destroy_cxl_region(struct vfio_cxl_core_device *cxl)
> }
> EXPORT_SYMBOL_GPL(vfio_cxl_core_destroy_cxl_region);
>
> +static int vfio_cxl_region_mmap(struct vfio_pci_core_device *pci,
> + struct vfio_pci_region *region,
> + struct vm_area_struct *vma)
> +{
> + struct vfio_cxl_region *cxl_region = region->data;
> + u64 req_len, pgoff, req_start, end;
> + int ret;
> +
> + if (!(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
> + return -EINVAL;
> +
> + if (!(region->flags & VFIO_REGION_INFO_FLAG_READ) &&
> + (vma->vm_flags & VM_READ))
> + return -EPERM;
> +
> + if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE) &&
> + (vma->vm_flags & VM_WRITE))
> + return -EPERM;
> +
> + pgoff = vma->vm_pgoff &
> + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
GENMASK() might be slightly easier to read and makes it really obvious
this is a simple masking operation.
> +
> + if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
> + check_add_overflow(PHYS_PFN(cxl_region->addr), pgoff, &req_start) ||
> + check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
> + return -EOVERFLOW;
> +
> + if (end > cxl_region->size)
> + return -EINVAL;
> +
> + if (cxl_region->noncached)
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> +
> + vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
> + VM_DONTEXPAND | VM_DONTDUMP);
> +
> + ret = remap_pfn_range(vma, vma->vm_start, req_start,
> + req_len, vma->vm_page_prot);
> + if (ret)
> + return ret;
> +
> + vma->vm_pgoff = req_start;
> +
> + return 0;
> +}
> +
> +static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device *core_dev,
> + char __user *buf, size_t count, loff_t *ppos,
> + bool iswrite)
> +{
> + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> + struct vfio_cxl_region *cxl_region = core_dev->region[i].data;
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> + if (!count)
> + return 0;
> +
> + return vfio_pci_core_do_io_rw(core_dev, false,
> + cxl_region->vaddr,
> + (char __user *)buf, pos, count,
buf is already a char __user * so not sure why you'd need a cast here.
> + 0, 0, iswrite);
> +}
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [RFC v2 08/15] vfio/cxl: discover precommitted CXL region
2025-12-09 16:50 ` [RFC v2 08/15] vfio/cxl: discover precommitted CXL region mhonap
@ 2025-12-22 14:09 ` Jonathan Cameron
0 siblings, 0 replies; 25+ messages in thread
From: Jonathan Cameron @ 2025-12-22 14:09 UTC (permalink / raw)
To: mhonap
Cc: aniketa, ankita, alwilliamson, vsethi, jgg, mochs, skolothumtho,
alejandro.lucero-palau, dave, dave.jiang, alison.schofield,
vishal.l.verma, ira.weiny, dan.j.williams, jgg, yishaih,
kevin.tian, cjia, kwankhede, targupta, zhiw, kjaju, linux-kernel,
linux-cxl, kvm
On Tue, 9 Dec 2025 22:20:12 +0530
mhonap@nvidia.com wrote:
> From: Zhi Wang <zhiw@nvidia.com>
>
> A type-2 device can have precommitted CXL region that is configured by
> BIOS. Before letting a VFIO CXL variant driver create a new CXL region,
> the VFIO CXL core first needs to discover the precommited CXL region.
This is similar to the discussion in Alejandro's type 2 series.
Before we put infrastructure in place for handling bios precommitting I'd
like some discussion of why they are doing that.
There are a few possible reasons, but in at least some cases I suspect it
is misguided attempt to set things up that the BIOS should be leaving
well alone. I'd also be curious to hear if the decoders are locked
or not in the systems you've seen it on? I.e. can we rip it down
and start again?
I'm definitely not saying we should not support this, but I want
people to enumerate the reasons they need it.
>
> Discover the precommited CXL region when enabling CXL devices.
>
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> Signed-off-by: Manish Honap <mhonap@nvidia.com>
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2025-12-22 14:09 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-09 16:50 [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all, mhonap
2025-12-09 16:50 ` [RFC v2 01/15] cxl: factor out cxl_await_range_active() and cxl_media_ready() mhonap
2025-12-22 12:21 ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 02/15] cxl: introduce cxl_get_hdm_reg_info() mhonap
2025-12-09 16:50 ` [RFC v2 03/15] cxl: introduce cxl_find_comp_reglock_offset() mhonap
2025-12-09 16:50 ` [RFC v2 04/15] cxl: introduce devm_cxl_del_memdev() mhonap
2025-12-09 16:50 ` [RFC v2 05/15] cxl: introduce cxl_get_committed_regions() mhonap
2025-12-22 12:31 ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 06/15] vfio/cxl: introduce vfio-cxl core preludes mhonap
2025-12-22 13:54 ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace via a new VFIO device region mhonap
2025-12-11 16:06 ` Dave Jiang
2025-12-11 17:31 ` Manish Honap
2025-12-11 18:01 ` Dave Jiang
2025-12-22 14:00 ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 08/15] vfio/cxl: discover precommitted CXL region mhonap
2025-12-22 14:09 ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 09/15] vfio/cxl: introduce vfio_cxl_core_{read, write}() mhonap
2025-12-09 16:50 ` [RFC v2 10/15] vfio/cxl: introduce the register emulation framework mhonap
2025-12-09 16:50 ` [RFC v2 11/15] vfio/cxl: introduce the emulation of HDM registers mhonap
2025-12-11 18:13 ` Dave Jiang
2025-12-09 16:50 ` [RFC v2 12/15] vfio/cxl: introduce the emulation of CXL configuration space mhonap
2025-12-09 16:50 ` [RFC v2 13/15] vfio/pci: introduce CXL device awareness mhonap
2025-12-09 16:50 ` [RFC v2 14/15] vfio/cxl: VFIO variant driver for QEMU CXL accel device mhonap
2025-12-09 16:50 ` [RFC v2 15/15] cxl/mem: Fix NULL pointer deference in memory device paths mhonap
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox