[RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Hello all,

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: <mhonap@nvidia.com>
To: <aniketa@nvidia.com>, <ankita@nvidia.com>,
	<alwilliamson@nvidia.com>, <vsethi@nvidia.com>, <jgg@nvidia.com>,
	<mochs@nvidia.com>, <skolothumtho@nvidia.com>,
	<alejandro.lucero-palau@amd.com>, <dave@stgolabs.net>,
	<jonathan.cameron@huawei.com>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
	<ira.weiny@intel.com>, <dan.j.williams@intel.com>, <jgg@ziepe.ca>,
	<yishaih@nvidia.com>, <kevin.tian@intel.com>
Cc: <cjia@nvidia.com>, <kwankhede@nvidia.com>, <targupta@nvidia.com>,
	<zhiw@nvidia.com>, <kjaju@nvidia.com>,
	<linux-kernel@vger.kernel.org>, <linux-cxl@vger.kernel.org>,
	<kvm@vger.kernel.org>, <mhonap@nvidia.com>
Subject: [RFC v2 00/15] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough                                                                                                                                                                                                                                                                                                                                        Hello all,
Date: Tue, 9 Dec 2025 22:20:04 +0530	[thread overview]
Message-ID: <20251209165019.2643142-1-mhonap@nvidia.com> (raw)

From: Manish Honap <mhonap@nvidia.com>

This is the re-spin of VFIO-CXL patches which Zhi had sent earlier[1] and
rebased to Alejandro's patch series[3] v20 for "CXL type-2 device support".

Current patchset only modifies the RFC V1 sent by Zhi in accordance
with the CXL type-2 device support currently in upstream review. In
the next version, I will reorganize the patch series to first create
VFIO variant driver for QEMU CXL accelerator emulated device and
incrementally implement the features for ease of review.
This will create a logical separation between code which is required
in VFIO-CXL-CORE, CXL-CORE and in variant driver. This will also help
for review to understand the delta between CXL specific initialization
as compared to standard PCI.

V2 changes:
===========

- Address all the comments from Alex on RFC V1.

- Besides addressing Alex comments, this series also solves the leftovers:
	- Introduce an emulation framework.
	- Proper CXL DVSEC configuration emulation.
	- Proper CXL MMIO BAR emmulation.
	- Re-use PCI mmap ops in vfio-pci for CXL region.
	- Introduce sparse map for CXL MMIO BAR.
	- Big-little endian considerations.
	- Correct teardown path. (which is missing due to the previous CXL
	  type 2 patches doesn't have it.)

- Refine the APIs and architecture of VFIO CXL.
	- Configurable params for HW.
	- Pre-commited CXL region.
	- PCI DOE registers passthrough. (For CDAT)
	- Media ready support. (SFC doesn't need this)

- Introduce new changes for the CXL core.
	- Teardown path of CXL memdev.
	- Committed regions.
	- Media ready for CXL type-2 device.

- Update the sample driver with latest VFIO-CXL APIs.

Patchwise description:
=====================

PATCH 1-5: Expose the necessary routines required by vfio-cxl.

PATCH 6: Introduce the preludes of vfio-cxl, including CXL device
initialization, CXL region creation as directed in patch v20 for vfio
type-2 device initialization.

PATCH 7: Expose the CXL region to the userspace.

PATCH 8: Add logic to discover precommitted CXL region.

PATCH 9: Introduce vfio-cxl read/write routines.

PATCH 10: Prepare to emulate the HDM decoder registers.

PATCH 11: Emulate the HDM decoder registers.

PATCH 12: Emulate CXL configuration space.

PATCH 13: Tweak vfio-cxl to be aware of working on a CXL device.

PATCH 14: An example variant driver to demonstrate the usage of
vfio-cxl-core from the perspective of the VFIO variant driver.

PATCH 15: NULL pointer dereference fixes found out during testing.

Background:
===========

Compute Express Link (CXL) is an open standard interconnect built upon
industrial PCI layers to enhance the performance and efficiency of data
centers by enabling high-speed, low-latency communication between CPUs
and various types of devices such as accelerators, memory.

It supports three key protocols: CXL.io as the control protocol, CXL.cache
as the cache-coherent host-device data transfer protocol, and CXL.mem as
memory expansion protocol. CXL Type 2 devices leverage the three protocols
to seamlessly integrate with host CPUs, providing a unified and efficient
interface for high-speed data transfer and memory sharing. This integration
is crucial for heterogeneous computing environments where accelerators,
such as GPUs, and other specialized processors, are used to handle
intensive workloads.

Goal:
=====

Although CXL is built upon the PCI layers, passing a CXL type-2 device can
be different than PCI devices according to CXL specification[4]:

- CXL type-2 device initialization. CXL type-2 device requires an
additional initialization sequence besides the PCI device initialization.
CXL type-2 device initialization can be pretty complicated due to its
hierarchy of register interfaces. Thus, a standard CXL type-2 driver
initialization sequence provided by the kernel CXL core is used.

- Create a CXL region and map it to the VM. A mapping between HPA and DPA
(Device PA) needs to be created to access the device memory directly. HDM
decoders in the CXL topology need to be configured level by level to
manage the mapping. After the region is created, it needs to be mapped to
GPA in the virtual HDM decoders configured by the VM.

- CXL reset. The CXL device reset is different from the PCI device reset.
A CXL reset sequence is introduced by the CXL spec.

- Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
configuration for device enumeration and device control. (E.g. if a device
is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
by the kernel CXL core, and the VM can not modify them.

- Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
that can sit in a PCI BAR. The location of register groups sit in the PCI
BAR is indicated by the register locator in the CXL DVSECs. They are also
owned by the kernel CXL core. Some of them need to be emulated.

Design:
=======

To achieve the purpose above, the vfio-cxl-core is introduced to host the
common routines that variant driver requires for device passthrough.
Similar with the vfio-pci-core, the vfio-cxl-core provides common
routines of vfio_device_ops for the variant driver to hook and perform the
CXL routines behind it.

Besides, several extra APIs are introduced for the variant driver to
provide the necessary information the kernel CXL core to initialize
the CXL device. E.g., Device DPA.

CXL is built upon the PCI layers but with differences. Thus, the
vfio-pci-core is aimed to be re-used as much as possible with the
awareness of operating on a CXL device.

A new VFIO device region is introduced to expose the CXL region to the
userspace. A new CXL VFIO device cap has also been introduced to convey
the necessary CXL device information to the userspace.

Test:
=====

To test the patches and hack around, a virtual passthrough with nested
virtualization approach is used.

The host QEMU[5] emulates a CXL type-2 accel device based on Ira's patches
with the changes to emulate HDM decoders.

While running the vfio-cxl in the L1 guest, an example VFIO variant
driver is used to attach with the QEMU CXL access device.

The L2 guest can be booted via the QEMU with the vfio-cxl support in the
VFIOStub.

In the L2 guest, a dummy CXL device driver is provided to attach to the
virtual pass-through device.

The dummy CXL type-2 device driver can successfully be loaded with the
kernel cxl core type2 support, create CXL region by requesting the CXL
core to allocate HPA and DPA and configure the HDM decoders.

To make sure everyone can test the patches, the kernel config of L1 and
L2 are provided in the repos, the required kernel command params and
qemu command line can be found from the demonstration video[6]

Repos:
======

QEMU host:
https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
L1 Kernel:
https://github.com/mhonap-nvidia/vfio-cxl-l1-kernel-rfc-v2/tree/vfio-cxl-l1-kernel-rfc-v2
L1 QEMU:
https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
L2 Kernel:
https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2

Feedback expected:
==================

- Architecture level between vfio-pci-core and vfio-cxl-core.
- Variant driver requirements from more hardware vendors.
- vfio-cxl-core UABI to QEMU.

Applying patches:
=================

This patchset should be applied on the tree with base commit v6.18-rc2
along with patches from [2] and [3].

[1] https://lore.kernel.org/all/20240920223446.1908673-1-zhiw@nvidia.com/
[2] https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
[3] https://lore.kernel.org/linux-cxl/20251110153657.2706192-1-alejandro.lucero-palau@amd.com/
[4] https://computeexpresslink.org/cxl-specification/
[5] https://lore.kernel.org/linux-cxl/20251104170305.4163840-1-terry.bowman@amd.com/
[6] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7

Manish Honap (15):
  cxl: factor out cxl_await_range_active() and cxl_media_ready()
  cxl: introduce cxl_get_hdm_reg_info()
  cxl: introduce cxl_find_comp_reglock_offset()
  cxl: introduce devm_cxl_del_memdev()
  cxl: introduce cxl_get_committed_regions()
  vfio/cxl: introduce vfio-cxl core preludes
  vfio/cxl: expose CXL region to the userspace via a new VFIO device
    region
  vfio/cxl: discover precommitted CXL region
  vfio/cxl: introduce vfio_cxl_core_{read, write}()
  vfio/cxl: introduce the register emulation framework
  vfio/cxl: introduce the emulation of HDM registers
  vfio/cxl: introduce the emulation of CXL configuration space
  vfio/pci: introduce CXL device awareness
  vfio/cxl: VFIO variant driver for QEMU CXL accel device
  cxl: NULL checks for CXL memory devices

 drivers/cxl/core/memdev.c             |   8 +-
 drivers/cxl/core/pci.c                |  46 +-
 drivers/cxl/core/pci_drv.c            |   3 +-
 drivers/cxl/core/region.c             |  73 +++
 drivers/cxl/core/regs.c               |  22 +
 drivers/cxl/cxlmem.h                  |   3 +-
 drivers/cxl/mem.c                     |   3 +
 drivers/vfio/pci/Kconfig              |  13 +
 drivers/vfio/pci/Makefile             |   5 +
 drivers/vfio/pci/cxl-accel/Kconfig    |   9 +
 drivers/vfio/pci/cxl-accel/Makefile   |   4 +
 drivers/vfio/pci/cxl-accel/main.c     | 143 +++++
 drivers/vfio/pci/vfio_cxl_core.c      | 695 +++++++++++++++++++++++
 drivers/vfio/pci/vfio_cxl_core_emu.c  | 778 ++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_cxl_core_priv.h |  17 +
 drivers/vfio/pci/vfio_pci_core.c      |  41 +-
 drivers/vfio/pci/vfio_pci_rdwr.c      |  11 +-
 include/cxl/cxl.h                     |   9 +
 include/linux/vfio_pci_core.h         |  96 ++++
 include/uapi/linux/vfio.h             |  14 +
 tools/testing/cxl/Kbuild              |   3 +-
 tools/testing/cxl/test/mock.c         |  21 +-
 22 files changed, 1992 insertions(+), 25 deletions(-)
 create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
 create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
 create mode 100644 drivers/vfio/pci/cxl-accel/main.c
 create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
 create mode 100644 drivers/vfio/pci/vfio_cxl_core_emu.c
 create mode 100644 drivers/vfio/pci/vfio_cxl_core_priv.h

--
2.25.1

next             reply	other threads:[~2025-12-09 16:51 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-09 16:50 mhonap [this message]
2025-12-09 16:50 ` [RFC v2 01/15] cxl: factor out cxl_await_range_active() and cxl_media_ready() mhonap
2025-12-22 12:21   ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 02/15] cxl: introduce cxl_get_hdm_reg_info() mhonap
2025-12-09 16:50 ` [RFC v2 03/15] cxl: introduce cxl_find_comp_reglock_offset() mhonap
2025-12-09 16:50 ` [RFC v2 04/15] cxl: introduce devm_cxl_del_memdev() mhonap
2025-12-09 16:50 ` [RFC v2 05/15] cxl: introduce cxl_get_committed_regions() mhonap
2025-12-22 12:31   ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 06/15] vfio/cxl: introduce vfio-cxl core preludes mhonap
2025-12-22 13:54   ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 07/15] vfio/cxl: expose CXL region to the userspace via a new VFIO device region mhonap
2025-12-11 16:06   ` Dave Jiang
2025-12-11 17:31     ` Manish Honap
2025-12-11 18:01       ` Dave Jiang
2025-12-22 14:00   ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 08/15] vfio/cxl: discover precommitted CXL region mhonap
2025-12-22 14:09   ` Jonathan Cameron
2025-12-09 16:50 ` [RFC v2 09/15] vfio/cxl: introduce vfio_cxl_core_{read, write}() mhonap
2025-12-09 16:50 ` [RFC v2 10/15] vfio/cxl: introduce the register emulation framework mhonap
2025-12-09 16:50 ` [RFC v2 11/15] vfio/cxl: introduce the emulation of HDM registers mhonap
2025-12-11 18:13   ` Dave Jiang
2025-12-09 16:50 ` [RFC v2 12/15] vfio/cxl: introduce the emulation of CXL configuration space mhonap
2025-12-09 16:50 ` [RFC v2 13/15] vfio/pci: introduce CXL device awareness mhonap
2025-12-09 16:50 ` [RFC v2 14/15] vfio/cxl: VFIO variant driver for QEMU CXL accel device mhonap
2025-12-09 16:50 ` [RFC v2 15/15] cxl/mem: Fix NULL pointer deference in memory device paths mhonap

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251209165019.2643142-1-mhonap@nvidia.com \
    --to=mhonap@nvidia.com \
    --cc=alejandro.lucero-palau@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alwilliamson@nvidia.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jgg@nvidia.com \
    --cc=jgg@ziepe.ca \
    --cc=jonathan.cameron@huawei.com \
    --cc=kevin.tian@intel.com \
    --cc=kjaju@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mochs@nvidia.com \
    --cc=skolothumtho@nvidia.com \
    --cc=targupta@nvidia.com \
    --cc=vishal.l.verma@intel.com \
    --cc=vsethi@nvidia.com \
    --cc=yishaih@nvidia.com \
    --cc=zhiw@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).