[RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
@ 2024-09-20 22:34 Zhi Wang
  2024-09-20 22:34 ` [RFC 01/13] cxl: allow a type-2 device not to have memory device registers Zhi Wang
                   ` (16 more replies)
  0 siblings, 17 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

Hi folks:

As promised in the LPC, here are all you need (patches, repos, guiding
video, kernel config) to build a environment to test the vfio-cxl-core.

Thanks so much for the discussions! Enjoy and see you in the next one.

Background
==========

Compute Express Link (CXL) is an open standard interconnect built upon
industrial PCI layers to enhance the performance and efficiency of data
centers by enabling high-speed, low-latency communication between CPUs
and various types of devices such as accelerators, memory.

It supports three key protocols: CXL.io as the control protocol, CXL.cache
as the cache-coherent host-device data transfer protocol, and CXL.mem as
memory expansion protocol. CXL Type 2 devices leverage the three protocols
to seamlessly integrate with host CPUs, providing a unified and efficient
interface for high-speed data transfer and memory sharing. This integration
is crucial for heterogeneous computing environments where accelerators,
such as GPUs, and other specialized processors, are used to handle
intensive workloads.

Goal
====

Although CXL is built upon the PCI layers, passing a CXL type-2 device can
be different than PCI devices according to CXL specification[1]:

- CXL type-2 device initialization. CXL type-2 device requires an
additional initialization sequence besides the PCI device initialization.
CXL type-2 device initialization can be pretty complicated due to its
hierarchy of register interfaces. Thus, a standard CXL type-2 driver
initialization sequence provided by the kernel CXL core is used.

- Create a CXL region and map it to the VM. A mapping between HPA and DPA
(Device PA) needs to be created to access the device memory directly. HDM
decoders in the CXL topology need to be configured level by level to
manage the mapping. After the region is created, it needs to be mapped to
GPA in the virtual HDM decoders configured by the VM.

- CXL reset. The CXL device reset is different from the PCI device reset.
A CXL reset sequence is introduced by the CXL spec.

- Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
configuration for device enumeration and device control. (E.g. if a device
is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
by the kernel CXL core, and the VM can not modify them.

- Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
that can sit in a PCI BAR. The location of register groups sit in the PCI
BAR is indicated by the register locator in the CXL DVSECs. They are also
owned by the kernel CXL core. Some of them need to be emulated.

Design
======

To achieve the purpose above, the vfio-cxl-core is introduced to host the
common routines that variant driver requires for device passthrough.
Similar with the vfio-pci-core, the vfio-cxl-core provides common
routines of vfio_device_ops for the variant driver to hook and perform the
CXL routines behind it.

Besides, several extra APIs are introduced for the variant driver to
provide the necessary information the kernel CXL core to initialize
the CXL device. E.g., Device DPA.

CXL is built upon the PCI layers but with differences. Thus, the
vfio-pci-core is aimed to be re-used as much as possible with the
awareness of operating on a CXL device.

A new VFIO device region is introduced to expose the CXL region to the
userspace. A new CXL VFIO device cap has also been introduced to convey
the necessary CXL device information to the userspace.

Patches
=======

The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
rebase them to V3 once the cxl-type2 support v3 patch review is done.

PATCH 1-3: Expose the necessary routines required by vfio-cxl.

PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
initialization, CXL region creation.

PATCH 5: Expose the CXL region to the userspace.

PATCH 6-7: Prepare to emulate the HDM decoder registers.

PATCH 8: Emulate the HDM decoder registers.

PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.

PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.

PATCH 11: Expose the CXL device information that userspace needs.

PATCH 12: An example variant driver to demonstrate the usage of
vfio-cxl-core from the perspective of the VFIO variant driver.

PATCH 13: A workaround needs suggestions.

Test
====

To test the patches and hack around, a virtual passthrough with nested
virtualization approach is used.

The host QEMU emulates a CXL type-2 accel device based on Ira's patches
with the changes to emulate HDM decoders.

While running the vfio-cxl in the L1 guest, an example VFIO variant
driver is used to attach with the QEMU CXL access device.

The L2 guest can be booted via the QEMU with the vfio-cxl support in the
VFIOStub.

In the L2 guest, a dummy CXL device driver is provided to attach to the
virtual pass-thru device.

The dummy CXL type-2 device driver can successfully be loaded with the
kernel cxl core type2 support, create CXL region by requesting the CXL
core to allocate HPA and DPA and configure the HDM decoders.

To make sure everyone can test the patches, the kernel config of L1 and
L2 are provided in the repos, the required kernel command params and
qemu command line can be found from the demostration video.[5]

Repos
=====

QEMU host: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
L1 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
L1 QEMU: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2

[1] https://computeexpresslink.org/cxl-specification/
[2] https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
[3] https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
[4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7

Feedback expected
=================

- Archtiecture level between vfio-pci-core and vfio-cxl-core.
- Variant driver requirements from more hardware vendors.
- vfio-cxl-core UABI to QEMU.

Moving foward
=============

- Rebase the patches on top of Alejandro's PATCH v3.
- Get Ira's type-2 emulated device patch into upstream as CXL folks and RH
  folks both came to talk and expect this. I had a chat with Ira and he
  expected me to take it over. Will start a discussion in the CXL discord
  group for the desgin of V1.
- Sparse map in vfio-cxl-core.

Known issues
============

- Teardown path. Missing teardown paths have been implements in Alejandor's
  PATCH v3. It should be solved after the rebase.

- Powerdown L1 guest instead of reboot it. The QEMU reset handler is missing
  in the Ira's patch. When rebooting L1, many CXL registers are not reset.
  This will be addressed in the formal review of emulated CXL type-2 device
  support.

Zhi Wang (13):
  cxl: allow a type-2 device not to have memory device registers
  cxl: introduce cxl_get_hdm_info()
  cxl: introduce cxl_find_comp_reglock_offset()
  vfio: introduce vfio-cxl core preludes
  vfio/cxl: expose CXL region to the usersapce via a new VFIO device
    region
  vfio/pci: expose vfio_pci_rw()
  vfio/cxl: introduce vfio_cxl_core_{read, write}()
  vfio/cxl: emulate HDM decoder registers
  vfio/pci: introduce CXL device awareness
  vfio/pci: emulate CXL DVSEC registers in the configuration space
  vfio/cxl: introduce VFIO CXL device cap
  vfio/cxl: VFIO variant driver for QEMU CXL accel device
  vfio/cxl: workaround: don't take resource region when cxl is enabled.

 drivers/cxl/core/pci.c              |  28 ++
 drivers/cxl/core/regs.c             |  22 +
 drivers/cxl/cxl.h                   |   1 +
 drivers/cxl/cxlpci.h                |   3 +
 drivers/cxl/pci.c                   |  14 +-
 drivers/vfio/pci/Kconfig            |   6 +
 drivers/vfio/pci/Makefile           |   5 +
 drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
 drivers/vfio/pci/cxl-accel/Makefile |   3 +
 drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
 drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c  |  10 +
 drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
 drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
 include/linux/cxl_accel_mem.h       |   3 +
 include/linux/cxl_accel_pci.h       |   6 +
 include/linux/vfio_pci_core.h       |  53 +++
 include/uapi/linux/vfio.h           |  14 +
 18 files changed, 992 insertions(+), 32 deletions(-)
 create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
 create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
 create mode 100644 drivers/vfio/pci/cxl-accel/main.c
 create mode 100644 drivers/vfio/pci/vfio_cxl_core.c

-- 
2.34.1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC 01/13] cxl: allow a type-2 device not to have memory device registers
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-09-23  8:01   ` Tian, Kevin
  2024-09-23 15:38   ` Dave Jiang
  2024-09-20 22:34 ` [RFC 02/13] cxl: introduce cxl_get_hdm_info() Zhi Wang
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

CXL memory device registers provide additional information about device
memory and advanced control interface for type-3 device.

However, it is not mandatory for a type-2 device. A type-2 device can
have HDMs but not CXL memory device registers.

Allow a type-2 device not to hanve memory device register when probing
CXL registers.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/cxl/pci.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index e00ce7f4d0f9..3fbee31995f1 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -529,13 +529,13 @@ int cxl_pci_accel_setup_regs(struct pci_dev *pdev, struct cxl_dev_state *cxlds)
 	int rc;
 
 	rc = cxl_pci_setup_regs(pdev, CXL_REGLOC_RBI_MEMDEV, &map,
-				cxlds->capabilities);
-	if (rc)
-		return rc;
-
-	rc = cxl_map_device_regs(&map, &cxlds->regs.device_regs);
-	if (rc)
-		return rc;
+			cxlds->capabilities);
+	if (!rc) {
+		rc = cxl_map_device_regs(&map, &cxlds->regs.device_regs);
+		if (rc)
+			dev_dbg(&pdev->dev,
+				"Failed to map device registers.\n");
+	}
 
 	rc = cxl_pci_setup_regs(pdev, CXL_REGLOC_RBI_COMPONENT,
 				&cxlds->reg_map, cxlds->capabilities);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 02/13] cxl: introduce cxl_get_hdm_info()
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
  2024-09-20 22:34 ` [RFC 01/13] cxl: allow a type-2 device not to have memory device registers Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-10-17 15:44   ` Jonathan Cameron
  2024-09-20 22:34 ` [RFC 03/13] cxl: introduce cxl_find_comp_reglock_offset() Zhi Wang
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

CXL core has the information of what CXL register groups a device has.
When initializing the device, the CXL core probes the register groups
and saves the information. The probing sequence is quite complicated.

vfio-cxl requires the HDM register information to emualte the HDM decoder
registers.

Introduce cxl_get_hdm_info() for vfio-cxl to leverage the HDM
register information in the CXL core. Thus, it doesn't need to implement
its own probing sequence.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/cxl/core/pci.c        | 28 ++++++++++++++++++++++++++++
 drivers/cxl/cxlpci.h          |  3 +++
 include/linux/cxl_accel_mem.h |  2 ++
 3 files changed, 33 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index a663e7566c48..7b6c2b6211b3 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -502,6 +502,34 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, CXL);
 
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u32 *hdm_count,
+		     u64 *hdm_reg_offset, u64 *hdm_reg_size)
+{
+	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
+	int d = cxlds->cxl_dvsec;
+	u16 cap;
+	int rc;
+
+	if (!cxlds->reg_map.component_map.hdm_decoder.valid) {
+		*hdm_reg_offset = *hdm_reg_size = 0;
+	} else {
+		struct cxl_component_reg_map *map =
+			&cxlds->reg_map.component_map;
+
+		*hdm_reg_offset = map->hdm_decoder.offset;
+		*hdm_reg_size = map->hdm_decoder.size;
+	}
+
+	rc = pci_read_config_word(pdev,
+				  d + CXL_DVSEC_CAP_OFFSET, &cap);
+	if (rc)
+		return rc;
+
+	*hdm_count = FIELD_GET(CXL_DVSEC_HDM_COUNT_MASK, cap);
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_info, CXL);
+
 #define CXL_DOE_TABLE_ACCESS_REQ_CODE		0x000000ff
 #define   CXL_DOE_TABLE_ACCESS_REQ_CODE_READ	0
 #define CXL_DOE_TABLE_ACCESS_TABLE_TYPE		0x0000ff00
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 4da07727ab9c..8d4458f7e45b 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -129,4 +129,7 @@ void read_cdat_data(struct cxl_port *port);
 void cxl_cor_error_detected(struct pci_dev *pdev);
 pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 				    pci_channel_state_t state);
+
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u32 *hdm_count,
+		     u64 *hdm_reg_offset, u64 *hdm_reg_size);
 #endif /* __CXL_PCI_H__ */
diff --git a/include/linux/cxl_accel_mem.h b/include/linux/cxl_accel_mem.h
index 5d715eea6e91..db4182fc1936 100644
--- a/include/linux/cxl_accel_mem.h
+++ b/include/linux/cxl_accel_mem.h
@@ -55,4 +55,6 @@ struct cxl_region *cxl_create_region(struct cxl_root_decoder *cxlrd,
 int cxl_region_detach(struct cxl_endpoint_decoder *cxled);
 int cxl_accel_get_region_params(struct cxl_region *region,
 				resource_size_t *start, resource_size_t *end);
+int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u32 *hdm_count,
+		     u64 *hdm_reg_offset, u64 *hdm_reg_size);
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 03/13] cxl: introduce cxl_find_comp_reglock_offset()
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
  2024-09-20 22:34 ` [RFC 01/13] cxl: allow a type-2 device not to have memory device registers Zhi Wang
  2024-09-20 22:34 ` [RFC 02/13] cxl: introduce cxl_get_hdm_info() Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-09-20 22:34 ` [RFC 04/13] vfio: introduce vfio-cxl core preludes Zhi Wang
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

CXL core has the information of what CXL register groups a device
has.When initializing the device, the CXL core probes the register
groups and saves the information. The probing sequence is quite
complicated.

vfio-cxl needs to handle the CXL MMIO BAR specially. E.g. emulate
the HDM decoder register inside the component registers. Thus it
requires to know the offset of the CXL component register to locate
the PCI BAR where the component register sits.

Introduce cxl_find_comp_regblock_offset() for vfio-cxl to leverage the
register information in the CXL core. Thus, it doesn't need to
implement its own probing sequence.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/cxl/core/regs.c       | 22 ++++++++++++++++++++++
 drivers/cxl/cxl.h             |  1 +
 include/linux/cxl_accel_mem.h |  1 +
 3 files changed, 24 insertions(+)

diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 9d218ebe180d..7db3c8fcd66f 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -364,6 +364,28 @@ int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_find_regblock, CXL);
 
+/**
+ * cxl_find_comp_regblock_offset() - Locate the offset of component
+ * register blocks
+ * @pdev: The CXL PCI device to enumerate.
+ * @offset: Enumeration output, clobbered on error
+ *
+ * Return: 0 if register block enumerated, negative error code otherwise
+ */
+int cxl_find_comp_regblock_offset(struct pci_dev *pdev, u64 *offset)
+{
+	struct cxl_register_map map;
+	int ret;
+
+	ret = cxl_find_regblock(pdev, CXL_REGLOC_RBI_COMPONENT, &map);
+	if (ret)
+		return ret;
+
+	*offset = map.resource;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_find_comp_regblock_offset, CXL);
+
 /**
  * cxl_count_regblock() - Count instances of a given regblock type.
  * @pdev: The CXL PCI device to enumerate.
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 5e2b5b3e8f38..33dfdc278b47 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -300,6 +300,7 @@ int cxl_find_regblock_instance(struct pci_dev *pdev, enum cxl_regloc_type type,
 			       struct cxl_register_map *map, int index);
 int cxl_find_regblock(struct pci_dev *pdev, enum cxl_regloc_type type,
 		      struct cxl_register_map *map);
+int cxl_find_comp_regblock_offset(struct pci_dev *pdev, u64 *offset);
 int cxl_setup_regs(struct cxl_register_map *map, uint8_t caps);
 struct cxl_dport;
 resource_size_t cxl_rcd_component_reg_phys(struct device *dev,
diff --git a/include/linux/cxl_accel_mem.h b/include/linux/cxl_accel_mem.h
index db4182fc1936..6f585aae7eb6 100644
--- a/include/linux/cxl_accel_mem.h
+++ b/include/linux/cxl_accel_mem.h
@@ -57,4 +57,5 @@ int cxl_accel_get_region_params(struct cxl_region *region,
 				resource_size_t *start, resource_size_t *end);
 int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u32 *hdm_count,
 		     u64 *hdm_reg_offset, u64 *hdm_reg_size);
+int cxl_find_comp_regblock_offset(struct pci_dev *pdev, u64 *offset);
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 04/13] vfio: introduce vfio-cxl core preludes
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (2 preceding siblings ...)
  2024-09-20 22:34 ` [RFC 03/13] cxl: introduce cxl_find_comp_reglock_offset() Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-10-11 18:33   ` Alex Williamson
  2024-09-20 22:34 ` [RFC 05/13] vfio/cxl: expose CXL region to the usersapce via a new VFIO device region Zhi Wang
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

In VFIO, common functions that used by VFIO variant drivers are managed
in a set of "core" functions. E.g. the vfio-pci-core provides the common
functions used by VFIO variant drviers to support PCI device
passhthrough.

Although the CXL type-2 device has a PCI-compatible interface for device
configuration and programming, they still needs special handlings when
initialize the device:

- Probing the CXL DVSECs in the configuration.
- Probing the CXL register groups implemented by the device.
- Configuring the CXL device state required by the kernel CXL core.
- Create the CXL region.
- Special handlings of the CXL MMIO BAR.

Introduce vfio-cxl core predules to hold all the common functions used
by VFIO variant drivers to support CXL device passthrough.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/Kconfig         |   4 +
 drivers/vfio/pci/Makefile        |   3 +
 drivers/vfio/pci/vfio_cxl_core.c | 264 +++++++++++++++++++++++++++++++
 include/linux/vfio_pci_core.h    |  37 +++++
 4 files changed, 308 insertions(+)
 create mode 100644 drivers/vfio/pci/vfio_cxl_core.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index bf50ffa10bde..2196e79b132b 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -7,6 +7,10 @@ config VFIO_PCI_CORE
 	select VFIO_VIRQFD
 	select IRQ_BYPASS_MANAGER
 
+config VFIO_CXL_CORE
+	tristate
+	select VFIO_PCI_CORE
+
 config VFIO_PCI_MMAP
 	def_bool y if !S390
 	depends on VFIO_PCI_CORE
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index cf00c0a7e55c..b51221b94b0b 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -8,6 +8,9 @@ vfio-pci-y := vfio_pci.o
 vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
 obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
 
+vfio-cxl-core-y := vfio_cxl_core.o
+obj-$(CONFIG_VFIO_CXL_CORE) += vfio-cxl-core.o
+
 obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
 
 obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
new file mode 100644
index 000000000000..6a7859333f67
--- /dev/null
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -0,0 +1,264 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/eventfd.h>
+#include <linux/file.h>
+#include <linux/interrupt.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/pci.h>
+#include <linux/pm_runtime.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+
+#include "vfio_pci_priv.h"
+
+#define DRIVER_AUTHOR "Zhi Wang <zhiw@nvidia.com>"
+#define DRIVER_DESC "core driver for VFIO based CXL devices"
+
+static int get_hpa_and_request_dpa(struct vfio_pci_core_device *core_dev)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+	struct pci_dev *pdev = core_dev->pdev;
+	u64 max;
+
+	cxl->cxlrd = cxl_get_hpa_freespace(cxl->endpoint, 1,
+					   CXL_DECODER_F_RAM |
+					   CXL_DECODER_F_TYPE2,
+					   &max);
+	if (IS_ERR(cxl->cxlrd)) {
+		pci_err(pdev, "Fail to get HPA space.\n");
+		return PTR_ERR(cxl->cxlrd);
+	}
+
+	if (max < cxl->region.size) {
+		pci_err(pdev, "No enough free HPA space %llu < %llu\n",
+			max, cxl->region.size);
+		return -ENOSPC;
+	}
+
+	cxl->cxled = cxl_request_dpa(cxl->endpoint, true, cxl->region.size,
+				     cxl->region.size);
+	if (IS_ERR(cxl->cxled)) {
+		pci_err(pdev, "Fail to request DPA\n");
+		return PTR_ERR(cxl->cxled);
+	}
+
+	return 0;
+}
+
+static int create_cxl_region(struct vfio_pci_core_device *core_dev)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+	struct pci_dev *pdev = core_dev->pdev;
+	resource_size_t start, end;
+	int ret;
+
+	ret = cxl_accel_request_resource(cxl->cxlds, true);
+	if (ret) {
+		pci_err(pdev, "Fail to request CXL resource\n");
+		return ret;
+	}
+
+	if (!cxl_await_media_ready(cxl->cxlds)) {
+		cxl_accel_set_media_ready(cxl->cxlds);
+	} else {
+		pci_err(pdev, "CXL media is not active\n");
+		return ret;
+	}
+
+	cxl->cxlmd = devm_cxl_add_memdev(&pdev->dev, cxl->cxlds);
+	if (IS_ERR(cxl->cxlmd)) {
+		pci_err(pdev, "Fail to create CXL memdev\n");
+		return PTR_ERR(cxl->cxlmd);
+	}
+
+	cxl->endpoint = cxl_acquire_endpoint(cxl->cxlmd);
+	if (IS_ERR(cxl->endpoint)) {
+		pci_err(pdev, "Fail to acquire CXL endpoint\n");
+		return PTR_ERR(cxl->endpoint);
+	}
+
+	ret = get_hpa_and_request_dpa(core_dev);
+	if (ret)
+		goto out;
+
+	cxl->region.region = cxl_create_region(cxl->cxlrd, &cxl->cxled, 1);
+	if (IS_ERR(cxl->region.region)) {
+		ret = PTR_ERR(cxl->region.region);
+		pci_err(pdev, "Fail to create CXL region\n");
+		cxl_dpa_free(cxl->cxled);
+		goto out;
+	}
+
+	cxl_accel_get_region_params(cxl->region.region, &start, &end);
+
+	cxl->region.addr = start;
+out:
+	cxl_release_endpoint(cxl->cxlmd, cxl->endpoint);
+	return ret;
+}
+
+/* Standard CXL-type 2 driver initialization sequence */
+static int enable_cxl(struct vfio_pci_core_device *core_dev, u16 dvsec)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+	struct pci_dev *pdev = core_dev->pdev;
+	u32 count;
+	u64 offset, size;
+	int ret;
+
+	cxl->cxlds = cxl_accel_state_create(&pdev->dev, cxl->caps);
+	if (IS_ERR(cxl->cxlds))
+		return PTR_ERR(cxl->cxlds);
+
+	cxl_accel_set_dvsec(cxl->cxlds, dvsec);
+	cxl_accel_set_serial(cxl->cxlds, pdev->dev.id);
+
+	cxl_accel_set_resource(cxl->cxlds, cxl->dpa_res, CXL_ACCEL_RES_DPA);
+	cxl_accel_set_resource(cxl->cxlds, cxl->ram_res, CXL_ACCEL_RES_RAM);
+
+	ret = cxl_pci_accel_setup_regs(pdev, cxl->cxlds);
+	if (ret) {
+		pci_err(pdev, "Fail to setup CXL accel regs\n");
+		return ret;
+	}
+
+	ret = cxl_get_hdm_info(cxl->cxlds, &count, &offset, &size);
+	if (ret)
+		return ret;
+
+	if (!count || !size) {
+		pci_err(pdev, "Fail to find CXL HDM reg offset\n");
+		return -ENODEV;
+	}
+
+	cxl->hdm_count = count;
+	cxl->hdm_reg_offset = offset;
+	cxl->hdm_reg_size = size;
+
+	return create_cxl_region(core_dev);
+}
+
+static void disable_cxl(struct vfio_pci_core_device *core_dev)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+
+	if (cxl->region.region)
+		cxl_region_detach(cxl->cxled);
+
+	if (cxl->cxled)
+		cxl_dpa_free(cxl->cxled);
+}
+
+int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+	struct pci_dev *pdev = core_dev->pdev;
+	u16 dvsec;
+	int ret;
+
+	dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+					  CXL_DVSEC_PCIE_DEVICE);
+	if (!dvsec)
+		return -ENODEV;
+
+	if (!cxl->region.size)
+		return -EINVAL;
+
+	ret = vfio_pci_core_enable(core_dev);
+	if (ret)
+		return ret;
+
+	ret = enable_cxl(core_dev, dvsec);
+	if (ret)
+		goto err_enable_cxl_device;
+
+	return 0;
+
+err_enable_cxl_device:
+	vfio_pci_core_disable(core_dev);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_cxl_core_enable);
+
+void vfio_cxl_core_finish_enable(struct vfio_pci_core_device *core_dev)
+{
+	vfio_pci_core_finish_enable(core_dev);
+}
+EXPORT_SYMBOL(vfio_cxl_core_finish_enable);
+
+void vfio_cxl_core_close_device(struct vfio_device *vdev)
+{
+	struct vfio_pci_core_device *core_dev =
+		container_of(vdev, struct vfio_pci_core_device, vdev);
+
+	disable_cxl(core_dev);
+	vfio_pci_core_close_device(vdev);
+}
+EXPORT_SYMBOL(vfio_cxl_core_close_device);
+
+/*
+ * Configure the resource required by the kernel CXL core:
+ * device DPA and device RAM size
+ */
+void vfio_cxl_core_set_resource(struct vfio_pci_core_device *core_dev,
+				struct resource res,
+				enum accel_resource type)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+
+	switch (type) {
+	case CXL_ACCEL_RES_DPA:
+		cxl->dpa_size = res.end - res.start + 1;
+		cxl->dpa_res = res;
+		break;
+
+	case CXL_ACCEL_RES_RAM:
+		cxl->ram_res = res;
+		break;
+
+	default:
+		WARN(1, "invalid resource type: %d\n", type);
+		break;
+	}
+}
+EXPORT_SYMBOL(vfio_cxl_core_set_resource);
+
+/* Configure the expected CXL region size to be created */
+void vfio_cxl_core_set_region_size(struct vfio_pci_core_device *core_dev,
+				   u64 size)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+
+	if (WARN_ON(size > cxl->dpa_size))
+		return;
+
+	if (WARN_ON(cxl->region.region))
+		return;
+
+	cxl->region.size = size;
+}
+EXPORT_SYMBOL(vfio_cxl_core_set_region_size);
+
+/* Configure the driver cap required by the kernel CXL core */
+void vfio_cxl_core_set_driver_hdm_cap(struct vfio_pci_core_device *core_dev)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+
+	cxl->caps |= CXL_ACCEL_DRIVER_CAP_HDM;
+}
+EXPORT_SYMBOL(vfio_cxl_core_set_driver_hdm_cap);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+MODULE_IMPORT_NS(CXL);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index fbb472dd99b3..7762d4a3e825 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -15,6 +15,8 @@
 #include <linux/types.h>
 #include <linux/uuid.h>
 #include <linux/notifier.h>
+#include <linux/cxl_accel_mem.h>
+#include <linux/cxl_accel_pci.h>
 
 #ifndef VFIO_PCI_CORE_H
 #define VFIO_PCI_CORE_H
@@ -49,6 +51,31 @@ struct vfio_pci_region {
 	u32				flags;
 };
 
+struct vfio_cxl_region {
+	u64 size;
+	u64 addr;
+	struct cxl_region *region;
+};
+
+struct vfio_cxl {
+	u8 caps;
+	u64 dpa_size;
+
+	u32 hdm_count;
+	u64 hdm_reg_offset;
+	u64 hdm_reg_size;
+
+	struct cxl_dev_state *cxlds;
+	struct cxl_memdev *cxlmd;
+	struct cxl_root_decoder *cxlrd;
+	struct cxl_port *endpoint;
+	struct cxl_endpoint_decoder *cxled;
+	struct resource dpa_res;
+	struct resource ram_res;
+
+	struct vfio_cxl_region region;
+};
+
 struct vfio_pci_core_device {
 	struct vfio_device	vdev;
 	struct pci_dev		*pdev;
@@ -94,6 +121,7 @@ struct vfio_pci_core_device {
 	struct vfio_pci_core_device	*sriov_pf_core_dev;
 	struct notifier_block	nb;
 	struct rw_semaphore	memory_lock;
+	struct vfio_cxl		cxl;
 };
 
 /* Will be exported for vfio pci drivers usage */
@@ -159,4 +187,13 @@ VFIO_IOREAD_DECLARATION(32)
 VFIO_IOREAD_DECLARATION(64)
 #endif
 
+int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev);
+void vfio_cxl_core_finish_enable(struct vfio_pci_core_device *core_dev);
+void vfio_cxl_core_close_device(struct vfio_device *vdev);
+void vfio_cxl_core_set_resource(struct vfio_pci_core_device *core_dev,
+				struct resource res,
+				enum accel_resource type);
+void vfio_cxl_core_set_region_size(struct vfio_pci_core_device *core_dev,
+				   u64 size);
+void vfio_cxl_core_set_driver_hdm_cap(struct vfio_pci_core_device *core_dev);
 #endif /* VFIO_PCI_CORE_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 05/13] vfio/cxl: expose CXL region to the usersapce via a new VFIO device region
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (3 preceding siblings ...)
  2024-09-20 22:34 ` [RFC 04/13] vfio: introduce vfio-cxl core preludes Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-10-11 19:12   ` Alex Williamson
  2024-09-20 22:34 ` [RFC 06/13] vfio/pci: expose vfio_pci_rw() Zhi Wang
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

To directly access the device memory, a CXL region is required. Creating
a CXL region requires to configure HDM decoders on the path to map the
access of HPA level by level and evetually hit the DPA in the CXL
topology.

For the usersapce, e.g. QEMU, to access the CXL region, the region is
required to be exposed via VFIO interfaces.

Introduce a new VFIO device region and region ops to expose the created
CXL region when initailize the device in the vfio-cxl-core. Introduce a
new sub region type for the userspace to identify a CXL region.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/vfio_cxl_core.c   | 140 ++++++++++++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_config.c |   1 +
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |   3 +
 4 files changed, 144 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
index 6a7859333f67..ffc15fd94b22 100644
--- a/drivers/vfio/pci/vfio_cxl_core.c
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -102,6 +102,13 @@ static int create_cxl_region(struct vfio_pci_core_device *core_dev)
 	cxl_accel_get_region_params(cxl->region.region, &start, &end);
 
 	cxl->region.addr = start;
+	cxl->region.vaddr = ioremap(start, end - start);
+	if (!cxl->region.addr) {
+		pci_err(pdev, "Fail to map CXL region\n");
+		cxl_region_detach(cxl->cxled);
+		cxl_dpa_free(cxl->cxled);
+		goto out;
+	}
 out:
 	cxl_release_endpoint(cxl->cxlmd, cxl->endpoint);
 	return ret;
@@ -152,17 +159,135 @@ static void disable_cxl(struct vfio_pci_core_device *core_dev)
 {
 	struct vfio_cxl *cxl = &core_dev->cxl;
 
-	if (cxl->region.region)
+	if (cxl->region.region) {
+		iounmap(cxl->region.vaddr);
 		cxl_region_detach(cxl->cxled);
+	}
 
 	if (cxl->cxled)
 		cxl_dpa_free(cxl->cxled);
 }
 
+static unsigned long vma_to_pfn(struct vm_area_struct *vma)
+{
+	struct vfio_pci_core_device *vdev = vma->vm_private_data;
+	struct vfio_cxl *cxl = &vdev->cxl;
+	u64 pgoff;
+
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	return (cxl->region.addr >> PAGE_SHIFT) + pgoff;
+}
+
+static vm_fault_t vfio_cxl_mmap_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct vfio_pci_core_device *vdev = vma->vm_private_data;
+	unsigned long pfn, pgoff = vmf->pgoff - vma->vm_pgoff;
+	unsigned long addr = vma->vm_start;
+	vm_fault_t ret = VM_FAULT_SIGBUS;
+
+	pfn = vma_to_pfn(vma);
+
+	down_read(&vdev->memory_lock);
+
+	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
+		goto out_unlock;
+
+	ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
+	if (ret & VM_FAULT_ERROR)
+		goto out_unlock;
+
+	for (; addr < vma->vm_end; addr += PAGE_SIZE, pfn++) {
+		if (addr == vmf->address)
+			continue;
+
+		if (vmf_insert_pfn(vma, addr, pfn) & VM_FAULT_ERROR)
+			break;
+	}
+
+out_unlock:
+	up_read(&vdev->memory_lock);
+
+	return ret;
+}
+
+static const struct vm_operations_struct vfio_cxl_mmap_ops = {
+	.fault = vfio_cxl_mmap_fault,
+};
+
+static int vfio_cxl_region_mmap(struct vfio_pci_core_device *core_dev,
+				struct vfio_pci_region *region,
+				struct vm_area_struct *vma)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+	u64 phys_len, req_len, pgoff, req_start;
+
+	if (!(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
+		return -EINVAL;
+
+	if (!(region->flags & VFIO_REGION_INFO_FLAG_READ) &&
+	    (vma->vm_flags & VM_READ))
+		return -EPERM;
+
+	if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE) &&
+	    (vma->vm_flags & VM_WRITE))
+		return -EPERM;
+
+	phys_len = cxl->region.size;
+	req_len = vma->vm_end - vma->vm_start;
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+	req_start = pgoff << PAGE_SHIFT;
+
+	if (req_start + req_len > phys_len)
+		return -EINVAL;
+
+	vma->vm_private_data = core_dev;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
+
+	vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
+			VM_DONTEXPAND | VM_DONTDUMP);
+	vma->vm_ops = &vfio_cxl_mmap_ops;
+
+	return 0;
+}
+
+static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device *core_dev,
+				  char __user *buf, size_t count, loff_t *ppos,
+				  bool iswrite)
+{
+	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
+	struct vfio_cxl_region *cxl_region = core_dev->region[i].data;
+	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (!count)
+		return 0;
+
+	return vfio_pci_core_do_io_rw(core_dev, false,
+				      cxl_region->vaddr,
+				      (char __user *)buf, pos, count,
+				      0, 0, iswrite);
+}
+
+static void vfio_cxl_region_release(struct vfio_pci_core_device *vdev,
+				    struct vfio_pci_region *region)
+{
+}
+
+static const struct vfio_pci_regops vfio_cxl_regops = {
+	.rw		= vfio_cxl_region_rw,
+	.mmap		= vfio_cxl_region_mmap,
+	.release	= vfio_cxl_region_release,
+};
+
 int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
 {
 	struct vfio_cxl *cxl = &core_dev->cxl;
 	struct pci_dev *pdev = core_dev->pdev;
+	u32 flags;
 	u16 dvsec;
 	int ret;
 
@@ -182,8 +307,21 @@ int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
 	if (ret)
 		goto err_enable_cxl_device;
 
+	flags = VFIO_REGION_INFO_FLAG_READ |
+		VFIO_REGION_INFO_FLAG_WRITE |
+		VFIO_REGION_INFO_FLAG_MMAP;
+
+	ret = vfio_pci_core_register_dev_region(core_dev,
+		PCI_VENDOR_ID_CXL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
+		VFIO_REGION_SUBTYPE_CXL, &vfio_cxl_regops,
+		cxl->region.size, flags, &cxl->region);
+	if (ret)
+		goto err_register_cxl_region;
+
 	return 0;
 
+err_register_cxl_region:
+	disable_cxl(core_dev);
 err_enable_cxl_device:
 	vfio_pci_core_disable(core_dev);
 	return ret;
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 97422aafaa7b..98f3ac2d305c 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -412,6 +412,7 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
 	return pdev->current_state < PCI_D3hot &&
 	       (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY));
 }
+EXPORT_SYMBOL(__vfio_pci_memory_enabled);
 
 /*
  * Restore the *real* BARs after we detect a FLR or backdoor reset.
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 7762d4a3e825..6523d9d1bffe 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -54,6 +54,7 @@ struct vfio_pci_region {
 struct vfio_cxl_region {
 	u64 size;
 	u64 addr;
+	void *vaddr;
 	struct cxl_region *region;
 };
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2b68e6cdf190..71f766c29060 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -372,6 +372,9 @@ struct vfio_region_info_cap_type {
 /* sub-types for VFIO_REGION_TYPE_GFX */
 #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
 
+/* sub-types for VFIO CXL region */
+#define VFIO_REGION_SUBTYPE_CXL                 (1)
+
 /**
  * struct vfio_region_gfx_edid - EDID region layout.
  *
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 06/13] vfio/pci: expose vfio_pci_rw()
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (4 preceding siblings ...)
  2024-09-20 22:34 ` [RFC 05/13] vfio/cxl: expose CXL region to the usersapce via a new VFIO device region Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-09-20 22:34 ` [RFC 07/13] vfio/cxl: introduce vfio_cxl_core_{read, write}() Zhi Wang
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

vfio_pci_rw() is the common function for handling PCI device read and
write. A CXL device programming interface is built on top PCI
interfaces.

Expose vfio_pci_rw() for vfio-cxl-core to handle the access not
interesting for it.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 5 +++--
 include/linux/vfio_pci_core.h    | 2 ++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ba0ce0075b2f..9373942f1acb 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1537,8 +1537,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl_feature);
 
-static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
-			   size_t count, loff_t *ppos, bool iswrite)
+ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
+		    size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
 	int ret;
@@ -1583,6 +1583,7 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 	pm_runtime_put(&vdev->pdev->dev);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_rw);
 
 ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
 		size_t count, loff_t *ppos)
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 6523d9d1bffe..62fa0f54a567 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -144,6 +144,8 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		unsigned long arg);
 int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 				void __user *arg, size_t argsz);
+ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
+		    size_t count, loff_t *ppos, bool iswrite);
 ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
 		size_t count, loff_t *ppos);
 ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 07/13] vfio/cxl: introduce vfio_cxl_core_{read, write}()
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (5 preceding siblings ...)
  2024-09-20 22:34 ` [RFC 06/13] vfio/pci: expose vfio_pci_rw() Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-09-20 22:34 ` [RFC 08/13] vfio/cxl: emulate HDM decoder registers Zhi Wang
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

The read/write callbacks in vfio_device_ops is for accessing the device
when mmap is not support. It is also used for VFIO variant driver to
emulate the device registers.

CXL spec illusrates the standard programming interface, part of them
are MMIO registers sit in a PCI BAR. Some of them are emulated when
passing the CXL type-2 device to the VM. E.g. HDM decoder registers are
emulated.

Introduce vfio_cxl_core_{read, write}() in the vfio-cxl-core to prepare
for emulating the CXL MMIO registers in the PCI BAR.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/vfio_cxl_core.c | 20 ++++++++++++++++++++
 include/linux/vfio_pci_core.h    |  5 +++++
 2 files changed, 25 insertions(+)

diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
index ffc15fd94b22..68a935515256 100644
--- a/drivers/vfio/pci/vfio_cxl_core.c
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -396,6 +396,26 @@ void vfio_cxl_core_set_driver_hdm_cap(struct vfio_pci_core_device *core_dev)
 }
 EXPORT_SYMBOL(vfio_cxl_core_set_driver_hdm_cap);
 
+ssize_t vfio_cxl_core_read(struct vfio_device *core_vdev, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+
+	return vfio_pci_rw(vdev, buf, count, ppos, false);
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_read);
+
+ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+
+	return vfio_pci_rw(vdev, (char __user *)buf, count, ppos, true);
+}
+EXPORT_SYMBOL_GPL(vfio_cxl_core_write);
+
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR(DRIVER_AUTHOR);
 MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 62fa0f54a567..64ccdcdfa95e 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -199,4 +199,9 @@ void vfio_cxl_core_set_resource(struct vfio_pci_core_device *core_dev,
 void vfio_cxl_core_set_region_size(struct vfio_pci_core_device *core_dev,
 				   u64 size);
 void vfio_cxl_core_set_driver_hdm_cap(struct vfio_pci_core_device *core_dev);
+ssize_t vfio_cxl_core_read(struct vfio_device *core_vdev, char __user *buf,
+			   size_t count, loff_t *ppos);
+ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *buf,
+			    size_t count, loff_t *ppos);
+
 #endif /* VFIO_PCI_CORE_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 08/13] vfio/cxl: emulate HDM decoder registers
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (6 preceding siblings ...)
  2024-09-20 22:34 ` [RFC 07/13] vfio/cxl: introduce vfio_cxl_core_{read, write}() Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-09-20 22:34 ` [RFC 09/13] vfio/pci: introduce CXL device awareness Zhi Wang
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

To directly access the device memory, the HDM decoder registers on the
path from CXL root port to the device needs to be configured when
creating a CXL region.

However, the physical HDM decoders are owned by the kernel CXL core when
creating and configuring a CXL region. Thus the VM is forbidden to
access and configure the phsyical HDM decoder registers. The HDM decoder
register in the CXL component register group needs to be emulated.

Emulate the HDM decoder registers in the vfio-cxl-core. Locate the BAR
where the component registers sit. Take a snapshot of component
registers before initialize the CXL device. Emulate the HDM decoder
registers when VM access them from vfio_device_ops->{read, write}.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/vfio_cxl_core.c | 208 ++++++++++++++++++++++++++++++-
 include/linux/cxl_accel_pci.h    |   6 +
 include/linux/vfio_pci_core.h    |   5 +
 3 files changed, 216 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
index 68a935515256..bbb968cb1b70 100644
--- a/drivers/vfio/pci/vfio_cxl_core.c
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -283,6 +283,90 @@ static const struct vfio_pci_regops vfio_cxl_regops = {
 	.release	= vfio_cxl_region_release,
 };
 
+static int find_bar(struct pci_dev *pdev, u64 *offset, int *bar, u64 size)
+{
+	u64 start, end, flags;
+	int index, i;
+
+	for (i = 0; i < PCI_STD_NUM_BARS; i++) {
+		index = i + PCI_STD_RESOURCES;
+		flags = pci_resource_flags(pdev, index);
+
+		start = pci_resource_start(pdev, index);
+		end = pci_resource_end(pdev, index);
+
+		if (*offset >= start && *offset + size - 1 <= end)
+			break;
+
+		if (flags & IORESOURCE_MEM_64)
+			i++;
+	}
+
+	if (i == PCI_STD_NUM_BARS)
+		return -ENODEV;
+
+	*offset = *offset - start;
+	*bar = index;
+
+	return 0;
+}
+
+static int find_comp_regs(struct vfio_pci_core_device *core_dev)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+	struct pci_dev *pdev = core_dev->pdev;
+	u64 offset;
+	int ret, bar;
+
+	ret = cxl_find_comp_regblock_offset(pdev, &offset);
+	if (ret)
+		return ret;
+
+	ret = find_bar(pdev, &offset, &bar, SZ_64K);
+	if (ret)
+		return ret;
+
+	cxl->comp_reg_bar = bar;
+	cxl->comp_reg_offset = offset;
+	cxl->comp_reg_size = SZ_64K;
+	return 0;
+}
+
+static void clean_virt_comp_regs(struct vfio_pci_core_device *core_dev)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+
+	kvfree(cxl->comp_reg_virt);
+}
+
+static int setup_virt_comp_regs(struct vfio_pci_core_device *core_dev)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+	struct pci_dev *pdev = core_dev->pdev;
+	u64 offset = cxl->comp_reg_offset;
+	int bar = cxl->comp_reg_bar;
+	u64 size = cxl->comp_reg_size;
+	void *regs;
+	unsigned int i;
+
+	cxl->comp_reg_virt = kvzalloc(size, GFP_KERNEL);
+	if (!cxl->comp_reg_virt)
+		return -ENOMEM;
+
+	regs = ioremap(pci_resource_start(pdev, bar) + offset, size);
+	if (!regs) {
+		kvfree(cxl->comp_reg_virt);
+		return -EFAULT;
+	}
+
+	for (i = 0; i < size; i += 4)
+		*(u32 *)(cxl->comp_reg_virt + i) = readl(regs + i);
+
+	iounmap(regs);
+
+	return 0;
+}
+
 int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
 {
 	struct vfio_cxl *cxl = &core_dev->cxl;
@@ -299,10 +383,18 @@ int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
 	if (!cxl->region.size)
 		return -EINVAL;
 
-	ret = vfio_pci_core_enable(core_dev);
+	ret = find_comp_regs(core_dev);
+	if (ret)
+		return ret;
+
+	ret = setup_virt_comp_regs(core_dev);
 	if (ret)
 		return ret;
 
+	ret = vfio_pci_core_enable(core_dev);
+	if (ret)
+		goto err_pci_core_enable;
+
 	ret = enable_cxl(core_dev, dvsec);
 	if (ret)
 		goto err_enable_cxl_device;
@@ -324,6 +416,8 @@ int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
 	disable_cxl(core_dev);
 err_enable_cxl_device:
 	vfio_pci_core_disable(core_dev);
+err_pci_core_enable:
+	clean_virt_comp_regs(core_dev);
 	return ret;
 }
 EXPORT_SYMBOL(vfio_cxl_core_enable);
@@ -341,6 +435,7 @@ void vfio_cxl_core_close_device(struct vfio_device *vdev)
 
 	disable_cxl(core_dev);
 	vfio_pci_core_close_device(vdev);
+	clean_virt_comp_regs(core_dev);
 }
 EXPORT_SYMBOL(vfio_cxl_core_close_device);
 
@@ -396,13 +491,102 @@ void vfio_cxl_core_set_driver_hdm_cap(struct vfio_pci_core_device *core_dev)
 }
 EXPORT_SYMBOL(vfio_cxl_core_set_driver_hdm_cap);
 
+static bool is_hdm_regblock(struct vfio_cxl *cxl, u64 offset, size_t count)
+{
+	return offset >= cxl->hdm_reg_offset &&
+	       offset + count < cxl->hdm_reg_offset +
+	       cxl->hdm_reg_size;
+}
+
+static void write_hdm_decoder_global(void *virt, u64 offset, u32 v)
+{
+	if (offset == 0x4)
+		*(u32 *)(virt + offset) = v & GENMASK(1, 0);
+}
+
+static void write_hdm_decoder_n(void *virt, u64 offset, u32 v)
+{
+	u32 cur, index;
+
+	index = (offset - 0x10) / 0x20;
+
+	/* HDM decoder registers are locked? */
+	cur = *(u32 *)(virt + index * 0x20 + 0x20);
+
+	if (cur & CXL_HDM_DECODER0_CTRL_LOCK &&
+	    cur & CXL_HDM_DECODER0_CTRL_COMMITTED)
+		return;
+
+	/* emulate HDM_DECODER_CTRL. */
+	if (offset == CXL_HDM_DECODER0_CTRL_OFFSET(index)) {
+		v &= ~CXL_HDM_DECODER0_CTRL_COMMIT_ERROR;
+
+		/* commit/de-commit */
+		if (v & CXL_HDM_DECODER0_CTRL_COMMIT)
+			v |= CXL_HDM_DECODER0_CTRL_COMMITTED;
+		else
+			v &= ~CXL_HDM_DECODER0_CTRL_COMMITTED;
+	}
+	*(u32 *)(virt + offset) = v;
+}
+
+static ssize_t
+emulate_hdm_regblock(struct vfio_device *vdev, char __user *buf,
+		     size_t count, loff_t *ppos, bool write)
+{
+	struct vfio_pci_core_device *core_dev =
+		container_of(vdev, struct vfio_pci_core_device, vdev);
+	struct vfio_cxl *cxl = &core_dev->cxl;
+	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	void *hdm_reg_virt;
+	u64 hdm_offset;
+	u32 v;
+
+	hdm_offset = pos - cxl->hdm_reg_offset;
+	hdm_reg_virt = cxl->comp_reg_virt +
+		       (cxl->hdm_reg_offset - cxl->comp_reg_offset);
+
+	if (!write) {
+		v = *(u32 *)(hdm_reg_virt + hdm_offset);
+
+		if (copy_to_user(buf, &v, 4))
+			return -EFAULT;
+	} else {
+		if (copy_from_user(&v, buf, 4))
+			return -EFAULT;
+
+		if (hdm_offset < 0x10)
+			write_hdm_decoder_global(hdm_reg_virt, hdm_offset, v);
+		else
+			write_hdm_decoder_n(hdm_reg_virt, hdm_offset, v);
+	}
+	return count;
+}
+
 ssize_t vfio_cxl_core_read(struct vfio_device *core_vdev, char __user *buf,
 		size_t count, loff_t *ppos)
 {
 	struct vfio_pci_core_device *vdev =
 		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+	struct vfio_cxl *cxl = &vdev->cxl;
+	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+
+	if (!count)
+		return 0;
+
+	if (index != cxl->comp_reg_bar)
+		return vfio_pci_rw(vdev, buf, count, ppos, false);
+
+	if (WARN_ON_ONCE(!IS_ALIGNED(pos, 4) || count != 4))
+		return -EINVAL;
 
-	return vfio_pci_rw(vdev, buf, count, ppos, false);
+	if (is_hdm_regblock(cxl, pos, count))
+		return emulate_hdm_regblock(core_vdev, buf, count,
+					    ppos, false);
+	else
+		return vfio_pci_rw(vdev, (char __user *)buf, count,
+				   ppos, false);
 }
 EXPORT_SYMBOL_GPL(vfio_cxl_core_read);
 
@@ -411,8 +595,26 @@ ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *bu
 {
 	struct vfio_pci_core_device *vdev =
 		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+	struct vfio_cxl *cxl = &vdev->cxl;
+	u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+
+	if (!count)
+		return 0;
+
+	if (index != cxl->comp_reg_bar)
+		return vfio_pci_rw(vdev, (char __user *)buf, count, ppos,
+				   true);
+
+	if (WARN_ON_ONCE(!IS_ALIGNED(pos, 4) || count != 4))
+		return -EINVAL;
 
-	return vfio_pci_rw(vdev, (char __user *)buf, count, ppos, true);
+	if (is_hdm_regblock(cxl, pos, count))
+		return emulate_hdm_regblock(core_vdev, (char __user *)buf,
+					    count, ppos, true);
+	else
+		return vfio_pci_rw(vdev, (char __user *)buf, count, ppos,
+				   true);
 }
 EXPORT_SYMBOL_GPL(vfio_cxl_core_write);
 
diff --git a/include/linux/cxl_accel_pci.h b/include/linux/cxl_accel_pci.h
index c337ae8797e6..090f60fb9a3f 100644
--- a/include/linux/cxl_accel_pci.h
+++ b/include/linux/cxl_accel_pci.h
@@ -20,4 +20,10 @@
 #define   CXL_DVSEC_RANGE_BASE_LOW(i)	(0x24 + (i * 0x10))
 #define     CXL_DVSEC_MEM_BASE_LOW_MASK	GENMASK(31, 28)
 
+#define CXL_HDM_DECODER0_CTRL_OFFSET(i) (0x20 * (i) + 0x20)
+#define   CXL_HDM_DECODER0_CTRL_LOCK BIT(8)
+#define   CXL_HDM_DECODER0_CTRL_COMMIT BIT(9)
+#define   CXL_HDM_DECODER0_CTRL_COMMITTED BIT(10)
+#define   CXL_HDM_DECODER0_CTRL_COMMIT_ERROR BIT(11)
+
 #endif
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 64ccdcdfa95e..9d295ca9382a 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -62,6 +62,11 @@ struct vfio_cxl {
 	u8 caps;
 	u64 dpa_size;
 
+	int comp_reg_bar;
+	u64 comp_reg_offset;
+	u64 comp_reg_size;
+	void *comp_reg_virt;
+
 	u32 hdm_count;
 	u64 hdm_reg_offset;
 	u64 hdm_reg_size;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 09/13] vfio/pci: introduce CXL device awareness
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (7 preceding siblings ...)
  2024-09-20 22:34 ` [RFC 08/13] vfio/cxl: emulate HDM decoder registers Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-10-11 20:37   ` Alex Williamson
  2024-09-20 22:34 ` [RFC 10/13] vfio/pci: emulate CXL DVSEC registers in the configuration space Zhi Wang
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

CXL device programming interfaces are built upon PCI interfaces. Thus
the vfio-pci-core can be leveraged to handle a CXL device.

However, CXL device also has difference with PCI devicce:

- No INTX support, only MSI/MSIX is supported.
- Resest is one via CXL reset. FLR only resets CXL.io.

Introduce the CXL device awareness to the vfio-pci-core. Expose a new
VFIO device flags to the userspace to identify the VFIO device is a CXL
device. Disable INTX support in the vfio-pci-core. Disable FLR reset for
the CXL device as the kernel CXL core hasn't support CXL reset yet.
Disable mmap support on the CXL MMIO BAR in vfio-pci-core.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/vfio_cxl_core.c |  8 ++++++
 drivers/vfio/pci/vfio_pci_core.c | 42 +++++++++++++++++++++-----------
 include/linux/vfio_pci_core.h    |  2 ++
 include/uapi/linux/vfio.h        |  1 +
 4 files changed, 39 insertions(+), 14 deletions(-)

diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
index bbb968cb1b70..d8b51f8792a2 100644
--- a/drivers/vfio/pci/vfio_cxl_core.c
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -391,6 +391,8 @@ int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
 	if (ret)
 		return ret;
 
+	vfio_pci_core_enable_cxl(core_dev);
+
 	ret = vfio_pci_core_enable(core_dev);
 	if (ret)
 		goto err_pci_core_enable;
@@ -618,6 +620,12 @@ ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *bu
 }
 EXPORT_SYMBOL_GPL(vfio_cxl_core_write);
 
+void vfio_pci_core_enable_cxl(struct vfio_pci_core_device *core_dev)
+{
+	core_dev->has_cxl = true;
+}
+EXPORT_SYMBOL(vfio_pci_core_enable_cxl);
+
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR(DRIVER_AUTHOR);
 MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 9373942f1acb..e0f23b538858 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -126,6 +126,9 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 		if (!(res->flags & IORESOURCE_MEM))
 			goto no_mmap;
 
+		if (vdev->has_cxl && bar == vdev->cxl.comp_reg_bar)
+			goto no_mmap;
+
 		/*
 		 * The PCI core shouldn't set up a resource with a
 		 * type but zero size. But there may be bugs that
@@ -487,10 +490,15 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 	if (ret)
 		goto out_power;
 
-	/* If reset fails because of the device lock, fail this path entirely */
-	ret = pci_try_reset_function(pdev);
-	if (ret == -EAGAIN)
-		goto out_disable_device;
+	if (!vdev->has_cxl) {
+		/* If reset fails because of the device lock, fail this path entirely */
+		ret = pci_try_reset_function(pdev);
+		if (ret == -EAGAIN)
+			goto out_disable_device;
+	} else {
+		/* CXL Reset is missing in CXL core. FLR only resets CXL.io path. */
+		ret = -ENODEV;
+	}
 
 	vdev->reset_works = !ret;
 	pci_save_state(pdev);
@@ -498,14 +506,17 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 	if (!vdev->pci_saved_state)
 		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
 
-	if (likely(!nointxmask)) {
-		if (vfio_pci_nointx(pdev)) {
-			pci_info(pdev, "Masking broken INTx support\n");
-			vdev->nointx = true;
-			pci_intx(pdev, 0);
-		} else
-			vdev->pci_2_3 = pci_intx_mask_supported(pdev);
-	}
+	if (!vdev->has_cxl) {
+		if (likely(!nointxmask)) {
+			if (vfio_pci_nointx(pdev)) {
+				pci_info(pdev, "Masking broken INTx support\n");
+				vdev->nointx = true;
+				pci_intx(pdev, 0);
+			} else
+				vdev->pci_2_3 = pci_intx_mask_supported(pdev);
+		}
+	} else
+		vdev->nointx = true; /* CXL device doesn't have INTX. */
 
 	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
 	if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
@@ -541,7 +552,6 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
 		vdev->has_vga = true;
 
-
 	return 0;
 
 out_free_zdev:
@@ -657,7 +667,8 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	 * Disable INTx and MSI, presumably to avoid spurious interrupts
 	 * during reset.  Stolen from pci_reset_function()
 	 */
-	pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
+	if (!vdev->nointx)
+		pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
 
 	/*
 	 * Try to get the locks ourselves to prevent a deadlock. The
@@ -973,6 +984,9 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
 	if (vdev->reset_works)
 		info.flags |= VFIO_DEVICE_FLAGS_RESET;
 
+	if (vdev->has_cxl)
+		info.flags |= VFIO_DEVICE_FLAGS_CXL;
+
 	info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
 	info.num_irqs = VFIO_PCI_NUM_IRQS;
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 9d295ca9382a..e5646aad3eb3 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -113,6 +113,7 @@ struct vfio_pci_core_device {
 	bool			needs_pm_restore:1;
 	bool			pm_intx_masked:1;
 	bool			pm_runtime_engaged:1;
+	bool			has_cxl:1;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
@@ -208,5 +209,6 @@ ssize_t vfio_cxl_core_read(struct vfio_device *core_vdev, char __user *buf,
 			   size_t count, loff_t *ppos);
 ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *buf,
 			    size_t count, loff_t *ppos);
+void vfio_pci_core_enable_cxl(struct vfio_pci_core_device *core_dev);
 
 #endif /* VFIO_PCI_CORE_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 71f766c29060..0895183feaac 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -214,6 +214,7 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6)	/* vfio-fsl-mc device */
 #define VFIO_DEVICE_FLAGS_CAPS	(1 << 7)	/* Info supports caps */
 #define VFIO_DEVICE_FLAGS_CDX	(1 << 8)	/* vfio-cdx device */
+#define VFIO_DEVICE_FLAGS_CXL	(1 << 9)	/* Device supports CXL support */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 	__u32   cap_offset;	/* Offset within info struct of first cap */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 10/13] vfio/pci: emulate CXL DVSEC registers in the configuration space
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (8 preceding siblings ...)
  2024-09-20 22:34 ` [RFC 09/13] vfio/pci: introduce CXL device awareness Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-10-11 21:02   ` Alex Williamson
  2024-09-20 22:34 ` [RFC 11/13] vfio/cxl: introduce VFIO CXL device cap Zhi Wang
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

A CXL device has many DVSEC registers in the configuration space for
device control and enumeration. E.g. enable CXL.mem/CXL.cahce.

However, the kernel CXL core owns those registers to control the device.
Thus, the VM is forbidden to touch the physical device control registers.

Read/write the CXL DVSEC from/to the virt configuration space.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 98f3ac2d305c..af8c0997c796 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1902,6 +1902,15 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
 
 			perm = &ecap_perms[cap_id];
 			cap_start = vfio_find_cap_start(vdev, *ppos);
+
+			if (cap_id == PCI_EXT_CAP_ID_DVSEC) {
+				u32 dword;
+
+				memcpy(&dword, vdev->vconfig + cap_start + PCI_DVSEC_HEADER1, 4);
+
+				if (PCI_DVSEC_HEADER1_VID(dword) == PCI_VENDOR_ID_CXL)
+					perm = &virt_perms;
+			}
 		} else {
 			WARN_ON(cap_id > PCI_CAP_ID_MAX);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 11/13] vfio/cxl: introduce VFIO CXL device cap
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (9 preceding siblings ...)
  2024-09-20 22:34 ` [RFC 10/13] vfio/pci: emulate CXL DVSEC registers in the configuration space Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-10-11 21:14   ` Alex Williamson
  2024-09-20 22:34 ` [RFC 12/13] vfio/cxl: VFIO variant driver for QEMU CXL accel device Zhi Wang
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

The userspace needs CXL device information, e.g. HDM decoder registers
offset to know when the VM updates the HDM decoder and re-build the
mapping between GPA in the virtual HDM decoder base registers and the
HPA of the CXL region created by the vfio-cxl-core when initialize the
CXL device.

To acheive this, a new VFIO CXL device cap is required to convey those
information to the usersapce.

Introduce a new VFIO CXL device cap to expose necessary information to
the userspace. Initialize the cap with the information filled when the
CXL device is being initialized. vfio-pci-core fills the CXL cap into
the caps returned to userapce when CXL is enabled.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/vfio_cxl_core.c | 15 +++++++++++++++
 drivers/vfio/pci/vfio_pci_core.c | 19 ++++++++++++++++++-
 include/linux/vfio_pci_core.h    |  1 +
 include/uapi/linux/vfio.h        | 10 ++++++++++
 4 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
index d8b51f8792a2..cebc444b54b7 100644
--- a/drivers/vfio/pci/vfio_cxl_core.c
+++ b/drivers/vfio/pci/vfio_cxl_core.c
@@ -367,6 +367,19 @@ static int setup_virt_comp_regs(struct vfio_pci_core_device *core_dev)
 	return 0;
 }
 
+static void init_vfio_cxl_cap(struct vfio_pci_core_device *core_dev)
+{
+	struct vfio_cxl *cxl = &core_dev->cxl;
+
+	cxl->cap.header.id = VFIO_DEVICE_INFO_CAP_CXL;
+	cxl->cap.header.version = 1;
+	cxl->cap.hdm_count = cxl->hdm_count;
+	cxl->cap.hdm_reg_offset = cxl->hdm_reg_offset;
+	cxl->cap.hdm_reg_size = cxl->hdm_reg_size;
+	cxl->cap.hdm_reg_bar_index = cxl->comp_reg_bar;
+	cxl->cap.dpa_size = cxl->dpa_size;
+}
+
 int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
 {
 	struct vfio_cxl *cxl = &core_dev->cxl;
@@ -401,6 +414,8 @@ int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
 	if (ret)
 		goto err_enable_cxl_device;
 
+	init_vfio_cxl_cap(core_dev);
+
 	flags = VFIO_REGION_INFO_FLAG_READ |
 		VFIO_REGION_INFO_FLAG_WRITE |
 		VFIO_REGION_INFO_FLAG_MMAP;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index e0f23b538858..47e65e28a42b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -963,6 +963,15 @@ static int vfio_pci_info_atomic_cap(struct vfio_pci_core_device *vdev,
 	return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
 }
 
+static int vfio_pci_info_cxl_cap(struct vfio_pci_core_device *vdev,
+				 struct vfio_info_cap *caps)
+{
+	struct vfio_cxl *cxl = &vdev->cxl;
+
+	return vfio_info_add_capability(caps, &cxl->cap.header,
+					sizeof(cxl->cap));
+}
+
 static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
 				   struct vfio_device_info __user *arg)
 {
@@ -984,9 +993,17 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
 	if (vdev->reset_works)
 		info.flags |= VFIO_DEVICE_FLAGS_RESET;
 
-	if (vdev->has_cxl)
+	if (vdev->has_cxl) {
 		info.flags |= VFIO_DEVICE_FLAGS_CXL;
 
+		ret = vfio_pci_info_cxl_cap(vdev, &caps);
+		if (ret) {
+			pci_warn(vdev->pdev,
+				 "Failed to setup CXL capabilities\n");
+			return ret;
+		}
+	}
+
 	info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
 	info.num_irqs = VFIO_PCI_NUM_IRQS;
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index e5646aad3eb3..d79f7a91d977 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -80,6 +80,7 @@ struct vfio_cxl {
 	struct resource ram_res;
 
 	struct vfio_cxl_region region;
+	struct vfio_device_info_cap_cxl cap;
 };
 
 struct vfio_pci_core_device {
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0895183feaac..9a5972961280 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -257,6 +257,16 @@ struct vfio_device_info_cap_pci_atomic_comp {
 	__u32 reserved;
 };
 
+#define VFIO_DEVICE_INFO_CAP_CXL		6
+struct vfio_device_info_cap_cxl {
+	struct vfio_info_cap_header header;
+	__u8 hdm_count;
+	__u8 hdm_reg_bar_index;
+	__u64 hdm_reg_size;
+	__u64 hdm_reg_offset;
+	__u64 dpa_size;
+};
+
 /**
  * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
  *				       struct vfio_region_info)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 12/13] vfio/cxl: VFIO variant driver for QEMU CXL accel device
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (10 preceding siblings ...)
  2024-09-20 22:34 ` [RFC 11/13] vfio/cxl: introduce VFIO CXL device cap Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-09-20 22:34 ` [RFC 13/13] vfio/cxl: workaround: don't take resource region when cxl is enabled Zhi Wang
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

To demostrate the VFIO CXL core, a VFIO variant driver for QEMU CXL
accel device is introduced, so that people to test can try the patches.

This patch is not meant to be merged.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/Kconfig            |   2 +
 drivers/vfio/pci/Makefile           |   2 +
 drivers/vfio/pci/cxl-accel/Kconfig  |   6 ++
 drivers/vfio/pci/cxl-accel/Makefile |   3 +
 drivers/vfio/pci/cxl-accel/main.c   | 116 ++++++++++++++++++++++++++++
 5 files changed, 129 insertions(+)
 create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
 create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
 create mode 100644 drivers/vfio/pci/cxl-accel/main.c

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 2196e79b132b..9eebce09ffa2 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -75,4 +75,6 @@ source "drivers/vfio/pci/nvgrace-gpu/Kconfig"
 
 source "drivers/vfio/pci/qat/Kconfig"
 
+source "drivers/vfio/pci/cxl-accel/Kconfig"
+
 endmenu
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index b51221b94b0b..03293b52c5e3 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -22,3 +22,5 @@ obj-$(CONFIG_VIRTIO_VFIO_PCI) += virtio/
 obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu/
 
 obj-$(CONFIG_QAT_VFIO_PCI) += qat/
+
+obj-$(CONFIG_CXL_ACCEL_VFIO_PCI) += cxl-accel/
diff --git a/drivers/vfio/pci/cxl-accel/Kconfig b/drivers/vfio/pci/cxl-accel/Kconfig
new file mode 100644
index 000000000000..c3c9d7ec7fa4
--- /dev/null
+++ b/drivers/vfio/pci/cxl-accel/Kconfig
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config CXL_ACCEL_VFIO_PCI
+	tristate "VFIO support for the QEMU CXL accel device"
+	select VFIO_CXL_CORE
+	help
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/cxl-accel/Makefile b/drivers/vfio/pci/cxl-accel/Makefile
new file mode 100644
index 000000000000..20f190482cc9
--- /dev/null
+++ b/drivers/vfio/pci/cxl-accel/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_CXL_ACCEL_VFIO_PCI) += cxl-accel-vfio-pci.o
+cxl-accel-vfio-pci-y := main.o
diff --git a/drivers/vfio/pci/cxl-accel/main.c b/drivers/vfio/pci/cxl-accel/main.c
new file mode 100644
index 000000000000..1672fb9d9232
--- /dev/null
+++ b/drivers/vfio/pci/cxl-accel/main.c
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include <linux/sizes.h>
+#include <linux/vfio_pci_core.h>
+
+struct cxl_device {
+	struct vfio_pci_core_device core_device;
+};
+
+static int cxl_open_device(struct vfio_device *vdev)
+{
+	struct vfio_pci_core_device *core_dev =
+		container_of(vdev, struct vfio_pci_core_device, vdev);
+	struct resource res;
+	int ret;
+
+	/* Provide the device infomation to the kernel CXL core.*/
+	/* Device DPA */
+	res = DEFINE_RES_MEM(0, SZ_256M);
+	vfio_cxl_core_set_resource(core_dev, res, CXL_ACCEL_RES_DPA);
+
+	/* Device RAM */
+	res = DEFINE_RES_MEM_NAMED(0, SZ_256M, "ram");
+	vfio_cxl_core_set_resource(core_dev, res, CXL_ACCEL_RES_RAM);
+
+	/* The expected size of the CXL region to be created */
+	vfio_cxl_core_set_region_size(core_dev, SZ_256M);
+	vfio_cxl_core_set_driver_hdm_cap(core_dev);
+
+	/* Initailize the CXL device and enable the vfio-pci-core */
+	ret = vfio_cxl_core_enable(core_dev);
+	if (ret)
+		return ret;
+
+	vfio_cxl_core_finish_enable(core_dev);
+
+	return 0;
+}
+
+static const struct vfio_device_ops cxl_core_ops = {
+	.name		= "cxl-vfio-pci",
+	.init		= vfio_pci_core_init_dev,
+	.release	= vfio_pci_core_release_dev,
+	.open_device	= cxl_open_device,
+	.close_device	= vfio_cxl_core_close_device,
+	.ioctl		= vfio_pci_core_ioctl,
+	.device_feature	= vfio_pci_core_ioctl_feature,
+	.read		= vfio_cxl_core_read,
+	.write		= vfio_cxl_core_write,
+	.mmap		= vfio_pci_core_mmap,
+	.request	= vfio_pci_core_request,
+	.match		= vfio_pci_core_match,
+	.bind_iommufd	= vfio_iommufd_physical_bind,
+	.unbind_iommufd	= vfio_iommufd_physical_unbind,
+	.attach_ioas	= vfio_iommufd_physical_attach_ioas,
+	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
+};
+
+static int cxl_probe(struct pci_dev *pdev,
+			     const struct pci_device_id *id)
+{
+	const struct vfio_device_ops *ops = &cxl_core_ops;
+	struct cxl_device *cxl_device;
+	int ret;
+
+	cxl_device = vfio_alloc_device(cxl_device, core_device.vdev,
+				       &pdev->dev, ops);
+	if (IS_ERR(cxl_device))
+		return PTR_ERR(cxl_device);
+
+	dev_set_drvdata(&pdev->dev, &cxl_device->core_device);
+
+	ret = vfio_pci_core_register_device(&cxl_device->core_device);
+	if (ret)
+		goto out_put_vdev;
+
+	return ret;
+
+out_put_vdev:
+	vfio_put_device(&cxl_device->core_device.vdev);
+	return ret;
+}
+
+static void cxl_remove(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+
+	vfio_pci_core_unregister_device(core_device);
+	vfio_put_device(&core_device->vdev);
+}
+
+static const struct pci_device_id cxl_vfio_pci_table[] = {
+	{ PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_INTEL, 0xd94) },
+	{}
+};
+
+MODULE_DEVICE_TABLE(pci, cxl_vfio_pci_table);
+
+static struct pci_driver cxl_vfio_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = cxl_vfio_pci_table,
+	.probe = cxl_probe,
+	.remove = cxl_remove,
+	.err_handler = &vfio_pci_core_err_handlers,
+	.driver_managed_dma = true,
+};
+
+module_pci_driver(cxl_vfio_pci_driver);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Zhi Wang <zhiw@nvidia.com>");
+MODULE_DESCRIPTION("VFIO variant driver for QEMU CXL accel device");
+MODULE_IMPORT_NS(CXL);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 13/13] vfio/cxl: workaround: don't take resource region when cxl is enabled.
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (11 preceding siblings ...)
  2024-09-20 22:34 ` [RFC 12/13] vfio/cxl: VFIO variant driver for QEMU CXL accel device Zhi Wang
@ 2024-09-20 22:34 ` Zhi Wang
  2024-09-23  8:00 ` [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Tian, Kevin
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-20 22:34 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiw, zhiwang

Looking for a better suggestion here.

vfio-cxl-core uses the kernel CXL core to initialize the CXL device
and the kernel CXL core has requested the resource regions when accessing
the PCI BARs. Thus, requesting resource region in vfio-pci-core always
fails.

Signed-off-by: Zhi Wang <zhiw@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 15 +++++++++------
 drivers/vfio/pci/vfio_pci_rdwr.c |  8 +++++---
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 47e65e28a42b..91f8b984b53c 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -633,7 +633,8 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 		if (!vdev->barmap[bar])
 			continue;
 		pci_iounmap(pdev, vdev->barmap[bar]);
-		pci_release_selected_regions(pdev, 1 << bar);
+		if (!vdev->has_cxl)
+			pci_release_selected_regions(pdev, 1 << bar);
 		vdev->barmap[bar] = NULL;
 	}
 
@@ -1775,13 +1776,15 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma
 	 * we need to request the region and the barmap tracks that.
 	 */
 	if (!vdev->barmap[index]) {
-		ret = pci_request_selected_regions(pdev,
-						   1 << index, "vfio-pci");
-		if (ret)
-			return ret;
+		if (!vdev->has_cxl) {
+			ret = pci_request_selected_regions(pdev,
+					1 << index, "vfio-pci");
+			if (ret)
+				return ret;
+		}
 
 		vdev->barmap[index] = pci_iomap(pdev, index, 0);
-		if (!vdev->barmap[index]) {
+		if (!vdev->barmap[index] && !vdev->has_cxl) {
 			pci_release_selected_regions(pdev, 1 << index);
 			return -ENOMEM;
 		}
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 66b72c289284..df7b5aa078e9 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -214,9 +214,11 @@ int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar)
 	if (vdev->barmap[bar])
 		return 0;
 
-	ret = pci_request_selected_regions(pdev, 1 << bar, "vfio");
-	if (ret)
-		return ret;
+	if (!vdev->has_cxl) {
+		ret = pci_request_selected_regions(pdev, 1 << bar, "vfio");
+		if (ret)
+			return ret;
+	}
 
 	io = pci_iomap(pdev, bar, 0);
 	if (!io) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* RE: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (12 preceding siblings ...)
  2024-09-20 22:34 ` [RFC 13/13] vfio/cxl: workaround: don't take resource region when cxl is enabled Zhi Wang
@ 2024-09-23  8:00 ` Tian, Kevin
  2024-09-24  8:30   ` Zhi Wang
  2024-09-25 10:11 ` Alejandro Lucero Palau
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 38+ messages in thread
From: Tian, Kevin @ 2024-09-23  8:00 UTC (permalink / raw)
  To: Zhi Wang, kvm@vger.kernel.org, linux-cxl@vger.kernel.org
  Cc: alex.williamson@redhat.com, jgg@nvidia.com, Schofield, Alison,
	Williams, Dan J, Jiang, Dave, dave@stgolabs.net,
	jonathan.cameron@huawei.com, Weiny, Ira, Verma, Vishal L,
	alucerop@amd.com, Currid, Andy, cjia@nvidia.com,
	smitra@nvidia.com, ankita@nvidia.com, aniketa@nvidia.com,
	kwankhede@nvidia.com, targupta@nvidia.com, zhiwang@kernel.org

> From: Zhi Wang <zhiw@nvidia.com>
> Sent: Saturday, September 21, 2024 6:35 AM
> 
[...]
> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
> (Device PA) needs to be created to access the device memory directly. HDM
> decoders in the CXL topology need to be configured level by level to
> manage the mapping. After the region is created, it needs to be mapped to
> GPA in the virtual HDM decoders configured by the VM.

Any time when a new address space is introduced it's worthy of more
context to help people who have no CXL background better understand
the mechanism and think any potential hole.

At a glance looks we are talking about a mapping tier:

  GPA->HPA->DPA

The location/size of HPA/DPA for a cxl region are decided and mapped
at @open_device and the HPA range is mapped to GPA at @mmap.

In addition the guest also manages a virtual HDM decoder:

  GPA->vDPA

Ideally the vDPA range selected by guest is a subset of the physical
cxl region so based on offset and vHDM the VMM may figure out
which offset in the cxl region to be mmaped for the corresponding
GPA (which in the end maps to the desired DPA).

Is this understanding correct?

btw is one cxl device only allowed to create one region? If multiple
regions are possible how will they be exposed to the guest?

> 
> - CXL reset. The CXL device reset is different from the PCI device reset.
> A CXL reset sequence is introduced by the CXL spec.
> 
> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
> configuration for device enumeration and device control. (E.g. if a device
> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
> by the kernel CXL core, and the VM can not modify them.

any side effect from emulating it purely in software (patch10), e.g. when
the guest desired configuration is different from the physical one?

> 
> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
> that can sit in a PCI BAR. The location of register groups sit in the PCI
> BAR is indicated by the register locator in the CXL DVSECs. They are also
> owned by the kernel CXL core. Some of them need to be emulated.

ditto

> 
> In the L2 guest, a dummy CXL device driver is provided to attach to the
> virtual pass-thru device.
> 
> The dummy CXL type-2 device driver can successfully be loaded with the
> kernel cxl core type2 support, create CXL region by requesting the CXL
> core to allocate HPA and DPA and configure the HDM decoders.

It'd be good to see a real cxl device working to add confidence on
the core design.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC 01/13] cxl: allow a type-2 device not to have memory device registers
  2024-09-20 22:34 ` [RFC 01/13] cxl: allow a type-2 device not to have memory device registers Zhi Wang
@ 2024-09-23  8:01   ` Tian, Kevin
  2024-09-23 15:38   ` Dave Jiang
  1 sibling, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2024-09-23  8:01 UTC (permalink / raw)
  To: Zhi Wang, kvm@vger.kernel.org, linux-cxl@vger.kernel.org
  Cc: alex.williamson@redhat.com, jgg@nvidia.com, Schofield, Alison,
	Williams, Dan J, Jiang, Dave, dave@stgolabs.net,
	jonathan.cameron@huawei.com, Weiny, Ira, Verma, Vishal L,
	alucerop@amd.com, Currid, Andy, cjia@nvidia.com,
	smitra@nvidia.com, ankita@nvidia.com, aniketa@nvidia.com,
	kwankhede@nvidia.com, targupta@nvidia.com, zhiwang@kernel.org

> From: Zhi Wang <zhiw@nvidia.com>
> Sent: Saturday, September 21, 2024 6:35 AM
> 
> CXL memory device registers provide additional information about device
> memory and advanced control interface for type-3 device.
> 
> However, it is not mandatory for a type-2 device. A type-2 device can
> have HDMs but not CXL memory device registers.
> 
> Allow a type-2 device not to hanve memory device register when probing
> CXL registers.

this is nothing vfio specific.

> 
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> ---
>  drivers/cxl/pci.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index e00ce7f4d0f9..3fbee31995f1 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -529,13 +529,13 @@ int cxl_pci_accel_setup_regs(struct pci_dev *pdev,
> struct cxl_dev_state *cxlds)
>  	int rc;
> 
>  	rc = cxl_pci_setup_regs(pdev, CXL_REGLOC_RBI_MEMDEV, &map,
> -				cxlds->capabilities);
> -	if (rc)
> -		return rc;
> -
> -	rc = cxl_map_device_regs(&map, &cxlds->regs.device_regs);
> -	if (rc)
> -		return rc;
> +			cxlds->capabilities);
> +	if (!rc) {
> +		rc = cxl_map_device_regs(&map, &cxlds->regs.device_regs);
> +		if (rc)
> +			dev_dbg(&pdev->dev,
> +				"Failed to map device registers.\n");
> +	}
> 
>  	rc = cxl_pci_setup_regs(pdev, CXL_REGLOC_RBI_COMPONENT,
>  				&cxlds->reg_map, cxlds->capabilities);
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 01/13] cxl: allow a type-2 device not to have memory device registers
  2024-09-20 22:34 ` [RFC 01/13] cxl: allow a type-2 device not to have memory device registers Zhi Wang
  2024-09-23  8:01   ` Tian, Kevin
@ 2024-09-23 15:38   ` Dave Jiang
  2024-09-24  8:03     ` Zhi Wang
  1 sibling, 1 reply; 38+ messages in thread
From: Dave Jiang @ 2024-09-23 15:38 UTC (permalink / raw)
  To: Zhi Wang, kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave, jonathan.cameron, ira.weiny, vishal.l.verma,
	alucerop, acurrid, cjia, smitra, ankita, aniketa, kwankhede,
	targupta, zhiwang



On 9/20/24 3:34 PM, Zhi Wang wrote:
> CXL memory device registers provide additional information about device
> memory and advanced control interface for type-3 device.
> 
> However, it is not mandatory for a type-2 device. A type-2 device can
> have HDMs but not CXL memory device registers.
> 
> Allow a type-2 device not to hanve memory device register when probing
> CXL registers.
> 
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> ---
>  drivers/cxl/pci.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index e00ce7f4d0f9..3fbee31995f1 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -529,13 +529,13 @@ int cxl_pci_accel_setup_regs(struct pci_dev *pdev, struct cxl_dev_state *cxlds)
>  	int rc;
>  
>  	rc = cxl_pci_setup_regs(pdev, CXL_REGLOC_RBI_MEMDEV, &map,
> -				cxlds->capabilities);
> -	if (rc)
> -		return rc;
> -
> -	rc = cxl_map_device_regs(&map, &cxlds->regs.device_regs);
> -	if (rc)
> -		return rc;
> +			cxlds->capabilities);
> +	if (!rc) {

Given that device registers are mandatory for type3 devices, I don't think we should alter the current behavior where the code attempt to map device registers and fail if that doesn't exist. Maybe need to check against the capabilities that Alejandro introduced.

DJ  

> +		rc = cxl_map_device_regs(&map, &cxlds->regs.device_regs);
> +		if (rc)
> +			dev_dbg(&pdev->dev,
> +				"Failed to map device registers.\n");
> +	}
>  
>  	rc = cxl_pci_setup_regs(pdev, CXL_REGLOC_RBI_COMPONENT,
>  				&cxlds->reg_map, cxlds->capabilities);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 01/13] cxl: allow a type-2 device not to have memory device registers
  2024-09-23 15:38   ` Dave Jiang
@ 2024-09-24  8:03     ` Zhi Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-24  8:03 UTC (permalink / raw)
  To: Dave Jiang, kvm@vger.kernel.org, linux-cxl@vger.kernel.org
  Cc: alex.williamson@redhat.com, kevin.tian@intel.com, Jason Gunthorpe,
	alison.schofield@intel.com, dan.j.williams@intel.com,
	dave@stgolabs.net, jonathan.cameron@huawei.com,
	ira.weiny@intel.com, vishal.l.verma@intel.com, alucerop@amd.com,
	Andy Currid, Neo Jia, Surath Mitra, Ankit Agrawal, Aniket Agashe,
	Kirti Wankhede, Tarun Gupta (SW-GPU), zhiwang@kernel.org

On 23/09/2024 18.38, Dave Jiang wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 9/20/24 3:34 PM, Zhi Wang wrote:
>> CXL memory device registers provide additional information about device
>> memory and advanced control interface for type-3 device.
>>
>> However, it is not mandatory for a type-2 device. A type-2 device can
>> have HDMs but not CXL memory device registers.
>>
>> Allow a type-2 device not to hanve memory device register when probing
>> CXL registers.
>>
>> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
>> ---
>>   drivers/cxl/pci.c | 14 +++++++-------
>>   1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>> index e00ce7f4d0f9..3fbee31995f1 100644
>> --- a/drivers/cxl/pci.c
>> +++ b/drivers/cxl/pci.c
>> @@ -529,13 +529,13 @@ int cxl_pci_accel_setup_regs(struct pci_dev *pdev, struct cxl_dev_state *cxlds)
>>        int rc;
>>
>>        rc = cxl_pci_setup_regs(pdev, CXL_REGLOC_RBI_MEMDEV, &map,
>> -                             cxlds->capabilities);
>> -     if (rc)
>> -             return rc;
>> -
>> -     rc = cxl_map_device_regs(&map, &cxlds->regs.device_regs);
>> -     if (rc)
>> -             return rc;
>> +                     cxlds->capabilities);
>> +     if (!rc) {
> 
> Given that device registers are mandatory for type3 devices, I don't think we should alter the current behavior where the code attempt to map device registers and fail if that doesn't exist. Maybe need to check against the capabilities that Alejandro introduced.
> 
> DJ
> 

Correct. This patch will be dropped when rebased to Alejandro's latest 
PATCHv3, which has had the checking against the capabilities.

>> +             rc = cxl_map_device_regs(&map, &cxlds->regs.device_regs);
>> +             if (rc)
>> +                     dev_dbg(&pdev->dev,
>> +                             "Failed to map device registers.\n");
>> +     }
>>
>>        rc = cxl_pci_setup_regs(pdev, CXL_REGLOC_RBI_COMPONENT,
>>                                &cxlds->reg_map, cxlds->capabilities);
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-09-23  8:00 ` [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Tian, Kevin
@ 2024-09-24  8:30   ` Zhi Wang
  2024-09-25 13:05     ` Jonathan Cameron
  2024-09-26  6:55     ` Tian, Kevin
  0 siblings, 2 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-24  8:30 UTC (permalink / raw)
  To: Tian, Kevin, kvm@vger.kernel.org, linux-cxl@vger.kernel.org
  Cc: alex.williamson@redhat.com, Jason Gunthorpe, Schofield, Alison,
	Williams, Dan J, Jiang, Dave, dave@stgolabs.net,
	jonathan.cameron@huawei.com, Weiny, Ira, Verma, Vishal L,
	alucerop@amd.com, Andy Currid, Neo Jia, Surath Mitra,
	Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org

On 23/09/2024 11.00, Tian, Kevin wrote:
> External email: Use caution opening links or attachments
> 
> 
>> From: Zhi Wang <zhiw@nvidia.com>
>> Sent: Saturday, September 21, 2024 6:35 AM
>>
> [...]
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
> 
> Any time when a new address space is introduced it's worthy of more
> context to help people who have no CXL background better understand
> the mechanism and think any potential hole.
> 
> At a glance looks we are talking about a mapping tier:
> 
>    GPA->HPA->DPA
> 
> The location/size of HPA/DPA for a cxl region are decided and mapped
> at @open_device and the HPA range is mapped to GPA at @mmap.
> 
> In addition the guest also manages a virtual HDM decoder:
> 
>    GPA->vDPA
> 
> Ideally the vDPA range selected by guest is a subset of the physical
> cxl region so based on offset and vHDM the VMM may figure out
> which offset in the cxl region to be mmaped for the corresponding
> GPA (which in the end maps to the desired DPA).
> 
> Is this understanding correct?
> 

Yes. Many thanks to summarize this. It is a design decision from a 
discussion in the CXL discord channel.

> btw is one cxl device only allowed to create one region? If multiple
> regions are possible how will they be exposed to the guest?
>

It is not an (shouldn't be) enforced requirement from the VFIO cxl core. 
It is really requirement-driven. I am expecting what kind of use cases 
in reality that needs multiple CXL regions in the host and then passing 
multiple regions to the guest.

Presumably, the host creates one large CXL region that covers the entire 
DPA, while QEMU can virtually partition it into different regions and 
map them to different virtual CXL region if QEMU presents multiple HDM 
decoders to the guest.

>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
>> by the kernel CXL core, and the VM can not modify them.
> 
> any side effect from emulating it purely in software (patch10), e.g. when
> the guest desired configuration is different from the physical one?
> 

This should be with a summary and later be decided if mediate pass 
through is needed. In this RFC, its goal is just to prevent the guest to 
modify pRegs.

>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
> 
> ditto
> 
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
> 
> It'd be good to see a real cxl device working to add confidence on
> the core design.

To leverage the opportunity of F2F discussion in LPC, I proposed this 
patchset to start the discussion and meanwhile offered an environment 
for people to try and hack around. Also patches is good base for 
discussion. We see what we will get. :)

There are devices already there and on-going. AMD's SFC (patches are 
under review) and I think they are going to be the first variant driver 
that use the core. NVIDIA's device is also coming and NVIDIA's variant 
driver is going upstream for sure. Plus this emulated device, I assume 
we will have three in-tree variant drivers talks to the CXL core.

Thanks,
Zhi.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (13 preceding siblings ...)
  2024-09-23  8:00 ` [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Tian, Kevin
@ 2024-09-25 10:11 ` Alejandro Lucero Palau
  2024-09-27  7:38   ` Zhi Wang
  2024-09-27  7:38   ` Zhi Wang
  2024-10-21 10:49 ` Zhi Wang
  2024-10-30 11:56 ` Zhi Wang
  16 siblings, 2 replies; 38+ messages in thread
From: Alejandro Lucero Palau @ 2024-09-25 10:11 UTC (permalink / raw)
  To: Zhi Wang, kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, acurrid, cjia, smitra, ankita, aniketa, kwankhede,
	targupta, zhiwang


On 9/20/24 23:34, Zhi Wang wrote:
> Hi folks:
>
> As promised in the LPC, here are all you need (patches, repos, guiding
> video, kernel config) to build a environment to test the vfio-cxl-core.
>
> Thanks so much for the discussions! Enjoy and see you in the next one.
>
> Background
> ==========
>
> Compute Express Link (CXL) is an open standard interconnect built upon
> industrial PCI layers to enhance the performance and efficiency of data
> centers by enabling high-speed, low-latency communication between CPUs
> and various types of devices such as accelerators, memory.
>
> It supports three key protocols: CXL.io as the control protocol, CXL.cache
> as the cache-coherent host-device data transfer protocol, and CXL.mem as
> memory expansion protocol. CXL Type 2 devices leverage the three protocols
> to seamlessly integrate with host CPUs, providing a unified and efficient
> interface for high-speed data transfer and memory sharing. This integration
> is crucial for heterogeneous computing environments where accelerators,
> such as GPUs, and other specialized processors, are used to handle
> intensive workloads.
>
> Goal
> ====
>
> Although CXL is built upon the PCI layers, passing a CXL type-2 device can
> be different than PCI devices according to CXL specification[1]:
>
> - CXL type-2 device initialization. CXL type-2 device requires an
> additional initialization sequence besides the PCI device initialization.
> CXL type-2 device initialization can be pretty complicated due to its
> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
> initialization sequence provided by the kernel CXL core is used.
>
> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
> (Device PA) needs to be created to access the device memory directly. HDM
> decoders in the CXL topology need to be configured level by level to
> manage the mapping. After the region is created, it needs to be mapped to
> GPA in the virtual HDM decoders configured by the VM.
>
> - CXL reset. The CXL device reset is different from the PCI device reset.
> A CXL reset sequence is introduced by the CXL spec.
>
> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
> configuration for device enumeration and device control. (E.g. if a device
> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
> by the kernel CXL core, and the VM can not modify them.
>
> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
> that can sit in a PCI BAR. The location of register groups sit in the PCI
> BAR is indicated by the register locator in the CXL DVSECs. They are also
> owned by the kernel CXL core. Some of them need to be emulated.
>
> Design
> ======
>
> To achieve the purpose above, the vfio-cxl-core is introduced to host the
> common routines that variant driver requires for device passthrough.
> Similar with the vfio-pci-core, the vfio-cxl-core provides common
> routines of vfio_device_ops for the variant driver to hook and perform the
> CXL routines behind it.
>
> Besides, several extra APIs are introduced for the variant driver to
> provide the necessary information the kernel CXL core to initialize
> the CXL device. E.g., Device DPA.
>
> CXL is built upon the PCI layers but with differences. Thus, the
> vfio-pci-core is aimed to be re-used as much as possible with the
> awareness of operating on a CXL device.
>
> A new VFIO device region is introduced to expose the CXL region to the
> userspace. A new CXL VFIO device cap has also been introduced to convey
> the necessary CXL device information to the userspace.



Hi Zhi,


As you know, I was confused with this work but after looking at the 
patchset and thinking about all this, it makes sense now. FWIW, the most 
confusing point was to use the CXL core inside the VM for creating the 
region what implies commits to the CXL root switch/complex and any other 
switch in the path. I realize now it will happen but on emulated 
hardware with no implication to the real one, which was updated with any 
necessary change like those commits by the vfio cxl code in the host (L1 
VM in your tests).


The only problem I can see with this approach is the CXL initialization 
is left unconditionally to the hypervisor. I guess most of the time will 
be fine, but the driver could not be mapping/using the whole CXL mem 
always.  I know this could be awkward, but possible depending on the 
device state unrelated to CXL itself. In other words, this approach 
assumes beforehand something which could not be true. What I had in mind 
was to have VM exits for any action on CXL configuration on behalf of 
that device/driver inside the device.


This is all more problematic with CXL.cache, and I think the same 
approach can not be followed. I'm writing a document trying to share all 
my concerns about CXL.cache and DMA/IOMMU mappings, and I will cover 
this for sure. As a quick note, while DMA/IOMMU has no limitations 
regarding the amount of memory to use, it is unlikely the same can be 
done due to scarce host snoop cache resources, therefore the CXL.cache 
mappings will likely need to be explicitly done by the driver and 
approved by the CXL core (along with DMA/IOMMU), and for a driver inside 
a VM that implies VM exits.


Thanks.

Alejandro.

> Patches
> =======
>
> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>
> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>
> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
> initialization, CXL region creation.
>
> PATCH 5: Expose the CXL region to the userspace.
>
> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>
> PATCH 8: Emulate the HDM decoder registers.
>
> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>
> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>
> PATCH 11: Expose the CXL device information that userspace needs.
>
> PATCH 12: An example variant driver to demonstrate the usage of
> vfio-cxl-core from the perspective of the VFIO variant driver.
>
> PATCH 13: A workaround needs suggestions.
>
> Test
> ====
>
> To test the patches and hack around, a virtual passthrough with nested
> virtualization approach is used.
>
> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
> with the changes to emulate HDM decoders.
>
> While running the vfio-cxl in the L1 guest, an example VFIO variant
> driver is used to attach with the QEMU CXL access device.
>
> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
> VFIOStub.
>
> In the L2 guest, a dummy CXL device driver is provided to attach to the
> virtual pass-thru device.
>
> The dummy CXL type-2 device driver can successfully be loaded with the
> kernel cxl core type2 support, create CXL region by requesting the CXL
> core to allocate HPA and DPA and configure the HDM decoders.
>
> To make sure everyone can test the patches, the kernel config of L1 and
> L2 are provided in the repos, the required kernel command params and
> qemu command line can be found from the demostration video.[5]
>
> Repos
> =====
>
> QEMU host: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
> L1 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
> L1 QEMU: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>
> [1] https://computeexpresslink.org/cxl-specification/
> [2] https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
> [3] https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>
> Feedback expected
> =================
>
> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
> - Variant driver requirements from more hardware vendors.
> - vfio-cxl-core UABI to QEMU.
>
> Moving foward
> =============
>
> - Rebase the patches on top of Alejandro's PATCH v3.
> - Get Ira's type-2 emulated device patch into upstream as CXL folks and RH
>    folks both came to talk and expect this. I had a chat with Ira and he
>    expected me to take it over. Will start a discussion in the CXL discord
>    group for the desgin of V1.
> - Sparse map in vfio-cxl-core.
>
> Known issues
> ============
>
> - Teardown path. Missing teardown paths have been implements in Alejandor's
>    PATCH v3. It should be solved after the rebase.
>
> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is missing
>    in the Ira's patch. When rebooting L1, many CXL registers are not reset.
>    This will be addressed in the formal review of emulated CXL type-2 device
>    support.
>
> Zhi Wang (13):
>    cxl: allow a type-2 device not to have memory device registers
>    cxl: introduce cxl_get_hdm_info()
>    cxl: introduce cxl_find_comp_reglock_offset()
>    vfio: introduce vfio-cxl core preludes
>    vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>      region
>    vfio/pci: expose vfio_pci_rw()
>    vfio/cxl: introduce vfio_cxl_core_{read, write}()
>    vfio/cxl: emulate HDM decoder registers
>    vfio/pci: introduce CXL device awareness
>    vfio/pci: emulate CXL DVSEC registers in the configuration space
>    vfio/cxl: introduce VFIO CXL device cap
>    vfio/cxl: VFIO variant driver for QEMU CXL accel device
>    vfio/cxl: workaround: don't take resource region when cxl is enabled.
>
>   drivers/cxl/core/pci.c              |  28 ++
>   drivers/cxl/core/regs.c             |  22 +
>   drivers/cxl/cxl.h                   |   1 +
>   drivers/cxl/cxlpci.h                |   3 +
>   drivers/cxl/pci.c                   |  14 +-
>   drivers/vfio/pci/Kconfig            |   6 +
>   drivers/vfio/pci/Makefile           |   5 +
>   drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>   drivers/vfio/pci/cxl-accel/Makefile |   3 +
>   drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>   drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>   drivers/vfio/pci/vfio_pci_config.c  |  10 +
>   drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>   drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>   include/linux/cxl_accel_mem.h       |   3 +
>   include/linux/cxl_accel_pci.h       |   6 +
>   include/linux/vfio_pci_core.h       |  53 +++
>   include/uapi/linux/vfio.h           |  14 +
>   18 files changed, 992 insertions(+), 32 deletions(-)
>   create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>   create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>   create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>   create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-09-24  8:30   ` Zhi Wang
@ 2024-09-25 13:05     ` Jonathan Cameron
  2024-09-27  7:18       ` Zhi Wang
  2024-09-26  6:55     ` Tian, Kevin
  1 sibling, 1 reply; 38+ messages in thread
From: Jonathan Cameron @ 2024-09-25 13:05 UTC (permalink / raw)
  To: Zhi Wang
  Cc: Tian, Kevin, kvm@vger.kernel.org, linux-cxl@vger.kernel.org,
	alex.williamson@redhat.com, Jason Gunthorpe, Schofield, Alison,
	Williams, Dan J, Jiang, Dave, dave@stgolabs.net, Weiny, Ira,
	Verma, Vishal L, alucerop@amd.com, Andy Currid, Neo Jia,
	Surath Mitra, Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org

On Tue, 24 Sep 2024 08:30:17 +0000
Zhi Wang <zhiw@nvidia.com> wrote:

> On 23/09/2024 11.00, Tian, Kevin wrote:
> > External email: Use caution opening links or attachments
> > 
> >   
> >> From: Zhi Wang <zhiw@nvidia.com>
> >> Sent: Saturday, September 21, 2024 6:35 AM
> >>  
> > [...]  
> >> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
> >> (Device PA) needs to be created to access the device memory directly. HDM
> >> decoders in the CXL topology need to be configured level by level to
> >> manage the mapping. After the region is created, it needs to be mapped to
> >> GPA in the virtual HDM decoders configured by the VM.  
> > 
> > Any time when a new address space is introduced it's worthy of more
> > context to help people who have no CXL background better understand
> > the mechanism and think any potential hole.
> > 
> > At a glance looks we are talking about a mapping tier:
> > 
> >    GPA->HPA->DPA
> > 
> > The location/size of HPA/DPA for a cxl region are decided and mapped
> > at @open_device and the HPA range is mapped to GPA at @mmap.
> > 
> > In addition the guest also manages a virtual HDM decoder:
> > 
> >    GPA->vDPA
> > 
> > Ideally the vDPA range selected by guest is a subset of the physical
> > cxl region so based on offset and vHDM the VMM may figure out
> > which offset in the cxl region to be mmaped for the corresponding
> > GPA (which in the end maps to the desired DPA).
> > 
> > Is this understanding correct?
> >   
> 
> Yes. Many thanks to summarize this. It is a design decision from a 
> discussion in the CXL discord channel.
> 
> > btw is one cxl device only allowed to create one region? If multiple
> > regions are possible how will they be exposed to the guest?
> >  
> 
> It is not an (shouldn't be) enforced requirement from the VFIO cxl core. 
> It is really requirement-driven. I am expecting what kind of use cases 
> in reality that needs multiple CXL regions in the host and then passing 
> multiple regions to the guest.

Mix of back invalidate and non back invalidate supporting device memory
maybe?  A bounce region for p2p traffic would the obvious reason to do
this without paying the cost of large snoop filters. If anyone puts PMEM
on the device, then maybe mix of that at volatile. In theory you might
do separate regions for QoS reasons but seems unlikely to me...

Anyhow not an immediately problem as I don't know of any
BI capable hosts yet and doubt anyone (other than Dan) cares about PMEM :)


> 
> Presumably, the host creates one large CXL region that covers the entire 
> DPA, while QEMU can virtually partition it into different regions and 
> map them to different virtual CXL region if QEMU presents multiple HDM 
> decoders to the guest.

I'm not sure why it would do that. Can't think why you'd break up
a host region - maybe I'm missing something.

...

> >> In the L2 guest, a dummy CXL device driver is provided to attach to the
> >> virtual pass-thru device.
> >>
> >> The dummy CXL type-2 device driver can successfully be loaded with the
> >> kernel cxl core type2 support, create CXL region by requesting the CXL
> >> core to allocate HPA and DPA and configure the HDM decoders.  
> > 
> > It'd be good to see a real cxl device working to add confidence on
> > the core design.  
> 
> To leverage the opportunity of F2F discussion in LPC, I proposed this 
> patchset to start the discussion and meanwhile offered an environment 
> for people to try and hack around. Also patches is good base for 
> discussion. We see what we will get. :)
> 
> There are devices already there and on-going. AMD's SFC (patches are 
> under review) and I think they are going to be the first variant driver 
> that use the core. NVIDIA's device is also coming and NVIDIA's variant 
> driver is going upstream for sure. Plus this emulated device, I assume 
> we will have three in-tree variant drivers talks to the CXL core.
Nice.
> 
> Thanks,
> Zhi.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-09-24  8:30   ` Zhi Wang
  2024-09-25 13:05     ` Jonathan Cameron
@ 2024-09-26  6:55     ` Tian, Kevin
  1 sibling, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2024-09-26  6:55 UTC (permalink / raw)
  To: Zhi Wang, kvm@vger.kernel.org, linux-cxl@vger.kernel.org
  Cc: alex.williamson@redhat.com, Jason Gunthorpe, Schofield, Alison,
	Williams, Dan J, Jiang, Dave, dave@stgolabs.net,
	jonathan.cameron@huawei.com, Weiny, Ira, Verma, Vishal L,
	alucerop@amd.com, Currid, Andy, Neo Jia, Surath Mitra,
	Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org

> From: Zhi Wang <zhiw@nvidia.com>
> Sent: Tuesday, September 24, 2024 4:30 PM
> 
> On 23/09/2024 11.00, Tian, Kevin wrote:
> > External email: Use caution opening links or attachments
> >
> >
> >> From: Zhi Wang <zhiw@nvidia.com>
> >> Sent: Saturday, September 21, 2024 6:35 AM
> >>
> > [...]
> >> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
> >> (Device PA) needs to be created to access the device memory directly. HDM
> >> decoders in the CXL topology need to be configured level by level to
> >> manage the mapping. After the region is created, it needs to be mapped to
> >> GPA in the virtual HDM decoders configured by the VM.
> >
> > Any time when a new address space is introduced it's worthy of more
> > context to help people who have no CXL background better understand
> > the mechanism and think any potential hole.
> >
> > At a glance looks we are talking about a mapping tier:
> >
> >    GPA->HPA->DPA
> >
> > The location/size of HPA/DPA for a cxl region are decided and mapped
> > at @open_device and the HPA range is mapped to GPA at @mmap.
> >
> > In addition the guest also manages a virtual HDM decoder:
> >
> >    GPA->vDPA
> >
> > Ideally the vDPA range selected by guest is a subset of the physical
> > cxl region so based on offset and vHDM the VMM may figure out
> > which offset in the cxl region to be mmaped for the corresponding
> > GPA (which in the end maps to the desired DPA).
> >
> > Is this understanding correct?
> >
> 
> Yes. Many thanks to summarize this. It is a design decision from a
> discussion in the CXL discord channel.
> 
> > btw is one cxl device only allowed to create one region? If multiple
> > regions are possible how will they be exposed to the guest?
> >
> 
> It is not an (shouldn't be) enforced requirement from the VFIO cxl core.
> It is really requirement-driven. I am expecting what kind of use cases
> in reality that needs multiple CXL regions in the host and then passing
> multiple regions to the guest.
> 
> Presumably, the host creates one large CXL region that covers the entire
> DPA, while QEMU can virtually partition it into different regions and
> map them to different virtual CXL region if QEMU presents multiple HDM
> decoders to the guest.

non-cxl guys have no idea about what a region is and how it is associated
to the backing hardware resource, e.g. it's created by software then
when the virtual CXL device is composed how is that software-decided
region translated back to a set of virtual CXL hw resource enumerable
to the guest, etc.

In your description, QEMU, as the virtual platform, map the VFIO CXL
region into different virtual CXL regions. This kind of suggests regions
are created by hw, conflicting with the point having sw create it.

We need a fully picture to connect relevant knowledge points in CXL
so the proposal can be better reviewed in the VFIO side. 😊

> 
> >>
> >> - CXL reset. The CXL device reset is different from the PCI device reset.
> >> A CXL reset sequence is introduced by the CXL spec.
> >>
> >> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
> >> configuration for device enumeration and device control. (E.g. if a device
> >> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
> >> by the kernel CXL core, and the VM can not modify them.
> >
> > any side effect from emulating it purely in software (patch10), e.g. when
> > the guest desired configuration is different from the physical one?
> >
> 
> This should be with a summary and later be decided if mediate pass
> through is needed. In this RFC, its goal is just to prevent the guest to
> modify pRegs.

Look forward to that information in future posting.

> 
> >>
> >> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
> >> that can sit in a PCI BAR. The location of register groups sit in the PCI
> >> BAR is indicated by the register locator in the CXL DVSECs. They are also
> >> owned by the kernel CXL core. Some of them need to be emulated.
> >
> > ditto
> >
> >>
> >> In the L2 guest, a dummy CXL device driver is provided to attach to the
> >> virtual pass-thru device.
> >>
> >> The dummy CXL type-2 device driver can successfully be loaded with the
> >> kernel cxl core type2 support, create CXL region by requesting the CXL
> >> core to allocate HPA and DPA and configure the HDM decoders.
> >
> > It'd be good to see a real cxl device working to add confidence on
> > the core design.
> 
> To leverage the opportunity of F2F discussion in LPC, I proposed this
> patchset to start the discussion and meanwhile offered an environment
> for people to try and hack around. Also patches is good base for
> discussion. We see what we will get. :)
> 
> There are devices already there and on-going. AMD's SFC (patches are
> under review) and I think they are going to be the first variant driver
> that use the core. NVIDIA's device is also coming and NVIDIA's variant
> driver is going upstream for sure. Plus this emulated device, I assume
> we will have three in-tree variant drivers talks to the CXL core.
> 

Yeah, this sounds a great first step!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-09-25 13:05     ` Jonathan Cameron
@ 2024-09-27  7:18       ` Zhi Wang
  2024-10-04 11:40         ` Jonathan Cameron
  0 siblings, 1 reply; 38+ messages in thread
From: Zhi Wang @ 2024-09-27  7:18 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Tian, Kevin, kvm@vger.kernel.org, linux-cxl@vger.kernel.org,
	alex.williamson@redhat.com, Jason Gunthorpe, Schofield, Alison,
	Williams, Dan J, Jiang, Dave, dave@stgolabs.net, Weiny, Ira,
	Verma, Vishal L, alucerop@amd.com, Andy Currid, Neo Jia,
	Surath Mitra, Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org

On 25/09/2024 15.05, Jonathan Cameron wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Tue, 24 Sep 2024 08:30:17 +0000
> Zhi Wang <zhiw@nvidia.com> wrote:
> 
>> On 23/09/2024 11.00, Tian, Kevin wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>>> From: Zhi Wang <zhiw@nvidia.com>
>>>> Sent: Saturday, September 21, 2024 6:35 AM
>>>>
>>> [...]
>>>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>>>> (Device PA) needs to be created to access the device memory directly. HDM
>>>> decoders in the CXL topology need to be configured level by level to
>>>> manage the mapping. After the region is created, it needs to be mapped to
>>>> GPA in the virtual HDM decoders configured by the VM.
>>>
>>> Any time when a new address space is introduced it's worthy of more
>>> context to help people who have no CXL background better understand
>>> the mechanism and think any potential hole.
>>>
>>> At a glance looks we are talking about a mapping tier:
>>>
>>>     GPA->HPA->DPA
>>>
>>> The location/size of HPA/DPA for a cxl region are decided and mapped
>>> at @open_device and the HPA range is mapped to GPA at @mmap.
>>>
>>> In addition the guest also manages a virtual HDM decoder:
>>>
>>>     GPA->vDPA
>>>
>>> Ideally the vDPA range selected by guest is a subset of the physical
>>> cxl region so based on offset and vHDM the VMM may figure out
>>> which offset in the cxl region to be mmaped for the corresponding
>>> GPA (which in the end maps to the desired DPA).
>>>
>>> Is this understanding correct?
>>>
>>
>> Yes. Many thanks to summarize this. It is a design decision from a
>> discussion in the CXL discord channel.
>>
>>> btw is one cxl device only allowed to create one region? If multiple
>>> regions are possible how will they be exposed to the guest?
>>>
>>
>> It is not an (shouldn't be) enforced requirement from the VFIO cxl core.
>> It is really requirement-driven. I am expecting what kind of use cases
>> in reality that needs multiple CXL regions in the host and then passing
>> multiple regions to the guest.
> 
> Mix of back invalidate and non back invalidate supporting device memory
> maybe?  A bounce region for p2p traffic would the obvious reason to do
> this without paying the cost of large snoop filters. If anyone puts PMEM
> on the device, then maybe mix of that at volatile. In theory you might
> do separate regions for QoS reasons but seems unlikely to me...
> 
> Anyhow not an immediately problem as I don't know of any
> BI capable hosts yet and doubt anyone (other than Dan) cares about PMEM :)
> 

Got it.
> 
>>
>> Presumably, the host creates one large CXL region that covers the entire
>> DPA, while QEMU can virtually partition it into different regions and
>> map them to different virtual CXL region if QEMU presents multiple HDM
>> decoders to the guest.
> 
> I'm not sure why it would do that. Can't think why you'd break up
> a host region - maybe I'm missing something.
> 

It is mostly concerning about a device can have multiple HDM decoders. 
In the current design, a large physical CXL (pCXL) region with the whole 
DPA will be passed to the userspace. Thinking that the guest will see 
the virtual multiple HDM decoders, which usually SW is asking for, the 
guest SW might create multiple virtual CXL regions. In that case QEMU 
needs to map them into different regions of the pCXL region.

> ...
> 
>>>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>>>> virtual pass-thru device.
>>>>
>>>> The dummy CXL type-2 device driver can successfully be loaded with the
>>>> kernel cxl core type2 support, create CXL region by requesting the CXL
>>>> core to allocate HPA and DPA and configure the HDM decoders.
>>>
>>> It'd be good to see a real cxl device working to add confidence on
>>> the core design.
>>
>> To leverage the opportunity of F2F discussion in LPC, I proposed this
>> patchset to start the discussion and meanwhile offered an environment
>> for people to try and hack around. Also patches is good base for
>> discussion. We see what we will get. :)
>>
>> There are devices already there and on-going. AMD's SFC (patches are
>> under review) and I think they are going to be the first variant driver
>> that use the core. NVIDIA's device is also coming and NVIDIA's variant
>> driver is going upstream for sure. Plus this emulated device, I assume
>> we will have three in-tree variant drivers talks to the CXL core.
> Nice.
>>
>> Thanks,
>> Zhi.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-09-25 10:11 ` Alejandro Lucero Palau
@ 2024-09-27  7:38   ` Zhi Wang
  2024-09-27  7:38   ` Zhi Wang
  1 sibling, 0 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-27  7:38 UTC (permalink / raw)
  To: Alejandro Lucero Palau, kvm@vger.kernel.org,
	linux-cxl@vger.kernel.org
  Cc: alex.williamson@redhat.com, kevin.tian@intel.com, Jason Gunthorpe,
	alison.schofield@intel.com, dan.j.williams@intel.com,
	dave.jiang@intel.com, dave@stgolabs.net,
	jonathan.cameron@huawei.com, ira.weiny@intel.com,
	vishal.l.verma@intel.com, Andy Currid, Neo Jia, Surath Mitra,
	Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org

On 25/09/2024 12.11, Alejandro Lucero Palau wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 9/20/24 23:34, Zhi Wang wrote:
>> Hi folks:
>>
>> As promised in the LPC, here are all you need (patches, repos, guiding
>> video, kernel config) to build a environment to test the vfio-cxl-core.
>>
>> Thanks so much for the discussions! Enjoy and see you in the next one.
>>
>> Background
>> ==========
>>
>> Compute Express Link (CXL) is an open standard interconnect built upon
>> industrial PCI layers to enhance the performance and efficiency of data
>> centers by enabling high-speed, low-latency communication between CPUs
>> and various types of devices such as accelerators, memory.
>>
>> It supports three key protocols: CXL.io as the control protocol, 
>> CXL.cache
>> as the cache-coherent host-device data transfer protocol, and CXL.mem as
>> memory expansion protocol. CXL Type 2 devices leverage the three 
>> protocols
>> to seamlessly integrate with host CPUs, providing a unified and efficient
>> interface for high-speed data transfer and memory sharing. This 
>> integration
>> is crucial for heterogeneous computing environments where accelerators,
>> such as GPUs, and other specialized processors, are used to handle
>> intensive workloads.
>>
>> Goal
>> ====
>>
>> Although CXL is built upon the PCI layers, passing a CXL type-2 device 
>> can
>> be different than PCI devices according to CXL specification[1]:
>>
>> - CXL type-2 device initialization. CXL type-2 device requires an
>> additional initialization sequence besides the PCI device initialization.
>> CXL type-2 device initialization can be pretty complicated due to its
>> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
>> initialization sequence provided by the kernel CXL core is used.
>>
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a 
>> device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are 
>> owned
>> by the kernel CXL core, and the VM can not modify them.
>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO 
>> registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
>>
>> Design
>> ======
>>
>> To achieve the purpose above, the vfio-cxl-core is introduced to host the
>> common routines that variant driver requires for device passthrough.
>> Similar with the vfio-pci-core, the vfio-cxl-core provides common
>> routines of vfio_device_ops for the variant driver to hook and perform 
>> the
>> CXL routines behind it.
>>
>> Besides, several extra APIs are introduced for the variant driver to
>> provide the necessary information the kernel CXL core to initialize
>> the CXL device. E.g., Device DPA.
>>
>> CXL is built upon the PCI layers but with differences. Thus, the
>> vfio-pci-core is aimed to be re-used as much as possible with the
>> awareness of operating on a CXL device.
>>
>> A new VFIO device region is introduced to expose the CXL region to the
>> userspace. A new CXL VFIO device cap has also been introduced to convey
>> the necessary CXL device information to the userspace.
> 
> 
> 
> Hi Zhi,
> 
> 
> As you know, I was confused with this work but after looking at the
> patchset and thinking about all this, it makes sense now. FWIW, the most
> confusing point was to use the CXL core inside the VM for creating the
> region what implies commits to the CXL root switch/complex and any other
> switch in the path. I realize now it will happen but on emulated
> hardware with no implication to the real one, which was updated with any
> necessary change like those commits by the vfio cxl code in the host (L1
> VM in your tests).
> 
> 
> The only problem I can see with this approach is the CXL initialization
> is left unconditionally to the hypervisor. I guess most of the time will
> be fine, but the driver could not be mapping/using the whole CXL mem
> always.  I know this could be awkward, but possible depending on the
> device state unrelated to CXL itself. 

Will this device states be one-time on/off state or a runtime 
configuration state that a guest need to poke all the time?

There can be two paths for handling these states in a vendor-specific 
variant driver: 1) vfio_device->fops->open() path, it suits for one-time 
on/off state 2) vfio_device->fops->{read, write}(), the VM 
exit->QEMU->variant driver path. The vendor-specific driver can 
configure the HW based on the register access from the guest.

It would be nice to know more about this, like how many registers the 
vendor-specific driver would like to handle. Thus, the VFIO CXL core can 
provide common helpers.

In other words, this approach
> assumes beforehand something which could not be true. What I had in mind
> was to have VM exits for any action on CXL configuration on behalf of
> that device/driver inside the device.
> 

Initially, this was a idea from Dan. I think this would be a good topic 
for the next CXL open-source collaboration meeting. Kevn also commented 
for this.

> 
> This is all more problematic with CXL.cache, and I think the same
> approach can not be followed. I'm writing a document trying to share all
> my concerns about CXL.cache and DMA/IOMMU mappings, and I will cover
> this for sure. As a quick note, while DMA/IOMMU has no limitations
> regarding the amount of memory to use, it is unlikely the same can be
> done due to scarce host snoop cache resources, therefore the CXL.cache
> mappings will likely need to be explicitly done by the driver and
> approved by the CXL core (along with DMA/IOMMU), and for a driver inside
> a VM that implies VM exits.
> 

Good to hear. Please CCme as well. Many thanks.

> 
> Thanks.
> 
> Alejandro.
> 
>> Patches
>> =======
>>
>> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
>> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>>
>> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>>
>> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
>> initialization, CXL region creation.
>>
>> PATCH 5: Expose the CXL region to the userspace.
>>
>> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>>
>> PATCH 8: Emulate the HDM decoder registers.
>>
>> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>>
>> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>>
>> PATCH 11: Expose the CXL device information that userspace needs.
>>
>> PATCH 12: An example variant driver to demonstrate the usage of
>> vfio-cxl-core from the perspective of the VFIO variant driver.
>>
>> PATCH 13: A workaround needs suggestions.
>>
>> Test
>> ====
>>
>> To test the patches and hack around, a virtual passthrough with nested
>> virtualization approach is used.
>>
>> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
>> with the changes to emulate HDM decoders.
>>
>> While running the vfio-cxl in the L1 guest, an example VFIO variant
>> driver is used to attach with the QEMU CXL access device.
>>
>> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
>> VFIOStub.
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
>>
>> To make sure everyone can test the patches, the kernel config of L1 and
>> L2 are provided in the repos, the required kernel command params and
>> qemu command line can be found from the demostration video.[5]
>>
>> Repos
>> =====
>>
>> QEMU host: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
>> L1 Kernel: 
>> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
>> L1 QEMU: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
>> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>>
>> [1] https://computeexpresslink.org/cxl-specification/
>> [2] 
>> https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
>> [3] 
>> https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
>> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>>
>> Feedback expected
>> =================
>>
>> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
>> - Variant driver requirements from more hardware vendors.
>> - vfio-cxl-core UABI to QEMU.
>>
>> Moving foward
>> =============
>>
>> - Rebase the patches on top of Alejandro's PATCH v3.
>> - Get Ira's type-2 emulated device patch into upstream as CXL folks 
>> and RH
>>    folks both came to talk and expect this. I had a chat with Ira and he
>>    expected me to take it over. Will start a discussion in the CXL 
>> discord
>>    group for the desgin of V1.
>> - Sparse map in vfio-cxl-core.
>>
>> Known issues
>> ============
>>
>> - Teardown path. Missing teardown paths have been implements in 
>> Alejandor's
>>    PATCH v3. It should be solved after the rebase.
>>
>> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is 
>> missing
>>    in the Ira's patch. When rebooting L1, many CXL registers are not 
>> reset.
>>    This will be addressed in the formal review of emulated CXL type-2 
>> device
>>    support.
>>
>> Zhi Wang (13):
>>    cxl: allow a type-2 device not to have memory device registers
>>    cxl: introduce cxl_get_hdm_info()
>>    cxl: introduce cxl_find_comp_reglock_offset()
>>    vfio: introduce vfio-cxl core preludes
>>    vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>>      region
>>    vfio/pci: expose vfio_pci_rw()
>>    vfio/cxl: introduce vfio_cxl_core_{read, write}()
>>    vfio/cxl: emulate HDM decoder registers
>>    vfio/pci: introduce CXL device awareness
>>    vfio/pci: emulate CXL DVSEC registers in the configuration space
>>    vfio/cxl: introduce VFIO CXL device cap
>>    vfio/cxl: VFIO variant driver for QEMU CXL accel device
>>    vfio/cxl: workaround: don't take resource region when cxl is enabled.
>>
>>   drivers/cxl/core/pci.c              |  28 ++
>>   drivers/cxl/core/regs.c             |  22 +
>>   drivers/cxl/cxl.h                   |   1 +
>>   drivers/cxl/cxlpci.h                |   3 +
>>   drivers/cxl/pci.c                   |  14 +-
>>   drivers/vfio/pci/Kconfig            |   6 +
>>   drivers/vfio/pci/Makefile           |   5 +
>>   drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>>   drivers/vfio/pci/cxl-accel/Makefile |   3 +
>>   drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>>   drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>>   drivers/vfio/pci/vfio_pci_config.c  |  10 +
>>   drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>>   drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>>   include/linux/cxl_accel_mem.h       |   3 +
>>   include/linux/cxl_accel_pci.h       |   6 +
>>   include/linux/vfio_pci_core.h       |  53 +++
>>   include/uapi/linux/vfio.h           |  14 +
>>   18 files changed, 992 insertions(+), 32 deletions(-)
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>>   create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>>   create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-09-25 10:11 ` Alejandro Lucero Palau
  2024-09-27  7:38   ` Zhi Wang
@ 2024-09-27  7:38   ` Zhi Wang
  1 sibling, 0 replies; 38+ messages in thread
From: Zhi Wang @ 2024-09-27  7:38 UTC (permalink / raw)
  To: Alejandro Lucero Palau, kvm@vger.kernel.org,
	linux-cxl@vger.kernel.org
  Cc: alex.williamson@redhat.com, kevin.tian@intel.com, Jason Gunthorpe,
	alison.schofield@intel.com, dan.j.williams@intel.com,
	dave.jiang@intel.com, dave@stgolabs.net,
	jonathan.cameron@huawei.com, ira.weiny@intel.com,
	vishal.l.verma@intel.com, Andy Currid, Neo Jia, Surath Mitra,
	Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org

On 25/09/2024 12.11, Alejandro Lucero Palau wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 9/20/24 23:34, Zhi Wang wrote:
>> Hi folks:
>>
>> As promised in the LPC, here are all you need (patches, repos, guiding
>> video, kernel config) to build a environment to test the vfio-cxl-core.
>>
>> Thanks so much for the discussions! Enjoy and see you in the next one.
>>
>> Background
>> ==========
>>
>> Compute Express Link (CXL) is an open standard interconnect built upon
>> industrial PCI layers to enhance the performance and efficiency of data
>> centers by enabling high-speed, low-latency communication between CPUs
>> and various types of devices such as accelerators, memory.
>>
>> It supports three key protocols: CXL.io as the control protocol, 
>> CXL.cache
>> as the cache-coherent host-device data transfer protocol, and CXL.mem as
>> memory expansion protocol. CXL Type 2 devices leverage the three 
>> protocols
>> to seamlessly integrate with host CPUs, providing a unified and efficient
>> interface for high-speed data transfer and memory sharing. This 
>> integration
>> is crucial for heterogeneous computing environments where accelerators,
>> such as GPUs, and other specialized processors, are used to handle
>> intensive workloads.
>>
>> Goal
>> ====
>>
>> Although CXL is built upon the PCI layers, passing a CXL type-2 device 
>> can
>> be different than PCI devices according to CXL specification[1]:
>>
>> - CXL type-2 device initialization. CXL type-2 device requires an
>> additional initialization sequence besides the PCI device initialization.
>> CXL type-2 device initialization can be pretty complicated due to its
>> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
>> initialization sequence provided by the kernel CXL core is used.
>>
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a 
>> device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are 
>> owned
>> by the kernel CXL core, and the VM can not modify them.
>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO 
>> registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
>>
>> Design
>> ======
>>
>> To achieve the purpose above, the vfio-cxl-core is introduced to host the
>> common routines that variant driver requires for device passthrough.
>> Similar with the vfio-pci-core, the vfio-cxl-core provides common
>> routines of vfio_device_ops for the variant driver to hook and perform 
>> the
>> CXL routines behind it.
>>
>> Besides, several extra APIs are introduced for the variant driver to
>> provide the necessary information the kernel CXL core to initialize
>> the CXL device. E.g., Device DPA.
>>
>> CXL is built upon the PCI layers but with differences. Thus, the
>> vfio-pci-core is aimed to be re-used as much as possible with the
>> awareness of operating on a CXL device.
>>
>> A new VFIO device region is introduced to expose the CXL region to the
>> userspace. A new CXL VFIO device cap has also been introduced to convey
>> the necessary CXL device information to the userspace.
> 
> 
> 
> Hi Zhi,
> 
> 
> As you know, I was confused with this work but after looking at the
> patchset and thinking about all this, it makes sense now. FWIW, the most
> confusing point was to use the CXL core inside the VM for creating the
> region what implies commits to the CXL root switch/complex and any other
> switch in the path. I realize now it will happen but on emulated
> hardware with no implication to the real one, which was updated with any
> necessary change like those commits by the vfio cxl code in the host (L1
> VM in your tests).
> 
> 
> The only problem I can see with this approach is the CXL initialization
> is left unconditionally to the hypervisor. I guess most of the time will
> be fine, but the driver could not be mapping/using the whole CXL mem
> always.  I know this could be awkward, but possible depending on the
> device state unrelated to CXL itself. 

Will this device states be one-time on/off state or a runtime 
configuration state that a guest need to poke all the time?

There can be two paths for handling these states in a vendor-specific 
variant driver: 1) vfio_device->fops->open() path, it suits for one-time 
on/off state 2) vfio_device->fops->{read, write}(), the VM 
exit->QEMU->variant driver path. The vendor-specific driver can 
configure the HW based on the register access from the guest.

It would be nice to know more about this, like how many registers the 
vendor-specific driver would like to handle. Thus, the VFIO CXL core can 
provide common helpers.

In other words, this approach
> assumes beforehand something which could not be true. What I had in mind
> was to have VM exits for any action on CXL configuration on behalf of
> that device/driver inside the device.
> 

Initially, this was a idea from Dan. I think this would be a good topic 
for the next CXL open-source collaboration meeting. Kevn also commented 
for this.

> 
> This is all more problematic with CXL.cache, and I think the same
> approach can not be followed. I'm writing a document trying to share all
> my concerns about CXL.cache and DMA/IOMMU mappings, and I will cover
> this for sure. As a quick note, while DMA/IOMMU has no limitations
> regarding the amount of memory to use, it is unlikely the same can be
> done due to scarce host snoop cache resources, therefore the CXL.cache
> mappings will likely need to be explicitly done by the driver and
> approved by the CXL core (along with DMA/IOMMU), and for a driver inside
> a VM that implies VM exits.
> 

Good to hear. Please CCme as well. Many thanks.

> 
> Thanks.
> 
> Alejandro.
> 
>> Patches
>> =======
>>
>> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
>> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>>
>> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>>
>> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
>> initialization, CXL region creation.
>>
>> PATCH 5: Expose the CXL region to the userspace.
>>
>> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>>
>> PATCH 8: Emulate the HDM decoder registers.
>>
>> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>>
>> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>>
>> PATCH 11: Expose the CXL device information that userspace needs.
>>
>> PATCH 12: An example variant driver to demonstrate the usage of
>> vfio-cxl-core from the perspective of the VFIO variant driver.
>>
>> PATCH 13: A workaround needs suggestions.
>>
>> Test
>> ====
>>
>> To test the patches and hack around, a virtual passthrough with nested
>> virtualization approach is used.
>>
>> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
>> with the changes to emulate HDM decoders.
>>
>> While running the vfio-cxl in the L1 guest, an example VFIO variant
>> driver is used to attach with the QEMU CXL access device.
>>
>> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
>> VFIOStub.
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
>>
>> To make sure everyone can test the patches, the kernel config of L1 and
>> L2 are provided in the repos, the required kernel command params and
>> qemu command line can be found from the demostration video.[5]
>>
>> Repos
>> =====
>>
>> QEMU host: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
>> L1 Kernel: 
>> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
>> L1 QEMU: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
>> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>>
>> [1] https://computeexpresslink.org/cxl-specification/
>> [2] 
>> https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
>> [3] 
>> https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
>> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>>
>> Feedback expected
>> =================
>>
>> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
>> - Variant driver requirements from more hardware vendors.
>> - vfio-cxl-core UABI to QEMU.
>>
>> Moving foward
>> =============
>>
>> - Rebase the patches on top of Alejandro's PATCH v3.
>> - Get Ira's type-2 emulated device patch into upstream as CXL folks 
>> and RH
>>    folks both came to talk and expect this. I had a chat with Ira and he
>>    expected me to take it over. Will start a discussion in the CXL 
>> discord
>>    group for the desgin of V1.
>> - Sparse map in vfio-cxl-core.
>>
>> Known issues
>> ============
>>
>> - Teardown path. Missing teardown paths have been implements in 
>> Alejandor's
>>    PATCH v3. It should be solved after the rebase.
>>
>> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is 
>> missing
>>    in the Ira's patch. When rebooting L1, many CXL registers are not 
>> reset.
>>    This will be addressed in the formal review of emulated CXL type-2 
>> device
>>    support.
>>
>> Zhi Wang (13):
>>    cxl: allow a type-2 device not to have memory device registers
>>    cxl: introduce cxl_get_hdm_info()
>>    cxl: introduce cxl_find_comp_reglock_offset()
>>    vfio: introduce vfio-cxl core preludes
>>    vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>>      region
>>    vfio/pci: expose vfio_pci_rw()
>>    vfio/cxl: introduce vfio_cxl_core_{read, write}()
>>    vfio/cxl: emulate HDM decoder registers
>>    vfio/pci: introduce CXL device awareness
>>    vfio/pci: emulate CXL DVSEC registers in the configuration space
>>    vfio/cxl: introduce VFIO CXL device cap
>>    vfio/cxl: VFIO variant driver for QEMU CXL accel device
>>    vfio/cxl: workaround: don't take resource region when cxl is enabled.
>>
>>   drivers/cxl/core/pci.c              |  28 ++
>>   drivers/cxl/core/regs.c             |  22 +
>>   drivers/cxl/cxl.h                   |   1 +
>>   drivers/cxl/cxlpci.h                |   3 +
>>   drivers/cxl/pci.c                   |  14 +-
>>   drivers/vfio/pci/Kconfig            |   6 +
>>   drivers/vfio/pci/Makefile           |   5 +
>>   drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>>   drivers/vfio/pci/cxl-accel/Makefile |   3 +
>>   drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>>   drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>>   drivers/vfio/pci/vfio_pci_config.c  |  10 +
>>   drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>>   drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>>   include/linux/cxl_accel_mem.h       |   3 +
>>   include/linux/cxl_accel_pci.h       |   6 +
>>   include/linux/vfio_pci_core.h       |  53 +++
>>   include/uapi/linux/vfio.h           |  14 +
>>   18 files changed, 992 insertions(+), 32 deletions(-)
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>>   create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>>   create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-09-27  7:18       ` Zhi Wang
@ 2024-10-04 11:40         ` Jonathan Cameron
  2024-10-19  5:30           ` Zhi Wang
  0 siblings, 1 reply; 38+ messages in thread
From: Jonathan Cameron @ 2024-10-04 11:40 UTC (permalink / raw)
  To: Zhi Wang
  Cc: Tian, Kevin, kvm@vger.kernel.org, linux-cxl@vger.kernel.org,
	alex.williamson@redhat.com, Jason Gunthorpe, Schofield, Alison,
	Williams, Dan J, Jiang, Dave, dave@stgolabs.net, Weiny, Ira,
	Verma, Vishal L, alucerop@amd.com, Andy Currid, Neo Jia,
	Surath Mitra, Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org


> >   
> >>
> >> Presumably, the host creates one large CXL region that covers the entire
> >> DPA, while QEMU can virtually partition it into different regions and
> >> map them to different virtual CXL region if QEMU presents multiple HDM
> >> decoders to the guest.  
> > 
> > I'm not sure why it would do that. Can't think why you'd break up
> > a host region - maybe I'm missing something.
> >   
> 
> It is mostly concerning about a device can have multiple HDM decoders. 
> In the current design, a large physical CXL (pCXL) region with the whole 
> DPA will be passed to the userspace. Thinking that the guest will see 
> the virtual multiple HDM decoders, which usually SW is asking for, the 
> guest SW might create multiple virtual CXL regions. In that case QEMU 
> needs to map them into different regions of the pCXL region.

Don't let the guest see multiple HDM decoders?

There is no obvious reason why it would want them other than type
differences.

Why is it useful for a type 2 device to be setup for multiple CXL regions?
It shouldn't be a performance thing. Might be convenient for management
I guess, but the driver can layer it's own allocator etc on top of a single
region so I'm not sure I see a reason to do this...

Jonathan



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 04/13] vfio: introduce vfio-cxl core preludes
  2024-09-20 22:34 ` [RFC 04/13] vfio: introduce vfio-cxl core preludes Zhi Wang
@ 2024-10-11 18:33   ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2024-10-11 18:33 UTC (permalink / raw)
  To: Zhi Wang
  Cc: kvm, linux-cxl, kevin.tian, jgg, alison.schofield, dan.j.williams,
	dave.jiang, dave, jonathan.cameron, ira.weiny, vishal.l.verma,
	alucerop, acurrid, cjia, smitra, ankita, aniketa, kwankhede,
	targupta, zhiwang

On Fri, 20 Sep 2024 15:34:37 -0700
Zhi Wang <zhiw@nvidia.com> wrote:

> In VFIO, common functions that used by VFIO variant drivers are managed
> in a set of "core" functions. E.g. the vfio-pci-core provides the common
> functions used by VFIO variant drviers to support PCI device
> passhthrough.
> 
> Although the CXL type-2 device has a PCI-compatible interface for device
> configuration and programming, they still needs special handlings when
> initialize the device:
> 
> - Probing the CXL DVSECs in the configuration.
> - Probing the CXL register groups implemented by the device.
> - Configuring the CXL device state required by the kernel CXL core.
> - Create the CXL region.
> - Special handlings of the CXL MMIO BAR.
> 
> Introduce vfio-cxl core predules to hold all the common functions used

s/predules/preludes/

> by VFIO variant drivers to support CXL device passthrough.
> 
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> ---
>  drivers/vfio/pci/Kconfig         |   4 +
>  drivers/vfio/pci/Makefile        |   3 +
>  drivers/vfio/pci/vfio_cxl_core.c | 264 +++++++++++++++++++++++++++++++
>  include/linux/vfio_pci_core.h    |  37 +++++
>  4 files changed, 308 insertions(+)
>  create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index bf50ffa10bde..2196e79b132b 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -7,6 +7,10 @@ config VFIO_PCI_CORE
>  	select VFIO_VIRQFD
>  	select IRQ_BYPASS_MANAGER
>  
> +config VFIO_CXL_CORE
> +	tristate
> +	select VFIO_PCI_CORE

I don't see anything in this series that depends on CXL Kconfigs, so it
seems this will break in randconfig when the resulting vfio-cxl variant
driver is enabled without core CXL support.

> +
>  config VFIO_PCI_MMAP
>  	def_bool y if !S390
>  	depends on VFIO_PCI_CORE
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index cf00c0a7e55c..b51221b94b0b 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -8,6 +8,9 @@ vfio-pci-y := vfio_pci.o
>  vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
>  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
>  
> +vfio-cxl-core-y := vfio_cxl_core.o
> +obj-$(CONFIG_VFIO_CXL_CORE) += vfio-cxl-core.o
> +
>  obj-$(CONFIG_MLX5_VFIO_PCI)           += mlx5/
>  
>  obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
> diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
> new file mode 100644
> index 000000000000..6a7859333f67
> --- /dev/null
> +++ b/drivers/vfio/pci/vfio_cxl_core.c
> @@ -0,0 +1,264 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/file.h>
> +#include <linux/interrupt.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/notifier.h>
> +#include <linux/pci.h>
> +#include <linux/pm_runtime.h>
> +#include <linux/slab.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +
> +#include "vfio_pci_priv.h"
> +
> +#define DRIVER_AUTHOR "Zhi Wang <zhiw@nvidia.com>"
> +#define DRIVER_DESC "core driver for VFIO based CXL devices"
> +
> +static int get_hpa_and_request_dpa(struct vfio_pci_core_device *core_dev)
> +{
> +	struct vfio_cxl *cxl = &core_dev->cxl;
> +	struct pci_dev *pdev = core_dev->pdev;
> +	u64 max;
> +
> +	cxl->cxlrd = cxl_get_hpa_freespace(cxl->endpoint, 1,
> +					   CXL_DECODER_F_RAM |
> +					   CXL_DECODER_F_TYPE2,
> +					   &max);

I don't see that this adhere to the comment in cxl_get_hpa_freespace()
that the caller needs to deal with the elevated ref count on the root
decoder.  There's no put_device() call in either the error path or
disable path.

Also, maybe this is inherent in the cxl code, but cxl->cxlrd seems
redundant to me, couldn't we refer to this as cxl->root_decoder? (or
some variant more descriptive than "rd")

Is this exclusively a type2 extension or how do you envision type1/3
devices with vfio?

> +	if (IS_ERR(cxl->cxlrd)) {
> +		pci_err(pdev, "Fail to get HPA space.\n");
> +		return PTR_ERR(cxl->cxlrd);
> +	}
> +
> +	if (max < cxl->region.size) {
> +		pci_err(pdev, "No enough free HPA space %llu < %llu\n",
> +			max, cxl->region.size);
> +		return -ENOSPC;
> +	}
> +
> +	cxl->cxled = cxl_request_dpa(cxl->endpoint, true, cxl->region.size,
> +				     cxl->region.size);

cxl->endpoint_decoder? cxl->endp_dec?

> +	if (IS_ERR(cxl->cxled)) {
> +		pci_err(pdev, "Fail to request DPA\n");
> +		return PTR_ERR(cxl->cxled);
> +	}
> +
> +	return 0;
> +}
> +
> +static int create_cxl_region(struct vfio_pci_core_device *core_dev)
> +{
> +	struct vfio_cxl *cxl = &core_dev->cxl;
> +	struct pci_dev *pdev = core_dev->pdev;
> +	resource_size_t start, end;
> +	int ret;
> +
> +	ret = cxl_accel_request_resource(cxl->cxlds, true);
> +	if (ret) {
> +		pci_err(pdev, "Fail to request CXL resource\n");
> +		return ret;
> +	}

Where is the corresponding release_resource()?

> +
> +	if (!cxl_await_media_ready(cxl->cxlds)) {
> +		cxl_accel_set_media_ready(cxl->cxlds);
> +	} else {
> +		pci_err(pdev, "CXL media is not active\n");
> +		return ret;
> +	}

We're not capturing the media ready error for this return.  I think
Jason would typically suggest a success oriented flow as:

	ret = cxl_await_media_ready(cxl->cxlds)
	if (ret) {
		pci_err(...);
		return ret;
	}
	cxl_accel_set_media_ready(cxl->cxlds);

> +
> +	cxl->cxlmd = devm_cxl_add_memdev(&pdev->dev, cxl->cxlds);
> +	if (IS_ERR(cxl->cxlmd)) {
> +		pci_err(pdev, "Fail to create CXL memdev\n");
> +		return PTR_ERR(cxl->cxlmd);
> +	}
> +
> +	cxl->endpoint = cxl_acquire_endpoint(cxl->cxlmd);
> +	if (IS_ERR(cxl->endpoint)) {
> +		pci_err(pdev, "Fail to acquire CXL endpoint\n");
> +		return PTR_ERR(cxl->endpoint);
> +	}
> +
> +	ret = get_hpa_and_request_dpa(core_dev);
> +	if (ret)
> +		goto out;
> +
> +	cxl->region.region = cxl_create_region(cxl->cxlrd, &cxl->cxled, 1);
> +	if (IS_ERR(cxl->region.region)) {
> +		ret = PTR_ERR(cxl->region.region);
> +		pci_err(pdev, "Fail to create CXL region\n");
> +		cxl_dpa_free(cxl->cxled);
> +		goto out;
> +	}
> +
> +	cxl_accel_get_region_params(cxl->region.region, &start, &end);
> +
> +	cxl->region.addr = start;
> +out:
> +	cxl_release_endpoint(cxl->cxlmd, cxl->endpoint);
> +	return ret;
> +}
> +
> +/* Standard CXL-type 2 driver initialization sequence */
> +static int enable_cxl(struct vfio_pci_core_device *core_dev, u16 dvsec)
> +{
> +	struct vfio_cxl *cxl = &core_dev->cxl;
> +	struct pci_dev *pdev = core_dev->pdev;
> +	u32 count;
> +	u64 offset, size;
> +	int ret;
> +
> +	cxl->cxlds = cxl_accel_state_create(&pdev->dev, cxl->caps);
> +	if (IS_ERR(cxl->cxlds))
> +		return PTR_ERR(cxl->cxlds);
> +
> +	cxl_accel_set_dvsec(cxl->cxlds, dvsec);
> +	cxl_accel_set_serial(cxl->cxlds, pdev->dev.id);

Doesn't seem to meet the description were cxl_device_state.serial is
described as the PCIe device serial number, not a struct device
instance number.

> +
> +	cxl_accel_set_resource(cxl->cxlds, cxl->dpa_res, CXL_ACCEL_RES_DPA);
> +	cxl_accel_set_resource(cxl->cxlds, cxl->ram_res, CXL_ACCEL_RES_RAM);
> +
> +	ret = cxl_pci_accel_setup_regs(pdev, cxl->cxlds);
> +	if (ret) {
> +		pci_err(pdev, "Fail to setup CXL accel regs\n");
> +		return ret;
> +	}
> +
> +	ret = cxl_get_hdm_info(cxl->cxlds, &count, &offset, &size);
> +	if (ret)
> +		return ret;
> +
> +	if (!count || !size) {
> +		pci_err(pdev, "Fail to find CXL HDM reg offset\n");
> +		return -ENODEV;
> +	}
> +
> +	cxl->hdm_count = count;
> +	cxl->hdm_reg_offset = offset;
> +	cxl->hdm_reg_size = size;
> +
> +	return create_cxl_region(core_dev);
> +}
> +
> +static void disable_cxl(struct vfio_pci_core_device *core_dev)
> +{
> +	struct vfio_cxl *cxl = &core_dev->cxl;
> +
> +	if (cxl->region.region)
> +		cxl_region_detach(cxl->cxled);
> +
> +	if (cxl->cxled)
> +		cxl_dpa_free(cxl->cxled);
> +}
> +
> +int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
> +{
> +	struct vfio_cxl *cxl = &core_dev->cxl;
> +	struct pci_dev *pdev = core_dev->pdev;
> +	u16 dvsec;
> +	int ret;
> +
> +	dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> +					  CXL_DVSEC_PCIE_DEVICE);
> +	if (!dvsec)
> +		return -ENODEV;
> +
> +	if (!cxl->region.size)
> +		return -EINVAL;
> +
> +	ret = vfio_pci_core_enable(core_dev);
> +	if (ret)
> +		return ret;
> +
> +	ret = enable_cxl(core_dev, dvsec);
> +	if (ret)
> +		goto err_enable_cxl_device;
> +
> +	return 0;
> +
> +err_enable_cxl_device:
> +	vfio_pci_core_disable(core_dev);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_cxl_core_enable);

These should all be _GPL symbols by default, right?

> +
> +void vfio_cxl_core_finish_enable(struct vfio_pci_core_device *core_dev)
> +{
> +	vfio_pci_core_finish_enable(core_dev);
> +}
> +EXPORT_SYMBOL(vfio_cxl_core_finish_enable);
> +
> +void vfio_cxl_core_close_device(struct vfio_device *vdev)
> +{
> +	struct vfio_pci_core_device *core_dev =
> +		container_of(vdev, struct vfio_pci_core_device, vdev);
> +
> +	disable_cxl(core_dev);
> +	vfio_pci_core_close_device(vdev);
> +}
> +EXPORT_SYMBOL(vfio_cxl_core_close_device);
> +
> +/*
> + * Configure the resource required by the kernel CXL core:
> + * device DPA and device RAM size
> + */
> +void vfio_cxl_core_set_resource(struct vfio_pci_core_device *core_dev,
> +				struct resource res,
> +				enum accel_resource type)
> +{
> +	struct vfio_cxl *cxl = &core_dev->cxl;
> +
> +	switch (type) {
> +	case CXL_ACCEL_RES_DPA:
> +		cxl->dpa_size = res.end - res.start + 1;
> +		cxl->dpa_res = res;
> +		break;
> +
> +	case CXL_ACCEL_RES_RAM:
> +		cxl->ram_res = res;
> +		break;
> +
> +	default:
> +		WARN(1, "invalid resource type: %d\n", type);
> +		break;
> +	}
> +}
> +EXPORT_SYMBOL(vfio_cxl_core_set_resource);

It's not obvious to me why we want to multiplex these through one
function rather than have separate functions to set the dpa and ram.
The usage in patch 12/ doesn't really dictate a multiplexed function.

> +
> +/* Configure the expected CXL region size to be created */
> +void vfio_cxl_core_set_region_size(struct vfio_pci_core_device *core_dev,
> +				   u64 size)
> +{
> +	struct vfio_cxl *cxl = &core_dev->cxl;
> +
> +	if (WARN_ON(size > cxl->dpa_size))
> +		return;
> +
> +	if (WARN_ON(cxl->region.region))
> +		return;
> +
> +	cxl->region.size = size;
> +}
> +EXPORT_SYMBOL(vfio_cxl_core_set_region_size);
> +
> +/* Configure the driver cap required by the kernel CXL core */
> +void vfio_cxl_core_set_driver_hdm_cap(struct vfio_pci_core_device *core_dev)
> +{
> +	struct vfio_cxl *cxl = &core_dev->cxl;
> +
> +	cxl->caps |= CXL_ACCEL_DRIVER_CAP_HDM;
> +}
> +EXPORT_SYMBOL(vfio_cxl_core_set_driver_hdm_cap);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> +MODULE_IMPORT_NS(CXL);
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index fbb472dd99b3..7762d4a3e825 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -15,6 +15,8 @@
>  #include <linux/types.h>
>  #include <linux/uuid.h>
>  #include <linux/notifier.h>
> +#include <linux/cxl_accel_mem.h>
> +#include <linux/cxl_accel_pci.h>
>  
>  #ifndef VFIO_PCI_CORE_H
>  #define VFIO_PCI_CORE_H
> @@ -49,6 +51,31 @@ struct vfio_pci_region {
>  	u32				flags;
>  };
>  
> +struct vfio_cxl_region {
> +	u64 size;
> +	u64 addr;
> +	struct cxl_region *region;
> +};
> +
> +struct vfio_cxl {
> +	u8 caps;
> +	u64 dpa_size;
> +
> +	u32 hdm_count;

Poor packing, caps and hdm_count should at least be adjacent to leave
only a single 24-bit gap.

> +	u64 hdm_reg_offset;
> +	u64 hdm_reg_size;
> +
> +	struct cxl_dev_state *cxlds;
> +	struct cxl_memdev *cxlmd;
> +	struct cxl_root_decoder *cxlrd;
> +	struct cxl_port *endpoint;
> +	struct cxl_endpoint_decoder *cxled;
> +	struct resource dpa_res;
> +	struct resource ram_res;
> +
> +	struct vfio_cxl_region region;
> +};
> +
>  struct vfio_pci_core_device {
>  	struct vfio_device	vdev;
>  	struct pci_dev		*pdev;
> @@ -94,6 +121,7 @@ struct vfio_pci_core_device {
>  	struct vfio_pci_core_device	*sriov_pf_core_dev;
>  	struct notifier_block	nb;
>  	struct rw_semaphore	memory_lock;
> +	struct vfio_cxl		cxl;

I'd prefer we not embed a structure here that's unused for 100% of
current use cases.  Why can't we have:

struct vfio_cxl_core_device {
	struct vfio_pci_core_device	pci_core;
	struct vfio_cxl			clx;
};

Thanks,
Alex

>  };
>  
>  /* Will be exported for vfio pci drivers usage */
> @@ -159,4 +187,13 @@ VFIO_IOREAD_DECLARATION(32)
>  VFIO_IOREAD_DECLARATION(64)
>  #endif
>  
> +int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev);
> +void vfio_cxl_core_finish_enable(struct vfio_pci_core_device *core_dev);
> +void vfio_cxl_core_close_device(struct vfio_device *vdev);
> +void vfio_cxl_core_set_resource(struct vfio_pci_core_device *core_dev,
> +				struct resource res,
> +				enum accel_resource type);
> +void vfio_cxl_core_set_region_size(struct vfio_pci_core_device *core_dev,
> +				   u64 size);
> +void vfio_cxl_core_set_driver_hdm_cap(struct vfio_pci_core_device *core_dev);
>  #endif /* VFIO_PCI_CORE_H */


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 05/13] vfio/cxl: expose CXL region to the usersapce via a new VFIO device region
  2024-09-20 22:34 ` [RFC 05/13] vfio/cxl: expose CXL region to the usersapce via a new VFIO device region Zhi Wang
@ 2024-10-11 19:12   ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2024-10-11 19:12 UTC (permalink / raw)
  To: Zhi Wang
  Cc: kvm, linux-cxl, kevin.tian, jgg, alison.schofield, dan.j.williams,
	dave.jiang, dave, jonathan.cameron, ira.weiny, vishal.l.verma,
	alucerop, acurrid, cjia, smitra, ankita, aniketa, kwankhede,
	targupta, zhiwang

On Fri, 20 Sep 2024 15:34:38 -0700
Zhi Wang <zhiw@nvidia.com> wrote:

> To directly access the device memory, a CXL region is required. Creating
> a CXL region requires to configure HDM decoders on the path to map the
> access of HPA level by level and evetually hit the DPA in the CXL
> topology.
> 
> For the usersapce, e.g. QEMU, to access the CXL region, the region is
> required to be exposed via VFIO interfaces.
> 
> Introduce a new VFIO device region and region ops to expose the created
> CXL region when initailize the device in the vfio-cxl-core. Introduce a
> new sub region type for the userspace to identify a CXL region.
> 
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_cxl_core.c   | 140 ++++++++++++++++++++++++++++-
>  drivers/vfio/pci/vfio_pci_config.c |   1 +
>  include/linux/vfio_pci_core.h      |   1 +
>  include/uapi/linux/vfio.h          |   3 +
>  4 files changed, 144 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
> index 6a7859333f67..ffc15fd94b22 100644
> --- a/drivers/vfio/pci/vfio_cxl_core.c
> +++ b/drivers/vfio/pci/vfio_cxl_core.c
> @@ -102,6 +102,13 @@ static int create_cxl_region(struct vfio_pci_core_device *core_dev)
>  	cxl_accel_get_region_params(cxl->region.region, &start, &end);
>  
>  	cxl->region.addr = start;
> +	cxl->region.vaddr = ioremap(start, end - start);
> +	if (!cxl->region.addr) {

This is testing addr, not the newly added vaddr.

> +		pci_err(pdev, "Fail to map CXL region\n");
> +		cxl_region_detach(cxl->cxled);
> +		cxl_dpa_free(cxl->cxled);
> +		goto out;
> +	}
>  out:
>  	cxl_release_endpoint(cxl->cxlmd, cxl->endpoint);
>  	return ret;
> @@ -152,17 +159,135 @@ static void disable_cxl(struct vfio_pci_core_device *core_dev)
>  {
>  	struct vfio_cxl *cxl = &core_dev->cxl;
>  
> -	if (cxl->region.region)
> +	if (cxl->region.region) {
> +		iounmap(cxl->region.vaddr);
>  		cxl_region_detach(cxl->cxled);
> +	}
>  
>  	if (cxl->cxled)
>  		cxl_dpa_free(cxl->cxled);
>  }
>  
> +static unsigned long vma_to_pfn(struct vm_area_struct *vma)
> +{
> +	struct vfio_pci_core_device *vdev = vma->vm_private_data;
> +	struct vfio_cxl *cxl = &vdev->cxl;
> +	u64 pgoff;
> +
> +	pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> +	return (cxl->region.addr >> PAGE_SHIFT) + pgoff;
> +}
> +
> +static vm_fault_t vfio_cxl_mmap_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct vfio_pci_core_device *vdev = vma->vm_private_data;
> +	unsigned long pfn, pgoff = vmf->pgoff - vma->vm_pgoff;
> +	unsigned long addr = vma->vm_start;
> +	vm_fault_t ret = VM_FAULT_SIGBUS;
> +
> +	pfn = vma_to_pfn(vma);
> +
> +	down_read(&vdev->memory_lock);
> +
> +	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev))
> +		goto out_unlock;
> +
> +	ret = vmf_insert_pfn(vma, vmf->address, pfn + pgoff);
> +	if (ret & VM_FAULT_ERROR)
> +		goto out_unlock;
> +
> +	for (; addr < vma->vm_end; addr += PAGE_SIZE, pfn++) {
> +		if (addr == vmf->address)
> +			continue;
> +
> +		if (vmf_insert_pfn(vma, addr, pfn) & VM_FAULT_ERROR)
> +			break;
> +	}
> +
> +out_unlock:
> +	up_read(&vdev->memory_lock);
> +
> +	return ret;
> +}

This is identical to vfio_pci_mmap_fault(), we should figure out how to
reuse it rather than duplicate it.

> +
> +static const struct vm_operations_struct vfio_cxl_mmap_ops = {
> +	.fault = vfio_cxl_mmap_fault,
> +};

This should make use of the huge_fault support similar to what we added
in vfio-pci too, right?

> +
> +static int vfio_cxl_region_mmap(struct vfio_pci_core_device *core_dev,
> +				struct vfio_pci_region *region,
> +				struct vm_area_struct *vma)
> +{
> +	struct vfio_cxl *cxl = &core_dev->cxl;
> +	u64 phys_len, req_len, pgoff, req_start;
> +
> +	if (!(region->flags & VFIO_REGION_INFO_FLAG_MMAP))
> +		return -EINVAL;
> +
> +	if (!(region->flags & VFIO_REGION_INFO_FLAG_READ) &&
> +	    (vma->vm_flags & VM_READ))
> +		return -EPERM;
> +
> +	if (!(region->flags & VFIO_REGION_INFO_FLAG_WRITE) &&
> +	    (vma->vm_flags & VM_WRITE))
> +		return -EPERM;
> +
> +	phys_len = cxl->region.size;
> +	req_len = vma->vm_end - vma->vm_start;
> +	pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +	req_start = pgoff << PAGE_SHIFT;
> +
> +	if (req_start + req_len > phys_len)
> +		return -EINVAL;
> +
> +	vma->vm_private_data = core_dev;
> +	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

I thought a large part of CXL was that the memory is coherent, maybe I
don't understand what we're providing access to here.

> +	vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> +
> +	vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
> +			VM_DONTEXPAND | VM_DONTDUMP);
> +	vma->vm_ops = &vfio_cxl_mmap_ops;
> +
> +	return 0;
> +}
> +
> +static ssize_t vfio_cxl_region_rw(struct vfio_pci_core_device *core_dev,
> +				  char __user *buf, size_t count, loff_t *ppos,
> +				  bool iswrite)
> +{
> +	unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS;
> +	struct vfio_cxl_region *cxl_region = core_dev->region[i].data;
> +	loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> +	if (!count)
> +		return 0;
> +
> +	return vfio_pci_core_do_io_rw(core_dev, false,
> +				      cxl_region->vaddr,
> +				      (char __user *)buf, pos, count,
> +				      0, 0, iswrite);
> +}
> +
> +static void vfio_cxl_region_release(struct vfio_pci_core_device *vdev,
> +				    struct vfio_pci_region *region)
> +{
> +}
> +
> +static const struct vfio_pci_regops vfio_cxl_regops = {
> +	.rw		= vfio_cxl_region_rw,
> +	.mmap		= vfio_cxl_region_mmap,
> +	.release	= vfio_cxl_region_release,
> +};
> +
>  int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
>  {
>  	struct vfio_cxl *cxl = &core_dev->cxl;
>  	struct pci_dev *pdev = core_dev->pdev;
> +	u32 flags;
>  	u16 dvsec;
>  	int ret;
>  
> @@ -182,8 +307,21 @@ int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
>  	if (ret)
>  		goto err_enable_cxl_device;
>  
> +	flags = VFIO_REGION_INFO_FLAG_READ |
> +		VFIO_REGION_INFO_FLAG_WRITE |
> +		VFIO_REGION_INFO_FLAG_MMAP;
> +
> +	ret = vfio_pci_core_register_dev_region(core_dev,
> +		PCI_VENDOR_ID_CXL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE,
> +		VFIO_REGION_SUBTYPE_CXL, &vfio_cxl_regops,
> +		cxl->region.size, flags, &cxl->region);
> +	if (ret)
> +		goto err_register_cxl_region;
> +
>  	return 0;
>  
> +err_register_cxl_region:
> +	disable_cxl(core_dev);
>  err_enable_cxl_device:
>  	vfio_pci_core_disable(core_dev);
>  	return ret;
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 97422aafaa7b..98f3ac2d305c 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -412,6 +412,7 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
>  	return pdev->current_state < PCI_D3hot &&
>  	       (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY));
>  }
> +EXPORT_SYMBOL(__vfio_pci_memory_enabled);

_GPL

There should also be a lockdep assert added if this is exported.  With
that it could also drop the underscore prefix.

>  
>  /*
>   * Restore the *real* BARs after we detect a FLR or backdoor reset.
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 7762d4a3e825..6523d9d1bffe 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -54,6 +54,7 @@ struct vfio_pci_region {
>  struct vfio_cxl_region {
>  	u64 size;
>  	u64 addr;
> +	void *vaddr;
>  	struct cxl_region *region;
>  };
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2b68e6cdf190..71f766c29060 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -372,6 +372,9 @@ struct vfio_region_info_cap_type {
>  /* sub-types for VFIO_REGION_TYPE_GFX */
>  #define VFIO_REGION_SUBTYPE_GFX_EDID            (1)
>  
> +/* sub-types for VFIO CXL region */
> +#define VFIO_REGION_SUBTYPE_CXL                 (1)

This is a PCI vendor sub-type with vendor ID 0x1e98.  The comment
should reflect that.  Thanks,

Alex

> +
>  /**
>   * struct vfio_region_gfx_edid - EDID region layout.
>   *


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 09/13] vfio/pci: introduce CXL device awareness
  2024-09-20 22:34 ` [RFC 09/13] vfio/pci: introduce CXL device awareness Zhi Wang
@ 2024-10-11 20:37   ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2024-10-11 20:37 UTC (permalink / raw)
  To: Zhi Wang
  Cc: kvm, linux-cxl, kevin.tian, jgg, alison.schofield, dan.j.williams,
	dave.jiang, dave, jonathan.cameron, ira.weiny, vishal.l.verma,
	alucerop, acurrid, cjia, smitra, ankita, aniketa, kwankhede,
	targupta, zhiwang

On Fri, 20 Sep 2024 15:34:42 -0700
Zhi Wang <zhiw@nvidia.com> wrote:

> CXL device programming interfaces are built upon PCI interfaces. Thus
> the vfio-pci-core can be leveraged to handle a CXL device.
> 
> However, CXL device also has difference with PCI devicce:
> 
> - No INTX support, only MSI/MSIX is supported.
> - Resest is one via CXL reset. FLR only resets CXL.io.
> 
> Introduce the CXL device awareness to the vfio-pci-core. Expose a new
> VFIO device flags to the userspace to identify the VFIO device is a CXL
> device. Disable INTX support in the vfio-pci-core. Disable FLR reset for
> the CXL device as the kernel CXL core hasn't support CXL reset yet.
> Disable mmap support on the CXL MMIO BAR in vfio-pci-core.

Why are we disabling mmap on the entire BAR when we have sparse mmap
support to handle disabling mmap only on a portion of the BAR, which
seems to be the case needed here.  Am I mistaken?
 
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_cxl_core.c |  8 ++++++
>  drivers/vfio/pci/vfio_pci_core.c | 42 +++++++++++++++++++++-----------
>  include/linux/vfio_pci_core.h    |  2 ++
>  include/uapi/linux/vfio.h        |  1 +
>  4 files changed, 39 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
> index bbb968cb1b70..d8b51f8792a2 100644
> --- a/drivers/vfio/pci/vfio_cxl_core.c
> +++ b/drivers/vfio/pci/vfio_cxl_core.c
> @@ -391,6 +391,8 @@ int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
>  	if (ret)
>  		return ret;
>  
> +	vfio_pci_core_enable_cxl(core_dev);
> +
>  	ret = vfio_pci_core_enable(core_dev);
>  	if (ret)
>  		goto err_pci_core_enable;
> @@ -618,6 +620,12 @@ ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *bu
>  }
>  EXPORT_SYMBOL_GPL(vfio_cxl_core_write);
>  
> +void vfio_pci_core_enable_cxl(struct vfio_pci_core_device *core_dev)
> +{
> +	core_dev->has_cxl = true;
> +}
> +EXPORT_SYMBOL(vfio_pci_core_enable_cxl);
> +
>  MODULE_LICENSE("GPL");
>  MODULE_AUTHOR(DRIVER_AUTHOR);
>  MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 9373942f1acb..e0f23b538858 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -126,6 +126,9 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  		if (!(res->flags & IORESOURCE_MEM))
>  			goto no_mmap;
>  
> +		if (vdev->has_cxl && bar == vdev->cxl.comp_reg_bar)
> +			goto no_mmap;
> +
>  		/*
>  		 * The PCI core shouldn't set up a resource with a
>  		 * type but zero size. But there may be bugs that
> @@ -487,10 +490,15 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  	if (ret)
>  		goto out_power;
>  
> -	/* If reset fails because of the device lock, fail this path entirely */
> -	ret = pci_try_reset_function(pdev);
> -	if (ret == -EAGAIN)
> -		goto out_disable_device;
> +	if (!vdev->has_cxl) {
> +		/* If reset fails because of the device lock, fail this path entirely */
> +		ret = pci_try_reset_function(pdev);
> +		if (ret == -EAGAIN)
> +			goto out_disable_device;
> +	} else {
> +		/* CXL Reset is missing in CXL core. FLR only resets CXL.io path. */
> +		ret = -ENODEV;
> +	}

Seems like this should perhaps be a prerequisite to exposing the device
to userspace, otherwise how is the state sanitized between uses?

>  
>  	vdev->reset_works = !ret;
>  	pci_save_state(pdev);
> @@ -498,14 +506,17 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  	if (!vdev->pci_saved_state)
>  		pci_dbg(pdev, "%s: Couldn't store saved state\n", __func__);
>  
> -	if (likely(!nointxmask)) {
> -		if (vfio_pci_nointx(pdev)) {
> -			pci_info(pdev, "Masking broken INTx support\n");
> -			vdev->nointx = true;
> -			pci_intx(pdev, 0);
> -		} else
> -			vdev->pci_2_3 = pci_intx_mask_supported(pdev);
> -	}
> +	if (!vdev->has_cxl) {
> +		if (likely(!nointxmask)) {
> +			if (vfio_pci_nointx(pdev)) {
> +				pci_info(pdev, "Masking broken INTx support\n");
> +				vdev->nointx = true;
> +				pci_intx(pdev, 0);
> +			} else
> +				vdev->pci_2_3 = pci_intx_mask_supported(pdev);
> +		}
> +	} else
> +		vdev->nointx = true; /* CXL device doesn't have INTX. */
>  

Why do we need to do anything here?  nointx is for exposing a device
with INTx support as if it does not have INTx support.  If a CXL device
simply does not support INTx, like SR-IOV VFs, this is not necessary,
the interrupt pin register should be zero.  Is that not the case on a
CXL device?

>  	pci_read_config_word(pdev, PCI_COMMAND, &cmd);
>  	if (vdev->pci_2_3 && (cmd & PCI_COMMAND_INTX_DISABLE)) {
> @@ -541,7 +552,6 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  	if (!vfio_vga_disabled() && vfio_pci_is_vga(pdev))
>  		vdev->has_vga = true;
>  
> -

Gratuitous whitespace change.

>  	return 0;
>  
>  out_free_zdev:
> @@ -657,7 +667,8 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  	 * Disable INTx and MSI, presumably to avoid spurious interrupts
>  	 * during reset.  Stolen from pci_reset_function()
>  	 */
> -	pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);
> +	if (!vdev->nointx)
> +		pci_write_config_word(pdev, PCI_COMMAND, PCI_COMMAND_INTX_DISABLE);

No, this is not what nointx is for.  Regardless it should be a no-op on
a device that doesn't support INTx.

>  
>  	/*
>  	 * Try to get the locks ourselves to prevent a deadlock. The
> @@ -973,6 +984,9 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
>  	if (vdev->reset_works)
>  		info.flags |= VFIO_DEVICE_FLAGS_RESET;
>  
> +	if (vdev->has_cxl)
> +		info.flags |= VFIO_DEVICE_FLAGS_CXL;
> +

So the proposal is that a vfio-cxl device will expose *both* PCI and
CXL device compatibility flags?  That means existing userspace
expecting only a PCI device will try to make use of this.  Shouldn't
the device be bound to vfio-pci rather than a vfio-cxl driver for that,
if it's even valid?

>  	info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
>  	info.num_irqs = VFIO_PCI_NUM_IRQS;
>  
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 9d295ca9382a..e5646aad3eb3 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -113,6 +113,7 @@ struct vfio_pci_core_device {
>  	bool			needs_pm_restore:1;
>  	bool			pm_intx_masked:1;
>  	bool			pm_runtime_engaged:1;
> +	bool			has_cxl:1;

I wonder if has_cxl is based on has_vga, I would have expected is_cxl.
A PCI device that supports VGA is a much more discrete add-on, versus
my limited understanding of CXL is that it is a CXL device where PCI is
mostly just the configuration and enumeration interface, ie. CXL.io.

>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;
> @@ -208,5 +209,6 @@ ssize_t vfio_cxl_core_read(struct vfio_device *core_vdev, char __user *buf,
>  			   size_t count, loff_t *ppos);
>  ssize_t vfio_cxl_core_write(struct vfio_device *core_vdev, const char __user *buf,
>  			    size_t count, loff_t *ppos);
> +void vfio_pci_core_enable_cxl(struct vfio_pci_core_device *core_dev);
>  
>  #endif /* VFIO_PCI_CORE_H */
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 71f766c29060..0895183feaac 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -214,6 +214,7 @@ struct vfio_device_info {
>  #define VFIO_DEVICE_FLAGS_FSL_MC (1 << 6)	/* vfio-fsl-mc device */
>  #define VFIO_DEVICE_FLAGS_CAPS	(1 << 7)	/* Info supports caps */
>  #define VFIO_DEVICE_FLAGS_CDX	(1 << 8)	/* vfio-cdx device */
> +#define VFIO_DEVICE_FLAGS_CXL	(1 << 9)	/* Device supports CXL support */

Comment wording.  Thanks,

Alex

>  	__u32	num_regions;	/* Max region index + 1 */
>  	__u32	num_irqs;	/* Max IRQ index + 1 */
>  	__u32   cap_offset;	/* Offset within info struct of first cap */


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 10/13] vfio/pci: emulate CXL DVSEC registers in the configuration space
  2024-09-20 22:34 ` [RFC 10/13] vfio/pci: emulate CXL DVSEC registers in the configuration space Zhi Wang
@ 2024-10-11 21:02   ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2024-10-11 21:02 UTC (permalink / raw)
  To: Zhi Wang
  Cc: kvm, linux-cxl, kevin.tian, jgg, alison.schofield, dan.j.williams,
	dave.jiang, dave, jonathan.cameron, ira.weiny, vishal.l.verma,
	alucerop, acurrid, cjia, smitra, ankita, aniketa, kwankhede,
	targupta, zhiwang

On Fri, 20 Sep 2024 15:34:43 -0700
Zhi Wang <zhiw@nvidia.com> wrote:

> A CXL device has many DVSEC registers in the configuration space for
> device control and enumeration. E.g. enable CXL.mem/CXL.cahce.
> 
> However, the kernel CXL core owns those registers to control the device.
> Thus, the VM is forbidden to touch the physical device control registers.
> 
> Read/write the CXL DVSEC from/to the virt configuration space.
> 
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 98f3ac2d305c..af8c0997c796 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -1902,6 +1902,15 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>  
>  			perm = &ecap_perms[cap_id];
>  			cap_start = vfio_find_cap_start(vdev, *ppos);
> +
> +			if (cap_id == PCI_EXT_CAP_ID_DVSEC) {
> +				u32 dword;

This should be an __le32 and we should use an le32_to_cpu before
comparison to PCI_VENDOR_ID_CXL.

> +
> +				memcpy(&dword, vdev->vconfig + cap_start + PCI_DVSEC_HEADER1, 4);
> +
> +				if (PCI_DVSEC_HEADER1_VID(dword) == PCI_VENDOR_ID_CXL)
> +					perm = &virt_perms;

We're making an assumption here that all CXL defined DVSEC capabilities
will have the same behavior.  Also, should we bother to expose an
emulated, dummy capability, or should we expect the VMM to handle
emulating it?  Doesn't the virt_perms allow the entire capability,
including headers to be writable?  Thanks,

Alex

> +			}
>  		} else {
>  			WARN_ON(cap_id > PCI_CAP_ID_MAX);
>  


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 11/13] vfio/cxl: introduce VFIO CXL device cap
  2024-09-20 22:34 ` [RFC 11/13] vfio/cxl: introduce VFIO CXL device cap Zhi Wang
@ 2024-10-11 21:14   ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2024-10-11 21:14 UTC (permalink / raw)
  To: Zhi Wang
  Cc: kvm, linux-cxl, kevin.tian, jgg, alison.schofield, dan.j.williams,
	dave.jiang, dave, jonathan.cameron, ira.weiny, vishal.l.verma,
	alucerop, acurrid, cjia, smitra, ankita, aniketa, kwankhede,
	targupta, zhiwang

On Fri, 20 Sep 2024 15:34:44 -0700
Zhi Wang <zhiw@nvidia.com> wrote:

> The userspace needs CXL device information, e.g. HDM decoder registers
> offset to know when the VM updates the HDM decoder and re-build the
> mapping between GPA in the virtual HDM decoder base registers and the
> HPA of the CXL region created by the vfio-cxl-core when initialize the
> CXL device.
> 
> To acheive this, a new VFIO CXL device cap is required to convey those
> information to the usersapce.
> 
> Introduce a new VFIO CXL device cap to expose necessary information to
> the userspace. Initialize the cap with the information filled when the
> CXL device is being initialized. vfio-pci-core fills the CXL cap into
> the caps returned to userapce when CXL is enabled.
> 
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_cxl_core.c | 15 +++++++++++++++
>  drivers/vfio/pci/vfio_pci_core.c | 19 ++++++++++++++++++-
>  include/linux/vfio_pci_core.h    |  1 +
>  include/uapi/linux/vfio.h        | 10 ++++++++++
>  4 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_cxl_core.c b/drivers/vfio/pci/vfio_cxl_core.c
> index d8b51f8792a2..cebc444b54b7 100644
> --- a/drivers/vfio/pci/vfio_cxl_core.c
> +++ b/drivers/vfio/pci/vfio_cxl_core.c
> @@ -367,6 +367,19 @@ static int setup_virt_comp_regs(struct vfio_pci_core_device *core_dev)
>  	return 0;
>  }
>  
> +static void init_vfio_cxl_cap(struct vfio_pci_core_device *core_dev)
> +{
> +	struct vfio_cxl *cxl = &core_dev->cxl;
> +
> +	cxl->cap.header.id = VFIO_DEVICE_INFO_CAP_CXL;
> +	cxl->cap.header.version = 1;
> +	cxl->cap.hdm_count = cxl->hdm_count;
> +	cxl->cap.hdm_reg_offset = cxl->hdm_reg_offset;
> +	cxl->cap.hdm_reg_size = cxl->hdm_reg_size;
> +	cxl->cap.hdm_reg_bar_index = cxl->comp_reg_bar;
> +	cxl->cap.dpa_size = cxl->dpa_size;
> +}
> +
>  int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
>  {
>  	struct vfio_cxl *cxl = &core_dev->cxl;
> @@ -401,6 +414,8 @@ int vfio_cxl_core_enable(struct vfio_pci_core_device *core_dev)
>  	if (ret)
>  		goto err_enable_cxl_device;
>  
> +	init_vfio_cxl_cap(core_dev);
> +
>  	flags = VFIO_REGION_INFO_FLAG_READ |
>  		VFIO_REGION_INFO_FLAG_WRITE |
>  		VFIO_REGION_INFO_FLAG_MMAP;
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index e0f23b538858..47e65e28a42b 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -963,6 +963,15 @@ static int vfio_pci_info_atomic_cap(struct vfio_pci_core_device *vdev,
>  	return vfio_info_add_capability(caps, &cap.header, sizeof(cap));
>  }
>  
> +static int vfio_pci_info_cxl_cap(struct vfio_pci_core_device *vdev,
> +				 struct vfio_info_cap *caps)
> +{
> +	struct vfio_cxl *cxl = &vdev->cxl;
> +
> +	return vfio_info_add_capability(caps, &cxl->cap.header,
> +					sizeof(cxl->cap));
> +}
> +
>  static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
>  				   struct vfio_device_info __user *arg)
>  {
> @@ -984,9 +993,17 @@ static int vfio_pci_ioctl_get_info(struct vfio_pci_core_device *vdev,
>  	if (vdev->reset_works)
>  		info.flags |= VFIO_DEVICE_FLAGS_RESET;
>  
> -	if (vdev->has_cxl)
> +	if (vdev->has_cxl) {
>  		info.flags |= VFIO_DEVICE_FLAGS_CXL;
>  
> +		ret = vfio_pci_info_cxl_cap(vdev, &caps);
> +		if (ret) {
> +			pci_warn(vdev->pdev,
> +				 "Failed to setup CXL capabilities\n");
> +			return ret;
> +		}
> +	}
> +
>  	info.num_regions = VFIO_PCI_NUM_REGIONS + vdev->num_regions;
>  	info.num_irqs = VFIO_PCI_NUM_IRQS;
>  
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index e5646aad3eb3..d79f7a91d977 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -80,6 +80,7 @@ struct vfio_cxl {
>  	struct resource ram_res;
>  
>  	struct vfio_cxl_region region;
> +	struct vfio_device_info_cap_cxl cap;

We should create this dynamically on request, the device info ioctl is
not a fast path.  Thanks,

Alex

>  };
>  
>  struct vfio_pci_core_device {
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 0895183feaac..9a5972961280 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -257,6 +257,16 @@ struct vfio_device_info_cap_pci_atomic_comp {
>  	__u32 reserved;
>  };
>  
> +#define VFIO_DEVICE_INFO_CAP_CXL		6
> +struct vfio_device_info_cap_cxl {
> +	struct vfio_info_cap_header header;
> +	__u8 hdm_count;
> +	__u8 hdm_reg_bar_index;
> +	__u64 hdm_reg_size;
> +	__u64 hdm_reg_offset;
> +	__u64 dpa_size;
> +};
> +
>  /**
>   * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
>   *				       struct vfio_region_info)


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 02/13] cxl: introduce cxl_get_hdm_info()
  2024-09-20 22:34 ` [RFC 02/13] cxl: introduce cxl_get_hdm_info() Zhi Wang
@ 2024-10-17 15:44   ` Jonathan Cameron
  2024-10-19  5:38     ` Zhi Wang
  0 siblings, 1 reply; 38+ messages in thread
From: Jonathan Cameron @ 2024-10-17 15:44 UTC (permalink / raw)
  To: Zhi Wang
  Cc: kvm, linux-cxl, alex.williamson, kevin.tian, jgg,
	alison.schofield, dan.j.williams, dave.jiang, dave, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiwang

On Fri, 20 Sep 2024 15:34:35 -0700
Zhi Wang <zhiw@nvidia.com> wrote:

> CXL core has the information of what CXL register groups a device has.
> When initializing the device, the CXL core probes the register groups
> and saves the information. The probing sequence is quite complicated.
> 
> vfio-cxl requires the HDM register information to emualte the HDM decoder
Hi Zhi,

I know these were a bit rushed out so I'll only comment once.
Give your patch descriptions a spell check (I always forget :)
emulate

> registers.
> 
> Introduce cxl_get_hdm_info() for vfio-cxl to leverage the HDM
> register information in the CXL core. Thus, it doesn't need to implement
> its own probing sequence.
> 
> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
> ---
>  drivers/cxl/core/pci.c        | 28 ++++++++++++++++++++++++++++
>  drivers/cxl/cxlpci.h          |  3 +++
>  include/linux/cxl_accel_mem.h |  2 ++
>  3 files changed, 33 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index a663e7566c48..7b6c2b6211b3 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -502,6 +502,34 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, CXL);
>  
> +int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u32 *hdm_count,
> +		     u64 *hdm_reg_offset, u64 *hdm_reg_size)
> +{
> +	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> +	int d = cxlds->cxl_dvsec;
> +	u16 cap;
> +	int rc;
> +
> +	if (!cxlds->reg_map.component_map.hdm_decoder.valid) {
> +		*hdm_reg_offset = *hdm_reg_size = 0;
Probably want to zero out the hdm_count as well?

> +	} else {
> +		struct cxl_component_reg_map *map =
> +			&cxlds->reg_map.component_map;
> +
> +		*hdm_reg_offset = map->hdm_decoder.offset;
> +		*hdm_reg_size = map->hdm_decoder.size;
> +	}
> +
> +	rc = pci_read_config_word(pdev,
> +				  d + CXL_DVSEC_CAP_OFFSET, &cap);
> +	if (rc)
> +		return rc;
> +
> +	*hdm_count = FIELD_GET(CXL_DVSEC_HDM_COUNT_MASK, cap);
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_info, CXL);

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-10-04 11:40         ` Jonathan Cameron
@ 2024-10-19  5:30           ` Zhi Wang
  2024-10-21 11:07             ` Alejandro Lucero Palau
  0 siblings, 1 reply; 38+ messages in thread
From: Zhi Wang @ 2024-10-19  5:30 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Tian, Kevin, kvm@vger.kernel.org, linux-cxl@vger.kernel.org,
	alex.williamson@redhat.com, Jason Gunthorpe, Schofield, Alison,
	Williams, Dan J, Jiang, Dave, dave@stgolabs.net, Weiny, Ira,
	Verma, Vishal L, alucerop@amd.com, Andy Currid, Neo Jia,
	Surath Mitra, Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org

On 04/10/2024 14.40, Jonathan Cameron wrote:
> External email: Use caution opening links or attachments
> 
> 
>>>
>>>>
>>>> Presumably, the host creates one large CXL region that covers the entire
>>>> DPA, while QEMU can virtually partition it into different regions and
>>>> map them to different virtual CXL region if QEMU presents multiple HDM
>>>> decoders to the guest.
>>>
>>> I'm not sure why it would do that. Can't think why you'd break up
>>> a host region - maybe I'm missing something.
>>>
>>
>> It is mostly concerning about a device can have multiple HDM decoders.
>> In the current design, a large physical CXL (pCXL) region with the whole
>> DPA will be passed to the userspace. Thinking that the guest will see
>> the virtual multiple HDM decoders, which usually SW is asking for, the
>> guest SW might create multiple virtual CXL regions. In that case QEMU
>> needs to map them into different regions of the pCXL region.
> 
> Don't let the guest see multiple HDM decoders?
> 
> There is no obvious reason why it would want them other than type
> differences.
> 
> Why is it useful for a type 2 device to be setup for multiple CXL regions?
> It shouldn't be a performance thing. Might be convenient for management
> I guess, but the driver can layer it's own allocator etc on top of a single
> region so I'm not sure I see a reason to do this...
> 

Sorry for the late reply as I were confirming the this requirement with 
folks. It make sense to have only one HDM decoder for the guest CXL 
type-2 device driver. I think it is similar to efx_cxl according to the 
code. Alejandro, it would be nice you can confirm this.

Thanks,
Zhi.
> Jonathan
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 02/13] cxl: introduce cxl_get_hdm_info()
  2024-10-17 15:44   ` Jonathan Cameron
@ 2024-10-19  5:38     ` Zhi Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Zhi Wang @ 2024-10-19  5:38 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: kvm@vger.kernel.org, linux-cxl@vger.kernel.org,
	alex.williamson@redhat.com, kevin.tian@intel.com, Jason Gunthorpe,
	alison.schofield@intel.com, dan.j.williams@intel.com,
	dave.jiang@intel.com, dave@stgolabs.net, ira.weiny@intel.com,
	vishal.l.verma@intel.com, alucerop@amd.com, Andy Currid, Neo Jia,
	Surath Mitra, Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org

On 17/10/2024 18.44, Jonathan Cameron wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Fri, 20 Sep 2024 15:34:35 -0700
> Zhi Wang <zhiw@nvidia.com> wrote:
> 
>> CXL core has the information of what CXL register groups a device has.
>> When initializing the device, the CXL core probes the register groups
>> and saves the information. The probing sequence is quite complicated.
>>
>> vfio-cxl requires the HDM register information to emualte the HDM decoder
> Hi Zhi,
> 
> I know these were a bit rushed out so I'll only comment once.
> Give your patch descriptions a spell check (I always forget :)
> emulate
> 

Thanks for pointing it out. Will proceed. Lol. Me the same. I write a 
script of sending patches that ask for confirming every item was in the 
checklist now.

>> registers.
>>
>> Introduce cxl_get_hdm_info() for vfio-cxl to leverage the HDM
>> register information in the CXL core. Thus, it doesn't need to implement
>> its own probing sequence.
>>
>> Signed-off-by: Zhi Wang <zhiw@nvidia.com>
>> ---
>>   drivers/cxl/core/pci.c        | 28 ++++++++++++++++++++++++++++
>>   drivers/cxl/cxlpci.h          |  3 +++
>>   include/linux/cxl_accel_mem.h |  2 ++
>>   3 files changed, 33 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index a663e7566c48..7b6c2b6211b3 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -502,6 +502,34 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
>>   }
>>   EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, CXL);
>>
>> +int cxl_get_hdm_info(struct cxl_dev_state *cxlds, u32 *hdm_count,
>> +                  u64 *hdm_reg_offset, u64 *hdm_reg_size)
>> +{
>> +     struct pci_dev *pdev = to_pci_dev(cxlds->dev);
>> +     int d = cxlds->cxl_dvsec;
>> +     u16 cap;
>> +     int rc;
>> +
>> +     if (!cxlds->reg_map.component_map.hdm_decoder.valid) {
>> +             *hdm_reg_offset = *hdm_reg_size = 0;
> Probably want to zero out the hdm_count as well?
> 
>> +     } else {
>> +             struct cxl_component_reg_map *map =
>> +                     &cxlds->reg_map.component_map;
>> +
>> +             *hdm_reg_offset = map->hdm_decoder.offset;
>> +             *hdm_reg_size = map->hdm_decoder.size;
>> +     }
>> +
>> +     rc = pci_read_config_word(pdev,
>> +                               d + CXL_DVSEC_CAP_OFFSET, &cap);
>> +     if (rc)
>> +             return rc;
>> +
>> +     *hdm_count = FIELD_GET(CXL_DVSEC_HDM_COUNT_MASK, cap);
>> +     return 0;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_get_hdm_info, CXL);


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (14 preceding siblings ...)
  2024-09-25 10:11 ` Alejandro Lucero Palau
@ 2024-10-21 10:49 ` Zhi Wang
  2024-10-21 13:10   ` Alejandro Lucero Palau
  2024-10-30 11:56 ` Zhi Wang
  16 siblings, 1 reply; 38+ messages in thread
From: Zhi Wang @ 2024-10-21 10:49 UTC (permalink / raw)
  To: kvm@vger.kernel.org, linux-cxl@vger.kernel.org
  Cc: alex.williamson@redhat.com, kevin.tian@intel.com, Jason Gunthorpe,
	alison.schofield@intel.com, dan.j.williams@intel.com,
	dave.jiang@intel.com, dave@stgolabs.net,
	jonathan.cameron@huawei.com, ira.weiny@intel.com,
	vishal.l.verma@intel.com, alucerop@amd.com, Andy Currid, Neo Jia,
	Surath Mitra, Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org

On 21/09/2024 1.34, Zhi Wang wrote:

Hi folks:

Thanks so much for the comments and discussions in the mail and 
collaboration meeting. Here are the update of the major opens raised and 
conclusion/next steps:

1) It is not necessary to support the multiple virtual HDM decoders for 
CXL type-2 device. (Jonathan)

Was asking SW folks around about the requirement of multiple HDM 
decoders in a CXL type-2 device driver. It seems one is enough, which is 
reasonable, because the CXL region created by the type-2 device driver 
is mostly kept for its own private use.

2) Pre-created vs post-created CXL region for the guest. 
(Dan/Kevin/Alejandro)

There has been a discussion about when to create CXL region for the guest.

Option a: The pCXL region is pre-created before VM boots. When a guest 
creates the CXL region via its virtual HDM decoder, QEMU maps the pCXL 
region to the virtual CXL region configured by the guest. Changes and 
re-configuration of the pCXL region is not expected.

Option b: The pCXL region will be (re)created when a guest creates the 
CXL region via its virtual HDM decoder. QEMU traps the HDM decoder 
commits, triggers the pCXL region creation, maps the pCXL to the virtual 
CXL region.

Alejandro (option b):
- Will write a doc to elaborate the problem of CXL.cache and why option 
b should be chosen.

Kevin (option b):
- CXL region is a SW concept, it should be controlled by the guest SW.

Dan (option a):
- Error handling when creating the pCXL region at runtime. E.g. 
Available HPA in the FWMS in the host is running out when creating the 
pCXL region
- CXL.cache might need extra handling which cannot be done at runtime. 
(Need to check Alejandro's doc)

Next step:

- Will check with Alejandro and start from his doc about CXL.cache concerns.

3) Is this exclusively a type2 extension or how do you envision type1/3
devices with vfio? (Alex)

For type-3 device passthrough, due to its nature of memory expander, CXL 
folks have decided to use either virtio-mem or another stub driver in 
QEMU to manage/map the memory to the guest.

For type-1 device, I am not aware of any type-1 device in the market. 
Dan commented in the CXL discord group:

"my understanding is that some of the CXL FPGA kits offer Type-1 flows, 
but those are for custom solutions not open-market. I am aware of some 
private deployments of such hardware, but nothing with an upstream driver."

My take is that we don't need to consider support type-1 device 
passthrough so far.

Z.

> Hi folks:
> 
> As promised in the LPC, here are all you need (patches, repos, guiding
> video, kernel config) to build a environment to test the vfio-cxl-core.
> 
> Thanks so much for the discussions! Enjoy and see you in the next one.
> 
> Background
> ==========
> 
> Compute Express Link (CXL) is an open standard interconnect built upon
> industrial PCI layers to enhance the performance and efficiency of data
> centers by enabling high-speed, low-latency communication between CPUs
> and various types of devices such as accelerators, memory.
> 
> It supports three key protocols: CXL.io as the control protocol, CXL.cache
> as the cache-coherent host-device data transfer protocol, and CXL.mem as
> memory expansion protocol. CXL Type 2 devices leverage the three protocols
> to seamlessly integrate with host CPUs, providing a unified and efficient
> interface for high-speed data transfer and memory sharing. This integration
> is crucial for heterogeneous computing environments where accelerators,
> such as GPUs, and other specialized processors, are used to handle
> intensive workloads.
> 
> Goal
> ====
> 
> Although CXL is built upon the PCI layers, passing a CXL type-2 device can
> be different than PCI devices according to CXL specification[1]:
> 
> - CXL type-2 device initialization. CXL type-2 device requires an
> additional initialization sequence besides the PCI device initialization.
> CXL type-2 device initialization can be pretty complicated due to its
> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
> initialization sequence provided by the kernel CXL core is used.
> 
> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
> (Device PA) needs to be created to access the device memory directly. HDM
> decoders in the CXL topology need to be configured level by level to
> manage the mapping. After the region is created, it needs to be mapped to
> GPA in the virtual HDM decoders configured by the VM.
> 
> - CXL reset. The CXL device reset is different from the PCI device reset.
> A CXL reset sequence is introduced by the CXL spec.
> 
> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
> configuration for device enumeration and device control. (E.g. if a device
> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
> by the kernel CXL core, and the VM can not modify them.
> 
> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
> that can sit in a PCI BAR. The location of register groups sit in the PCI
> BAR is indicated by the register locator in the CXL DVSECs. They are also
> owned by the kernel CXL core. Some of them need to be emulated.
> 
> Design
> ======
> 
> To achieve the purpose above, the vfio-cxl-core is introduced to host the
> common routines that variant driver requires for device passthrough.
> Similar with the vfio-pci-core, the vfio-cxl-core provides common
> routines of vfio_device_ops for the variant driver to hook and perform the
> CXL routines behind it.
> 
> Besides, several extra APIs are introduced for the variant driver to
> provide the necessary information the kernel CXL core to initialize
> the CXL device. E.g., Device DPA.
> 
> CXL is built upon the PCI layers but with differences. Thus, the
> vfio-pci-core is aimed to be re-used as much as possible with the
> awareness of operating on a CXL device.
> 
> A new VFIO device region is introduced to expose the CXL region to the
> userspace. A new CXL VFIO device cap has also been introduced to convey
> the necessary CXL device information to the userspace.
> 
> Patches
> =======
> 
> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
> rebase them to V3 once the cxl-type2 support v3 patch review is done.
> 
> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
> 
> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
> initialization, CXL region creation.
> 
> PATCH 5: Expose the CXL region to the userspace.
> 
> PATCH 6-7: Prepare to emulate the HDM decoder registers.
> 
> PATCH 8: Emulate the HDM decoder registers.
> 
> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
> 
> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
> 
> PATCH 11: Expose the CXL device information that userspace needs.
> 
> PATCH 12: An example variant driver to demonstrate the usage of
> vfio-cxl-core from the perspective of the VFIO variant driver.
> 
> PATCH 13: A workaround needs suggestions.
> 
> Test
> ====
> 
> To test the patches and hack around, a virtual passthrough with nested
> virtualization approach is used.
> 
> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
> with the changes to emulate HDM decoders.
> 
> While running the vfio-cxl in the L1 guest, an example VFIO variant
> driver is used to attach with the QEMU CXL access device.
> 
> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
> VFIOStub.
> 
> In the L2 guest, a dummy CXL device driver is provided to attach to the
> virtual pass-thru device.
> 
> The dummy CXL type-2 device driver can successfully be loaded with the
> kernel cxl core type2 support, create CXL region by requesting the CXL
> core to allocate HPA and DPA and configure the HDM decoders.
> 
> To make sure everyone can test the patches, the kernel config of L1 and
> L2 are provided in the repos, the required kernel command params and
> qemu command line can be found from the demostration video.[5]
> 
> Repos
> =====
> 
> QEMU host: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
> L1 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
> L1 QEMU: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
> 
> [1] https://computeexpresslink.org/cxl-specification/
> [2] https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
> [3] https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
> 
> Feedback expected
> =================
> 
> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
> - Variant driver requirements from more hardware vendors.
> - vfio-cxl-core UABI to QEMU.
> 
> Moving foward
> =============
> 
> - Rebase the patches on top of Alejandro's PATCH v3.
> - Get Ira's type-2 emulated device patch into upstream as CXL folks and RH
>    folks both came to talk and expect this. I had a chat with Ira and he
>    expected me to take it over. Will start a discussion in the CXL discord
>    group for the desgin of V1.
> - Sparse map in vfio-cxl-core.
> 
> Known issues
> ============
> 
> - Teardown path. Missing teardown paths have been implements in Alejandor's
>    PATCH v3. It should be solved after the rebase.
> 
> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is missing
>    in the Ira's patch. When rebooting L1, many CXL registers are not reset.
>    This will be addressed in the formal review of emulated CXL type-2 device
>    support.
> 
> Zhi Wang (13):
>    cxl: allow a type-2 device not to have memory device registers
>    cxl: introduce cxl_get_hdm_info()
>    cxl: introduce cxl_find_comp_reglock_offset()
>    vfio: introduce vfio-cxl core preludes
>    vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>      region
>    vfio/pci: expose vfio_pci_rw()
>    vfio/cxl: introduce vfio_cxl_core_{read, write}()
>    vfio/cxl: emulate HDM decoder registers
>    vfio/pci: introduce CXL device awareness
>    vfio/pci: emulate CXL DVSEC registers in the configuration space
>    vfio/cxl: introduce VFIO CXL device cap
>    vfio/cxl: VFIO variant driver for QEMU CXL accel device
>    vfio/cxl: workaround: don't take resource region when cxl is enabled.
> 
>   drivers/cxl/core/pci.c              |  28 ++
>   drivers/cxl/core/regs.c             |  22 +
>   drivers/cxl/cxl.h                   |   1 +
>   drivers/cxl/cxlpci.h                |   3 +
>   drivers/cxl/pci.c                   |  14 +-
>   drivers/vfio/pci/Kconfig            |   6 +
>   drivers/vfio/pci/Makefile           |   5 +
>   drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>   drivers/vfio/pci/cxl-accel/Makefile |   3 +
>   drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>   drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>   drivers/vfio/pci/vfio_pci_config.c  |  10 +
>   drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>   drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>   include/linux/cxl_accel_mem.h       |   3 +
>   include/linux/cxl_accel_pci.h       |   6 +
>   include/linux/vfio_pci_core.h       |  53 +++
>   include/uapi/linux/vfio.h           |  14 +
>   18 files changed, 992 insertions(+), 32 deletions(-)
>   create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>   create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>   create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>   create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-10-19  5:30           ` Zhi Wang
@ 2024-10-21 11:07             ` Alejandro Lucero Palau
  0 siblings, 0 replies; 38+ messages in thread
From: Alejandro Lucero Palau @ 2024-10-21 11:07 UTC (permalink / raw)
  To: Zhi Wang, Jonathan Cameron
  Cc: Tian, Kevin, kvm@vger.kernel.org, linux-cxl@vger.kernel.org,
	alex.williamson@redhat.com, Jason Gunthorpe, Schofield, Alison,
	Williams, Dan J, Jiang, Dave, dave@stgolabs.net, Weiny, Ira,
	Verma, Vishal L, Andy Currid, Neo Jia, Surath Mitra,
	Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org


On 10/19/24 06:30, Zhi Wang wrote:
> On 04/10/2024 14.40, Jonathan Cameron wrote:
>> External email: Use caution opening links or attachments
>>
>>
>>>>> Presumably, the host creates one large CXL region that covers the entire
>>>>> DPA, while QEMU can virtually partition it into different regions and
>>>>> map them to different virtual CXL region if QEMU presents multiple HDM
>>>>> decoders to the guest.
>>>> I'm not sure why it would do that. Can't think why you'd break up
>>>> a host region - maybe I'm missing something.
>>>>
>>> It is mostly concerning about a device can have multiple HDM decoders.
>>> In the current design, a large physical CXL (pCXL) region with the whole
>>> DPA will be passed to the userspace. Thinking that the guest will see
>>> the virtual multiple HDM decoders, which usually SW is asking for, the
>>> guest SW might create multiple virtual CXL regions. In that case QEMU
>>> needs to map them into different regions of the pCXL region.
>> Don't let the guest see multiple HDM decoders?
>>
>> There is no obvious reason why it would want them other than type
>> differences.
>>
>> Why is it useful for a type 2 device to be setup for multiple CXL regions?
>> It shouldn't be a performance thing. Might be convenient for management
>> I guess, but the driver can layer it's own allocator etc on top of a single
>> region so I'm not sure I see a reason to do this...
>>
> Sorry for the late reply as I were confirming the this requirement with
> folks. It make sense to have only one HDM decoder for the guest CXL
> type-2 device driver. I think it is similar to efx_cxl according to the
> code. Alejandro, it would be nice you can confirm this.


Not sure if how sfc does this should be taken as any confirmation of 
expected behavior/needs, but we plan to have just one HDM decoder and 
one region covering it all.


But maybe it is important to make clear the distinction between an HDM 
decoder and CXL regions linked to it. In other words, I can foresee 
different regions programmed by the driver because the way coherency is 
going to be handled by the device changes for each region. So supporting 
multiple regions makes sense with just one decoder. Not sure if we 
should rely on an use case for supporting more than one region ...


> Thanks,
> Zhi.
>> Jonathan
>>
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-10-21 10:49 ` Zhi Wang
@ 2024-10-21 13:10   ` Alejandro Lucero Palau
  0 siblings, 0 replies; 38+ messages in thread
From: Alejandro Lucero Palau @ 2024-10-21 13:10 UTC (permalink / raw)
  To: Zhi Wang, kvm@vger.kernel.org, linux-cxl@vger.kernel.org
  Cc: alex.williamson@redhat.com, kevin.tian@intel.com, Jason Gunthorpe,
	alison.schofield@intel.com, dan.j.williams@intel.com,
	dave.jiang@intel.com, dave@stgolabs.net,
	jonathan.cameron@huawei.com, ira.weiny@intel.com,
	vishal.l.verma@intel.com, Andy Currid, Neo Jia, Surath Mitra,
	Ankit Agrawal, Aniket Agashe, Kirti Wankhede,
	Tarun Gupta (SW-GPU), zhiwang@kernel.org

Hi Zhi,


Some comments below.


On 10/21/24 11:49, Zhi Wang wrote:
> On 21/09/2024 1.34, Zhi Wang wrote:
>
> Hi folks:
>
> Thanks so much for the comments and discussions in the mail and
> collaboration meeting. Here are the update of the major opens raised and
> conclusion/next steps:
>
> 1) It is not necessary to support the multiple virtual HDM decoders for
> CXL type-2 device. (Jonathan)
>
> Was asking SW folks around about the requirement of multiple HDM
> decoders in a CXL type-2 device driver. It seems one is enough, which is
> reasonable, because the CXL region created by the type-2 device driver
> is mostly kept for its own private use.
>
> 2) Pre-created vs post-created CXL region for the guest.
> (Dan/Kevin/Alejandro)
>
> There has been a discussion about when to create CXL region for the guest.
>
> Option a: The pCXL region is pre-created before VM boots. When a guest
> creates the CXL region via its virtual HDM decoder, QEMU maps the pCXL
> region to the virtual CXL region configured by the guest. Changes and
> re-configuration of the pCXL region is not expected.
>
> Option b: The pCXL region will be (re)created when a guest creates the
> CXL region via its virtual HDM decoder. QEMU traps the HDM decoder
> commits, triggers the pCXL region creation, maps the pCXL to the virtual
> CXL region.
>
> Alejandro (option b):
> - Will write a doc to elaborate the problem of CXL.cache and why option
> b should be chosen.
>
> Kevin (option b):
> - CXL region is a SW concept, it should be controlled by the guest SW.
>
> Dan (option a):
> - Error handling when creating the pCXL region at runtime. E.g.
> Available HPA in the FWMS in the host is running out when creating the
> pCXL region


I think there is nothing option b can not do including any error 
handling. Available HPA can change, but this is not different to this 
being handled for host devices trying to get an HPA range concurrently.


> - CXL.cache might need extra handling which cannot be done at runtime.
> (Need to check Alejandro's doc)
>
> Next step:
>
> - Will check with Alejandro and start from his doc about CXL.cache concerns.


Working on it. Hopefully a first draft next week.


> 3) Is this exclusively a type2 extension or how do you envision type1/3
> devices with vfio? (Alex)
>
> For type-3 device passthrough, due to its nature of memory expander, CXL
> folks have decided to use either virtio-mem or another stub driver in
> QEMU to manage/map the memory to the guest.
>
> For type-1 device, I am not aware of any type-1 device in the market.
> Dan commented in the CXL discord group:
>
> "my understanding is that some of the CXL FPGA kits offer Type-1 flows,
> but those are for custom solutions not open-market. I am aware of some
> private deployments of such hardware, but nothing with an upstream driver."
>
> My take is that we don't need to consider support type-1 device
> passthrough so far.


I can not see a difference between Type1 and Type2 regarding CXL.cache 
support. Once we have a solution for Type2, that should be fine for Type1.

Thanks,

Alejandro


>
> Z.
>
>> Hi folks:
>>
>> As promised in the LPC, here are all you need (patches, repos, guiding
>> video, kernel config) to build a environment to test the vfio-cxl-core.
>>
>> Thanks so much for the discussions! Enjoy and see you in the next one.
>>
>> Background
>> ==========
>>
>> Compute Express Link (CXL) is an open standard interconnect built upon
>> industrial PCI layers to enhance the performance and efficiency of data
>> centers by enabling high-speed, low-latency communication between CPUs
>> and various types of devices such as accelerators, memory.
>>
>> It supports three key protocols: CXL.io as the control protocol, CXL.cache
>> as the cache-coherent host-device data transfer protocol, and CXL.mem as
>> memory expansion protocol. CXL Type 2 devices leverage the three protocols
>> to seamlessly integrate with host CPUs, providing a unified and efficient
>> interface for high-speed data transfer and memory sharing. This integration
>> is crucial for heterogeneous computing environments where accelerators,
>> such as GPUs, and other specialized processors, are used to handle
>> intensive workloads.
>>
>> Goal
>> ====
>>
>> Although CXL is built upon the PCI layers, passing a CXL type-2 device can
>> be different than PCI devices according to CXL specification[1]:
>>
>> - CXL type-2 device initialization. CXL type-2 device requires an
>> additional initialization sequence besides the PCI device initialization.
>> CXL type-2 device initialization can be pretty complicated due to its
>> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
>> initialization sequence provided by the kernel CXL core is used.
>>
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are owned
>> by the kernel CXL core, and the VM can not modify them.
>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
>>
>> Design
>> ======
>>
>> To achieve the purpose above, the vfio-cxl-core is introduced to host the
>> common routines that variant driver requires for device passthrough.
>> Similar with the vfio-pci-core, the vfio-cxl-core provides common
>> routines of vfio_device_ops for the variant driver to hook and perform the
>> CXL routines behind it.
>>
>> Besides, several extra APIs are introduced for the variant driver to
>> provide the necessary information the kernel CXL core to initialize
>> the CXL device. E.g., Device DPA.
>>
>> CXL is built upon the PCI layers but with differences. Thus, the
>> vfio-pci-core is aimed to be re-used as much as possible with the
>> awareness of operating on a CXL device.
>>
>> A new VFIO device region is introduced to expose the CXL region to the
>> userspace. A new CXL VFIO device cap has also been introduced to convey
>> the necessary CXL device information to the userspace.
>>
>> Patches
>> =======
>>
>> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
>> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>>
>> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>>
>> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
>> initialization, CXL region creation.
>>
>> PATCH 5: Expose the CXL region to the userspace.
>>
>> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>>
>> PATCH 8: Emulate the HDM decoder registers.
>>
>> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>>
>> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>>
>> PATCH 11: Expose the CXL device information that userspace needs.
>>
>> PATCH 12: An example variant driver to demonstrate the usage of
>> vfio-cxl-core from the perspective of the VFIO variant driver.
>>
>> PATCH 13: A workaround needs suggestions.
>>
>> Test
>> ====
>>
>> To test the patches and hack around, a virtual passthrough with nested
>> virtualization approach is used.
>>
>> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
>> with the changes to emulate HDM decoders.
>>
>> While running the vfio-cxl in the L1 guest, an example VFIO variant
>> driver is used to attach with the QEMU CXL access device.
>>
>> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
>> VFIOStub.
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
>>
>> To make sure everyone can test the patches, the kernel config of L1 and
>> L2 are provided in the repos, the required kernel command params and
>> qemu command line can be found from the demostration video.[5]
>>
>> Repos
>> =====
>>
>> QEMU host: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
>> L1 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
>> L1 QEMU: https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
>> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>>
>> [1] https://computeexpresslink.org/cxl-specification/
>> [2] https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
>> [3] https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
>> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>>
>> Feedback expected
>> =================
>>
>> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
>> - Variant driver requirements from more hardware vendors.
>> - vfio-cxl-core UABI to QEMU.
>>
>> Moving foward
>> =============
>>
>> - Rebase the patches on top of Alejandro's PATCH v3.
>> - Get Ira's type-2 emulated device patch into upstream as CXL folks and RH
>>     folks both came to talk and expect this. I had a chat with Ira and he
>>     expected me to take it over. Will start a discussion in the CXL discord
>>     group for the desgin of V1.
>> - Sparse map in vfio-cxl-core.
>>
>> Known issues
>> ============
>>
>> - Teardown path. Missing teardown paths have been implements in Alejandor's
>>     PATCH v3. It should be solved after the rebase.
>>
>> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is missing
>>     in the Ira's patch. When rebooting L1, many CXL registers are not reset.
>>     This will be addressed in the formal review of emulated CXL type-2 device
>>     support.
>>
>> Zhi Wang (13):
>>     cxl: allow a type-2 device not to have memory device registers
>>     cxl: introduce cxl_get_hdm_info()
>>     cxl: introduce cxl_find_comp_reglock_offset()
>>     vfio: introduce vfio-cxl core preludes
>>     vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>>       region
>>     vfio/pci: expose vfio_pci_rw()
>>     vfio/cxl: introduce vfio_cxl_core_{read, write}()
>>     vfio/cxl: emulate HDM decoder registers
>>     vfio/pci: introduce CXL device awareness
>>     vfio/pci: emulate CXL DVSEC registers in the configuration space
>>     vfio/cxl: introduce VFIO CXL device cap
>>     vfio/cxl: VFIO variant driver for QEMU CXL accel device
>>     vfio/cxl: workaround: don't take resource region when cxl is enabled.
>>
>>    drivers/cxl/core/pci.c              |  28 ++
>>    drivers/cxl/core/regs.c             |  22 +
>>    drivers/cxl/cxl.h                   |   1 +
>>    drivers/cxl/cxlpci.h                |   3 +
>>    drivers/cxl/pci.c                   |  14 +-
>>    drivers/vfio/pci/Kconfig            |   6 +
>>    drivers/vfio/pci/Makefile           |   5 +
>>    drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>>    drivers/vfio/pci/cxl-accel/Makefile |   3 +
>>    drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>>    drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>>    drivers/vfio/pci/vfio_pci_config.c  |  10 +
>>    drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>>    drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>>    include/linux/cxl_accel_mem.h       |   3 +
>>    include/linux/cxl_accel_pci.h       |   6 +
>>    include/linux/vfio_pci_core.h       |  53 +++
>>    include/uapi/linux/vfio.h           |  14 +
>>    18 files changed, 992 insertions(+), 32 deletions(-)
>>    create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>>    create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>>    create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>>    create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
  2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
                   ` (15 preceding siblings ...)
  2024-10-21 10:49 ` Zhi Wang
@ 2024-10-30 11:56 ` Zhi Wang
  16 siblings, 0 replies; 38+ messages in thread
From: Zhi Wang @ 2024-10-30 11:56 UTC (permalink / raw)
  To: kvm, linux-cxl
  Cc: alex.williamson, kevin.tian, jgg, alison.schofield,
	dan.j.williams, dave.jiang, dave, jonathan.cameron, ira.weiny,
	vishal.l.verma, alucerop, acurrid, cjia, smitra, ankita, aniketa,
	kwankhede, targupta, zhiwang

On Fri, 20 Sep 2024 15:34:33 -0700
Zhi Wang <zhiw@nvidia.com> wrote:

Hi folks:

As what Kevin and Alex pointed out that we need to discuss about the
virtualization polices for the registers that need to be handled by the
vfio-cxl core, here are my summary about the virtualization polices for
configuration space registers.

For CXL MMIO registers, I am leaning towards vfio-cxl core respecting
the BAR Virtualization ACL Register Block (8.2.6 BAR Virtualization ACL
Register Block), which shows the register ranges can be safely
passed to the guest. For registers are not in the Virtualization ACL
Register Block, or a device doesn't present a BAR Virtualization ACL
Register Block, we leave the handling the to the variant driver.
Besides, 

Feel free to comment.

Z.

----

8.1 Configuration Space Registers
=================================

8.1.3 PCIE DVSEC for CXL Devices
==================================

- DVS Header 1 (0x4)
All bits are RO, emulated as RO and initial values are from the HW.

- DVS Header 2 (0x8)
All bits are RO, emulated as RO and initial values are from the HW.

- DVSEC CXL Capability (0xa)

Overide values from HW:

Bit [5:4] RO - HDM_Count:
Support only one HDM decoder

Bit [12] RO - TSP Capable:
Not supported in v1

Other bits are RO, emulated, and inital values are from the HW.

- DVSEC CXL Control (0xc)

Bit [0] RWL - Cache_Enable:
Emulated as RO if CONFIG_LOCK bit is set, otherwise RW.

Bit [1] RO - IO_Enable:
Emulated as RO.

Bit [12:2] RWL:
Emulated as RO when CONFIG_LOCK bit is set, otherwise RW.

Bit [13] Rsvd:
Emualted as RO

Bit [14] RWL:
Emulated as RO when CONFIG_LOCK bit is set, otherwise RW.

Bit [15] Rsvd:
Emulated as RO.

- DVSEC CXL Status (0xe)

Bit [14] RW1CS - Viral_Status:
Emulate write one clear bit.

Other bits are Rsvd and emulated as RO, inital values are from the HW.

- DVSEC CXL Control 2 (0x10)

Bit [0] RW - Disable caching:
Disable the caching on HW when VM writes bit 1.

Bit [1] RW - Initiate Cache Write Back and Invalidation:
Trigger the cache writeback and invalidation via Linux CXL core, update
cache invalid bit in DVSEC CXL Status 2.

Bit [2] RW - Initiate CXL Reset:
Trigger the CXL reset via Linux CXL core, update the CXL reset complete
, CXL reset error in DVSEC CXL Status 2.

Bit [3] RW - CXL Reset Mem Clr Enable:
As a param when trigger the CXL reset via Linux CXL core.

Bit [4] Desired Volatile HDM State after Hot Reset - RWS/RO
Write the bit on the HW or via Linux CXL core.

Bit [5] Modified Completion Eanble - RW/RO
Write the bit on the HW or via Linux CXL core.

Other bits are Rsvd, emulated as RO and inital values are from the HW.

- DVSEC CXL Status 2 (0x12)
Bit [0] RO - Cache Invalid:
Updated when emulating DVSEC CXL Control 2.

Bit [1] RO - CXL Reset Complete:
Updated when emulating DVSEC CXL Control 2.

Bit [1] RO - CXL Reset Error:
Updated when emulating DVSEC CXL Control 2.

Bit [3] RW1CS/RsvdZ - Volatile HDM Preservation Error:
Read the bit from the HW.

Bit [14:4] Rsvd:
Emulated as RO and inital values are from the HW.

Bit [15] RO - Power Management Intalization Complete:
Read the bit from the HW.

DVSEC CXL Capability2 (0x16)
All bits are RO, emulated as RO and initial values are from the HW.

DVSEC CXL Range 1 Size High (0x18)
All bits are RO, emulated as RO and initial values are from the HW.

DVSEC CXL Range 1 Size Low (0x1c)
All bits are RO, emulated as RO and initial values are from the HW.

DVSEC CXL Range 1 Base High (0x20)
Emulated as RW

DVSEC CXL Range 1 Base Low (0x24)
Emulated as RW

DVSEC CXL Range 2 Size High (0x28)
All bits are RO, emulated as RO and initial values are from the HW.

DVSEC CXL Range 2 Size Low (0x2c)
All bits are RO, emulated as RO and initial values are from the HW.

DVSEC CXL Range 2 Base High (0x30)
Emulated as RW

DVSEC CXL Range 2 Base Low (0x34)
Emulated as RW

DVSEC CXL Capability 3 (0x38)
All bits are RO, emulated as RO and initial values are from the HW.

8.1.4 Non-CXL Function Map DVSEC
================================
Not supported

8.1.5 CXL Extensions DVSEC for Ports
====================================
For root port/switches, no need to support in type-2 device passthorugh

8.1.6 GPF DVSEC for CXL Port
============================
For root port/switches, no need to support in type-2 device passthorugh

8.1.7 GPF DVSEC for CXL Device
==============================

DVS Header 1 (0x4)
All bits are RO, emulated as RO and initial values are from the HW.

DVS Header 2 (0x8)
All bits are RO, emulated as RO and initial values are from the HW.

GPF Phase 2 Duration (0xa)
All bits are RO, emulated as RO and initial values are from the HW.

GPF Phase 2 Power (0xc)
All bits are RO, emulated as RO and initial values are from the HW.

8.1.8 PCIE DVSEC for Flex Bus Port
==================================
For root port/switches, no need to support in type-2 device passthorugh

8.1.9 Register Locator DVSEC
============================

DVS Header 1 (0x4)
All bits are RO, emulated as RO and initial values are from the HW.

DVS Header 2 (0x8)
All bits are RO, emulated as RO and initial values are from the HW.

Register Block 1-3 (Varies)
All bits are RO, emulated as RO and initial values are from the HW.

8.1.10 MLD DVSEC
================
Not supported. Mostly this is for type-3 device.

8.1.11 Table Access DOE
Coupled with QEMU DOE emulation

8.1.12 Memory Device Configuration Space Layout
Not supported. Mostly this is for type-3 device.

8.1.13 Switch Mailbox CCI Configuration Space Layout
Not supported. This is for switches.


> Hi folks:
> 
> As promised in the LPC, here are all you need (patches, repos, guiding
> video, kernel config) to build a environment to test the
> vfio-cxl-core.
> 
> Thanks so much for the discussions! Enjoy and see you in the next one.
> 
> Background
> ==========
> 
> Compute Express Link (CXL) is an open standard interconnect built upon
> industrial PCI layers to enhance the performance and efficiency of
> data centers by enabling high-speed, low-latency communication
> between CPUs and various types of devices such as accelerators,
> memory.
> 
> It supports three key protocols: CXL.io as the control protocol,
> CXL.cache as the cache-coherent host-device data transfer protocol,
> and CXL.mem as memory expansion protocol. CXL Type 2 devices leverage
> the three protocols to seamlessly integrate with host CPUs, providing
> a unified and efficient interface for high-speed data transfer and
> memory sharing. This integration is crucial for heterogeneous
> computing environments where accelerators, such as GPUs, and other
> specialized processors, are used to handle intensive workloads.
> 
> Goal
> ====
> 
> Although CXL is built upon the PCI layers, passing a CXL type-2
> device can be different than PCI devices according to CXL
> specification[1]:
> 
> - CXL type-2 device initialization. CXL type-2 device requires an
> additional initialization sequence besides the PCI device
> initialization. CXL type-2 device initialization can be pretty
> complicated due to its hierarchy of register interfaces. Thus, a
> standard CXL type-2 driver initialization sequence provided by the
> kernel CXL core is used.
> 
> - Create a CXL region and map it to the VM. A mapping between HPA and
> DPA (Device PA) needs to be created to access the device memory
> directly. HDM decoders in the CXL topology need to be configured
> level by level to manage the mapping. After the region is created, it
> needs to be mapped to GPA in the virtual HDM decoders configured by
> the VM.
> 
> - CXL reset. The CXL device reset is different from the PCI device
> reset. A CXL reset sequence is introduced by the CXL spec.
> 
> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in
> the configuration for device enumeration and device control. (E.g. if
> a device is capable of CXL.mem CXL.cache, enable/disable capability)
> They are owned by the kernel CXL core, and the VM can not modify them.
> 
> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO
> registers that can sit in a PCI BAR. The location of register groups
> sit in the PCI BAR is indicated by the register locator in the CXL
> DVSECs. They are also owned by the kernel CXL core. Some of them need
> to be emulated.
> 
> Design
> ======
> 
> To achieve the purpose above, the vfio-cxl-core is introduced to host
> the common routines that variant driver requires for device
> passthrough. Similar with the vfio-pci-core, the vfio-cxl-core
> provides common routines of vfio_device_ops for the variant driver to
> hook and perform the CXL routines behind it.
> 
> Besides, several extra APIs are introduced for the variant driver to
> provide the necessary information the kernel CXL core to initialize
> the CXL device. E.g., Device DPA.
> 
> CXL is built upon the PCI layers but with differences. Thus, the
> vfio-pci-core is aimed to be re-used as much as possible with the
> awareness of operating on a CXL device.
> 
> A new VFIO device region is introduced to expose the CXL region to the
> userspace. A new CXL VFIO device cap has also been introduced to
> convey the necessary CXL device information to the userspace.
> 
> Patches
> =======
> 
> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
> rebase them to V3 once the cxl-type2 support v3 patch review is done.
> 
> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
> 
> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
> initialization, CXL region creation.
> 
> PATCH 5: Expose the CXL region to the userspace.
> 
> PATCH 6-7: Prepare to emulate the HDM decoder registers.
> 
> PATCH 8: Emulate the HDM decoder registers.
> 
> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
> 
> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
> 
> PATCH 11: Expose the CXL device information that userspace needs.
> 
> PATCH 12: An example variant driver to demonstrate the usage of
> vfio-cxl-core from the perspective of the VFIO variant driver.
> 
> PATCH 13: A workaround needs suggestions.
> 
> Test
> ====
> 
> To test the patches and hack around, a virtual passthrough with nested
> virtualization approach is used.
> 
> The host QEMU emulates a CXL type-2 accel device based on Ira's
> patches with the changes to emulate HDM decoders.
> 
> While running the vfio-cxl in the L1 guest, an example VFIO variant
> driver is used to attach with the QEMU CXL access device.
> 
> The L2 guest can be booted via the QEMU with the vfio-cxl support in
> the VFIOStub.
> 
> In the L2 guest, a dummy CXL device driver is provided to attach to
> the virtual pass-thru device.
> 
> The dummy CXL type-2 device driver can successfully be loaded with the
> kernel cxl core type2 support, create CXL region by requesting the CXL
> core to allocate HPA and DPA and configure the HDM decoders.
> 
> To make sure everyone can test the patches, the kernel config of L1
> and L2 are provided in the repos, the required kernel command params
> and qemu command line can be found from the demostration video.[5]
> 
> Repos
> =====
> 
> QEMU host:
> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host L1
> Kernel:
> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
> L1 QEMU:
> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
> L2 Kernel:
> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
> 
> [1] https://computeexpresslink.org/cxl-specification/
> [2]
> https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
> [3]
> https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
> 
> Feedback expected
> =================
> 
> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
> - Variant driver requirements from more hardware vendors.
> - vfio-cxl-core UABI to QEMU.
> 
> Moving foward
> =============
> 
> - Rebase the patches on top of Alejandro's PATCH v3.
> - Get Ira's type-2 emulated device patch into upstream as CXL folks
> and RH folks both came to talk and expect this. I had a chat with Ira
> and he expected me to take it over. Will start a discussion in the
> CXL discord group for the desgin of V1.
> - Sparse map in vfio-cxl-core.
> 
> Known issues
> ============
> 
> - Teardown path. Missing teardown paths have been implements in
> Alejandor's PATCH v3. It should be solved after the rebase.
> 
> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is
> missing in the Ira's patch. When rebooting L1, many CXL registers are
> not reset. This will be addressed in the formal review of emulated
> CXL type-2 device support.
> 
> Zhi Wang (13):
>   cxl: allow a type-2 device not to have memory device registers
>   cxl: introduce cxl_get_hdm_info()
>   cxl: introduce cxl_find_comp_reglock_offset()
>   vfio: introduce vfio-cxl core preludes
>   vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>     region
>   vfio/pci: expose vfio_pci_rw()
>   vfio/cxl: introduce vfio_cxl_core_{read, write}()
>   vfio/cxl: emulate HDM decoder registers
>   vfio/pci: introduce CXL device awareness
>   vfio/pci: emulate CXL DVSEC registers in the configuration space
>   vfio/cxl: introduce VFIO CXL device cap
>   vfio/cxl: VFIO variant driver for QEMU CXL accel device
>   vfio/cxl: workaround: don't take resource region when cxl is
> enabled.
> 
>  drivers/cxl/core/pci.c              |  28 ++
>  drivers/cxl/core/regs.c             |  22 +
>  drivers/cxl/cxl.h                   |   1 +
>  drivers/cxl/cxlpci.h                |   3 +
>  drivers/cxl/pci.c                   |  14 +-
>  drivers/vfio/pci/Kconfig            |   6 +
>  drivers/vfio/pci/Makefile           |   5 +
>  drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>  drivers/vfio/pci/cxl-accel/Makefile |   3 +
>  drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>  drivers/vfio/pci/vfio_cxl_core.c    | 647
> ++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_config.c  |
> 10 + drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>  drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>  include/linux/cxl_accel_mem.h       |   3 +
>  include/linux/cxl_accel_pci.h       |   6 +
>  include/linux/vfio_pci_core.h       |  53 +++
>  include/uapi/linux/vfio.h           |  14 +
>  18 files changed, 992 insertions(+), 32 deletions(-)
>  create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>  create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>  create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>  create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-10-30 11:56 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
2024-09-20 22:34 ` [RFC 01/13] cxl: allow a type-2 device not to have memory device registers Zhi Wang
2024-09-23  8:01   ` Tian, Kevin
2024-09-23 15:38   ` Dave Jiang
2024-09-24  8:03     ` Zhi Wang
2024-09-20 22:34 ` [RFC 02/13] cxl: introduce cxl_get_hdm_info() Zhi Wang
2024-10-17 15:44   ` Jonathan Cameron
2024-10-19  5:38     ` Zhi Wang
2024-09-20 22:34 ` [RFC 03/13] cxl: introduce cxl_find_comp_reglock_offset() Zhi Wang
2024-09-20 22:34 ` [RFC 04/13] vfio: introduce vfio-cxl core preludes Zhi Wang
2024-10-11 18:33   ` Alex Williamson
2024-09-20 22:34 ` [RFC 05/13] vfio/cxl: expose CXL region to the usersapce via a new VFIO device region Zhi Wang
2024-10-11 19:12   ` Alex Williamson
2024-09-20 22:34 ` [RFC 06/13] vfio/pci: expose vfio_pci_rw() Zhi Wang
2024-09-20 22:34 ` [RFC 07/13] vfio/cxl: introduce vfio_cxl_core_{read, write}() Zhi Wang
2024-09-20 22:34 ` [RFC 08/13] vfio/cxl: emulate HDM decoder registers Zhi Wang
2024-09-20 22:34 ` [RFC 09/13] vfio/pci: introduce CXL device awareness Zhi Wang
2024-10-11 20:37   ` Alex Williamson
2024-09-20 22:34 ` [RFC 10/13] vfio/pci: emulate CXL DVSEC registers in the configuration space Zhi Wang
2024-10-11 21:02   ` Alex Williamson
2024-09-20 22:34 ` [RFC 11/13] vfio/cxl: introduce VFIO CXL device cap Zhi Wang
2024-10-11 21:14   ` Alex Williamson
2024-09-20 22:34 ` [RFC 12/13] vfio/cxl: VFIO variant driver for QEMU CXL accel device Zhi Wang
2024-09-20 22:34 ` [RFC 13/13] vfio/cxl: workaround: don't take resource region when cxl is enabled Zhi Wang
2024-09-23  8:00 ` [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Tian, Kevin
2024-09-24  8:30   ` Zhi Wang
2024-09-25 13:05     ` Jonathan Cameron
2024-09-27  7:18       ` Zhi Wang
2024-10-04 11:40         ` Jonathan Cameron
2024-10-19  5:30           ` Zhi Wang
2024-10-21 11:07             ` Alejandro Lucero Palau
2024-09-26  6:55     ` Tian, Kevin
2024-09-25 10:11 ` Alejandro Lucero Palau
2024-09-27  7:38   ` Zhi Wang
2024-09-27  7:38   ` Zhi Wang
2024-10-21 10:49 ` Zhi Wang
2024-10-21 13:10   ` Alejandro Lucero Palau
2024-10-30 11:56 ` Zhi Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox