Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Zhi Wang <zhiw@nvidia.com>
To: Alejandro Lucero Palau <alucerop@amd.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>
Cc: "alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"kevin.tian@intel.com" <kevin.tian@intel.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	"alison.schofield@intel.com" <alison.schofield@intel.com>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"dave.jiang@intel.com" <dave.jiang@intel.com>,
	"dave@stgolabs.net" <dave@stgolabs.net>,
	"jonathan.cameron@huawei.com" <jonathan.cameron@huawei.com>,
	"ira.weiny@intel.com" <ira.weiny@intel.com>,
	"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
	Andy Currid <ACurrid@nvidia.com>, Neo Jia <cjia@nvidia.com>,
	Surath Mitra <smitra@nvidia.com>,
	Ankit Agrawal <ankita@nvidia.com>,
	Aniket Agashe <aniketa@nvidia.com>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	"Tarun Gupta (SW-GPU)" <targupta@nvidia.com>,
	"zhiwang@kernel.org" <zhiwang@kernel.org>
Subject: Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
Date: Fri, 27 Sep 2024 07:38:17 +0000	[thread overview]
Message-ID: <8beff9bc-9d60-4e56-ae25-b25755ecd38f@nvidia.com> (raw)
In-Reply-To: <4230fba5-030c-49ef-799e-f4138b1c9f7d@amd.com>

On 25/09/2024 12.11, Alejandro Lucero Palau wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 9/20/24 23:34, Zhi Wang wrote:
>> Hi folks:
>>
>> As promised in the LPC, here are all you need (patches, repos, guiding
>> video, kernel config) to build a environment to test the vfio-cxl-core.
>>
>> Thanks so much for the discussions! Enjoy and see you in the next one.
>>
>> Background
>> ==========
>>
>> Compute Express Link (CXL) is an open standard interconnect built upon
>> industrial PCI layers to enhance the performance and efficiency of data
>> centers by enabling high-speed, low-latency communication between CPUs
>> and various types of devices such as accelerators, memory.
>>
>> It supports three key protocols: CXL.io as the control protocol, 
>> CXL.cache
>> as the cache-coherent host-device data transfer protocol, and CXL.mem as
>> memory expansion protocol. CXL Type 2 devices leverage the three 
>> protocols
>> to seamlessly integrate with host CPUs, providing a unified and efficient
>> interface for high-speed data transfer and memory sharing. This 
>> integration
>> is crucial for heterogeneous computing environments where accelerators,
>> such as GPUs, and other specialized processors, are used to handle
>> intensive workloads.
>>
>> Goal
>> ====
>>
>> Although CXL is built upon the PCI layers, passing a CXL type-2 device 
>> can
>> be different than PCI devices according to CXL specification[1]:
>>
>> - CXL type-2 device initialization. CXL type-2 device requires an
>> additional initialization sequence besides the PCI device initialization.
>> CXL type-2 device initialization can be pretty complicated due to its
>> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
>> initialization sequence provided by the kernel CXL core is used.
>>
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a 
>> device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are 
>> owned
>> by the kernel CXL core, and the VM can not modify them.
>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO 
>> registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
>>
>> Design
>> ======
>>
>> To achieve the purpose above, the vfio-cxl-core is introduced to host the
>> common routines that variant driver requires for device passthrough.
>> Similar with the vfio-pci-core, the vfio-cxl-core provides common
>> routines of vfio_device_ops for the variant driver to hook and perform 
>> the
>> CXL routines behind it.
>>
>> Besides, several extra APIs are introduced for the variant driver to
>> provide the necessary information the kernel CXL core to initialize
>> the CXL device. E.g., Device DPA.
>>
>> CXL is built upon the PCI layers but with differences. Thus, the
>> vfio-pci-core is aimed to be re-used as much as possible with the
>> awareness of operating on a CXL device.
>>
>> A new VFIO device region is introduced to expose the CXL region to the
>> userspace. A new CXL VFIO device cap has also been introduced to convey
>> the necessary CXL device information to the userspace.
> 
> 
> 
> Hi Zhi,
> 
> 
> As you know, I was confused with this work but after looking at the
> patchset and thinking about all this, it makes sense now. FWIW, the most
> confusing point was to use the CXL core inside the VM for creating the
> region what implies commits to the CXL root switch/complex and any other
> switch in the path. I realize now it will happen but on emulated
> hardware with no implication to the real one, which was updated with any
> necessary change like those commits by the vfio cxl code in the host (L1
> VM in your tests).
> 
> 
> The only problem I can see with this approach is the CXL initialization
> is left unconditionally to the hypervisor. I guess most of the time will
> be fine, but the driver could not be mapping/using the whole CXL mem
> always.  I know this could be awkward, but possible depending on the
> device state unrelated to CXL itself. 

Will this device states be one-time on/off state or a runtime 
configuration state that a guest need to poke all the time?

There can be two paths for handling these states in a vendor-specific 
variant driver: 1) vfio_device->fops->open() path, it suits for one-time 
on/off state 2) vfio_device->fops->{read, write}(), the VM 
exit->QEMU->variant driver path. The vendor-specific driver can 
configure the HW based on the register access from the guest.

It would be nice to know more about this, like how many registers the 
vendor-specific driver would like to handle. Thus, the VFIO CXL core can 
provide common helpers.

In other words, this approach
> assumes beforehand something which could not be true. What I had in mind
> was to have VM exits for any action on CXL configuration on behalf of
> that device/driver inside the device.
> 

Initially, this was a idea from Dan. I think this would be a good topic 
for the next CXL open-source collaboration meeting. Kevn also commented 
for this.

> 
> This is all more problematic with CXL.cache, and I think the same
> approach can not be followed. I'm writing a document trying to share all
> my concerns about CXL.cache and DMA/IOMMU mappings, and I will cover
> this for sure. As a quick note, while DMA/IOMMU has no limitations
> regarding the amount of memory to use, it is unlikely the same can be
> done due to scarce host snoop cache resources, therefore the CXL.cache
> mappings will likely need to be explicitly done by the driver and
> approved by the CXL core (along with DMA/IOMMU), and for a driver inside
> a VM that implies VM exits.
> 

Good to hear. Please CCme as well. Many thanks.

> 
> Thanks.
> 
> Alejandro.
> 
>> Patches
>> =======
>>
>> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
>> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>>
>> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>>
>> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
>> initialization, CXL region creation.
>>
>> PATCH 5: Expose the CXL region to the userspace.
>>
>> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>>
>> PATCH 8: Emulate the HDM decoder registers.
>>
>> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>>
>> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>>
>> PATCH 11: Expose the CXL device information that userspace needs.
>>
>> PATCH 12: An example variant driver to demonstrate the usage of
>> vfio-cxl-core from the perspective of the VFIO variant driver.
>>
>> PATCH 13: A workaround needs suggestions.
>>
>> Test
>> ====
>>
>> To test the patches and hack around, a virtual passthrough with nested
>> virtualization approach is used.
>>
>> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
>> with the changes to emulate HDM decoders.
>>
>> While running the vfio-cxl in the L1 guest, an example VFIO variant
>> driver is used to attach with the QEMU CXL access device.
>>
>> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
>> VFIOStub.
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
>>
>> To make sure everyone can test the patches, the kernel config of L1 and
>> L2 are provided in the repos, the required kernel command params and
>> qemu command line can be found from the demostration video.[5]
>>
>> Repos
>> =====
>>
>> QEMU host: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
>> L1 Kernel: 
>> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
>> L1 QEMU: 
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
>> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>>
>> [1] https://computeexpresslink.org/cxl-specification/
>> [2] 
>> https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
>> [3] 
>> https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
>> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>>
>> Feedback expected
>> =================
>>
>> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
>> - Variant driver requirements from more hardware vendors.
>> - vfio-cxl-core UABI to QEMU.
>>
>> Moving foward
>> =============
>>
>> - Rebase the patches on top of Alejandro's PATCH v3.
>> - Get Ira's type-2 emulated device patch into upstream as CXL folks 
>> and RH
>>    folks both came to talk and expect this. I had a chat with Ira and he
>>    expected me to take it over. Will start a discussion in the CXL 
>> discord
>>    group for the desgin of V1.
>> - Sparse map in vfio-cxl-core.
>>
>> Known issues
>> ============
>>
>> - Teardown path. Missing teardown paths have been implements in 
>> Alejandor's
>>    PATCH v3. It should be solved after the rebase.
>>
>> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is 
>> missing
>>    in the Ira's patch. When rebooting L1, many CXL registers are not 
>> reset.
>>    This will be addressed in the formal review of emulated CXL type-2 
>> device
>>    support.
>>
>> Zhi Wang (13):
>>    cxl: allow a type-2 device not to have memory device registers
>>    cxl: introduce cxl_get_hdm_info()
>>    cxl: introduce cxl_find_comp_reglock_offset()
>>    vfio: introduce vfio-cxl core preludes
>>    vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>>      region
>>    vfio/pci: expose vfio_pci_rw()
>>    vfio/cxl: introduce vfio_cxl_core_{read, write}()
>>    vfio/cxl: emulate HDM decoder registers
>>    vfio/pci: introduce CXL device awareness
>>    vfio/pci: emulate CXL DVSEC registers in the configuration space
>>    vfio/cxl: introduce VFIO CXL device cap
>>    vfio/cxl: VFIO variant driver for QEMU CXL accel device
>>    vfio/cxl: workaround: don't take resource region when cxl is enabled.
>>
>>   drivers/cxl/core/pci.c              |  28 ++
>>   drivers/cxl/core/regs.c             |  22 +
>>   drivers/cxl/cxl.h                   |   1 +
>>   drivers/cxl/cxlpci.h                |   3 +
>>   drivers/cxl/pci.c                   |  14 +-
>>   drivers/vfio/pci/Kconfig            |   6 +
>>   drivers/vfio/pci/Makefile           |   5 +
>>   drivers/vfio/pci/cxl-accel/Kconfig  |   6 +
>>   drivers/vfio/pci/cxl-accel/Makefile |   3 +
>>   drivers/vfio/pci/cxl-accel/main.c   | 116 +++++
>>   drivers/vfio/pci/vfio_cxl_core.c    | 647 ++++++++++++++++++++++++++++
>>   drivers/vfio/pci/vfio_pci_config.c  |  10 +
>>   drivers/vfio/pci/vfio_pci_core.c    |  79 +++-
>>   drivers/vfio/pci/vfio_pci_rdwr.c    |   8 +-
>>   include/linux/cxl_accel_mem.h       |   3 +
>>   include/linux/cxl_accel_pci.h       |   6 +
>>   include/linux/vfio_pci_core.h       |  53 +++
>>   include/uapi/linux/vfio.h           |  14 +
>>   18 files changed, 992 insertions(+), 32 deletions(-)
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>>   create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>>   create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>>   create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>>

next prev parent reply	other threads:[~2024-09-27  7:38 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
2024-09-20 22:34 ` [RFC 01/13] cxl: allow a type-2 device not to have memory device registers Zhi Wang
2024-09-23  8:01   ` Tian, Kevin
2024-09-23 15:38   ` Dave Jiang
2024-09-24  8:03     ` Zhi Wang
2024-09-20 22:34 ` [RFC 02/13] cxl: introduce cxl_get_hdm_info() Zhi Wang
2024-10-17 15:44   ` Jonathan Cameron
2024-10-19  5:38     ` Zhi Wang
2024-09-20 22:34 ` [RFC 03/13] cxl: introduce cxl_find_comp_reglock_offset() Zhi Wang
2024-09-20 22:34 ` [RFC 04/13] vfio: introduce vfio-cxl core preludes Zhi Wang
2024-10-11 18:33   ` Alex Williamson
2024-09-20 22:34 ` [RFC 05/13] vfio/cxl: expose CXL region to the usersapce via a new VFIO device region Zhi Wang
2024-10-11 19:12   ` Alex Williamson
2024-09-20 22:34 ` [RFC 06/13] vfio/pci: expose vfio_pci_rw() Zhi Wang
2024-09-20 22:34 ` [RFC 07/13] vfio/cxl: introduce vfio_cxl_core_{read, write}() Zhi Wang
2024-09-20 22:34 ` [RFC 08/13] vfio/cxl: emulate HDM decoder registers Zhi Wang
2024-09-20 22:34 ` [RFC 09/13] vfio/pci: introduce CXL device awareness Zhi Wang
2024-10-11 20:37   ` Alex Williamson
2024-09-20 22:34 ` [RFC 10/13] vfio/pci: emulate CXL DVSEC registers in the configuration space Zhi Wang
2024-10-11 21:02   ` Alex Williamson
2024-09-20 22:34 ` [RFC 11/13] vfio/cxl: introduce VFIO CXL device cap Zhi Wang
2024-10-11 21:14   ` Alex Williamson
2024-09-20 22:34 ` [RFC 12/13] vfio/cxl: VFIO variant driver for QEMU CXL accel device Zhi Wang
2024-09-20 22:34 ` [RFC 13/13] vfio/cxl: workaround: don't take resource region when cxl is enabled Zhi Wang
2024-09-23  8:00 ` [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Tian, Kevin
2024-09-24  8:30   ` Zhi Wang
2024-09-25 13:05     ` Jonathan Cameron
2024-09-27  7:18       ` Zhi Wang
2024-10-04 11:40         ` Jonathan Cameron
2024-10-19  5:30           ` Zhi Wang
2024-10-21 11:07             ` Alejandro Lucero Palau
2024-09-26  6:55     ` Tian, Kevin
2024-09-25 10:11 ` Alejandro Lucero Palau
2024-09-27  7:38   ` Zhi Wang [this message]
2024-09-27  7:38   ` Zhi Wang
2024-10-21 10:49 ` Zhi Wang
2024-10-21 13:10   ` Alejandro Lucero Palau
2024-10-30 11:56 ` Zhi Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8beff9bc-9d60-4e56-ae25-b25755ecd38f@nvidia.com \
    --to=zhiw@nvidia.com \
    --cc=ACurrid@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=alison.schofield@intel.com \
    --cc=alucerop@amd.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=cjia@nvidia.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jgg@nvidia.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=smitra@nvidia.com \
    --cc=targupta@nvidia.com \
    --cc=vishal.l.verma@intel.com \
    --cc=zhiwang@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox