From: Zhi Wang <zhiw@nvidia.com>
To: Alejandro Lucero Palau <alucerop@amd.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>
Cc: "alex.williamson@redhat.com" <alex.williamson@redhat.com>,
"kevin.tian@intel.com" <kevin.tian@intel.com>,
Jason Gunthorpe <jgg@nvidia.com>,
"alison.schofield@intel.com" <alison.schofield@intel.com>,
"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
"dave.jiang@intel.com" <dave.jiang@intel.com>,
"dave@stgolabs.net" <dave@stgolabs.net>,
"jonathan.cameron@huawei.com" <jonathan.cameron@huawei.com>,
"ira.weiny@intel.com" <ira.weiny@intel.com>,
"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
Andy Currid <ACurrid@nvidia.com>, Neo Jia <cjia@nvidia.com>,
Surath Mitra <smitra@nvidia.com>,
Ankit Agrawal <ankita@nvidia.com>,
Aniket Agashe <aniketa@nvidia.com>,
Kirti Wankhede <kwankhede@nvidia.com>,
"Tarun Gupta (SW-GPU)" <targupta@nvidia.com>,
"zhiwang@kernel.org" <zhiwang@kernel.org>
Subject: Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
Date: Fri, 27 Sep 2024 07:38:17 +0000 [thread overview]
Message-ID: <8beff9bc-9d60-4e56-ae25-b25755ecd38f@nvidia.com> (raw)
In-Reply-To: <4230fba5-030c-49ef-799e-f4138b1c9f7d@amd.com>
On 25/09/2024 12.11, Alejandro Lucero Palau wrote:
> External email: Use caution opening links or attachments
>
>
> On 9/20/24 23:34, Zhi Wang wrote:
>> Hi folks:
>>
>> As promised in the LPC, here are all you need (patches, repos, guiding
>> video, kernel config) to build a environment to test the vfio-cxl-core.
>>
>> Thanks so much for the discussions! Enjoy and see you in the next one.
>>
>> Background
>> ==========
>>
>> Compute Express Link (CXL) is an open standard interconnect built upon
>> industrial PCI layers to enhance the performance and efficiency of data
>> centers by enabling high-speed, low-latency communication between CPUs
>> and various types of devices such as accelerators, memory.
>>
>> It supports three key protocols: CXL.io as the control protocol,
>> CXL.cache
>> as the cache-coherent host-device data transfer protocol, and CXL.mem as
>> memory expansion protocol. CXL Type 2 devices leverage the three
>> protocols
>> to seamlessly integrate with host CPUs, providing a unified and efficient
>> interface for high-speed data transfer and memory sharing. This
>> integration
>> is crucial for heterogeneous computing environments where accelerators,
>> such as GPUs, and other specialized processors, are used to handle
>> intensive workloads.
>>
>> Goal
>> ====
>>
>> Although CXL is built upon the PCI layers, passing a CXL type-2 device
>> can
>> be different than PCI devices according to CXL specification[1]:
>>
>> - CXL type-2 device initialization. CXL type-2 device requires an
>> additional initialization sequence besides the PCI device initialization.
>> CXL type-2 device initialization can be pretty complicated due to its
>> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
>> initialization sequence provided by the kernel CXL core is used.
>>
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a
>> device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are
>> owned
>> by the kernel CXL core, and the VM can not modify them.
>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO
>> registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
>>
>> Design
>> ======
>>
>> To achieve the purpose above, the vfio-cxl-core is introduced to host the
>> common routines that variant driver requires for device passthrough.
>> Similar with the vfio-pci-core, the vfio-cxl-core provides common
>> routines of vfio_device_ops for the variant driver to hook and perform
>> the
>> CXL routines behind it.
>>
>> Besides, several extra APIs are introduced for the variant driver to
>> provide the necessary information the kernel CXL core to initialize
>> the CXL device. E.g., Device DPA.
>>
>> CXL is built upon the PCI layers but with differences. Thus, the
>> vfio-pci-core is aimed to be re-used as much as possible with the
>> awareness of operating on a CXL device.
>>
>> A new VFIO device region is introduced to expose the CXL region to the
>> userspace. A new CXL VFIO device cap has also been introduced to convey
>> the necessary CXL device information to the userspace.
>
>
>
> Hi Zhi,
>
>
> As you know, I was confused with this work but after looking at the
> patchset and thinking about all this, it makes sense now. FWIW, the most
> confusing point was to use the CXL core inside the VM for creating the
> region what implies commits to the CXL root switch/complex and any other
> switch in the path. I realize now it will happen but on emulated
> hardware with no implication to the real one, which was updated with any
> necessary change like those commits by the vfio cxl code in the host (L1
> VM in your tests).
>
>
> The only problem I can see with this approach is the CXL initialization
> is left unconditionally to the hypervisor. I guess most of the time will
> be fine, but the driver could not be mapping/using the whole CXL mem
> always. I know this could be awkward, but possible depending on the
> device state unrelated to CXL itself.
Will this device states be one-time on/off state or a runtime
configuration state that a guest need to poke all the time?
There can be two paths for handling these states in a vendor-specific
variant driver: 1) vfio_device->fops->open() path, it suits for one-time
on/off state 2) vfio_device->fops->{read, write}(), the VM
exit->QEMU->variant driver path. The vendor-specific driver can
configure the HW based on the register access from the guest.
It would be nice to know more about this, like how many registers the
vendor-specific driver would like to handle. Thus, the VFIO CXL core can
provide common helpers.
In other words, this approach
> assumes beforehand something which could not be true. What I had in mind
> was to have VM exits for any action on CXL configuration on behalf of
> that device/driver inside the device.
>
Initially, this was a idea from Dan. I think this would be a good topic
for the next CXL open-source collaboration meeting. Kevn also commented
for this.
>
> This is all more problematic with CXL.cache, and I think the same
> approach can not be followed. I'm writing a document trying to share all
> my concerns about CXL.cache and DMA/IOMMU mappings, and I will cover
> this for sure. As a quick note, while DMA/IOMMU has no limitations
> regarding the amount of memory to use, it is unlikely the same can be
> done due to scarce host snoop cache resources, therefore the CXL.cache
> mappings will likely need to be explicitly done by the driver and
> approved by the CXL core (along with DMA/IOMMU), and for a driver inside
> a VM that implies VM exits.
>
Good to hear. Please CCme as well. Many thanks.
>
> Thanks.
>
> Alejandro.
>
>> Patches
>> =======
>>
>> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
>> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>>
>> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>>
>> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
>> initialization, CXL region creation.
>>
>> PATCH 5: Expose the CXL region to the userspace.
>>
>> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>>
>> PATCH 8: Emulate the HDM decoder registers.
>>
>> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>>
>> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>>
>> PATCH 11: Expose the CXL device information that userspace needs.
>>
>> PATCH 12: An example variant driver to demonstrate the usage of
>> vfio-cxl-core from the perspective of the VFIO variant driver.
>>
>> PATCH 13: A workaround needs suggestions.
>>
>> Test
>> ====
>>
>> To test the patches and hack around, a virtual passthrough with nested
>> virtualization approach is used.
>>
>> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
>> with the changes to emulate HDM decoders.
>>
>> While running the vfio-cxl in the L1 guest, an example VFIO variant
>> driver is used to attach with the QEMU CXL access device.
>>
>> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
>> VFIOStub.
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
>>
>> To make sure everyone can test the patches, the kernel config of L1 and
>> L2 are provided in the repos, the required kernel command params and
>> qemu command line can be found from the demostration video.[5]
>>
>> Repos
>> =====
>>
>> QEMU host:
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
>> L1 Kernel:
>> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
>> L1 QEMU:
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
>> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>>
>> [1] https://computeexpresslink.org/cxl-specification/
>> [2]
>> https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
>> [3]
>> https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
>> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>>
>> Feedback expected
>> =================
>>
>> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
>> - Variant driver requirements from more hardware vendors.
>> - vfio-cxl-core UABI to QEMU.
>>
>> Moving foward
>> =============
>>
>> - Rebase the patches on top of Alejandro's PATCH v3.
>> - Get Ira's type-2 emulated device patch into upstream as CXL folks
>> and RH
>> folks both came to talk and expect this. I had a chat with Ira and he
>> expected me to take it over. Will start a discussion in the CXL
>> discord
>> group for the desgin of V1.
>> - Sparse map in vfio-cxl-core.
>>
>> Known issues
>> ============
>>
>> - Teardown path. Missing teardown paths have been implements in
>> Alejandor's
>> PATCH v3. It should be solved after the rebase.
>>
>> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is
>> missing
>> in the Ira's patch. When rebooting L1, many CXL registers are not
>> reset.
>> This will be addressed in the formal review of emulated CXL type-2
>> device
>> support.
>>
>> Zhi Wang (13):
>> cxl: allow a type-2 device not to have memory device registers
>> cxl: introduce cxl_get_hdm_info()
>> cxl: introduce cxl_find_comp_reglock_offset()
>> vfio: introduce vfio-cxl core preludes
>> vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>> region
>> vfio/pci: expose vfio_pci_rw()
>> vfio/cxl: introduce vfio_cxl_core_{read, write}()
>> vfio/cxl: emulate HDM decoder registers
>> vfio/pci: introduce CXL device awareness
>> vfio/pci: emulate CXL DVSEC registers in the configuration space
>> vfio/cxl: introduce VFIO CXL device cap
>> vfio/cxl: VFIO variant driver for QEMU CXL accel device
>> vfio/cxl: workaround: don't take resource region when cxl is enabled.
>>
>> drivers/cxl/core/pci.c | 28 ++
>> drivers/cxl/core/regs.c | 22 +
>> drivers/cxl/cxl.h | 1 +
>> drivers/cxl/cxlpci.h | 3 +
>> drivers/cxl/pci.c | 14 +-
>> drivers/vfio/pci/Kconfig | 6 +
>> drivers/vfio/pci/Makefile | 5 +
>> drivers/vfio/pci/cxl-accel/Kconfig | 6 +
>> drivers/vfio/pci/cxl-accel/Makefile | 3 +
>> drivers/vfio/pci/cxl-accel/main.c | 116 +++++
>> drivers/vfio/pci/vfio_cxl_core.c | 647 ++++++++++++++++++++++++++++
>> drivers/vfio/pci/vfio_pci_config.c | 10 +
>> drivers/vfio/pci/vfio_pci_core.c | 79 +++-
>> drivers/vfio/pci/vfio_pci_rdwr.c | 8 +-
>> include/linux/cxl_accel_mem.h | 3 +
>> include/linux/cxl_accel_pci.h | 6 +
>> include/linux/vfio_pci_core.h | 53 +++
>> include/uapi/linux/vfio.h | 14 +
>> 18 files changed, 992 insertions(+), 32 deletions(-)
>> create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>> create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>> create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>> create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>>
next prev parent reply other threads:[~2024-09-27 7:38 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
2024-09-20 22:34 ` [RFC 01/13] cxl: allow a type-2 device not to have memory device registers Zhi Wang
2024-09-23 8:01 ` Tian, Kevin
2024-09-23 15:38 ` Dave Jiang
2024-09-24 8:03 ` Zhi Wang
2024-09-20 22:34 ` [RFC 02/13] cxl: introduce cxl_get_hdm_info() Zhi Wang
2024-10-17 15:44 ` Jonathan Cameron
2024-10-19 5:38 ` Zhi Wang
2024-09-20 22:34 ` [RFC 03/13] cxl: introduce cxl_find_comp_reglock_offset() Zhi Wang
2024-09-20 22:34 ` [RFC 04/13] vfio: introduce vfio-cxl core preludes Zhi Wang
2024-10-11 18:33 ` Alex Williamson
2024-09-20 22:34 ` [RFC 05/13] vfio/cxl: expose CXL region to the usersapce via a new VFIO device region Zhi Wang
2024-10-11 19:12 ` Alex Williamson
2024-09-20 22:34 ` [RFC 06/13] vfio/pci: expose vfio_pci_rw() Zhi Wang
2024-09-20 22:34 ` [RFC 07/13] vfio/cxl: introduce vfio_cxl_core_{read, write}() Zhi Wang
2024-09-20 22:34 ` [RFC 08/13] vfio/cxl: emulate HDM decoder registers Zhi Wang
2024-09-20 22:34 ` [RFC 09/13] vfio/pci: introduce CXL device awareness Zhi Wang
2024-10-11 20:37 ` Alex Williamson
2024-09-20 22:34 ` [RFC 10/13] vfio/pci: emulate CXL DVSEC registers in the configuration space Zhi Wang
2024-10-11 21:02 ` Alex Williamson
2024-09-20 22:34 ` [RFC 11/13] vfio/cxl: introduce VFIO CXL device cap Zhi Wang
2024-10-11 21:14 ` Alex Williamson
2024-09-20 22:34 ` [RFC 12/13] vfio/cxl: VFIO variant driver for QEMU CXL accel device Zhi Wang
2024-09-20 22:34 ` [RFC 13/13] vfio/cxl: workaround: don't take resource region when cxl is enabled Zhi Wang
2024-09-23 8:00 ` [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Tian, Kevin
2024-09-24 8:30 ` Zhi Wang
2024-09-25 13:05 ` Jonathan Cameron
2024-09-27 7:18 ` Zhi Wang
2024-10-04 11:40 ` Jonathan Cameron
2024-10-19 5:30 ` Zhi Wang
2024-10-21 11:07 ` Alejandro Lucero Palau
2024-09-26 6:55 ` Tian, Kevin
2024-09-25 10:11 ` Alejandro Lucero Palau
2024-09-27 7:38 ` Zhi Wang [this message]
2024-09-27 7:38 ` Zhi Wang
2024-10-21 10:49 ` Zhi Wang
2024-10-21 13:10 ` Alejandro Lucero Palau
2024-10-30 11:56 ` Zhi Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8beff9bc-9d60-4e56-ae25-b25755ecd38f@nvidia.com \
--to=zhiw@nvidia.com \
--cc=ACurrid@nvidia.com \
--cc=alex.williamson@redhat.com \
--cc=alison.schofield@intel.com \
--cc=alucerop@amd.com \
--cc=aniketa@nvidia.com \
--cc=ankita@nvidia.com \
--cc=cjia@nvidia.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=jgg@nvidia.com \
--cc=jonathan.cameron@huawei.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=linux-cxl@vger.kernel.org \
--cc=smitra@nvidia.com \
--cc=targupta@nvidia.com \
--cc=vishal.l.verma@intel.com \
--cc=zhiwang@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox