From: Zhi Wang <zhiw@nvidia.com>
To: Alejandro Lucero Palau <alucerop@amd.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>
Cc: "alex.williamson@redhat.com" <alex.williamson@redhat.com>,
"kevin.tian@intel.com" <kevin.tian@intel.com>,
Jason Gunthorpe <jgg@nvidia.com>,
"alison.schofield@intel.com" <alison.schofield@intel.com>,
"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
"dave.jiang@intel.com" <dave.jiang@intel.com>,
"dave@stgolabs.net" <dave@stgolabs.net>,
"jonathan.cameron@huawei.com" <jonathan.cameron@huawei.com>,
"ira.weiny@intel.com" <ira.weiny@intel.com>,
"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
Andy Currid <ACurrid@nvidia.com>, Neo Jia <cjia@nvidia.com>,
Surath Mitra <smitra@nvidia.com>,
Ankit Agrawal <ankita@nvidia.com>,
Aniket Agashe <aniketa@nvidia.com>,
Kirti Wankhede <kwankhede@nvidia.com>,
"Tarun Gupta (SW-GPU)" <targupta@nvidia.com>,
"zhiwang@kernel.org" <zhiwang@kernel.org>
Subject: Re: [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough
Date: Fri, 27 Sep 2024 07:38:17 +0000 [thread overview]
Message-ID: <8beff9bc-9d60-4e56-ae25-b25755ecd38f@nvidia.com> (raw)
In-Reply-To: <4230fba5-030c-49ef-799e-f4138b1c9f7d@amd.com>
On 25/09/2024 12.11, Alejandro Lucero Palau wrote:
> External email: Use caution opening links or attachments
>
>
> On 9/20/24 23:34, Zhi Wang wrote:
>> Hi folks:
>>
>> As promised in the LPC, here are all you need (patches, repos, guiding
>> video, kernel config) to build a environment to test the vfio-cxl-core.
>>
>> Thanks so much for the discussions! Enjoy and see you in the next one.
>>
>> Background
>> ==========
>>
>> Compute Express Link (CXL) is an open standard interconnect built upon
>> industrial PCI layers to enhance the performance and efficiency of data
>> centers by enabling high-speed, low-latency communication between CPUs
>> and various types of devices such as accelerators, memory.
>>
>> It supports three key protocols: CXL.io as the control protocol,
>> CXL.cache
>> as the cache-coherent host-device data transfer protocol, and CXL.mem as
>> memory expansion protocol. CXL Type 2 devices leverage the three
>> protocols
>> to seamlessly integrate with host CPUs, providing a unified and efficient
>> interface for high-speed data transfer and memory sharing. This
>> integration
>> is crucial for heterogeneous computing environments where accelerators,
>> such as GPUs, and other specialized processors, are used to handle
>> intensive workloads.
>>
>> Goal
>> ====
>>
>> Although CXL is built upon the PCI layers, passing a CXL type-2 device
>> can
>> be different than PCI devices according to CXL specification[1]:
>>
>> - CXL type-2 device initialization. CXL type-2 device requires an
>> additional initialization sequence besides the PCI device initialization.
>> CXL type-2 device initialization can be pretty complicated due to its
>> hierarchy of register interfaces. Thus, a standard CXL type-2 driver
>> initialization sequence provided by the kernel CXL core is used.
>>
>> - Create a CXL region and map it to the VM. A mapping between HPA and DPA
>> (Device PA) needs to be created to access the device memory directly. HDM
>> decoders in the CXL topology need to be configured level by level to
>> manage the mapping. After the region is created, it needs to be mapped to
>> GPA in the virtual HDM decoders configured by the VM.
>>
>> - CXL reset. The CXL device reset is different from the PCI device reset.
>> A CXL reset sequence is introduced by the CXL spec.
>>
>> - Emulating CXL DVSECs. CXL spec defines a set of DVSECs registers in the
>> configuration for device enumeration and device control. (E.g. if a
>> device
>> is capable of CXL.mem CXL.cache, enable/disable capability) They are
>> owned
>> by the kernel CXL core, and the VM can not modify them.
>>
>> - Emulate CXL MMIO registers. CXL spec defines a set of CXL MMIO
>> registers
>> that can sit in a PCI BAR. The location of register groups sit in the PCI
>> BAR is indicated by the register locator in the CXL DVSECs. They are also
>> owned by the kernel CXL core. Some of them need to be emulated.
>>
>> Design
>> ======
>>
>> To achieve the purpose above, the vfio-cxl-core is introduced to host the
>> common routines that variant driver requires for device passthrough.
>> Similar with the vfio-pci-core, the vfio-cxl-core provides common
>> routines of vfio_device_ops for the variant driver to hook and perform
>> the
>> CXL routines behind it.
>>
>> Besides, several extra APIs are introduced for the variant driver to
>> provide the necessary information the kernel CXL core to initialize
>> the CXL device. E.g., Device DPA.
>>
>> CXL is built upon the PCI layers but with differences. Thus, the
>> vfio-pci-core is aimed to be re-used as much as possible with the
>> awareness of operating on a CXL device.
>>
>> A new VFIO device region is introduced to expose the CXL region to the
>> userspace. A new CXL VFIO device cap has also been introduced to convey
>> the necessary CXL device information to the userspace.
>
>
>
> Hi Zhi,
>
>
> As you know, I was confused with this work but after looking at the
> patchset and thinking about all this, it makes sense now. FWIW, the most
> confusing point was to use the CXL core inside the VM for creating the
> region what implies commits to the CXL root switch/complex and any other
> switch in the path. I realize now it will happen but on emulated
> hardware with no implication to the real one, which was updated with any
> necessary change like those commits by the vfio cxl code in the host (L1
> VM in your tests).
>
>
> The only problem I can see with this approach is the CXL initialization
> is left unconditionally to the hypervisor. I guess most of the time will
> be fine, but the driver could not be mapping/using the whole CXL mem
> always. I know this could be awkward, but possible depending on the
> device state unrelated to CXL itself.
Will this device states be one-time on/off state or a runtime
configuration state that a guest need to poke all the time?
There can be two paths for handling these states in a vendor-specific
variant driver: 1) vfio_device->fops->open() path, it suits for one-time
on/off state 2) vfio_device->fops->{read, write}(), the VM
exit->QEMU->variant driver path. The vendor-specific driver can
configure the HW based on the register access from the guest.
It would be nice to know more about this, like how many registers the
vendor-specific driver would like to handle. Thus, the VFIO CXL core can
provide common helpers.
In other words, this approach
> assumes beforehand something which could not be true. What I had in mind
> was to have VM exits for any action on CXL configuration on behalf of
> that device/driver inside the device.
>
Initially, this was a idea from Dan. I think this would be a good topic
for the next CXL open-source collaboration meeting. Kevn also commented
for this.
>
> This is all more problematic with CXL.cache, and I think the same
> approach can not be followed. I'm writing a document trying to share all
> my concerns about CXL.cache and DMA/IOMMU mappings, and I will cover
> this for sure. As a quick note, while DMA/IOMMU has no limitations
> regarding the amount of memory to use, it is unlikely the same can be
> done due to scarce host snoop cache resources, therefore the CXL.cache
> mappings will likely need to be explicitly done by the driver and
> approved by the CXL core (along with DMA/IOMMU), and for a driver inside
> a VM that implies VM exits.
>
Good to hear. Please CCme as well. Many thanks.
>
> Thanks.
>
> Alejandro.
>
>> Patches
>> =======
>>
>> The patches are based on the cxl-type2 support RFCv2 patchset[2]. Will
>> rebase them to V3 once the cxl-type2 support v3 patch review is done.
>>
>> PATCH 1-3: Expose the necessary routines required by vfio-cxl.
>>
>> PATCH 4: Introduce the preludes of vfio-cxl, including CXL device
>> initialization, CXL region creation.
>>
>> PATCH 5: Expose the CXL region to the userspace.
>>
>> PATCH 6-7: Prepare to emulate the HDM decoder registers.
>>
>> PATCH 8: Emulate the HDM decoder registers.
>>
>> PATCH 9: Tweak vfio-cxl to be aware of working on a CXL device.
>>
>> PATCH 10: Tell vfio-pci-core to emulate CXL DVSECs.
>>
>> PATCH 11: Expose the CXL device information that userspace needs.
>>
>> PATCH 12: An example variant driver to demonstrate the usage of
>> vfio-cxl-core from the perspective of the VFIO variant driver.
>>
>> PATCH 13: A workaround needs suggestions.
>>
>> Test
>> ====
>>
>> To test the patches and hack around, a virtual passthrough with nested
>> virtualization approach is used.
>>
>> The host QEMU emulates a CXL type-2 accel device based on Ira's patches
>> with the changes to emulate HDM decoders.
>>
>> While running the vfio-cxl in the L1 guest, an example VFIO variant
>> driver is used to attach with the QEMU CXL access device.
>>
>> The L2 guest can be booted via the QEMU with the vfio-cxl support in the
>> VFIOStub.
>>
>> In the L2 guest, a dummy CXL device driver is provided to attach to the
>> virtual pass-thru device.
>>
>> The dummy CXL type-2 device driver can successfully be loaded with the
>> kernel cxl core type2 support, create CXL region by requesting the CXL
>> core to allocate HPA and DPA and configure the HDM decoders.
>>
>> To make sure everyone can test the patches, the kernel config of L1 and
>> L2 are provided in the repos, the required kernel command params and
>> qemu command line can be found from the demostration video.[5]
>>
>> Repos
>> =====
>>
>> QEMU host:
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-host
>> L1 Kernel:
>> https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l1-kernel-rfc
>> L1 QEMU:
>> https://github.com/zhiwang-nvidia/qemu/tree/zhi/vfio-cxl-qemu-l1-rfc
>> L2 Kernel: https://github.com/zhiwang-nvidia/linux/tree/zhi/vfio-cxl-l2
>>
>> [1] https://computeexpresslink.org/cxl-specification/
>> [2]
>> https://lore.kernel.org/netdev/20240715172835.24757-1-alejandro.lucero-palau@amd.com/T/
>> [3]
>> https://patchew.org/QEMU/20230517-rfc-type2-dev-v1-0-6eb2e470981b@intel.com/
>> [4] https://youtu.be/zlk_ecX9bxs?si=hc8P58AdhGXff3Q7
>>
>> Feedback expected
>> =================
>>
>> - Archtiecture level between vfio-pci-core and vfio-cxl-core.
>> - Variant driver requirements from more hardware vendors.
>> - vfio-cxl-core UABI to QEMU.
>>
>> Moving foward
>> =============
>>
>> - Rebase the patches on top of Alejandro's PATCH v3.
>> - Get Ira's type-2 emulated device patch into upstream as CXL folks
>> and RH
>> folks both came to talk and expect this. I had a chat with Ira and he
>> expected me to take it over. Will start a discussion in the CXL
>> discord
>> group for the desgin of V1.
>> - Sparse map in vfio-cxl-core.
>>
>> Known issues
>> ============
>>
>> - Teardown path. Missing teardown paths have been implements in
>> Alejandor's
>> PATCH v3. It should be solved after the rebase.
>>
>> - Powerdown L1 guest instead of reboot it. The QEMU reset handler is
>> missing
>> in the Ira's patch. When rebooting L1, many CXL registers are not
>> reset.
>> This will be addressed in the formal review of emulated CXL type-2
>> device
>> support.
>>
>> Zhi Wang (13):
>> cxl: allow a type-2 device not to have memory device registers
>> cxl: introduce cxl_get_hdm_info()
>> cxl: introduce cxl_find_comp_reglock_offset()
>> vfio: introduce vfio-cxl core preludes
>> vfio/cxl: expose CXL region to the usersapce via a new VFIO device
>> region
>> vfio/pci: expose vfio_pci_rw()
>> vfio/cxl: introduce vfio_cxl_core_{read, write}()
>> vfio/cxl: emulate HDM decoder registers
>> vfio/pci: introduce CXL device awareness
>> vfio/pci: emulate CXL DVSEC registers in the configuration space
>> vfio/cxl: introduce VFIO CXL device cap
>> vfio/cxl: VFIO variant driver for QEMU CXL accel device
>> vfio/cxl: workaround: don't take resource region when cxl is enabled.
>>
>> drivers/cxl/core/pci.c | 28 ++
>> drivers/cxl/core/regs.c | 22 +
>> drivers/cxl/cxl.h | 1 +
>> drivers/cxl/cxlpci.h | 3 +
>> drivers/cxl/pci.c | 14 +-
>> drivers/vfio/pci/Kconfig | 6 +
>> drivers/vfio/pci/Makefile | 5 +
>> drivers/vfio/pci/cxl-accel/Kconfig | 6 +
>> drivers/vfio/pci/cxl-accel/Makefile | 3 +
>> drivers/vfio/pci/cxl-accel/main.c | 116 +++++
>> drivers/vfio/pci/vfio_cxl_core.c | 647 ++++++++++++++++++++++++++++
>> drivers/vfio/pci/vfio_pci_config.c | 10 +
>> drivers/vfio/pci/vfio_pci_core.c | 79 +++-
>> drivers/vfio/pci/vfio_pci_rdwr.c | 8 +-
>> include/linux/cxl_accel_mem.h | 3 +
>> include/linux/cxl_accel_pci.h | 6 +
>> include/linux/vfio_pci_core.h | 53 +++
>> include/uapi/linux/vfio.h | 14 +
>> 18 files changed, 992 insertions(+), 32 deletions(-)
>> create mode 100644 drivers/vfio/pci/cxl-accel/Kconfig
>> create mode 100644 drivers/vfio/pci/cxl-accel/Makefile
>> create mode 100644 drivers/vfio/pci/cxl-accel/main.c
>> create mode 100644 drivers/vfio/pci/vfio_cxl_core.c
>>
next prev parent reply other threads:[~2024-09-27 7:38 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-20 22:34 [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Zhi Wang
2024-09-20 22:34 ` [RFC 01/13] cxl: allow a type-2 device not to have memory device registers Zhi Wang
2024-09-23 8:01 ` Tian, Kevin
2024-09-23 15:38 ` Dave Jiang
2024-09-24 8:03 ` Zhi Wang
2024-09-20 22:34 ` [RFC 02/13] cxl: introduce cxl_get_hdm_info() Zhi Wang
2024-10-17 15:44 ` Jonathan Cameron
2024-10-19 5:38 ` Zhi Wang
2024-09-20 22:34 ` [RFC 03/13] cxl: introduce cxl_find_comp_reglock_offset() Zhi Wang
2024-09-20 22:34 ` [RFC 04/13] vfio: introduce vfio-cxl core preludes Zhi Wang
2024-10-11 18:33 ` Alex Williamson
2024-09-20 22:34 ` [RFC 05/13] vfio/cxl: expose CXL region to the usersapce via a new VFIO device region Zhi Wang
2024-10-11 19:12 ` Alex Williamson
2024-09-20 22:34 ` [RFC 06/13] vfio/pci: expose vfio_pci_rw() Zhi Wang
2024-09-20 22:34 ` [RFC 07/13] vfio/cxl: introduce vfio_cxl_core_{read, write}() Zhi Wang
2024-09-20 22:34 ` [RFC 08/13] vfio/cxl: emulate HDM decoder registers Zhi Wang
2024-09-20 22:34 ` [RFC 09/13] vfio/pci: introduce CXL device awareness Zhi Wang
2024-10-11 20:37 ` Alex Williamson
2024-09-20 22:34 ` [RFC 10/13] vfio/pci: emulate CXL DVSEC registers in the configuration space Zhi Wang
2024-10-11 21:02 ` Alex Williamson
2024-09-20 22:34 ` [RFC 11/13] vfio/cxl: introduce VFIO CXL device cap Zhi Wang
2024-10-11 21:14 ` Alex Williamson
2024-09-20 22:34 ` [RFC 12/13] vfio/cxl: VFIO variant driver for QEMU CXL accel device Zhi Wang
2024-09-20 22:34 ` [RFC 13/13] vfio/cxl: workaround: don't take resource region when cxl is enabled Zhi Wang
2024-09-23 8:00 ` [RFC 00/13] vfio: introduce vfio-cxl to support CXL type-2 accelerator passthrough Tian, Kevin
2024-09-24 8:30 ` Zhi Wang
2024-09-25 13:05 ` Jonathan Cameron
2024-09-27 7:18 ` Zhi Wang
2024-10-04 11:40 ` Jonathan Cameron
2024-10-19 5:30 ` Zhi Wang
2024-10-21 11:07 ` Alejandro Lucero Palau
2024-09-26 6:55 ` Tian, Kevin
2024-09-25 10:11 ` Alejandro Lucero Palau
2024-09-27 7:38 ` Zhi Wang [this message]
2024-09-27 7:38 ` Zhi Wang
2024-10-21 10:49 ` Zhi Wang
2024-10-21 13:10 ` Alejandro Lucero Palau
2024-10-30 11:56 ` Zhi Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8beff9bc-9d60-4e56-ae25-b25755ecd38f@nvidia.com \
--to=zhiw@nvidia.com \
--cc=ACurrid@nvidia.com \
--cc=alex.williamson@redhat.com \
--cc=alison.schofield@intel.com \
--cc=alucerop@amd.com \
--cc=aniketa@nvidia.com \
--cc=ankita@nvidia.com \
--cc=cjia@nvidia.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=jgg@nvidia.com \
--cc=jonathan.cameron@huawei.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=linux-cxl@vger.kernel.org \
--cc=smitra@nvidia.com \
--cc=targupta@nvidia.com \
--cc=vishal.l.verma@intel.com \
--cc=zhiwang@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.