qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Nicolin Chen <nicolinc@nvidia.com>
To: <peter.maydell@linaro.org>, <shannon.zhaosl@gmail.com>,
	<mst@redhat.com>,  <imammedo@redhat.com>, <anisinha@redhat.com>,
	<eric.auger@redhat.com>, <peterx@redhat.com>
Cc: <qemu-arm@nongnu.org>, <qemu-devel@nongnu.org>, <jgg@nvidia.com>,
	<shameerali.kolothum.thodi@huawei.com>, <jasowang@redhat.com>
Subject: [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs
Date: Tue, 25 Jun 2024 17:28:27 -0700	[thread overview]
Message-ID: <cover.1719361174.git.nicolinc@nvidia.com> (raw)

Hi all,

This is a draft solution adding multiple nested SMMU instances to VM.
The main goal of the series is to collect opinions, to figure out a
reasonable solution that would fit our needs.

I understood that there are concerns regarding this support, from our
previous discussion:
https://lore.kernel.org/all/ZEcT%2F7erkhHDaNvD@Asurada-Nvidia/

Yet, some followup discussion in the kernel maillist has shifted the
direction of regular SMMU nesting to potentially having multiple vSMMU
instances as well:
https://lore.kernel.org/all/20240611121756.GT19897@nvidia.com/

I will summerize all the dots in the following paragraphs:

[ Why do we need multiple nested SMMUs? ]
1, This is a must feature for NVIDIA's Grace SoC to support its CMDQV
   (an extension HW for SMMU). It allows to assgin a Command Queue HW
   dedicatedly to a VM. Then VM controls it via an mmap'd MMIO page:
   https://lore.kernel.org/all/f00da8df12a154204e53b343b2439bf31517241f.1712978213.git.nicolinc@nvidia.com/
   Each Grace SoC has 5 SMMUs (i.e. 5 CMDQVs), meaning there can be 5
   MMIO pages. If QEMU only supports one vSMMU and all passthrough
   devices attach to one shared vSMMU, it technically cannot mmap 5
   MMIO pages, nor assign devices to use corresponding pages.
2, This is optional for nested SMMU, and essentially a design choice
   between a single vSMMU design and a multiple-vSMMU design. Here're
   the pros and cons:
   + Pros for single vSMMU design
     a) It is easy and clean, by all means.
   - Cons for single vSMMU design
     b) It can have complications if underlying pSMMUs are different.
     c) Emulated devices might have to be added to the nested SMMU,
        since "iommu=nested-smmuv3" enables for the whole VM. This
        means the vSMMU instance has to act at the same time as both
        a nested SMMU and a para-virt SMMU.
     d) IOTLB inefficiency. Since devices behind different pSMMUs are
        attached to a single vSMMU, the vSMMU code traps invalidation
        commands in a shared guest CMDQ, meaning it needs to dispatch
        those commands correctly to pSMMUs, by either a broadcast or
        walking through a lookup table. Note that if a command is not
        tied to any hwpt or device, it still has to be broadcasted.
   + Pros for multiple vSMMU design
     e) Emulated devices can be isolated from any nested SMMU.
     f) Cache invalidation commands will always be forwarded to the
        corresponding pSMMU, reducing the overhead from vSMMU walking
        through a lookup table or broadcasting.
     g) It will adapt CMDQV very easily.
   - Cons for multiple vSMMU diesgn
     h) Complications in VIRT and IORT design
     i) Difficulty to support device hotplugs
     j) Potential of running out of PCI bus number, as QEMU doesn't
        support multiple PCI domains.

[ How is it implemented with this series? ]
 * As an experimental series, this is all done in VIRT and ACPI code.
 * Scan iommu sysfs nodes and build an SMMU node list (PATCH-03).
 * Create one PCIe Expander Bridge (+ one vSMMU) from the top of bus
   number (0xFF) with intervals for root-ports (PATCH-05). E.g. host
   system with three pSMMUs:
   [ pcie.0 bus ]
   -----------------------------------------------------------------------------
           |                  |                   |                  |
   -----------------  ------------------  ------------------  ------------------
   | emulated devs |  | smmu_bridge.e5 |  | smmu_bridge.ee |  | smmu_bridge.f7 |
   -----------------  ------------------  ------------------  ------------------
 * Loop vfio-pci devices against the SMMU node list and assign them
   automatically in the VIRT code to the corresponding smmu_bridges,
   and then attach them by creating a root port (PATCH-06):
   [ pcie.0 bus ]
   -----------------------------------------------------------------------------
           |                  |                   |                  |
   -----------------  ------------------  ------------------  ------------------
   | emulated devs |  | smmu_bridge.e5 |  | smmu_bridge.ee |  | smmu_bridge.f7 |
   -----------------  ------------------  ------------------  ------------------
                                                  |
                                          ----------------   -----------
                                          | root_port.ef |---| PCI dev |
                                          ----------------   -----------
 * Set the "pcie.0" root bus to iommu bypass, so its entire ID space
   will be directed to ITS in IORT. If a vfio-pci device chooses to
   bypass 2-stage translation, it can be added to "pcie.0" (PATCH-07):
     --------------build_iort: its_idmaps
     build_iort_id_mapping: input_base=0x0, id_count=0xe4ff, out_ref=0x30
 * Map IDs of smmu_bridges to corresponding vSMMUs (PATCH-09):
     --------------build_iort: smmu_idmaps
     build_iort_id_mapping: input_base=0xe500, id_count=0x8ff, out_ref=0x48
     build_iort_id_mapping: input_base=0xee00, id_count=0x8ff, out_ref=0xa0
     build_iort_id_mapping: input_base=0xf700, id_count=0x8ff, out_ref=0xf8
 * Finally, "lspci -tv" in the guest looks like this:
     -+-[0000:ee]---00.0-[ef]----00.0  [vfio-pci passthrough]
      \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
                  +-01.0  Red Hat, Inc. QEMU PCIe Expander bridge
                  +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
                  +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
                  +-04.0  Red Hat, Inc. QEMU NVM Express Controller [emulated]
                  \-05.0  Intel Corporation 82540EM Gigabit Ethernet [emulated]

[ Topics for discussion ]
 * Some of the bits can be moved to backends/iommufd.c, e.g.
     -object iommufd,id=iommufd0,[nesting=smmu3,[max-hotplugs=1]]
   And I was hoping that the vfio-pci device could take the iommufd
   BE pointer so it can redirect the PCI bus. Yet, seems to be more
   complicated than I thought...
 * Possiblity of adding nesting support for vfio-pci-nohotplug only?
   The kernel uAPI (even for nesting cap detection) requires a dev
   handler. If a VM boots without a vfio-pci and then gets a hotplug
   after boot-to-console, a vSMMU that has already finished a reset
   cycle will need to sync the idr/idrr bits and will have to reset
   again?

This series is on Github:
https://github.com/nicolinc/qemu/commits/iommufd_multi_vsmmu-rfcv1

Thanks!
Nicolin

Eric Auger (1):
  hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested
    binding

Nicolin Chen (9):
  hw/arm/virt: Add iommufd link to virt-machine
  hw/arm/virt: Get the number of host-level SMMUv3 instances
  hw/arm/virt: Add an SMMU_IO_LEN macro
  hw/arm/virt: Add VIRT_NESTED_SMMU
  hw/arm/virt: Assign vfio-pci devices to nested SMMUs
  hw/arm/virt: Bypass iommu for default PCI bus
  hw/arm/virt-acpi-build: Handle reserved bus number of pxb buses
  hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes
  hw/arm/virt-acpi-build: Enable ATS for nested SMMUv3

 hw/arm/virt-acpi-build.c | 144 ++++++++++++++++----
 hw/arm/virt.c            | 277 +++++++++++++++++++++++++++++++++++++--
 include/hw/arm/virt.h    |  63 +++++++++
 3 files changed, 449 insertions(+), 35 deletions(-)

-- 
2.43.0



             reply	other threads:[~2024-06-26  3:07 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-26  0:28 Nicolin Chen [this message]
2024-06-26  0:28 ` [PATCH RFCv1 01/10] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine Nicolin Chen
2024-07-09  9:11   ` Eric Auger
2024-07-09 16:59     ` Nicolin Chen
2024-07-09 17:06       ` Eric Auger
2024-07-09 17:18         ` Nicolin Chen
2024-07-10  2:32           ` Duan, Zhenzhong
2024-06-26  0:28 ` [PATCH RFCv1 03/10] hw/arm/virt: Get the number of host-level SMMUv3 instances Nicolin Chen
2024-07-09  9:20   ` Eric Auger
2024-07-09 17:11     ` Nicolin Chen
2024-07-09 17:22       ` Eric Auger
2024-07-09 18:02         ` Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 04/10] hw/arm/virt: Add an SMMU_IO_LEN macro Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 05/10] hw/arm/virt: Add VIRT_NESTED_SMMU Nicolin Chen
2024-07-09 13:26   ` Eric Auger
2024-07-09 17:59     ` Nicolin Chen
2024-07-11 15:48       ` Andrea Bolognani
2024-07-11 17:57         ` Jason Gunthorpe
2024-06-26  0:28 ` [PATCH RFCv1 06/10] hw/arm/virt: Assign vfio-pci devices to nested SMMUs Nicolin Chen
2024-07-09 13:32   ` Eric Auger
2024-06-26  0:28 ` [PATCH RFCv1 07/10] hw/arm/virt: Bypass iommu for default PCI bus Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 08/10] hw/arm/virt-acpi-build: Handle reserved bus number of pxb buses Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 09/10] hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 10/10] hw/arm/virt-acpi-build: Enable ATS for nested SMMUv3 Nicolin Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1719361174.git.nicolinc@nvidia.com \
    --to=nicolinc@nvidia.com \
    --cc=anisinha@redhat.com \
    --cc=eric.auger@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=mst@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=peterx@redhat.com \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=shannon.zhaosl@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).