[PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs
@ 2024-06-26  0:28 Nicolin Chen
  2024-06-26  0:28 ` [PATCH RFCv1 01/10] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Nicolin Chen
                   ` (9 more replies)
  0 siblings, 10 replies; 25+ messages in thread
From: Nicolin Chen @ 2024-06-26  0:28 UTC (permalink / raw)
  To: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha,
	eric.auger, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

Hi all,

This is a draft solution adding multiple nested SMMU instances to VM.
The main goal of the series is to collect opinions, to figure out a
reasonable solution that would fit our needs.

I understood that there are concerns regarding this support, from our
previous discussion:
https://lore.kernel.org/all/ZEcT%2F7erkhHDaNvD@Asurada-Nvidia/

Yet, some followup discussion in the kernel maillist has shifted the
direction of regular SMMU nesting to potentially having multiple vSMMU
instances as well:
https://lore.kernel.org/all/20240611121756.GT19897@nvidia.com/

I will summerize all the dots in the following paragraphs:

[ Why do we need multiple nested SMMUs? ]
1, This is a must feature for NVIDIA's Grace SoC to support its CMDQV
   (an extension HW for SMMU). It allows to assgin a Command Queue HW
   dedicatedly to a VM. Then VM controls it via an mmap'd MMIO page:
   https://lore.kernel.org/all/f00da8df12a154204e53b343b2439bf31517241f.1712978213.git.nicolinc@nvidia.com/
   Each Grace SoC has 5 SMMUs (i.e. 5 CMDQVs), meaning there can be 5
   MMIO pages. If QEMU only supports one vSMMU and all passthrough
   devices attach to one shared vSMMU, it technically cannot mmap 5
   MMIO pages, nor assign devices to use corresponding pages.
2, This is optional for nested SMMU, and essentially a design choice
   between a single vSMMU design and a multiple-vSMMU design. Here're
   the pros and cons:
   + Pros for single vSMMU design
     a) It is easy and clean, by all means.
   - Cons for single vSMMU design
     b) It can have complications if underlying pSMMUs are different.
     c) Emulated devices might have to be added to the nested SMMU,
        since "iommu=nested-smmuv3" enables for the whole VM. This
        means the vSMMU instance has to act at the same time as both
        a nested SMMU and a para-virt SMMU.
     d) IOTLB inefficiency. Since devices behind different pSMMUs are
        attached to a single vSMMU, the vSMMU code traps invalidation
        commands in a shared guest CMDQ, meaning it needs to dispatch
        those commands correctly to pSMMUs, by either a broadcast or
        walking through a lookup table. Note that if a command is not
        tied to any hwpt or device, it still has to be broadcasted.
   + Pros for multiple vSMMU design
     e) Emulated devices can be isolated from any nested SMMU.
     f) Cache invalidation commands will always be forwarded to the
        corresponding pSMMU, reducing the overhead from vSMMU walking
        through a lookup table or broadcasting.
     g) It will adapt CMDQV very easily.
   - Cons for multiple vSMMU diesgn
     h) Complications in VIRT and IORT design
     i) Difficulty to support device hotplugs
     j) Potential of running out of PCI bus number, as QEMU doesn't
        support multiple PCI domains.

[ How is it implemented with this series? ]
 * As an experimental series, this is all done in VIRT and ACPI code.
 * Scan iommu sysfs nodes and build an SMMU node list (PATCH-03).
 * Create one PCIe Expander Bridge (+ one vSMMU) from the top of bus
   number (0xFF) with intervals for root-ports (PATCH-05). E.g. host
   system with three pSMMUs:
   [ pcie.0 bus ]
   -----------------------------------------------------------------------------
           |                  |                   |                  |
   -----------------  ------------------  ------------------  ------------------
   | emulated devs |  | smmu_bridge.e5 |  | smmu_bridge.ee |  | smmu_bridge.f7 |
   -----------------  ------------------  ------------------  ------------------
 * Loop vfio-pci devices against the SMMU node list and assign them
   automatically in the VIRT code to the corresponding smmu_bridges,
   and then attach them by creating a root port (PATCH-06):
   [ pcie.0 bus ]
   -----------------------------------------------------------------------------
           |                  |                   |                  |
   -----------------  ------------------  ------------------  ------------------
   | emulated devs |  | smmu_bridge.e5 |  | smmu_bridge.ee |  | smmu_bridge.f7 |
   -----------------  ------------------  ------------------  ------------------
                                                  |
                                          ----------------   -----------
                                          | root_port.ef |---| PCI dev |
                                          ----------------   -----------
 * Set the "pcie.0" root bus to iommu bypass, so its entire ID space
   will be directed to ITS in IORT. If a vfio-pci device chooses to
   bypass 2-stage translation, it can be added to "pcie.0" (PATCH-07):
     --------------build_iort: its_idmaps
     build_iort_id_mapping: input_base=0x0, id_count=0xe4ff, out_ref=0x30
 * Map IDs of smmu_bridges to corresponding vSMMUs (PATCH-09):
     --------------build_iort: smmu_idmaps
     build_iort_id_mapping: input_base=0xe500, id_count=0x8ff, out_ref=0x48
     build_iort_id_mapping: input_base=0xee00, id_count=0x8ff, out_ref=0xa0
     build_iort_id_mapping: input_base=0xf700, id_count=0x8ff, out_ref=0xf8
 * Finally, "lspci -tv" in the guest looks like this:
     -+-[0000:ee]---00.0-[ef]----00.0  [vfio-pci passthrough]
      \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
                  +-01.0  Red Hat, Inc. QEMU PCIe Expander bridge
                  +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
                  +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
                  +-04.0  Red Hat, Inc. QEMU NVM Express Controller [emulated]
                  \-05.0  Intel Corporation 82540EM Gigabit Ethernet [emulated]

[ Topics for discussion ]
 * Some of the bits can be moved to backends/iommufd.c, e.g.
     -object iommufd,id=iommufd0,[nesting=smmu3,[max-hotplugs=1]]
   And I was hoping that the vfio-pci device could take the iommufd
   BE pointer so it can redirect the PCI bus. Yet, seems to be more
   complicated than I thought...
 * Possiblity of adding nesting support for vfio-pci-nohotplug only?
   The kernel uAPI (even for nesting cap detection) requires a dev
   handler. If a VM boots without a vfio-pci and then gets a hotplug
   after boot-to-console, a vSMMU that has already finished a reset
   cycle will need to sync the idr/idrr bits and will have to reset
   again?

This series is on Github:
https://github.com/nicolinc/qemu/commits/iommufd_multi_vsmmu-rfcv1

Thanks!
Nicolin

Eric Auger (1):
  hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested
    binding

Nicolin Chen (9):
  hw/arm/virt: Add iommufd link to virt-machine
  hw/arm/virt: Get the number of host-level SMMUv3 instances
  hw/arm/virt: Add an SMMU_IO_LEN macro
  hw/arm/virt: Add VIRT_NESTED_SMMU
  hw/arm/virt: Assign vfio-pci devices to nested SMMUs
  hw/arm/virt: Bypass iommu for default PCI bus
  hw/arm/virt-acpi-build: Handle reserved bus number of pxb buses
  hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes
  hw/arm/virt-acpi-build: Enable ATS for nested SMMUv3

 hw/arm/virt-acpi-build.c | 144 ++++++++++++++++----
 hw/arm/virt.c            | 277 +++++++++++++++++++++++++++++++++++++--
 include/hw/arm/virt.h    |  63 +++++++++
 3 files changed, 449 insertions(+), 35 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFCv1 01/10] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding
  2024-06-26  0:28 [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs Nicolin Chen
@ 2024-06-26  0:28 ` Nicolin Chen
  2024-06-26  0:28 ` [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine Nicolin Chen
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Nicolin Chen @ 2024-06-26  0:28 UTC (permalink / raw)
  To: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha,
	eric.auger, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

From: Eric Auger <eric.auger@redhat.com>

To handle SMMUv3 nested stage support it is practical to
expose the guest with reserved memory regions (RMRs)
covering the IOVAs used by the host kernel to map
physical MSI doorbells.

Those IOVAs belong to [0x8000000, 0x8100000] matching
MSI_IOVA_BASE and MSI_IOVA_LENGTH definitions in kernel
arm-smmu-v3 driver. This is the window used to allocate
IOVAs matching physical MSI doorbells.

With those RMRs, the guest is forced to use a flat mapping
for this range. Hence the assigned device is programmed
with one IOVA from this range. Stage 1, owned by the guest
has a flat mapping for this IOVA. Stage2, owned by the VMM
then enforces a mapping from this IOVA to the physical
MSI doorbell.

The creation of those RMR nodes only is relevant if nested
stage SMMU is in use, along with VFIO. As VFIO devices can be
hotplugged, all RMRs need to be created in advance. Hence
the patch introduces a new arm virt "nested-smmuv3" iommu type.

ARM DEN 0049E.b IORT specification also mandates that when
RMRs are present, the OS must preserve PCIe configuration
performed by the boot FW. So along with the RMR IORT nodes,
a _DSM function #5, as defined by PCI FIRMWARE SPECIFICATION
EVISION 3.3, chapter 4.6.5 is added to PCIe host bridge
and PCIe expander bridge objects.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

---

v2 -> v3:
- Comply with IORT E.d spec. RMR node rev = 3; IORT rev = 5
  With this spec revision, the restriction on number of
  Stream IDs that can be associated with memory ranges in an
  RMR node was removed. So no need anymore to define 1 RMR node per
  SID!
---
 hw/arm/virt-acpi-build.c | 84 +++++++++++++++++++++++++++++++++-------
 hw/arm/virt.c            |  7 +++-
 include/hw/arm/virt.h    |  7 ++++
 3 files changed, 84 insertions(+), 14 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 94363a6d65..d5e72800f6 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -132,6 +132,14 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
         .bus    = vms->bus,
     };
 
+    /*
+     * Nested SMMU requires RMRs for MSI 1-1 mapping, which
+     * require _DSM for PreservingPCI Boot Configurations
+     */
+    if (vms->iommu == VIRT_IOMMU_NESTED_SMMUV3) {
+        cfg.preserve_config = true;
+    }
+
     if (vms->highmem_mmio) {
         cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
     }
@@ -216,16 +224,16 @@ static void acpi_dsdt_add_tpm(Aml *scope, VirtMachineState *vms)
  *
  * Note that @id_count gets internally subtracted by one, following the spec.
  */
-static void build_iort_id_mapping(GArray *table_data, uint32_t input_base,
-                                  uint32_t id_count, uint32_t out_ref)
+static void
+build_iort_id_mapping(GArray *table_data, uint32_t input_base,
+                      uint32_t id_count, uint32_t out_ref, uint32_t flags)
 {
     build_append_int_noprefix(table_data, input_base, 4); /* Input base */
     /* Number of IDs - The number of IDs in the range minus one */
     build_append_int_noprefix(table_data, id_count - 1, 4);
     build_append_int_noprefix(table_data, input_base, 4); /* Output base */
     build_append_int_noprefix(table_data, out_ref, 4); /* Output Reference */
-    /* Flags */
-    build_append_int_noprefix(table_data, 0 /* Single mapping (disabled) */, 4);
+    build_append_int_noprefix(table_data, flags, 4); /* Flags */
 }
 
 struct AcpiIortIdMapping {
@@ -267,6 +275,48 @@ static int iort_idmap_compare(gconstpointer a, gconstpointer b)
     return idmap_a->input_base - idmap_b->input_base;
 }
 
+static void
+build_iort_rmr_nodes(GArray *table_data, GArray *smmu_idmaps, int smmu_offset, uint32_t *id) {
+    AcpiIortIdMapping *range;
+    int i;
+
+    for (i = 0; i < smmu_idmaps->len; i++) {
+        range = &g_array_index(smmu_idmaps, AcpiIortIdMapping, i);
+        int bdf = range->input_base;
+
+        /* Table 18 Reserved Memory Range Node */
+
+        build_append_int_noprefix(table_data, 6 /* RMR */, 1); /* Type */
+        /* Length */
+        build_append_int_noprefix(table_data, 28 + ID_MAPPING_ENTRY_SIZE + 20, 2);
+        build_append_int_noprefix(table_data, 3, 1); /* Revision */
+        build_append_int_noprefix(table_data, *id, 4); /* Identifier */
+        /* Number of ID mappings */
+        build_append_int_noprefix(table_data, 1, 4);
+        /* Reference to ID Array */
+        build_append_int_noprefix(table_data, 28, 4);
+
+        /* RMR specific data */
+
+        /* Flags */
+        build_append_int_noprefix(table_data, 0 /* Disallow remapping */, 4);
+        /* Number of Memory Range Descriptors */
+        build_append_int_noprefix(table_data, 1 , 4);
+        /* Reference to Memory Range Descriptors */
+        build_append_int_noprefix(table_data, 28 + ID_MAPPING_ENTRY_SIZE, 4);
+        build_iort_id_mapping(table_data, bdf, range->id_count, smmu_offset, 1);
+
+        /* Table 19 Memory Range Descriptor */
+
+        /* Physical Range offset */
+        build_append_int_noprefix(table_data, 0x8000000, 8);
+        /* Physical Range length */
+        build_append_int_noprefix(table_data, 0x100000, 8);
+        build_append_int_noprefix(table_data, 0, 4); /* Reserved */
+        *id += 1;
+    }
+}
+
 /*
  * Input Output Remapping Table (IORT)
  * Conforms to "IO Remapping Table System Software on ARM Platforms",
@@ -282,17 +332,19 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     GArray *smmu_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping));
     GArray *its_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping));
 
-    AcpiTable table = { .sig = "IORT", .rev = 3, .oem_id = vms->oem_id,
+    AcpiTable table = { .sig = "IORT", .rev = 5, .oem_id = vms->oem_id,
                         .oem_table_id = vms->oem_table_id };
     /* Table 2 The IORT */
     acpi_table_begin(&table, table_data);
 
-    if (vms->iommu == VIRT_IOMMU_SMMUV3) {
+    if (virt_has_smmuv3(vms)) {
         AcpiIortIdMapping next_range = {0};
 
         object_child_foreach_recursive(object_get_root(),
                                        iort_host_bridges, smmu_idmaps);
 
+        nb_nodes = 3; /* RC, ITS, SMMUv3 */
+
         /* Sort the smmu idmap by input_base */
         g_array_sort(smmu_idmaps, iort_idmap_compare);
 
@@ -309,6 +361,9 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
             }
 
             next_range.input_base = idmap->input_base + idmap->id_count;
+            if (vms->iommu == VIRT_IOMMU_NESTED_SMMUV3) {
+                nb_nodes++;
+            }
         }
 
         /* Append the last RC -> ITS ID mapping */
@@ -317,7 +372,6 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
             g_array_append_val(its_idmaps, next_range);
         }
 
-        nb_nodes = 3; /* RC, ITS, SMMUv3 */
         rc_mapping_count = smmu_idmaps->len + its_idmaps->len;
     } else {
         nb_nodes = 2; /* RC, ITS */
@@ -342,7 +396,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     /* GIC ITS Identifier Array */
     build_append_int_noprefix(table_data, 0 /* MADT translation_id */, 4);
 
-    if (vms->iommu == VIRT_IOMMU_SMMUV3) {
+    if (virt_has_smmuv3(vms)) {
         int irq =  vms->irqmap[VIRT_SMMU] + ARM_SPI_BASE;
 
         smmu_offset = table_data->len - table.table_offset;
@@ -372,7 +426,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
         build_append_int_noprefix(table_data, 0, 4);
 
         /* output IORT node is the ITS group node (the first node) */
-        build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET);
+        build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET, 0);
     }
 
     /* Table 17 Root Complex Node */
@@ -405,7 +459,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     build_append_int_noprefix(table_data, 0, 3); /* Reserved */
 
     /* Output Reference */
-    if (vms->iommu == VIRT_IOMMU_SMMUV3) {
+    if (virt_has_smmuv3(vms)) {
         AcpiIortIdMapping *range;
 
         /* translated RIDs connect to SMMUv3 node: RC -> SMMUv3 -> ITS */
@@ -413,7 +467,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
             range = &g_array_index(smmu_idmaps, AcpiIortIdMapping, i);
             /* output IORT node is the smmuv3 node */
             build_iort_id_mapping(table_data, range->input_base,
-                                  range->id_count, smmu_offset);
+                                  range->id_count, smmu_offset, 0);
         }
 
         /* bypassed RIDs connect to ITS group node directly: RC -> ITS */
@@ -421,11 +475,15 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
             range = &g_array_index(its_idmaps, AcpiIortIdMapping, i);
             /* output IORT node is the ITS group node (the first node) */
             build_iort_id_mapping(table_data, range->input_base,
-                                  range->id_count, IORT_NODE_OFFSET);
+                                  range->id_count, IORT_NODE_OFFSET, 0);
         }
     } else {
         /* output IORT node is the ITS group node (the first node) */
-        build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET);
+        build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET, 0);
+    }
+
+    if (vms->iommu == VIRT_IOMMU_NESTED_SMMUV3) {
+        build_iort_rmr_nodes(table_data, smmu_idmaps, smmu_offset, &id);
     }
 
     acpi_table_end(linker, &table);
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 3c93c0c0a6..78af2d2195 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1396,7 +1396,7 @@ static void create_smmu(const VirtMachineState *vms,
     DeviceState *dev;
     MachineState *ms = MACHINE(vms);
 
-    if (vms->iommu != VIRT_IOMMU_SMMUV3 || !vms->iommu_phandle) {
+    if (!virt_has_smmuv3(vms) || !vms->iommu_phandle) {
         return;
     }
 
@@ -1578,6 +1578,7 @@ static void create_pcie(VirtMachineState *vms)
 
         switch (vms->iommu) {
         case VIRT_IOMMU_SMMUV3:
+        case VIRT_IOMMU_NESTED_SMMUV3:
             create_smmu(vms, vms->bus);
             qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map",
                                    0x0, vms->iommu_phandle, 0x0, 0x10000);
@@ -2653,6 +2654,8 @@ static char *virt_get_iommu(Object *obj, Error **errp)
         return g_strdup("none");
     case VIRT_IOMMU_SMMUV3:
         return g_strdup("smmuv3");
+    case VIRT_IOMMU_NESTED_SMMUV3:
+        return g_strdup("nested-smmuv3");
     default:
         g_assert_not_reached();
     }
@@ -2664,6 +2667,8 @@ static void virt_set_iommu(Object *obj, const char *value, Error **errp)
 
     if (!strcmp(value, "smmuv3")) {
         vms->iommu = VIRT_IOMMU_SMMUV3;
+    } else if (!strcmp(value, "nested-smmuv3")) {
+        vms->iommu = VIRT_IOMMU_NESTED_SMMUV3;
     } else if (!strcmp(value, "none")) {
         vms->iommu = VIRT_IOMMU_NONE;
     } else {
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index bb486d36b1..7df0813e28 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -89,6 +89,7 @@ enum {
 typedef enum VirtIOMMUType {
     VIRT_IOMMU_NONE,
     VIRT_IOMMU_SMMUV3,
+    VIRT_IOMMU_NESTED_SMMUV3,
     VIRT_IOMMU_VIRTIO,
 } VirtIOMMUType;
 
@@ -209,4 +210,10 @@ static inline int virt_gicv3_redist_region_count(VirtMachineState *vms)
             vms->highmem_redists) ? 2 : 1;
 }
 
+static inline bool virt_has_smmuv3(const VirtMachineState *vms)
+{
+    return vms->iommu == VIRT_IOMMU_SMMUV3 ||
+           vms->iommu == VIRT_IOMMU_NESTED_SMMUV3;
+}
+
 #endif /* QEMU_ARM_VIRT_H */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine
  2024-06-26  0:28 [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs Nicolin Chen
  2024-06-26  0:28 ` [PATCH RFCv1 01/10] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Nicolin Chen
@ 2024-06-26  0:28 ` Nicolin Chen
  2024-07-09  9:11   ` Eric Auger
  2024-06-26  0:28 ` [PATCH RFCv1 03/10] hw/arm/virt: Get the number of host-level SMMUv3 instances Nicolin Chen
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Nicolin Chen @ 2024-06-26  0:28 UTC (permalink / raw)
  To: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha,
	eric.auger, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

A nested SMMU must use iommufd ioctls to communicate with the host-level
SMMU instance for 2-stage translation support. Add an iommufd link to the
ARM virt-machine, allowing QEMU command to pass in an iommufd object.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/arm/virt.c         | 14 ++++++++++++++
 include/hw/arm/virt.h |  2 ++
 2 files changed, 16 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 78af2d2195..71093d7c60 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1404,6 +1404,13 @@ static void create_smmu(const VirtMachineState *vms,
 
     object_property_set_link(OBJECT(dev), "primary-bus", OBJECT(bus),
                              &error_abort);
+
+    if (vms->iommu == VIRT_IOMMU_NESTED_SMMUV3) {
+        g_assert(vms->iommufd);
+        object_property_set_link(OBJECT(dev), "iommufd", OBJECT(vms->iommufd),
+                                 &error_abort);
+        object_property_set_bool(OBJECT(dev), "nested", true, &error_abort);
+    }
     sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), &error_fatal);
     sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, base);
     for (i = 0; i < NUM_SMMU_IRQS; i++) {
@@ -3114,6 +3121,13 @@ static void virt_machine_class_init(ObjectClass *oc, void *data)
                                           "Set GIC version. "
                                           "Valid values are 2, 3, 4, host and max");
 
+    object_class_property_add_link(oc, "iommufd", TYPE_IOMMUFD_BACKEND,
+                                   offsetof(VirtMachineState, iommufd),
+                                   object_property_allow_set_link,
+                                   OBJ_PROP_LINK_STRONG);
+    object_class_property_set_description(oc, "iommufd",
+                                          "Set the IOMMUFD handler from \"-iommufd\"");
+
     object_class_property_add_str(oc, "iommu", virt_get_iommu, virt_set_iommu);
     object_class_property_set_description(oc, "iommu",
                                           "Set the IOMMU type. "
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 7df0813e28..d5cbce1a30 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -36,6 +36,7 @@
 #include "hw/arm/boot.h"
 #include "hw/arm/bsa.h"
 #include "hw/block/flash.h"
+#include "sysemu/iommufd.h"
 #include "sysemu/kvm.h"
 #include "hw/intc/arm_gicv3_common.h"
 #include "qom/object.h"
@@ -154,6 +155,7 @@ struct VirtMachineState {
     bool dtb_randomness;
     OnOffAuto acpi;
     VirtGICType gic_version;
+    IOMMUFDBackend *iommufd;
     VirtIOMMUType iommu;
     bool default_bus_bypass_iommu;
     VirtMSIControllerType msi_controller;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine
  2024-06-26  0:28 ` [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine Nicolin Chen
@ 2024-07-09  9:11   ` Eric Auger
  2024-07-09 16:59     ` Nicolin Chen
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Auger @ 2024-07-09  9:11 UTC (permalink / raw)
  To: Nicolin Chen, peter.maydell, shannon.zhaosl, mst, imammedo,
	anisinha, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

Hi Nicolin,

On 6/26/24 02:28, Nicolin Chen wrote:
> A nested SMMU must use iommufd ioctls to communicate with the host-level
> SMMU instance for 2-stage translation support. Add an iommufd link to the
> ARM virt-machine, allowing QEMU command to pass in an iommufd object.
If I am not wrong vfio devices are allowed to use different iommufd's
(although there is no real benefice). So this command line wouldn't
match with that option.
Also while reading the commit msg it is not clear with the iommufd is
needed in the machine whereas the vfio iommufd BE generally calls those
ioctls.

Thanks

Eric
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  hw/arm/virt.c         | 14 ++++++++++++++
>  include/hw/arm/virt.h |  2 ++
>  2 files changed, 16 insertions(+)
>
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 78af2d2195..71093d7c60 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -1404,6 +1404,13 @@ static void create_smmu(const VirtMachineState *vms,
>  
>      object_property_set_link(OBJECT(dev), "primary-bus", OBJECT(bus),
>                               &error_abort);
> +
> +    if (vms->iommu == VIRT_IOMMU_NESTED_SMMUV3) {
> +        g_assert(vms->iommufd);
> +        object_property_set_link(OBJECT(dev), "iommufd", OBJECT(vms->iommufd),
> +                                 &error_abort);
> +        object_property_set_bool(OBJECT(dev), "nested", true, &error_abort);

> +    }
>      sysbus_realize_and_unref(SYS_BUS_DEVICE(dev), &error_fatal);
>      sysbus_mmio_map(SYS_BUS_DEVICE(dev), 0, base);
>      for (i = 0; i < NUM_SMMU_IRQS; i++) {
> @@ -3114,6 +3121,13 @@ static void virt_machine_class_init(ObjectClass *oc, void *data)
>                                            "Set GIC version. "
>                                            "Valid values are 2, 3, 4, host and max");
>  
> +    object_class_property_add_link(oc, "iommufd", TYPE_IOMMUFD_BACKEND,
> +                                   offsetof(VirtMachineState, iommufd),
> +                                   object_property_allow_set_link,
> +                                   OBJ_PROP_LINK_STRONG);
> +    object_class_property_set_description(oc, "iommufd",
> +                                          "Set the IOMMUFD handler from \"-iommufd\"");
> +
>      object_class_property_add_str(oc, "iommu", virt_get_iommu, virt_set_iommu);
>      object_class_property_set_description(oc, "iommu",
>                                            "Set the IOMMU type. "
> diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> index 7df0813e28..d5cbce1a30 100644
> --- a/include/hw/arm/virt.h
> +++ b/include/hw/arm/virt.h
> @@ -36,6 +36,7 @@
>  #include "hw/arm/boot.h"
>  #include "hw/arm/bsa.h"
>  #include "hw/block/flash.h"
> +#include "sysemu/iommufd.h"
>  #include "sysemu/kvm.h"
>  #include "hw/intc/arm_gicv3_common.h"
>  #include "qom/object.h"
> @@ -154,6 +155,7 @@ struct VirtMachineState {
>      bool dtb_randomness;
>      OnOffAuto acpi;
>      VirtGICType gic_version;
> +    IOMMUFDBackend *iommufd;
>      VirtIOMMUType iommu;
>      bool default_bus_bypass_iommu;
>      VirtMSIControllerType msi_controller;



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine
  2024-07-09  9:11   ` Eric Auger
@ 2024-07-09 16:59     ` Nicolin Chen
  2024-07-09 17:06       ` Eric Auger
  0 siblings, 1 reply; 25+ messages in thread
From: Nicolin Chen @ 2024-07-09 16:59 UTC (permalink / raw)
  To: Eric Auger
  Cc: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha, peterx,
	qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

Hi Eric,

Thanks for the comments!

On Tue, Jul 09, 2024 at 11:11:56AM +0200, Eric Auger wrote:
> On 6/26/24 02:28, Nicolin Chen wrote:
> > A nested SMMU must use iommufd ioctls to communicate with the host-level
> > SMMU instance for 2-stage translation support. Add an iommufd link to the
> > ARM virt-machine, allowing QEMU command to pass in an iommufd object.

> If I am not wrong vfio devices are allowed to use different iommufd's
> (although there is no real benefice). So this command line wouldn't
> match with that option.

I think Jason's remarks highlighted that FD should be one per VM:
https://lore.kernel.org/qemu-devel/20240503141024.GE3341011@nvidia.com/

> Also while reading the commit msg it is not clear with the iommufd is
> needed in the machine whereas the vfio iommufd BE generally calls those
> ioctls.

I think I forgot to revisit it. Both intel_iommu and smmu-common
used to call iommufd_backend_connect() for counting, so there was
a need to pass in the same iommufd handler to the viommu driver.
For SMMU, since it is created in the virt code, we had to pass in
with this patch.

That being said, it looks like intel_iommu had removed that. So,
likely we don't need an extra user counting for SMMU too.

Thank you
Nicolin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine
  2024-07-09 16:59     ` Nicolin Chen
@ 2024-07-09 17:06       ` Eric Auger
  2024-07-09 17:18         ` Nicolin Chen
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Auger @ 2024-07-09 17:06 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha, peterx,
	qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang



On 7/9/24 18:59, Nicolin Chen wrote:
> Hi Eric,
>
> Thanks for the comments!
>
> On Tue, Jul 09, 2024 at 11:11:56AM +0200, Eric Auger wrote:
>> On 6/26/24 02:28, Nicolin Chen wrote:
>>> A nested SMMU must use iommufd ioctls to communicate with the host-level
>>> SMMU instance for 2-stage translation support. Add an iommufd link to the
>>> ARM virt-machine, allowing QEMU command to pass in an iommufd object.
>> If I am not wrong vfio devices are allowed to use different iommufd's
>> (although there is no real benefice). So this command line wouldn't
>> match with that option.
> I think Jason's remarks highlighted that FD should be one per VM:
> https://lore.kernel.org/qemu-devel/20240503141024.GE3341011@nvidia.com/
OK I thought this was still envisionned althought not really meaningful.
By the way, please add Yi and Zhenzhong in cc since thre problematics
are connected I think.
>
>> Also while reading the commit msg it is not clear with the iommufd is
>> needed in the machine whereas the vfio iommufd BE generally calls those
>> ioctls.
> I think I forgot to revisit it. Both intel_iommu and smmu-common
> used to call iommufd_backend_connect() for counting, so there was
> a need to pass in the same iommufd handler to the viommu driver.
> For SMMU, since it is created in the virt code, we had to pass in
> with this patch.
>
> That being said, it looks like intel_iommu had removed that. So,
> likely we don't need an extra user counting for SMMU too.
OK at least it deserves some explanation about the "why"

Thanks

Eric
>
> Thank you
> Nicolin
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine
  2024-07-09 17:06       ` Eric Auger
@ 2024-07-09 17:18         ` Nicolin Chen
  2024-07-10  2:32           ` Duan, Zhenzhong
  0 siblings, 1 reply; 25+ messages in thread
From: Nicolin Chen @ 2024-07-09 17:18 UTC (permalink / raw)
  To: Eric Auger, Zhenzhong Duan, yi.l.liu
  Cc: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha, peterx,
	qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

On Tue, Jul 09, 2024 at 07:06:50PM +0200, Eric Auger wrote:
> On 7/9/24 18:59, Nicolin Chen wrote:
> > Hi Eric,
> >
> > Thanks for the comments!
> >
> > On Tue, Jul 09, 2024 at 11:11:56AM +0200, Eric Auger wrote:
> >> On 6/26/24 02:28, Nicolin Chen wrote:
> >>> A nested SMMU must use iommufd ioctls to communicate with the host-level
> >>> SMMU instance for 2-stage translation support. Add an iommufd link to the
> >>> ARM virt-machine, allowing QEMU command to pass in an iommufd object.
> >> If I am not wrong vfio devices are allowed to use different iommufd's
> >> (although there is no real benefice). So this command line wouldn't
> >> match with that option.
> > I think Jason's remarks highlighted that FD should be one per VM:
> > https://lore.kernel.org/qemu-devel/20240503141024.GE3341011@nvidia.com/
> OK I thought this was still envisionned althought not really meaningful.
> By the way, please add Yi and Zhenzhong in cc since thre problematics
> are connected I think.

Yea.

Yi/Zhenzhong, would you please shed some light on forwarding an
iommufd handler to the intel_iommu code? IIRC, we did that at the
beginning but removed it later?

> >> Also while reading the commit msg it is not clear with the iommufd is
> >> needed in the machine whereas the vfio iommufd BE generally calls those
> >> ioctls.
> > I think I forgot to revisit it. Both intel_iommu and smmu-common
> > used to call iommufd_backend_connect() for counting, so there was
> > a need to pass in the same iommufd handler to the viommu driver.
> > For SMMU, since it is created in the virt code, we had to pass in
> > with this patch.
> >
> > That being said, it looks like intel_iommu had removed that. So,
> > likely we don't need an extra user counting for SMMU too.
> OK at least it deserves some explanation about the "why"

Yes, I agree that the commit message isn't good enough.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine
  2024-07-09 17:18         ` Nicolin Chen
@ 2024-07-10  2:32           ` Duan, Zhenzhong
  0 siblings, 0 replies; 25+ messages in thread
From: Duan, Zhenzhong @ 2024-07-10  2:32 UTC (permalink / raw)
  To: Nicolin Chen, Eric Auger, Liu, Yi L
  Cc: peter.maydell@linaro.org, shannon.zhaosl@gmail.com,
	mst@redhat.com, imammedo@redhat.com, anisinha@redhat.com,
	peterx@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	jasowang@redhat.com

Hi Nicolin,

>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-
>machine
>
>On Tue, Jul 09, 2024 at 07:06:50PM +0200, Eric Auger wrote:
>> On 7/9/24 18:59, Nicolin Chen wrote:
>> > Hi Eric,
>> >
>> > Thanks for the comments!
>> >
>> > On Tue, Jul 09, 2024 at 11:11:56AM +0200, Eric Auger wrote:
>> >> On 6/26/24 02:28, Nicolin Chen wrote:
>> >>> A nested SMMU must use iommufd ioctls to communicate with the
>host-level
>> >>> SMMU instance for 2-stage translation support. Add an iommufd link
>to the
>> >>> ARM virt-machine, allowing QEMU command to pass in an iommufd
>object.
>> >> If I am not wrong vfio devices are allowed to use different iommufd's
>> >> (although there is no real benefice). So this command line wouldn't
>> >> match with that option.
>> > I think Jason's remarks highlighted that FD should be one per VM:
>> > https://lore.kernel.org/qemu-
>devel/20240503141024.GE3341011@nvidia.com/
>> OK I thought this was still envisionned althought not really meaningful.
>> By the way, please add Yi and Zhenzhong in cc since thre problematics
>> are connected I think.
>
>Yea.
>
>Yi/Zhenzhong, would you please shed some light on forwarding an
>iommufd handler to the intel_iommu code? IIRC, we did that at the
>beginning but removed it later?

IOMMUFD/devid/ioas handler is packaged in HostIOMMUDeviceIOMMUFD and passed to vIOMMU, see https://github.com/yiliu1765/qemu/commit/02892a5b452382866e804c3db3bb392c8f8f500f

The whole nesting series is at https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_nesting_rfcv2/

We gave the user flexibility to use different iommufd backends in one VM in iommufd cdev series. 
We want to be backward compatible in nesting series, the code change to support that is also trivial.

Thanks
Zhenzhong

>
>> >> Also while reading the commit msg it is not clear with the iommufd is
>> >> needed in the machine whereas the vfio iommufd BE generally calls
>those
>> >> ioctls.
>> > I think I forgot to revisit it. Both intel_iommu and smmu-common
>> > used to call iommufd_backend_connect() for counting, so there was
>> > a need to pass in the same iommufd handler to the viommu driver.
>> > For SMMU, since it is created in the virt code, we had to pass in
>> > with this patch.
>> >
>> > That being said, it looks like intel_iommu had removed that. So,
>> > likely we don't need an extra user counting for SMMU too.
>> OK at least it deserves some explanation about the "why"
>
>Yes, I agree that the commit message isn't good enough.
>
>Thanks
>Nicolin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFCv1 03/10] hw/arm/virt: Get the number of host-level SMMUv3 instances
  2024-06-26  0:28 [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs Nicolin Chen
  2024-06-26  0:28 ` [PATCH RFCv1 01/10] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Nicolin Chen
  2024-06-26  0:28 ` [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine Nicolin Chen
@ 2024-06-26  0:28 ` Nicolin Chen
  2024-07-09  9:20   ` Eric Auger
  2024-06-26  0:28 ` [PATCH RFCv1 04/10] hw/arm/virt: Add an SMMU_IO_LEN macro Nicolin Chen
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Nicolin Chen @ 2024-06-26  0:28 UTC (permalink / raw)
  To: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha,
	eric.auger, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

Nested SMMUv3 feature requires the support/presence of host-level SMMUv3
instance(s). Add a helper to read the sysfs for the number of instances.
Log them in a vms list using a new struct VirtNestedSmmu.

This will be used by a following patch to assign a passthrough device to
corresponding nested SMMUv3 instance.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/arm/virt.c         | 35 +++++++++++++++++++++++++++++++++++
 include/hw/arm/virt.h |  8 ++++++++
 2 files changed, 43 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 71093d7c60..be3d8b0ce6 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -2668,6 +2668,37 @@ static char *virt_get_iommu(Object *obj, Error **errp)
     }
 }
 
+static int virt_get_num_nested_smmus(VirtMachineState *vms, Error **errp)
+{
+    VirtNestedSmmu *nested_smmu;
+    struct dirent *dent;
+    DIR *dir = NULL;
+    int num = 0;
+
+    dir = opendir("/sys/class/iommu");
+    if (!dir) {
+        error_setg_errno(errp, errno, "couldn't open /sys/class/iommu");
+        return 0;
+    }
+
+    while ((dent = readdir(dir))) {
+        if (!strncmp(dent->d_name, "smmu3.0x", 7)) {
+            nested_smmu = g_new0(VirtNestedSmmu, 1);
+            nested_smmu->index = num;
+            nested_smmu->smmu_node = g_strdup(dent->d_name);
+            QLIST_INSERT_HEAD(&vms->nested_smmu_list, nested_smmu, next);
+            num++;
+        }
+    }
+
+    if (num == 0) {
+        error_setg_errno(errp, errno,
+                         "couldn't find any SMMUv3 HW to setup nesting");
+    }
+
+    return num;
+}
+
 static void virt_set_iommu(Object *obj, const char *value, Error **errp)
 {
     VirtMachineState *vms = VIRT_MACHINE(obj);
@@ -2676,6 +2707,7 @@ static void virt_set_iommu(Object *obj, const char *value, Error **errp)
         vms->iommu = VIRT_IOMMU_SMMUV3;
     } else if (!strcmp(value, "nested-smmuv3")) {
         vms->iommu = VIRT_IOMMU_NESTED_SMMUV3;
+        vms->num_nested_smmus = virt_get_num_nested_smmus(vms, errp);
     } else if (!strcmp(value, "none")) {
         vms->iommu = VIRT_IOMMU_NONE;
     } else {
@@ -3232,6 +3264,9 @@ static void virt_instance_init(Object *obj)
     /* The default root bus is attached to iommu by default */
     vms->default_bus_bypass_iommu = false;
 
+    /* Default disallows nested SMMU instantiation */
+    vms->num_nested_smmus = 0;
+
     /* Default disallows RAS instantiation */
     vms->ras = false;
 
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index d5cbce1a30..e0c07527c4 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -135,6 +135,12 @@ struct VirtMachineClass {
     bool no_ns_el2_virt_timer_irq;
 };
 
+typedef struct VirtNestedSmmu {
+    int index;
+    char *smmu_node;
+    QLIST_ENTRY(VirtNestedSmmu) next;
+} VirtNestedSmmu;
+
 struct VirtMachineState {
     MachineState parent;
     Notifier machine_done;
@@ -153,6 +159,7 @@ struct VirtMachineState {
     bool ras;
     bool mte;
     bool dtb_randomness;
+    int num_nested_smmus;
     OnOffAuto acpi;
     VirtGICType gic_version;
     IOMMUFDBackend *iommufd;
@@ -178,6 +185,7 @@ struct VirtMachineState {
     char *oem_id;
     char *oem_table_id;
     bool ns_el2_virt_timer_irq;
+    QLIST_HEAD(, VirtNestedSmmu) nested_smmu_list;
 };
 
 #define VIRT_ECAM_ID(high) (high ? VIRT_HIGH_PCIE_ECAM : VIRT_PCIE_ECAM)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 03/10] hw/arm/virt: Get the number of host-level SMMUv3 instances
  2024-06-26  0:28 ` [PATCH RFCv1 03/10] hw/arm/virt: Get the number of host-level SMMUv3 instances Nicolin Chen
@ 2024-07-09  9:20   ` Eric Auger
  2024-07-09 17:11     ` Nicolin Chen
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Auger @ 2024-07-09  9:20 UTC (permalink / raw)
  To: Nicolin Chen, peter.maydell, shannon.zhaosl, mst, imammedo,
	anisinha, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

Hi Nicolin,

On 6/26/24 02:28, Nicolin Chen wrote:
> Nested SMMUv3 feature requires the support/presence of host-level SMMUv3
> instance(s). Add a helper to read the sysfs for the number of instances.
> Log them in a vms list using a new struct VirtNestedSmmu.
>
> This will be used by a following patch to assign a passthrough device to
> corresponding nested SMMUv3 instance.
Laterly the HostIOMMUDevice has been introduced to allow, among other
things, to pass information related to the physical IOMMU to the virtual
IOMMU.
I guess it would be well fitted to associate the viommu with its
underlying piommu.

I don't think we have such kind of host introspection in machine type.
Generally in can happen in the very device or in libvirt.

Thanks

Eric
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  hw/arm/virt.c         | 35 +++++++++++++++++++++++++++++++++++
>  include/hw/arm/virt.h |  8 ++++++++
>  2 files changed, 43 insertions(+)
>
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 71093d7c60..be3d8b0ce6 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -2668,6 +2668,37 @@ static char *virt_get_iommu(Object *obj, Error **errp)
>      }
>  }
>  
> +static int virt_get_num_nested_smmus(VirtMachineState *vms, Error **errp)
> +{
> +    VirtNestedSmmu *nested_smmu;
> +    struct dirent *dent;
> +    DIR *dir = NULL;
> +    int num = 0;
> +
> +    dir = opendir("/sys/class/iommu");
> +    if (!dir) {
> +        error_setg_errno(errp, errno, "couldn't open /sys/class/iommu");
> +        return 0;
> +    }
> +
> +    while ((dent = readdir(dir))) {
> +        if (!strncmp(dent->d_name, "smmu3.0x", 7)) {
> +            nested_smmu = g_new0(VirtNestedSmmu, 1);
> +            nested_smmu->index = num;
> +            nested_smmu->smmu_node = g_strdup(dent->d_name);
> +            QLIST_INSERT_HEAD(&vms->nested_smmu_list, nested_smmu, next);
> +            num++;
> +        }
> +    }
> +
> +    if (num == 0) {
> +        error_setg_errno(errp, errno,
> +                         "couldn't find any SMMUv3 HW to setup nesting");
> +    }
> +
> +    return num;
> +}
> +
>  static void virt_set_iommu(Object *obj, const char *value, Error **errp)
>  {
>      VirtMachineState *vms = VIRT_MACHINE(obj);
> @@ -2676,6 +2707,7 @@ static void virt_set_iommu(Object *obj, const char *value, Error **errp)
>          vms->iommu = VIRT_IOMMU_SMMUV3;
>      } else if (!strcmp(value, "nested-smmuv3")) {
>          vms->iommu = VIRT_IOMMU_NESTED_SMMUV3;
> +        vms->num_nested_smmus = virt_get_num_nested_smmus(vms, errp);
>      } else if (!strcmp(value, "none")) {
>          vms->iommu = VIRT_IOMMU_NONE;
>      } else {
> @@ -3232,6 +3264,9 @@ static void virt_instance_init(Object *obj)
>      /* The default root bus is attached to iommu by default */
>      vms->default_bus_bypass_iommu = false;
>  
> +    /* Default disallows nested SMMU instantiation */
> +    vms->num_nested_smmus = 0;
> +
>      /* Default disallows RAS instantiation */
>      vms->ras = false;
>  
> diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> index d5cbce1a30..e0c07527c4 100644
> --- a/include/hw/arm/virt.h
> +++ b/include/hw/arm/virt.h
> @@ -135,6 +135,12 @@ struct VirtMachineClass {
>      bool no_ns_el2_virt_timer_irq;
>  };
>  
> +typedef struct VirtNestedSmmu {
> +    int index;
> +    char *smmu_node;
> +    QLIST_ENTRY(VirtNestedSmmu) next;
> +} VirtNestedSmmu;
> +
>  struct VirtMachineState {
>      MachineState parent;
>      Notifier machine_done;
> @@ -153,6 +159,7 @@ struct VirtMachineState {
>      bool ras;
>      bool mte;
>      bool dtb_randomness;
> +    int num_nested_smmus;
>      OnOffAuto acpi;
>      VirtGICType gic_version;
>      IOMMUFDBackend *iommufd;
> @@ -178,6 +185,7 @@ struct VirtMachineState {
>      char *oem_id;
>      char *oem_table_id;
>      bool ns_el2_virt_timer_irq;
> +    QLIST_HEAD(, VirtNestedSmmu) nested_smmu_list;
>  };
>  
>  #define VIRT_ECAM_ID(high) (high ? VIRT_HIGH_PCIE_ECAM : VIRT_PCIE_ECAM)



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 03/10] hw/arm/virt: Get the number of host-level SMMUv3 instances
  2024-07-09  9:20   ` Eric Auger
@ 2024-07-09 17:11     ` Nicolin Chen
  2024-07-09 17:22       ` Eric Auger
  0 siblings, 1 reply; 25+ messages in thread
From: Nicolin Chen @ 2024-07-09 17:11 UTC (permalink / raw)
  To: Eric Auger
  Cc: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha, peterx,
	qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

On Tue, Jul 09, 2024 at 11:20:16AM +0200, Eric Auger wrote:
> On 6/26/24 02:28, Nicolin Chen wrote:
> > Nested SMMUv3 feature requires the support/presence of host-level SMMUv3
> > instance(s). Add a helper to read the sysfs for the number of instances.
> > Log them in a vms list using a new struct VirtNestedSmmu.
> >
> > This will be used by a following patch to assign a passthrough device to
> > corresponding nested SMMUv3 instance.

> Laterly the HostIOMMUDevice has been introduced to allow, among other
> things, to pass information related to the physical IOMMU to the virtual
> IOMMU.
> I guess it would be well fitted to associate the viommu with its
> underlying piommu.

Wow, I missed that part -- backends/host_iommu_device. I will
see how I can fit these well with that.

> I don't think we have such kind of host introspection in machine type.
> Generally in can happen in the very device or in libvirt.

I think the biggest reason for having such an introspection in
the virt code is because of hotplug, (though it's not properly
implemented yet), as we don't know what new devices requiring
for nested translation would be joining later. So somebody has
to hold a full list.

Would you mind elaborating how the "device" or "libvirt" can
handle that?

Thanks!
Nicolin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 03/10] hw/arm/virt: Get the number of host-level SMMUv3 instances
  2024-07-09 17:11     ` Nicolin Chen
@ 2024-07-09 17:22       ` Eric Auger
  2024-07-09 18:02         ` Nicolin Chen
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Auger @ 2024-07-09 17:22 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha, peterx,
	qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang



On 7/9/24 19:11, Nicolin Chen wrote:
> On Tue, Jul 09, 2024 at 11:20:16AM +0200, Eric Auger wrote:
>> On 6/26/24 02:28, Nicolin Chen wrote:
>>> Nested SMMUv3 feature requires the support/presence of host-level SMMUv3
>>> instance(s). Add a helper to read the sysfs for the number of instances.
>>> Log them in a vms list using a new struct VirtNestedSmmu.
>>>
>>> This will be used by a following patch to assign a passthrough device to
>>> corresponding nested SMMUv3 instance.
>> Laterly the HostIOMMUDevice has been introduced to allow, among other
>> things, to pass information related to the physical IOMMU to the virtual
>> IOMMU.
>> I guess it would be well fitted to associate the viommu with its
>> underlying piommu.
> Wow, I missed that part -- backends/host_iommu_device. I will
> see how I can fit these well with that.
>
>> I don't think we have such kind of host introspection in machine type.
>> Generally in can happen in the very device or in libvirt.
> I think the biggest reason for having such an introspection in
> the virt code is because of hotplug, (though it's not properly
> implemented yet), as we don't know what new devices requiring
> for nested translation would be joining later. So somebody has
> to hold a full list.
>
> Would you mind elaborating how the "device" or "libvirt" can
> handle that?
If you know that on Grace you have 5 SMMU instances, can't you pre-build
a PCIe topology with 5 PXB and root ports at libvirt level.
Then when you hotplug your device you specify the corresponding slot
just as we do normally. But maybe I misunderstood the hotplug problematics.

Eric
>
> Thanks!
> Nicolin
>



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 03/10] hw/arm/virt: Get the number of host-level SMMUv3 instances
  2024-07-09 17:22       ` Eric Auger
@ 2024-07-09 18:02         ` Nicolin Chen
  0 siblings, 0 replies; 25+ messages in thread
From: Nicolin Chen @ 2024-07-09 18:02 UTC (permalink / raw)
  To: Eric Auger
  Cc: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha, peterx,
	qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

On Tue, Jul 09, 2024 at 07:22:16PM +0200, Eric Auger wrote:
> On 7/9/24 19:11, Nicolin Chen wrote:
> > On Tue, Jul 09, 2024 at 11:20:16AM +0200, Eric Auger wrote:
> >> On 6/26/24 02:28, Nicolin Chen wrote:
> >>> Nested SMMUv3 feature requires the support/presence of host-level SMMUv3
> >>> instance(s). Add a helper to read the sysfs for the number of instances.
> >>> Log them in a vms list using a new struct VirtNestedSmmu.
> >>>
> >>> This will be used by a following patch to assign a passthrough device to
> >>> corresponding nested SMMUv3 instance.
> >> Laterly the HostIOMMUDevice has been introduced to allow, among other
> >> things, to pass information related to the physical IOMMU to the virtual
> >> IOMMU.
> >> I guess it would be well fitted to associate the viommu with its
> >> underlying piommu.
> > Wow, I missed that part -- backends/host_iommu_device. I will
> > see how I can fit these well with that.
> >
> >> I don't think we have such kind of host introspection in machine type.
> >> Generally in can happen in the very device or in libvirt.
> > I think the biggest reason for having such an introspection in
> > the virt code is because of hotplug, (though it's not properly
> > implemented yet), as we don't know what new devices requiring
> > for nested translation would be joining later. So somebody has
> > to hold a full list.
> >
> > Would you mind elaborating how the "device" or "libvirt" can
> > handle that?
> If you know that on Grace you have 5 SMMU instances, can't you pre-build
> a PCIe topology with 5 PXB and root ports at libvirt level.
> Then when you hotplug your device you specify the corresponding slot
> just as we do normally. But maybe I misunderstood the hotplug problematics.

I guess I got your point: basically, the introspection and sysfs
node matching for device assigning should happen in libvirt.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFCv1 04/10] hw/arm/virt: Add an SMMU_IO_LEN macro
  2024-06-26  0:28 [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs Nicolin Chen
                   ` (2 preceding siblings ...)
  2024-06-26  0:28 ` [PATCH RFCv1 03/10] hw/arm/virt: Get the number of host-level SMMUv3 instances Nicolin Chen
@ 2024-06-26  0:28 ` Nicolin Chen
  2024-06-26  0:28 ` [PATCH RFCv1 05/10] hw/arm/virt: Add VIRT_NESTED_SMMU Nicolin Chen
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Nicolin Chen @ 2024-06-26  0:28 UTC (permalink / raw)
  To: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha,
	eric.auger, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

A following patch will add a new MMIO region for nested SMMU instances.

This macro will be repeatedly used to set offsets and MMIO sizes in both
virt and virt-acpi-build.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/arm/virt.c         | 2 +-
 include/hw/arm/virt.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index be3d8b0ce6..ef59c79ca3 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -170,7 +170,7 @@ static const MemMapEntry base_memmap[] = {
     [VIRT_FW_CFG] =             { 0x09020000, 0x00000018 },
     [VIRT_GPIO] =               { 0x09030000, 0x00001000 },
     [VIRT_SECURE_UART] =        { 0x09040000, 0x00001000 },
-    [VIRT_SMMU] =               { 0x09050000, 0x00020000 },
+    [VIRT_SMMU] =               { 0x09050000, SMMU_IO_LEN },
     [VIRT_PCDIMM_ACPI] =        { 0x09070000, MEMORY_HOTPLUG_IO_LEN },
     [VIRT_ACPI_GED] =           { 0x09080000, ACPI_GED_EVT_SEL_LEN },
     [VIRT_NVDIMM_ACPI] =        { 0x09090000, NVDIMM_ACPI_IO_LEN},
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index e0c07527c4..b35c4f62d7 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -48,6 +48,9 @@
 /* See Linux kernel arch/arm64/include/asm/pvclock-abi.h */
 #define PVTIME_SIZE_PER_CPU 64
 
+/* MMIO region size for SMMUv3 */
+#define SMMU_IO_LEN         (0x20000)
+
 enum {
     VIRT_FLASH,
     VIRT_MEM,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFCv1 05/10] hw/arm/virt: Add VIRT_NESTED_SMMU
  2024-06-26  0:28 [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs Nicolin Chen
                   ` (3 preceding siblings ...)
  2024-06-26  0:28 ` [PATCH RFCv1 04/10] hw/arm/virt: Add an SMMU_IO_LEN macro Nicolin Chen
@ 2024-06-26  0:28 ` Nicolin Chen
  2024-07-09 13:26   ` Eric Auger
  2024-06-26  0:28 ` [PATCH RFCv1 06/10] hw/arm/virt: Assign vfio-pci devices to nested SMMUs Nicolin Chen
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Nicolin Chen @ 2024-06-26  0:28 UTC (permalink / raw)
  To: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha,
	eric.auger, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

VIRT_SMMU can be used as an emulated SMMU, i.e. iommu=smmuv3. However, the
MMIO space for VIRT_SMMU isn't large enough to accommodate multiple nested
SMMUv3 instances. Add a new VIRT_NESTED_SMMU to separate MMIO space.

Meanwhile, a nested SMMUv3 could only work with a vfio-pci device that is
physically associated to a host-level SMMUv3 instance, meaning that such a
passthrough deivce must be linked to a nesteed SMMU corresponding to the
underlying host-level SMMUv3 instance:
      HW SMMU0 <-----------------> vSMMU0 (nested)
          ^                             ^
	  |                             |
	  ----> PCI dev0 & PCI dev1 <----
      HW SMMU1 <-----------------> vSMMU1 (nested)
          ^                             ^
	  |                             |
	  ----> PCI dev2 & PCI dev3 <----

Note that, on the other hand, an emulated deivce should not be associated
to a nested SMMU.

To prepare for that, create a PCIe Expander Bridge per nested SMMU as a
docker for pci-vfio devices. It eases the QEMU code to build ID mappings
in the IORT.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/arm/virt.c         | 110 +++++++++++++++++++++++++++++++++++++-----
 include/hw/arm/virt.h |  17 +++++++
 2 files changed, 116 insertions(+), 11 deletions(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index ef59c79ca3..a54332fca8 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -55,6 +55,7 @@
 #include "qemu/bitops.h"
 #include "qemu/error-report.h"
 #include "qemu/module.h"
+#include "hw/pci/pci_bus.h"
 #include "hw/pci-host/gpex.h"
 #include "hw/virtio/virtio-pci.h"
 #include "hw/core/sysbus-fdt.h"
@@ -83,6 +84,7 @@
 #include "hw/virtio/virtio-md-pci.h"
 #include "hw/virtio/virtio-iommu.h"
 #include "hw/char/pl011.h"
+#include "qemu/config-file.h"
 #include "qemu/guest-random.h"
 
 static GlobalProperty arm_virt_compat[] = {
@@ -177,6 +179,7 @@ static const MemMapEntry base_memmap[] = {
     [VIRT_PVTIME] =             { 0x090a0000, 0x00010000 },
     [VIRT_SECURE_GPIO] =        { 0x090b0000, 0x00001000 },
     [VIRT_MMIO] =               { 0x0a000000, 0x00000200 },
+    [VIRT_NESTED_SMMU] =        { 0x0b000000, 0x01000000 },
     /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */
     [VIRT_PLATFORM_BUS] =       { 0x0c000000, 0x02000000 },
     [VIRT_SECURE_MEM] =         { 0x0e000000, 0x01000000 },
@@ -221,7 +224,8 @@ static const int a15irqmap[] = {
     [VIRT_MMIO] = 16, /* ...to 16 + NUM_VIRTIO_TRANSPORTS - 1 */
     [VIRT_GIC_V2M] = 48, /* ...to 48 + NUM_GICV2M_SPIS - 1 */
     [VIRT_SMMU] = 74,    /* ...to 74 + NUM_SMMU_IRQS - 1 */
-    [VIRT_PLATFORM_BUS] = 112, /* ...to 112 + PLATFORM_BUS_NUM_IRQS -1 */
+    [VIRT_PLATFORM_BUS] = 112, /* ...to 112 + PLATFORM_BUS_NUM_IRQS -1 (179) */
+    [VIRT_NESTED_SMMU] = 200, /* Keep it at the end of list */
 };
 
 static void create_randomness(MachineState *ms, const char *node)
@@ -1383,21 +1387,19 @@ static void create_pcie_irq_map(const MachineState *ms,
                            0x7           /* PCI irq */);
 }
 
-static void create_smmu(const VirtMachineState *vms,
-                        PCIBus *bus)
+static DeviceState *_create_smmu(const VirtMachineState *vms, PCIBus *bus,
+                                 hwaddr base, hwaddr size, int irq,
+                                 uint32_t smmu_phandle)
 {
     char *node;
     const char compat[] = "arm,smmu-v3";
-    int irq =  vms->irqmap[VIRT_SMMU];
     int i;
-    hwaddr base = vms->memmap[VIRT_SMMU].base;
-    hwaddr size = vms->memmap[VIRT_SMMU].size;
     const char irq_names[] = "eventq\0priq\0cmdq-sync\0gerror";
     DeviceState *dev;
     MachineState *ms = MACHINE(vms);
 
-    if (!virt_has_smmuv3(vms) || !vms->iommu_phandle) {
-        return;
+    if (!virt_has_smmuv3(vms) || !smmu_phandle) {
+        return NULL;
     }
 
     dev = qdev_new(TYPE_ARM_SMMUV3);
@@ -1436,8 +1438,31 @@ static void create_smmu(const VirtMachineState *vms,
 
     qemu_fdt_setprop_cell(ms->fdt, node, "#iommu-cells", 1);
 
-    qemu_fdt_setprop_cell(ms->fdt, node, "phandle", vms->iommu_phandle);
+    qemu_fdt_setprop_cell(ms->fdt, node, "phandle", smmu_phandle);
     g_free(node);
+
+    return dev;
+}
+
+static DeviceState *create_smmu(const VirtMachineState *vms, PCIBus *bus)
+{
+    hwaddr base = vms->memmap[VIRT_SMMU].base;
+    hwaddr size = vms->memmap[VIRT_SMMU].size;
+    int irq = vms->irqmap[VIRT_SMMU];
+
+    return _create_smmu(vms, bus, base, size, irq, vms->iommu_phandle);
+}
+
+static DeviceState *create_nested_smmu(const VirtMachineState *vms, PCIBus *bus,
+                                       int i)
+{
+    hwaddr base = vms->memmap[VIRT_NESTED_SMMU].base + i * SMMU_IO_LEN;
+    int irq = vms->irqmap[VIRT_SMMU] + i * NUM_SMMU_IRQS;
+    hwaddr size = SMMU_IO_LEN;
+    DeviceState *dev;
+
+    dev = _create_smmu(vms, bus, base, size, irq, vms->nested_smmu_phandle[i]);
+    return dev;
 }
 
 static void create_virtio_iommu_dt_bindings(VirtMachineState *vms)
@@ -1466,6 +1491,48 @@ static void create_virtio_iommu_dt_bindings(VirtMachineState *vms)
                            bdf + 1, vms->iommu_phandle, bdf + 1, 0xffff - bdf);
 }
 
+/*
+ * FIXME this is used to reverse for hotplug devices, yet it could result in a
+ * big waste of PCI bus numbners.
+ */
+#define MAX_DEV_PER_NESTED_SMMU (4)
+
+static PCIBus *create_pcie_expander_bridge(VirtMachineState *vms, uint8_t idx)
+{
+    /* Total = PXB + MAX_DEV_PER_NESTED_SMMU x (Rootport + device) */
+    const uint8_t total_bus_nums = 1 + MAX_DEV_PER_NESTED_SMMU * 2;
+    uint8_t bus_nr = PCI_BUS_MAX -
+                     (vms->num_nested_smmus - idx) * total_bus_nums;
+
+    VirtNestedSmmu *nested_smmu = find_nested_smmu_by_index(vms, idx);
+    char *name_pxb = g_strdup_printf("pxb_for_smmu.%d", idx);
+    PCIBus *bus = vms->bus;
+    DeviceState *dev;
+
+    /* Create an expander bridge */
+    dev = qdev_new("pxb-pcie");
+    if (!qdev_set_id(dev, name_pxb, &error_fatal)) {
+        return NULL;
+    }
+
+    qdev_prop_set_uint8(dev, "bus_nr", bus_nr);
+    qdev_prop_set_uint16(dev, "numa_node", idx);
+    qdev_realize_and_unref(dev, BUS(bus), &error_fatal);
+
+    /* Get the pxb bus */
+    QLIST_FOREACH(bus, &bus->child, sibling) {
+        if (pci_bus_num(bus) == bus_nr) {
+            break;
+        }
+    }
+    g_assert(bus && pci_bus_num(bus) == bus_nr);
+    nested_smmu->pci_bus = bus;
+    nested_smmu->reserved_bus_nums = total_bus_nums;
+
+    /* Return the pxb bus */
+    return bus;
+}
+
 static void create_pcie(VirtMachineState *vms)
 {
     hwaddr base_mmio = vms->memmap[VIRT_PCIE_MMIO].base;
@@ -1580,12 +1647,33 @@ static void create_pcie(VirtMachineState *vms)
     qemu_fdt_setprop_cell(ms->fdt, nodename, "#interrupt-cells", 1);
     create_pcie_irq_map(ms, vms->gic_phandle, irq, nodename);
 
-    if (vms->iommu) {
+    /* Build PCI Expander Bridge + Root Port from the top of PCI_BUS_MAX */
+    if (vms->num_nested_smmus) {
+        /* VIRT_NESTED_SMMU must hold all vSMMUs */
+        g_assert(vms->num_nested_smmus <=
+                 vms->memmap[VIRT_NESTED_SMMU].size / SMMU_IO_LEN);
+
+        vms->nested_smmu_phandle = g_new0(uint32_t, vms->num_nested_smmus);
+
+        for (i = 0; i < vms->num_nested_smmus; i++) {
+            DeviceState *smmu_dev;
+            PCIBus *pxb_bus;
+
+            pxb_bus = create_pcie_expander_bridge(vms, i);
+            g_assert(pxb_bus);
+
+            vms->nested_smmu_phandle[i] = qemu_fdt_alloc_phandle(ms->fdt);
+            smmu_dev = create_nested_smmu(vms, pxb_bus, i);
+            g_assert(smmu_dev);
+
+            qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map", 0x0,
+                                   vms->nested_smmu_phandle[i], 0x0, 0x10000);
+        }
+    } else if (vms->iommu) {
         vms->iommu_phandle = qemu_fdt_alloc_phandle(ms->fdt);
 
         switch (vms->iommu) {
         case VIRT_IOMMU_SMMUV3:
-        case VIRT_IOMMU_NESTED_SMMUV3:
             create_smmu(vms, vms->bus);
             qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map",
                                    0x0, vms->iommu_phandle, 0x0, 0x10000);
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index b35c4f62d7..0a3f1ab8b5 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -63,6 +63,7 @@ enum {
     VIRT_GIC_ITS,
     VIRT_GIC_REDIST,
     VIRT_SMMU,
+    VIRT_NESTED_SMMU,
     VIRT_UART,
     VIRT_MMIO,
     VIRT_RTC,
@@ -141,6 +142,8 @@ struct VirtMachineClass {
 typedef struct VirtNestedSmmu {
     int index;
     char *smmu_node;
+    PCIBus *pci_bus;
+    int reserved_bus_nums;
     QLIST_ENTRY(VirtNestedSmmu) next;
 } VirtNestedSmmu;
 
@@ -179,6 +182,7 @@ struct VirtMachineState {
     uint32_t gic_phandle;
     uint32_t msi_phandle;
     uint32_t iommu_phandle;
+    uint32_t *nested_smmu_phandle;
     int psci_conduit;
     hwaddr highest_gpa;
     DeviceState *gic;
@@ -229,4 +233,17 @@ static inline bool virt_has_smmuv3(const VirtMachineState *vms)
            vms->iommu == VIRT_IOMMU_NESTED_SMMUV3;
 }
 
+static inline VirtNestedSmmu *
+find_nested_smmu_by_index(VirtMachineState *vms, int index)
+{
+    VirtNestedSmmu *nested_smmu;
+
+    QLIST_FOREACH(nested_smmu, &vms->nested_smmu_list, next) {
+        if (nested_smmu->index == index) {
+            return nested_smmu;
+        }
+    }
+    return NULL;
+}
+
 #endif /* QEMU_ARM_VIRT_H */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 05/10] hw/arm/virt: Add VIRT_NESTED_SMMU
  2024-06-26  0:28 ` [PATCH RFCv1 05/10] hw/arm/virt: Add VIRT_NESTED_SMMU Nicolin Chen
@ 2024-07-09 13:26   ` Eric Auger
  2024-07-09 17:59     ` Nicolin Chen
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Auger @ 2024-07-09 13:26 UTC (permalink / raw)
  To: Nicolin Chen, peter.maydell, shannon.zhaosl, mst, imammedo,
	anisinha, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang,
	Andrea Bolognani

Hi Nicolin,

On 6/26/24 02:28, Nicolin Chen wrote:
> VIRT_SMMU can be used as an emulated SMMU, i.e. iommu=smmuv3. However, the
> MMIO space for VIRT_SMMU isn't large enough to accommodate multiple nested
> SMMUv3 instances. Add a new VIRT_NESTED_SMMU to separate MMIO space.
>
> Meanwhile, a nested SMMUv3 could only work with a vfio-pci device that is
> physically associated to a host-level SMMUv3 instance, meaning that such a
> passthrough deivce must be linked to a nesteed SMMU corresponding to the
> underlying host-level SMMUv3 instance:
>       HW SMMU0 <-----------------> vSMMU0 (nested)
>           ^                             ^
> 	  |                             |
> 	  ----> PCI dev0 & PCI dev1 <----
>       HW SMMU1 <-----------------> vSMMU1 (nested)
>           ^                             ^
> 	  |                             |
> 	  ----> PCI dev2 & PCI dev3 <----
>
> Note that, on the other hand, an emulated deivce should not be associated
> to a nested SMMU.
>
> To prepare for that, create a PCIe Expander Bridge per nested SMMU as a
> docker for pci-vfio devices. It eases the QEMU code to build ID mappings
> in the IORT.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  hw/arm/virt.c         | 110 +++++++++++++++++++++++++++++++++++++-----
>  include/hw/arm/virt.h |  17 +++++++
>  2 files changed, 116 insertions(+), 11 deletions(-)
>
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index ef59c79ca3..a54332fca8 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -55,6 +55,7 @@
>  #include "qemu/bitops.h"
>  #include "qemu/error-report.h"
>  #include "qemu/module.h"
> +#include "hw/pci/pci_bus.h"
>  #include "hw/pci-host/gpex.h"
>  #include "hw/virtio/virtio-pci.h"
>  #include "hw/core/sysbus-fdt.h"
> @@ -83,6 +84,7 @@
>  #include "hw/virtio/virtio-md-pci.h"
>  #include "hw/virtio/virtio-iommu.h"
>  #include "hw/char/pl011.h"
> +#include "qemu/config-file.h"
>  #include "qemu/guest-random.h"
>  
>  static GlobalProperty arm_virt_compat[] = {
> @@ -177,6 +179,7 @@ static const MemMapEntry base_memmap[] = {
>      [VIRT_PVTIME] =             { 0x090a0000, 0x00010000 },
>      [VIRT_SECURE_GPIO] =        { 0x090b0000, 0x00001000 },
>      [VIRT_MMIO] =               { 0x0a000000, 0x00000200 },
> +    [VIRT_NESTED_SMMU] =        { 0x0b000000, 0x01000000 },
>      /* ...repeating for a total of NUM_VIRTIO_TRANSPORTS, each of that size */
>      [VIRT_PLATFORM_BUS] =       { 0x0c000000, 0x02000000 },
>      [VIRT_SECURE_MEM] =         { 0x0e000000, 0x01000000 },
> @@ -221,7 +224,8 @@ static const int a15irqmap[] = {
>      [VIRT_MMIO] = 16, /* ...to 16 + NUM_VIRTIO_TRANSPORTS - 1 */
>      [VIRT_GIC_V2M] = 48, /* ...to 48 + NUM_GICV2M_SPIS - 1 */
>      [VIRT_SMMU] = 74,    /* ...to 74 + NUM_SMMU_IRQS - 1 */
> -    [VIRT_PLATFORM_BUS] = 112, /* ...to 112 + PLATFORM_BUS_NUM_IRQS -1 */
> +    [VIRT_PLATFORM_BUS] = 112, /* ...to 112 + PLATFORM_BUS_NUM_IRQS -1 (179) */
> +    [VIRT_NESTED_SMMU] = 200, /* Keep it at the end of list */
>  };
>  
>  static void create_randomness(MachineState *ms, const char *node)
> @@ -1383,21 +1387,19 @@ static void create_pcie_irq_map(const MachineState *ms,
>                             0x7           /* PCI irq */);
>  }
>  
> -static void create_smmu(const VirtMachineState *vms,
> -                        PCIBus *bus)
> +static DeviceState *_create_smmu(const VirtMachineState *vms, PCIBus *bus,
> +                                 hwaddr base, hwaddr size, int irq,
> +                                 uint32_t smmu_phandle)
>  {
>      char *node;
>      const char compat[] = "arm,smmu-v3";
> -    int irq =  vms->irqmap[VIRT_SMMU];
>      int i;
> -    hwaddr base = vms->memmap[VIRT_SMMU].base;
> -    hwaddr size = vms->memmap[VIRT_SMMU].size;
>      const char irq_names[] = "eventq\0priq\0cmdq-sync\0gerror";
>      DeviceState *dev;
>      MachineState *ms = MACHINE(vms);
>  
> -    if (!virt_has_smmuv3(vms) || !vms->iommu_phandle) {
> -        return;
> +    if (!virt_has_smmuv3(vms) || !smmu_phandle) {
> +        return NULL;
>      }
>  
>      dev = qdev_new(TYPE_ARM_SMMUV3);
> @@ -1436,8 +1438,31 @@ static void create_smmu(const VirtMachineState *vms,
>  
>      qemu_fdt_setprop_cell(ms->fdt, node, "#iommu-cells", 1);
>  
> -    qemu_fdt_setprop_cell(ms->fdt, node, "phandle", vms->iommu_phandle);
> +    qemu_fdt_setprop_cell(ms->fdt, node, "phandle", smmu_phandle);
>      g_free(node);
> +
> +    return dev;
> +}
> +
> +static DeviceState *create_smmu(const VirtMachineState *vms, PCIBus *bus)
> +{
> +    hwaddr base = vms->memmap[VIRT_SMMU].base;
> +    hwaddr size = vms->memmap[VIRT_SMMU].size;
> +    int irq = vms->irqmap[VIRT_SMMU];
> +
> +    return _create_smmu(vms, bus, base, size, irq, vms->iommu_phandle);
> +}
> +
> +static DeviceState *create_nested_smmu(const VirtMachineState *vms, PCIBus *bus,
> +                                       int i)
> +{
> +    hwaddr base = vms->memmap[VIRT_NESTED_SMMU].base + i * SMMU_IO_LEN;
> +    int irq = vms->irqmap[VIRT_SMMU] + i * NUM_SMMU_IRQS;
> +    hwaddr size = SMMU_IO_LEN;
> +    DeviceState *dev;
> +
> +    dev = _create_smmu(vms, bus, base, size, irq, vms->nested_smmu_phandle[i]);
> +    return dev;
>  }
>  
>  static void create_virtio_iommu_dt_bindings(VirtMachineState *vms)
> @@ -1466,6 +1491,48 @@ static void create_virtio_iommu_dt_bindings(VirtMachineState *vms)
>                             bdf + 1, vms->iommu_phandle, bdf + 1, 0xffff - bdf);
>  }
>  
> +/*
> + * FIXME this is used to reverse for hotplug devices, yet it could result in a
> + * big waste of PCI bus numbners.
> + */
> +#define MAX_DEV_PER_NESTED_SMMU (4)
> +
> +static PCIBus *create_pcie_expander_bridge(VirtMachineState *vms, uint8_t idx)
> +{
> +    /* Total = PXB + MAX_DEV_PER_NESTED_SMMU x (Rootport + device) */
> +    const uint8_t total_bus_nums = 1 + MAX_DEV_PER_NESTED_SMMU * 2;
> +    uint8_t bus_nr = PCI_BUS_MAX -
> +                     (vms->num_nested_smmus - idx) * total_bus_nums;
> +
> +    VirtNestedSmmu *nested_smmu = find_nested_smmu_by_index(vms, idx);
> +    char *name_pxb = g_strdup_printf("pxb_for_smmu.%d", idx);
> +    PCIBus *bus = vms->bus;
> +    DeviceState *dev;
> +
> +    /* Create an expander bridge */
> +    dev = qdev_new("pxb-pcie");
> +    if (!qdev_set_id(dev, name_pxb, &error_fatal)) {
> +        return NULL;
> +    }
> +
> +    qdev_prop_set_uint8(dev, "bus_nr", bus_nr);
> +    qdev_prop_set_uint16(dev, "numa_node", idx);
> +    qdev_realize_and_unref(dev, BUS(bus), &error_fatal);
> +
> +    /* Get the pxb bus */
> +    QLIST_FOREACH(bus, &bus->child, sibling) {
> +        if (pci_bus_num(bus) == bus_nr) {
> +            break;
> +        }
> +    }
> +    g_assert(bus && pci_bus_num(bus) == bus_nr);
> +    nested_smmu->pci_bus = bus;
> +    nested_smmu->reserved_bus_nums = total_bus_nums;
> +
> +    /* Return the pxb bus */
> +    return bus;
> +}
> +
>  static void create_pcie(VirtMachineState *vms)
>  {
>      hwaddr base_mmio = vms->memmap[VIRT_PCIE_MMIO].base;
> @@ -1580,12 +1647,33 @@ static void create_pcie(VirtMachineState *vms)
>      qemu_fdt_setprop_cell(ms->fdt, nodename, "#interrupt-cells", 1);
>      create_pcie_irq_map(ms, vms->gic_phandle, irq, nodename);
>  
> -    if (vms->iommu) {
> +    /* Build PCI Expander Bridge + Root Port from the top of PCI_BUS_MAX */
> +    if (vms->num_nested_smmus) {
> +        /* VIRT_NESTED_SMMU must hold all vSMMUs */
> +        g_assert(vms->num_nested_smmus <=
> +                 vms->memmap[VIRT_NESTED_SMMU].size / SMMU_IO_LEN);
> +
> +        vms->nested_smmu_phandle = g_new0(uint32_t, vms->num_nested_smmus);
> +
> +        for (i = 0; i < vms->num_nested_smmus; i++) {
> +            DeviceState *smmu_dev;
> +            PCIBus *pxb_bus;
> +
> +            pxb_bus = create_pcie_expander_bridge(vms, i);
> +            g_assert(pxb_bus);
> +
> +            vms->nested_smmu_phandle[i] = qemu_fdt_alloc_phandle(ms->fdt);
> +            smmu_dev = create_nested_smmu(vms, pxb_bus, i);
> +            g_assert(smmu_dev);
> +
> +            qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map", 0x0,
> +                                   vms->nested_smmu_phandle[i], 0x0, 0x10000);
I think libvirt is supposed to create the pcie bus topology instead and
it shall not be created by qemu behind the scene.
The pcie elements you create here are not visible to libvirt and I guess
they may collide with elements explicitly created by libvirt at a given
pci bdf.

I think it would make more sense to be able to attach an smmu instance
to a given pci root or pxb either by adding an iommu id to a given
pxb-pcie option

-device
pxb-pcie,bus_nr=100,id=pci.12,numa_node=0,bus=pcie.0,addr=0x3,iommu=<id>
or
adding a list of pxb ids to the iommu option. It is unfortunate the
iommu option is a machine option. the smmu being a sysbus device the
platform bus framework could be considered to dynamically allocate them
using the -device option. This has been used along with dt generation
but with ACPI this would need to be studied. However at the time the
smmu was integrated the machine option was prefered.

Maybe using the 1st option would allow to infer that if there are
different iommu ids this implies that several IOMMU instances need to be
created.

 
Adding Andrea in cc

Thanks

Eric



> +        }
> +    } else if (vms->iommu) {
>          vms->iommu_phandle = qemu_fdt_alloc_phandle(ms->fdt);
>  
>          switch (vms->iommu) {
>          case VIRT_IOMMU_SMMUV3:
> -        case VIRT_IOMMU_NESTED_SMMUV3:
>              create_smmu(vms, vms->bus);
>              qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map",
>                                     0x0, vms->iommu_phandle, 0x0, 0x10000);
> diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> index b35c4f62d7..0a3f1ab8b5 100644
> --- a/include/hw/arm/virt.h
> +++ b/include/hw/arm/virt.h
> @@ -63,6 +63,7 @@ enum {
>      VIRT_GIC_ITS,
>      VIRT_GIC_REDIST,
>      VIRT_SMMU,
> +    VIRT_NESTED_SMMU,
>      VIRT_UART,
>      VIRT_MMIO,
>      VIRT_RTC,
> @@ -141,6 +142,8 @@ struct VirtMachineClass {
>  typedef struct VirtNestedSmmu {
>      int index;
>      char *smmu_node;
> +    PCIBus *pci_bus;
> +    int reserved_bus_nums;
>      QLIST_ENTRY(VirtNestedSmmu) next;
>  } VirtNestedSmmu;
>  
> @@ -179,6 +182,7 @@ struct VirtMachineState {
>      uint32_t gic_phandle;
>      uint32_t msi_phandle;
>      uint32_t iommu_phandle;
> +    uint32_t *nested_smmu_phandle;
>      int psci_conduit;
>      hwaddr highest_gpa;
>      DeviceState *gic;
> @@ -229,4 +233,17 @@ static inline bool virt_has_smmuv3(const VirtMachineState *vms)
>             vms->iommu == VIRT_IOMMU_NESTED_SMMUV3;
>  }
>  
> +static inline VirtNestedSmmu *
> +find_nested_smmu_by_index(VirtMachineState *vms, int index)
> +{
> +    VirtNestedSmmu *nested_smmu;
> +
> +    QLIST_FOREACH(nested_smmu, &vms->nested_smmu_list, next) {
> +        if (nested_smmu->index == index) {
> +            return nested_smmu;
> +        }
> +    }
> +    return NULL;
> +}
> +
>  #endif /* QEMU_ARM_VIRT_H */



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 05/10] hw/arm/virt: Add VIRT_NESTED_SMMU
  2024-07-09 13:26   ` Eric Auger
@ 2024-07-09 17:59     ` Nicolin Chen
  2024-07-11 15:48       ` Andrea Bolognani
  0 siblings, 1 reply; 25+ messages in thread
From: Nicolin Chen @ 2024-07-09 17:59 UTC (permalink / raw)
  To: Eric Auger
  Cc: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha, peterx,
	qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang,
	Andrea Bolognani

On Tue, Jul 09, 2024 at 03:26:58PM +0200, Eric Auger wrote:

> > @@ -1580,12 +1647,33 @@ static void create_pcie(VirtMachineState *vms)
> >      qemu_fdt_setprop_cell(ms->fdt, nodename, "#interrupt-cells", 1);
> >      create_pcie_irq_map(ms, vms->gic_phandle, irq, nodename);
> >
> > -    if (vms->iommu) {
> > +    /* Build PCI Expander Bridge + Root Port from the top of PCI_BUS_MAX */
> > +    if (vms->num_nested_smmus) {
> > +        /* VIRT_NESTED_SMMU must hold all vSMMUs */
> > +        g_assert(vms->num_nested_smmus <=
> > +                 vms->memmap[VIRT_NESTED_SMMU].size / SMMU_IO_LEN);
> > +
> > +        vms->nested_smmu_phandle = g_new0(uint32_t, vms->num_nested_smmus);
> > +
> > +        for (i = 0; i < vms->num_nested_smmus; i++) {
> > +            DeviceState *smmu_dev;
> > +            PCIBus *pxb_bus;
> > +
> > +            pxb_bus = create_pcie_expander_bridge(vms, i);
> > +            g_assert(pxb_bus);
> > +
> > +            vms->nested_smmu_phandle[i] = qemu_fdt_alloc_phandle(ms->fdt);
> > +            smmu_dev = create_nested_smmu(vms, pxb_bus, i);
> > +            g_assert(smmu_dev);
> > +
> > +            qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map", 0x0,
> > +                                   vms->nested_smmu_phandle[i], 0x0, 0x10000);
> I think libvirt is supposed to create the pcie bus topology instead and
> it shall not be created by qemu behind the scene.
> The pcie elements you create here are not visible to libvirt and I guess
> they may collide with elements explicitly created by libvirt at a given
> pci bdf.

Yea, the bdf conflict is a concern. So I allocated the bdf list
from the top of the bus number... one of the reasons of doing
this is to ease users so they don't need to deal with the over-
complicated topology. I will try libvirt and see how it goes.

> I think it would make more sense to be able to attach an smmu instance
> to a given pci root or pxb either by adding an iommu id to a given
> pxb-pcie option
> 
> -device
> pxb-pcie,bus_nr=100,id=pci.12,numa_node=0,bus=pcie.0,addr=0x3,iommu=<id>
> or
> adding a list of pxb ids to the iommu option. It is unfortunate the
> iommu option is a machine option. 

Yes. I had thought about that too, but the virt-machine code
creates all the instance at this moment...

> platform bus framework could be considered to dynamically allocate them
> using the -device option. This has been used along with dt generation
> but with ACPI this would need to be studied. However at the time the
> smmu was integrated the machine option was prefered.
> 
> Maybe using the 1st option would allow to infer that if there are
> different iommu ids this implies that several IOMMU instances need to be
> created.

Yea, I like the idea of creating iommu instance with a "-device"
string.

One more question. Let's say we have 2 smmus/pxb-buses:
  [ pxb0] <---> vSMMU0/pSMMU0 [ devA, devB, devC ]
  [ pxb1] <---> vSMMU1/pSMMU1 [ devD, devE, devF ]
How would a user know that devA/devB should be attached to pxb0
without doing like devA->pxb0 and devB->pxb1? Should QEMU just
error out until the user associate them correctly? Or they may
rely on libvirt to figure that out, i.e. moving the iommu node
matching from QEMU to libvirt?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 05/10] hw/arm/virt: Add VIRT_NESTED_SMMU
  2024-07-09 17:59     ` Nicolin Chen
@ 2024-07-11 15:48       ` Andrea Bolognani
  2024-07-11 17:57         ` Jason Gunthorpe
  0 siblings, 1 reply; 25+ messages in thread
From: Andrea Bolognani @ 2024-07-11 15:48 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Eric Auger, peter.maydell, shannon.zhaosl, mst, imammedo,
	anisinha, peterx, qemu-arm, qemu-devel, jgg,
	shameerali.kolothum.thodi, jasowang

On Tue, Jul 09, 2024 at 10:59:30AM GMT, Nicolin Chen wrote:
> On Tue, Jul 09, 2024 at 03:26:58PM +0200, Eric Auger wrote:
> > > +    /* Build PCI Expander Bridge + Root Port from the top of PCI_BUS_MAX */
> > > +    if (vms->num_nested_smmus) {
> > > +        /* VIRT_NESTED_SMMU must hold all vSMMUs */
> > > +        g_assert(vms->num_nested_smmus <=
> > > +                 vms->memmap[VIRT_NESTED_SMMU].size / SMMU_IO_LEN);
> > > +
> > > +        vms->nested_smmu_phandle = g_new0(uint32_t, vms->num_nested_smmus);
> > > +
> > > +        for (i = 0; i < vms->num_nested_smmus; i++) {
> > > +            DeviceState *smmu_dev;
> > > +            PCIBus *pxb_bus;
> > > +
> > > +            pxb_bus = create_pcie_expander_bridge(vms, i);
> > > +            g_assert(pxb_bus);
> > > +
> > > +            vms->nested_smmu_phandle[i] = qemu_fdt_alloc_phandle(ms->fdt);
> > > +            smmu_dev = create_nested_smmu(vms, pxb_bus, i);
> > > +            g_assert(smmu_dev);
> > > +
> > > +            qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map", 0x0,
> > > +                                   vms->nested_smmu_phandle[i], 0x0, 0x10000);
>
> > I think libvirt is supposed to create the pcie bus topology instead and
> > it shall not be created by qemu behind the scene.
> > The pcie elements you create here are not visible to libvirt and I guess
> > they may collide with elements explicitly created by libvirt at a given
> > pci bdf.
>
> Yea, the bdf conflict is a concern. So I allocated the bdf list
> from the top of the bus number... one of the reasons of doing
> this is to ease users so they don't need to deal with the over-
> complicated topology. I will try libvirt and see how it goes.

libvirt developer here. Eric is absolutely right that creating PCI
controllers implicitly will not fly, as libvirt's PCI address
allocation algorithm would not know about them and would produce
incorrect output as a result.

> > I think it would make more sense to be able to attach an smmu instance
> > to a given pci root or pxb either by adding an iommu id to a given
> > pxb-pcie option
> >
> > -device
> > pxb-pcie,bus_nr=100,id=pci.12,numa_node=0,bus=pcie.0,addr=0x3,iommu=<id>
> > or
> > adding a list of pxb ids to the iommu option. It is unfortunate the
> > iommu option is a machine option.
>
> Yes. I had thought about that too, but the virt-machine code
> creates all the instance at this moment...
>
> > platform bus framework could be considered to dynamically allocate them
> > using the -device option. This has been used along with dt generation
> > but with ACPI this would need to be studied. However at the time the
> > smmu was integrated the machine option was prefered.
> >
> > Maybe using the 1st option would allow to infer that if there are
> > different iommu ids this implies that several IOMMU instances need to be
> > created.
>
> Yea, I like the idea of creating iommu instance with a "-device"
> string.

This sounds like the best option to me as well. Incidentally, it's
also how the intel-iommu device works. From libvirt's test suite:

  -device '{"driver":"intel-iommu","id":"iommu0"}'

Looks like a sensible model to follow. You could then associate each
specific IOMMU instance to a PCI controller using an attribute, as
Eric suggested above.

> One more question. Let's say we have 2 smmus/pxb-buses:
>   [ pxb0] <---> vSMMU0/pSMMU0 [ devA, devB, devC ]
>   [ pxb1] <---> vSMMU1/pSMMU1 [ devD, devE, devF ]
> How would a user know that devA/devB should be attached to pxb0
> without doing like devA->pxb0 and devB->pxb1? Should QEMU just
> error out until the user associate them correctly? Or they may
> rely on libvirt to figure that out, i.e. moving the iommu node
> matching from QEMU to libvirt?

IIUC having devices associated to the "correct" SMMU is only a
requirement for optimal performance, not pure functionality, right?
In other words, getting the association wrong will make things slower
but they will keep working.

Is a 1:1 match between pSMMU and vSMMU mandatory? On a system with 4
SMMUs, could I create a VMs with

  vSMMU0 <-> pSMMU0
  vSMMU1 <-> pSMMU1

and another one with

  vSMMU0 <-> pSMMU2
  vSMMU1 <-> pSMMU3

assuming of course that devices are assigned to the correct VM?

How is the association between vSMMU and pSMMU created anyway? Is
that something that the user can control, or is it done automatically
somehow?

-- 
Andrea Bolognani / Red Hat / Virtualization



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 05/10] hw/arm/virt: Add VIRT_NESTED_SMMU
  2024-07-11 15:48       ` Andrea Bolognani
@ 2024-07-11 17:57         ` Jason Gunthorpe
  0 siblings, 0 replies; 25+ messages in thread
From: Jason Gunthorpe @ 2024-07-11 17:57 UTC (permalink / raw)
  To: Andrea Bolognani
  Cc: Nicolin Chen, Eric Auger, peter.maydell, shannon.zhaosl, mst,
	imammedo, anisinha, peterx, qemu-arm, qemu-devel,
	shameerali.kolothum.thodi, jasowang

On Thu, Jul 11, 2024 at 08:48:10AM -0700, Andrea Bolognani wrote:

> IIUC having devices associated to the "correct" SMMU is only a
> requirement for optimal performance, not pure functionality, right?
> In other words, getting the association wrong will make things slower
> but they will keep working.

As part of declaring the iommu as a device libvirt should also specify
the iommu capabilities and features.

There are capabilities that require perfect 1:1 pIOMMU to vIOMMU
matching, or it can't work.

For instance if the pIOMMU has a direct path for invalidations then
the pIOMMU, pPCI, vIOMMU, and vPCI all must be perfectly matched
because the vIOMMU will deliver invalidations directly to a single
pIOMMU with no possibility to steer them to the correct place in
hypervisor SW.

> Is a 1:1 match between pSMMU and vSMMU mandatory? On a system with 4
> SMMUs, could I create a VMs with
> 
>   vSMMU0 <-> pSMMU0
>   vSMMU1 <-> pSMMU1
> 
> and another one with
> 
>   vSMMU0 <-> pSMMU2
>   vSMMU1 <-> pSMMU3

This is fine

> assuming of course that devices are assigned to the correct VM?

But devices on pSMMU0 cannot be assigned to the second VM in above
situations.

> How is the association between vSMMU and pSMMU created anyway? Is
> that something that the user can control, or is it done automatically
> somehow?

From a kernel side VIOMMU will be created in IOMMUFD and it will get
as input a reference to a VFIO device. The VIOMMU will be linked to
that VFIO device's pIOMMU.

Also creating a dummy vIOMMU without any devices and thus no way to
reach the pIOMMU is definately "tricky", I'm not sure exactly what is
possible there.

Jason

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFCv1 06/10] hw/arm/virt: Assign vfio-pci devices to nested SMMUs
  2024-06-26  0:28 [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs Nicolin Chen
                   ` (4 preceding siblings ...)
  2024-06-26  0:28 ` [PATCH RFCv1 05/10] hw/arm/virt: Add VIRT_NESTED_SMMU Nicolin Chen
@ 2024-06-26  0:28 ` Nicolin Chen
  2024-07-09 13:32   ` Eric Auger
  2024-06-26  0:28 ` [PATCH RFCv1 07/10] hw/arm/virt: Bypass iommu for default PCI bus Nicolin Chen
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 25+ messages in thread
From: Nicolin Chen @ 2024-06-26  0:28 UTC (permalink / raw)
  To: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha,
	eric.auger, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

With iommu=nested-smmuv3, there could be multiple nested SMMU instances in
the vms. A passthrough device must to look up for its iommu handler in its
sysfs node, and then link to the nested SMMU instance created for the same
iommu handler. This isn't easy to do.

Add an auto-assign piece after all vSMMU backed pxb buses are created. It
loops the existing input devices, and sets/replaces their pci bus numbers
with a newly created pcie-root-port to the pxb bus.

Note that this is not an ideal solution to handle hot plug device.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/arm/virt.c         | 110 ++++++++++++++++++++++++++++++++++++++++++
 include/hw/arm/virt.h |  13 +++++
 2 files changed, 123 insertions(+)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index a54332fca8..3610f53304 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -38,6 +38,7 @@
 #include "hw/arm/primecell.h"
 #include "hw/arm/virt.h"
 #include "hw/block/flash.h"
+#include "hw/vfio/pci.h"
 #include "hw/vfio/vfio-calxeda-xgmac.h"
 #include "hw/vfio/vfio-amd-xgbe.h"
 #include "hw/display/ramfb.h"
@@ -1491,6 +1492,112 @@ static void create_virtio_iommu_dt_bindings(VirtMachineState *vms)
                            bdf + 1, vms->iommu_phandle, bdf + 1, 0xffff - bdf);
 }
 
+static char *create_new_pcie_port(VirtNestedSmmu *nested_smmu, Error **errp)
+{
+    uint32_t port_nr = nested_smmu->pci_bus->qbus.num_children;
+    uint32_t chassis_nr = UINT8_MAX - nested_smmu->index;
+    uint32_t bus_nr = pci_bus_num(nested_smmu->pci_bus);
+    DeviceState *dev;
+    char *name_port;
+
+    /* Create a root port */
+    dev = qdev_new("pcie-root-port");
+    name_port = g_strdup_printf("smmu_bus0x%x_port%d", bus_nr, port_nr);
+
+    if (!qdev_set_id(dev, name_port, &error_fatal)) {
+        /* FIXME retry with a different port num? */
+        error_setg(errp, "Could not set pcie-root-port ID %s", name_port);
+        g_free(name_port);
+        g_free(dev);
+        return NULL;
+    }
+    qdev_prop_set_uint32(dev, "chassis", chassis_nr);
+    qdev_prop_set_uint32(dev, "slot", port_nr);
+    qdev_prop_set_uint64(dev, "io-reserve", 0);
+    qdev_realize_and_unref(dev, BUS(nested_smmu->pci_bus), &error_fatal);
+    return name_port;
+}
+
+static int assign_nested_smmu(void *opaque, QemuOpts *opts, Error **errp)
+{
+    VirtMachineState *vms = (VirtMachineState *)opaque;
+    const char *sysfsdev = qemu_opt_get(opts, "sysfsdev");
+    const char *iommufd = qemu_opt_get(opts, "iommufd");
+    const char *driver = qemu_opt_get(opts, "driver");
+    const char *host = qemu_opt_get(opts, "host");
+    const char *bus = qemu_opt_get(opts, "bus");
+    VirtNestedSmmu *nested_smmu;
+    char *link_iommu;
+    char *dir_iommu;
+    char *smmu_node;
+    char *name_port;
+    int ret = 0;
+
+    if (!iommufd || !driver) {
+        return 0;
+    }
+    if (!sysfsdev && !host) {
+        return 0;
+    }
+    if (strncmp(driver, TYPE_VFIO_PCI, strlen(TYPE_VFIO_PCI))) {
+        return 0;
+    }
+    /* If the device wants to attach to the default bus, do not reassign it */
+    if (bus && !strncmp(bus, "pcie.0", strlen(bus))) {
+        return 0;
+    }
+
+    if (sysfsdev) {
+        link_iommu = g_strdup_printf("%s/iommu", sysfsdev);
+    } else {
+        link_iommu = g_strdup_printf("/sys/bus/pci/devices/%s/iommu", host);
+    }
+
+    dir_iommu = realpath(link_iommu, NULL);
+    if (!dir_iommu) {
+        error_setg(errp, "Could not get the real path for iommu link: %s",
+                   link_iommu);
+        ret = -EINVAL;
+        goto free_link;
+    }
+
+    smmu_node = g_path_get_basename(dir_iommu);
+    if (!smmu_node) {
+        error_setg(errp, "Could not get SMMU node name for iommu at: %s",
+                   dir_iommu);
+        ret = -EINVAL;
+        goto free_dir;
+    }
+
+    nested_smmu = find_nested_smmu_by_sysfs(vms, smmu_node);
+    if (!nested_smmu) {
+        error_setg(errp, "Could not find any detected SMMU matching node: %s",
+                   smmu_node);
+        ret = -EINVAL;
+        goto free_node;
+    }
+
+    name_port = create_new_pcie_port(nested_smmu, errp);
+    if (!name_port) {
+        ret = -EBUSY;
+        goto free_node;
+    }
+
+    qemu_opt_set(opts, "bus", name_port, &error_fatal);
+    if (bus) {
+        error_report("overriding PCI bus %s to %s for device %s [%s]",
+                     bus, name_port, host, sysfsdev);
+    }
+
+free_node:
+    free(smmu_node);
+free_dir:
+    free(dir_iommu);
+free_link:
+    free(link_iommu);
+    return ret;
+}
+
 /*
  * FIXME this is used to reverse for hotplug devices, yet it could result in a
  * big waste of PCI bus numbners.
@@ -1669,6 +1776,9 @@ static void create_pcie(VirtMachineState *vms)
             qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map", 0x0,
                                    vms->nested_smmu_phandle[i], 0x0, 0x10000);
         }
+
+        qemu_opts_foreach(qemu_find_opts("device"),
+                          assign_nested_smmu, vms, &error_fatal);
     } else if (vms->iommu) {
         vms->iommu_phandle = qemu_fdt_alloc_phandle(ms->fdt);
 
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 0a3f1ab8b5..dfbc4bba3c 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -246,4 +246,17 @@ find_nested_smmu_by_index(VirtMachineState *vms, int index)
     return NULL;
 }
 
+static inline VirtNestedSmmu *
+find_nested_smmu_by_sysfs(VirtMachineState *vms, char *node)
+{
+    VirtNestedSmmu *nested_smmu;
+
+    QLIST_FOREACH(nested_smmu, &vms->nested_smmu_list, next) {
+        if (!strncmp(nested_smmu->smmu_node, node, strlen(node))) {
+            return nested_smmu;
+        }
+    }
+    return NULL;
+}
+
 #endif /* QEMU_ARM_VIRT_H */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFCv1 06/10] hw/arm/virt: Assign vfio-pci devices to nested SMMUs
  2024-06-26  0:28 ` [PATCH RFCv1 06/10] hw/arm/virt: Assign vfio-pci devices to nested SMMUs Nicolin Chen
@ 2024-07-09 13:32   ` Eric Auger
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Auger @ 2024-07-09 13:32 UTC (permalink / raw)
  To: Nicolin Chen, peter.maydell, shannon.zhaosl, mst, imammedo,
	anisinha, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang,
	Andrea Bolognani



On 6/26/24 02:28, Nicolin Chen wrote:
> With iommu=nested-smmuv3, there could be multiple nested SMMU instances in
> the vms. A passthrough device must to look up for its iommu handler in its
> sysfs node, and then link to the nested SMMU instance created for the same
> iommu handler. This isn't easy to do.
>
> Add an auto-assign piece after all vSMMU backed pxb buses are created. It
> loops the existing input devices, and sets/replaces their pci bus numbers
> with a newly created pcie-root-port to the pxb bus.
Here again I don't think it is acceptable to create such topology under
the hood. Libvirt shall master the whole PCIe topology.

Eric
>
> Note that this is not an ideal solution to handle hot plug device.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  hw/arm/virt.c         | 110 ++++++++++++++++++++++++++++++++++++++++++
>  include/hw/arm/virt.h |  13 +++++
>  2 files changed, 123 insertions(+)
>
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index a54332fca8..3610f53304 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -38,6 +38,7 @@
>  #include "hw/arm/primecell.h"
>  #include "hw/arm/virt.h"
>  #include "hw/block/flash.h"
> +#include "hw/vfio/pci.h"
>  #include "hw/vfio/vfio-calxeda-xgmac.h"
>  #include "hw/vfio/vfio-amd-xgbe.h"
>  #include "hw/display/ramfb.h"
> @@ -1491,6 +1492,112 @@ static void create_virtio_iommu_dt_bindings(VirtMachineState *vms)
>                             bdf + 1, vms->iommu_phandle, bdf + 1, 0xffff - bdf);
>  }
>  
> +static char *create_new_pcie_port(VirtNestedSmmu *nested_smmu, Error **errp)
> +{
> +    uint32_t port_nr = nested_smmu->pci_bus->qbus.num_children;
> +    uint32_t chassis_nr = UINT8_MAX - nested_smmu->index;
> +    uint32_t bus_nr = pci_bus_num(nested_smmu->pci_bus);
> +    DeviceState *dev;
> +    char *name_port;
> +
> +    /* Create a root port */
> +    dev = qdev_new("pcie-root-port");
> +    name_port = g_strdup_printf("smmu_bus0x%x_port%d", bus_nr, port_nr);
> +
> +    if (!qdev_set_id(dev, name_port, &error_fatal)) {
> +        /* FIXME retry with a different port num? */
> +        error_setg(errp, "Could not set pcie-root-port ID %s", name_port);
> +        g_free(name_port);
> +        g_free(dev);
> +        return NULL;
> +    }
> +    qdev_prop_set_uint32(dev, "chassis", chassis_nr);
> +    qdev_prop_set_uint32(dev, "slot", port_nr);
> +    qdev_prop_set_uint64(dev, "io-reserve", 0);
> +    qdev_realize_and_unref(dev, BUS(nested_smmu->pci_bus), &error_fatal);
> +    return name_port;
> +}
> +
> +static int assign_nested_smmu(void *opaque, QemuOpts *opts, Error **errp)
> +{
> +    VirtMachineState *vms = (VirtMachineState *)opaque;
> +    const char *sysfsdev = qemu_opt_get(opts, "sysfsdev");
> +    const char *iommufd = qemu_opt_get(opts, "iommufd");
> +    const char *driver = qemu_opt_get(opts, "driver");
> +    const char *host = qemu_opt_get(opts, "host");
> +    const char *bus = qemu_opt_get(opts, "bus");
> +    VirtNestedSmmu *nested_smmu;
> +    char *link_iommu;
> +    char *dir_iommu;
> +    char *smmu_node;
> +    char *name_port;
> +    int ret = 0;
> +
> +    if (!iommufd || !driver) {
> +        return 0;
> +    }
> +    if (!sysfsdev && !host) {
> +        return 0;
> +    }
> +    if (strncmp(driver, TYPE_VFIO_PCI, strlen(TYPE_VFIO_PCI))) {
> +        return 0;
> +    }
> +    /* If the device wants to attach to the default bus, do not reassign it */
> +    if (bus && !strncmp(bus, "pcie.0", strlen(bus))) {
> +        return 0;
> +    }
> +
> +    if (sysfsdev) {
> +        link_iommu = g_strdup_printf("%s/iommu", sysfsdev);
> +    } else {
> +        link_iommu = g_strdup_printf("/sys/bus/pci/devices/%s/iommu", host);
> +    }
> +
> +    dir_iommu = realpath(link_iommu, NULL);
> +    if (!dir_iommu) {
> +        error_setg(errp, "Could not get the real path for iommu link: %s",
> +                   link_iommu);
> +        ret = -EINVAL;
> +        goto free_link;
> +    }
> +
> +    smmu_node = g_path_get_basename(dir_iommu);
> +    if (!smmu_node) {
> +        error_setg(errp, "Could not get SMMU node name for iommu at: %s",
> +                   dir_iommu);
> +        ret = -EINVAL;
> +        goto free_dir;
> +    }
> +
> +    nested_smmu = find_nested_smmu_by_sysfs(vms, smmu_node);
> +    if (!nested_smmu) {
> +        error_setg(errp, "Could not find any detected SMMU matching node: %s",
> +                   smmu_node);
> +        ret = -EINVAL;
> +        goto free_node;
> +    }
> +
> +    name_port = create_new_pcie_port(nested_smmu, errp);
> +    if (!name_port) {
> +        ret = -EBUSY;
> +        goto free_node;
> +    }
> +
> +    qemu_opt_set(opts, "bus", name_port, &error_fatal);
> +    if (bus) {
> +        error_report("overriding PCI bus %s to %s for device %s [%s]",
> +                     bus, name_port, host, sysfsdev);
> +    }
> +
> +free_node:
> +    free(smmu_node);
> +free_dir:
> +    free(dir_iommu);
> +free_link:
> +    free(link_iommu);
> +    return ret;
> +}
> +
>  /*
>   * FIXME this is used to reverse for hotplug devices, yet it could result in a
>   * big waste of PCI bus numbners.
> @@ -1669,6 +1776,9 @@ static void create_pcie(VirtMachineState *vms)
>              qemu_fdt_setprop_cells(ms->fdt, nodename, "iommu-map", 0x0,
>                                     vms->nested_smmu_phandle[i], 0x0, 0x10000);
>          }
> +
> +        qemu_opts_foreach(qemu_find_opts("device"),
> +                          assign_nested_smmu, vms, &error_fatal);
>      } else if (vms->iommu) {
>          vms->iommu_phandle = qemu_fdt_alloc_phandle(ms->fdt);
>  
> diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> index 0a3f1ab8b5..dfbc4bba3c 100644
> --- a/include/hw/arm/virt.h
> +++ b/include/hw/arm/virt.h
> @@ -246,4 +246,17 @@ find_nested_smmu_by_index(VirtMachineState *vms, int index)
>      return NULL;
>  }
>  
> +static inline VirtNestedSmmu *
> +find_nested_smmu_by_sysfs(VirtMachineState *vms, char *node)
> +{
> +    VirtNestedSmmu *nested_smmu;
> +
> +    QLIST_FOREACH(nested_smmu, &vms->nested_smmu_list, next) {
> +        if (!strncmp(nested_smmu->smmu_node, node, strlen(node))) {
> +            return nested_smmu;
> +        }
> +    }
> +    return NULL;
> +}
> +
>  #endif /* QEMU_ARM_VIRT_H */



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH RFCv1 07/10] hw/arm/virt: Bypass iommu for default PCI bus
  2024-06-26  0:28 [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs Nicolin Chen
                   ` (5 preceding siblings ...)
  2024-06-26  0:28 ` [PATCH RFCv1 06/10] hw/arm/virt: Assign vfio-pci devices to nested SMMUs Nicolin Chen
@ 2024-06-26  0:28 ` Nicolin Chen
  2024-06-26  0:28 ` [PATCH RFCv1 08/10] hw/arm/virt-acpi-build: Handle reserved bus number of pxb buses Nicolin Chen
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 25+ messages in thread
From: Nicolin Chen @ 2024-06-26  0:28 UTC (permalink / raw)
  To: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha,
	eric.auger, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

Now, all passthrough devices that should benefit from the nested SMMUv3
feature are assigned to dedicated pxb buses. So, the default PCI bus can
be only used by emulated devices.

In theory, these emualted devices can be still attached to an emualted
SMMUv3 instance, yet there is no gain doing that. Set the default PCI bus
to bypass iommu, for the maximum performance.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/arm/virt.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 3610f53304..5e91dc8c3d 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1708,7 +1708,8 @@ static void create_pcie(VirtMachineState *vms)
     }
 
     pci = PCI_HOST_BRIDGE(dev);
-    pci->bypass_iommu = vms->default_bus_bypass_iommu;
+    /* Default bus used by emulated devices does not go through nested SMMUs */
+    pci->bypass_iommu = vms->default_bus_bypass_iommu || vms->num_nested_smmus;
     vms->bus = pci->bus;
     if (vms->bus) {
         pci_init_nic_devices(pci->bus, mc->default_nic);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFCv1 08/10] hw/arm/virt-acpi-build: Handle reserved bus number of pxb buses
  2024-06-26  0:28 [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs Nicolin Chen
                   ` (6 preceding siblings ...)
  2024-06-26  0:28 ` [PATCH RFCv1 07/10] hw/arm/virt: Bypass iommu for default PCI bus Nicolin Chen
@ 2024-06-26  0:28 ` Nicolin Chen
  2024-06-26  0:28 ` [PATCH RFCv1 09/10] hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes Nicolin Chen
  2024-06-26  0:28 ` [PATCH RFCv1 10/10] hw/arm/virt-acpi-build: Enable ATS for nested SMMUv3 Nicolin Chen
  9 siblings, 0 replies; 25+ messages in thread
From: Nicolin Chen @ 2024-06-26  0:28 UTC (permalink / raw)
  To: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha,
	eric.auger, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

Each pxb bus created for a nested SMMU has a reserved bus number, allowing
a hotplug device to attach to the bus in a later stage.

Read it out to apply to the id_count calculation.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/arm/virt-acpi-build.c | 28 ++++++++++++++++++++++++----
 include/hw/arm/virt.h    | 13 +++++++++++++
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index d5e72800f6..91f53f90ca 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -236,6 +236,12 @@ build_iort_id_mapping(GArray *table_data, uint32_t input_base,
     build_append_int_noprefix(table_data, flags, 4); /* Flags */
 }
 
+struct AcpiIortIdMappingVM {
+    VirtMachineState *vms;
+    GArray *smmu_idmaps;
+};
+typedef struct AcpiIortIdMappingVM AcpiIortIdMappingVM;
+
 struct AcpiIortIdMapping {
     uint32_t input_base;
     uint32_t id_count;
@@ -246,21 +252,34 @@ typedef struct AcpiIortIdMapping AcpiIortIdMapping;
 static int
 iort_host_bridges(Object *obj, void *opaque)
 {
-    GArray *idmap_blob = opaque;
+    AcpiIortIdMappingVM *idmap_vm = opaque;
+    VirtMachineState *vms = idmap_vm->vms;
 
     if (object_dynamic_cast(obj, TYPE_PCI_HOST_BRIDGE)) {
         PCIBus *bus = PCI_HOST_BRIDGE(obj)->bus;
 
         if (bus && !pci_bus_bypass_iommu(bus)) {
+            VirtNestedSmmu *nested_smmu = find_nested_smmu_by_pci_bus(vms, bus);
             int min_bus, max_bus;
 
-            pci_bus_range(bus, &min_bus, &max_bus);
+            if (vms->iommu == VIRT_IOMMU_NESTED_SMMUV3) {
+                /* PCI host bridge hehind a nested SMMU has reserved buses */
+                if (nested_smmu) {
+                    min_bus = pci_bus_num(nested_smmu->pci_bus);
+                    max_bus = min_bus + nested_smmu->reserved_bus_nums - 1;
+                } else {
+                    /* Not connected to a nested SMMU */
+                    return 0;
+                }
+            } else {
+                pci_bus_range(bus, &min_bus, &max_bus);
+            }
 
             AcpiIortIdMapping idmap = {
                 .input_base = min_bus << 8,
                 .id_count = (max_bus - min_bus + 1) << 8,
             };
-            g_array_append_val(idmap_blob, idmap);
+            g_array_append_val(idmap_vm->smmu_idmaps, idmap);
         }
     }
 
@@ -331,6 +350,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     uint32_t id = 0;
     GArray *smmu_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping));
     GArray *its_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping));
+    AcpiIortIdMappingVM idmap_vm = { .vms = vms, .smmu_idmaps = smmu_idmaps, };
 
     AcpiTable table = { .sig = "IORT", .rev = 5, .oem_id = vms->oem_id,
                         .oem_table_id = vms->oem_table_id };
@@ -341,7 +361,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
         AcpiIortIdMapping next_range = {0};
 
         object_child_foreach_recursive(object_get_root(),
-                                       iort_host_bridges, smmu_idmaps);
+                                       iort_host_bridges, &idmap_vm);
 
         nb_nodes = 3; /* RC, ITS, SMMUv3 */
 
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index dfbc4bba3c..7ac392eb88 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -259,4 +259,17 @@ find_nested_smmu_by_sysfs(VirtMachineState *vms, char *node)
     return NULL;
 }
 
+static inline VirtNestedSmmu *
+find_nested_smmu_by_pci_bus(VirtMachineState *vms, PCIBus *pci_bus)
+{
+    VirtNestedSmmu *nested_smmu;
+
+    QLIST_FOREACH(nested_smmu, &vms->nested_smmu_list, next) {
+        if (nested_smmu->pci_bus == pci_bus) {
+            return nested_smmu;
+        }
+    }
+    return NULL;
+}
+
 #endif /* QEMU_ARM_VIRT_H */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFCv1 09/10] hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes
  2024-06-26  0:28 [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs Nicolin Chen
                   ` (7 preceding siblings ...)
  2024-06-26  0:28 ` [PATCH RFCv1 08/10] hw/arm/virt-acpi-build: Handle reserved bus number of pxb buses Nicolin Chen
@ 2024-06-26  0:28 ` Nicolin Chen
  2024-06-26  0:28 ` [PATCH RFCv1 10/10] hw/arm/virt-acpi-build: Enable ATS for nested SMMUv3 Nicolin Chen
  9 siblings, 0 replies; 25+ messages in thread
From: Nicolin Chen @ 2024-06-26  0:28 UTC (permalink / raw)
  To: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha,
	eric.auger, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

There can be multiple PCI buses behind different SMMU nodes. And each pair
should be associated in the IORT table too when building the ID mappings.

Create multiple SMMU nodes if needed, store their offsets in an array.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/arm/virt-acpi-build.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 91f53f90ca..6d8b9aea42 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -295,7 +295,7 @@ static int iort_idmap_compare(gconstpointer a, gconstpointer b)
 }
 
 static void
-build_iort_rmr_nodes(GArray *table_data, GArray *smmu_idmaps, int smmu_offset, uint32_t *id) {
+build_iort_rmr_nodes(GArray *table_data, GArray *smmu_idmaps, size_t *smmu_offset, uint32_t *id) {
     AcpiIortIdMapping *range;
     int i;
 
@@ -323,7 +323,7 @@ build_iort_rmr_nodes(GArray *table_data, GArray *smmu_idmaps, int smmu_offset, u
         build_append_int_noprefix(table_data, 1 , 4);
         /* Reference to Memory Range Descriptors */
         build_append_int_noprefix(table_data, 28 + ID_MAPPING_ENTRY_SIZE, 4);
-        build_iort_id_mapping(table_data, bdf, range->id_count, smmu_offset, 1);
+        build_iort_id_mapping(table_data, bdf, range->id_count, smmu_offset[i], 1);
 
         /* Table 19 Memory Range Descriptor */
 
@@ -345,25 +345,42 @@ static void
 build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
 {
     int i, nb_nodes, rc_mapping_count;
-    size_t node_size, smmu_offset = 0;
+    size_t node_size, *smmu_offset;
     AcpiIortIdMapping *idmap;
     uint32_t id = 0;
     GArray *smmu_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping));
     GArray *its_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping));
     AcpiIortIdMappingVM idmap_vm = { .vms = vms, .smmu_idmaps = smmu_idmaps, };
+    int irq_offset = NUM_SMMU_IRQS;
+    hwaddr offset = SMMU_IO_LEN;
+    int irq, num_smmus = 0;
+    hwaddr base;
 
     AcpiTable table = { .sig = "IORT", .rev = 5, .oem_id = vms->oem_id,
                         .oem_table_id = vms->oem_table_id };
     /* Table 2 The IORT */
     acpi_table_begin(&table, table_data);
 
+    if (vms->num_nested_smmus) {
+        irq = vms->irqmap[VIRT_NESTED_SMMU] + ARM_SPI_BASE;
+        base = vms->memmap[VIRT_NESTED_SMMU].base;
+        num_smmus = vms->num_nested_smmus;
+    } else if (virt_has_smmuv3(vms)) {
+        irq = vms->irqmap[VIRT_SMMU] + ARM_SPI_BASE;
+        base = vms->memmap[VIRT_SMMU].base;
+        num_smmus = 1;
+    }
+    smmu_offset = g_new0(size_t, num_smmus);
+
+    nb_nodes = 2; /* RC, ITS */
+    nb_nodes += num_smmus; /* SMMU nodes */
+
     if (virt_has_smmuv3(vms)) {
         AcpiIortIdMapping next_range = {0};
 
         object_child_foreach_recursive(object_get_root(),
                                        iort_host_bridges, &idmap_vm);
 
-        nb_nodes = 3; /* RC, ITS, SMMUv3 */
 
         /* Sort the smmu idmap by input_base */
         g_array_sort(smmu_idmaps, iort_idmap_compare);
@@ -394,7 +411,6 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
 
         rc_mapping_count = smmu_idmaps->len + its_idmaps->len;
     } else {
-        nb_nodes = 2; /* RC, ITS */
         rc_mapping_count = 1;
     }
     /* Number of IORT Nodes */
@@ -416,10 +432,9 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     /* GIC ITS Identifier Array */
     build_append_int_noprefix(table_data, 0 /* MADT translation_id */, 4);
 
-    if (virt_has_smmuv3(vms)) {
-        int irq =  vms->irqmap[VIRT_SMMU] + ARM_SPI_BASE;
+    for (i = 0; i < num_smmus; i++) {
+        smmu_offset[i] = table_data->len - table.table_offset;
 
-        smmu_offset = table_data->len - table.table_offset;
         /* Table 9 SMMUv3 Format */
         build_append_int_noprefix(table_data, 4 /* SMMUv3 */, 1); /* Type */
         node_size =  SMMU_V3_ENTRY_SIZE + ID_MAPPING_ENTRY_SIZE;
@@ -430,12 +445,13 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
         /* Reference to ID Array */
         build_append_int_noprefix(table_data, SMMU_V3_ENTRY_SIZE, 4);
         /* Base address */
-        build_append_int_noprefix(table_data, vms->memmap[VIRT_SMMU].base, 8);
+        build_append_int_noprefix(table_data, base + i * offset, 8);
         /* Flags */
         build_append_int_noprefix(table_data, 1 /* COHACC Override */, 4);
         build_append_int_noprefix(table_data, 0, 4); /* Reserved */
         build_append_int_noprefix(table_data, 0, 8); /* VATOS address */
         /* Model */
+        irq += irq_offset;
         build_append_int_noprefix(table_data, 0 /* Generic SMMU-v3 */, 4);
         build_append_int_noprefix(table_data, irq, 4); /* Event */
         build_append_int_noprefix(table_data, irq + 1, 4); /* PRI */
@@ -487,7 +503,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
             range = &g_array_index(smmu_idmaps, AcpiIortIdMapping, i);
             /* output IORT node is the smmuv3 node */
             build_iort_id_mapping(table_data, range->input_base,
-                                  range->id_count, smmu_offset, 0);
+                                  range->id_count, smmu_offset[i], 0);
         }
 
         /* bypassed RIDs connect to ITS group node directly: RC -> ITS */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH RFCv1 10/10] hw/arm/virt-acpi-build: Enable ATS for nested SMMUv3
  2024-06-26  0:28 [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs Nicolin Chen
                   ` (8 preceding siblings ...)
  2024-06-26  0:28 ` [PATCH RFCv1 09/10] hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes Nicolin Chen
@ 2024-06-26  0:28 ` Nicolin Chen
  9 siblings, 0 replies; 25+ messages in thread
From: Nicolin Chen @ 2024-06-26  0:28 UTC (permalink / raw)
  To: peter.maydell, shannon.zhaosl, mst, imammedo, anisinha,
	eric.auger, peterx
  Cc: qemu-arm, qemu-devel, jgg, shameerali.kolothum.thodi, jasowang

For a nested SMMUv3, the ATS capaiblity is decided by the underlying HW,
and then reflected in the IDR0 register of the vSMMU.

The IORT on the other hand could allow it to be always enabled, relying
on the guest-level SMMU kernel driver to disable ATS feature if the ATS
bit isn't set in IDR0.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/arm/virt-acpi-build.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 6d8b9aea42..c4cf1caf22 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -485,7 +485,11 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     /* Table 15 Memory Access Flags */
     build_append_int_noprefix(table_data, 0x3 /* CCA = CPM = DACS = 1 */, 1);
 
-    build_append_int_noprefix(table_data, 0, 4); /* ATS Attribute */
+    if (vms->iommu == VIRT_IOMMU_NESTED_SMMUV3) {
+        build_append_int_noprefix(table_data, 1, 4); /* ATS Attribute */
+    } else {
+        build_append_int_noprefix(table_data, 0, 4); /* ATS Attribute */
+    }
     /* MCFG pci_segment */
     build_append_int_noprefix(table_data, 0, 4); /* PCI Segment number */
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-07-11 17:58 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-26  0:28 [PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 01/10] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 02/10] hw/arm/virt: Add iommufd link to virt-machine Nicolin Chen
2024-07-09  9:11   ` Eric Auger
2024-07-09 16:59     ` Nicolin Chen
2024-07-09 17:06       ` Eric Auger
2024-07-09 17:18         ` Nicolin Chen
2024-07-10  2:32           ` Duan, Zhenzhong
2024-06-26  0:28 ` [PATCH RFCv1 03/10] hw/arm/virt: Get the number of host-level SMMUv3 instances Nicolin Chen
2024-07-09  9:20   ` Eric Auger
2024-07-09 17:11     ` Nicolin Chen
2024-07-09 17:22       ` Eric Auger
2024-07-09 18:02         ` Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 04/10] hw/arm/virt: Add an SMMU_IO_LEN macro Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 05/10] hw/arm/virt: Add VIRT_NESTED_SMMU Nicolin Chen
2024-07-09 13:26   ` Eric Auger
2024-07-09 17:59     ` Nicolin Chen
2024-07-11 15:48       ` Andrea Bolognani
2024-07-11 17:57         ` Jason Gunthorpe
2024-06-26  0:28 ` [PATCH RFCv1 06/10] hw/arm/virt: Assign vfio-pci devices to nested SMMUs Nicolin Chen
2024-07-09 13:32   ` Eric Auger
2024-06-26  0:28 ` [PATCH RFCv1 07/10] hw/arm/virt: Bypass iommu for default PCI bus Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 08/10] hw/arm/virt-acpi-build: Handle reserved bus number of pxb buses Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 09/10] hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes Nicolin Chen
2024-06-26  0:28 ` [PATCH RFCv1 10/10] hw/arm/virt-acpi-build: Enable ATS for nested SMMUv3 Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).