[PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3

qemu-arm.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3
@ 2025-10-31 10:49 Shameer Kolothum
  2025-10-31 10:49 ` [PATCH v5 01/32] backends/iommufd: Introduce iommufd_backend_alloc_viommu Shameer Kolothum
                   ` (31 more replies)
  0 siblings, 32 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Hi,

Changes from v4:
  https://lore.kernel.org/qemu-devel/20250929133643.38961-1-skolothumtho@nvidia.com/

 - Addressed feedback from v4 and picked up R-by and T-by tags.
   Thanks to all!
 - Split out the _DSM fix into a separate mini series which has
   already been sent out [0].
 - Introduced a global shared address space aliasing to the system
   address space instead of directly using "address_space_memory" in the
   get_address_space() callback(patch #6).
 - Fixed pci_find_device() returning NULL in the get_address_space()
   path (Patch #7).
 - Introduced an optional supports_address_space() callback for
   rejecting devices attached to a vIOMMU (Patch #8). This allows us
   to reject emulated endpoints when using SMMUv3 with accel=on.
 - Added BIOS table tests for the IORT revision change.
 - Added support to install vSTE based on SMMUv3 GBPA (Patch #14).
 - Factored out ID register initialization from the reset path so
   that it can be used early in the SMMUv3 accel path for HW
   compatibility checks (Patch #18).
 - GBPA-based vSTE update depends on Nicolin's kernel patch [1].
 - VFIO/IOMMUFD has dependency on Zhenzhong's patches: 4/5/8 from the
   pass-through support series [3].

PATCH organization:
 1–25: Enables accelerated SMMUv3 with features based on default QEMU SMMUv3,
       including IORT RMR based MSI support.
 26–28: Adds options for specifying RIL, ATS, and OAS features.
 29–32: Adds PASID support, including VFIO changes.

Tests:
Performed basic sanity tests on an NVIDIA GRACE platform with GPU device
assignments. A CUDA test application was used to verify the SVA use case.
Further tests are always welcome.

Eg: Qemu Cmd line:

qemu-system-aarch64 -machine virt,gic-version=3,highmem-mmio-size=2T \
-cpu host -smp cpus=4 -m size=16G,slots=2,maxmem=66G -nographic \
-bios QEMU_EFI.fd -object iommufd,id=iommufd0 -enable-kvm \
-object memory-backend-ram,size=8G,id=m0 \
-object memory-backend-ram,size=8G,id=m1 \
-numa node,memdev=m0,cpus=0-3,nodeid=0 -numa node,memdev=m1,nodeid=1 \
-numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 -numa node,nodeid=5 \
-numa node,nodeid=6 -numa node,nodeid=7 -numa node,nodeid=8 -numa node,nodeid=9 \
-device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0 \
-device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.0,accel=on,ats=on,ril=off,pasid=on,oas=48 \
-device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1,pref64-reserve=512G,id=dev0 \
-device vfio-pci,host=0019:06:00.0,rombar=0,id=dev0,iommufd=iommufd0,bus=pcie.port1 \
-object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
...
-object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
-device pxb-pcie,id=pcie.2,bus_nr=8,bus=pcie.0 \
-device arm-smmuv3,primary-bus=pcie.2,id=smmuv3.1,accel=on,ats=on,ril=off,pasid=on \
-device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2,pref64-reserve=512G \
-device vfio-pci,host=0018:06:00.0,rombar=0,id=dev1,iommufd=iommufd0,bus=pcie.port2 \
-device virtio-blk-device,drive=fs \
-drive file=image.qcow2,index=0,media=disk,format=qcow2,if=none,id=fs \
-net none \
-nographic

A complete branch can be found here,
https://github.com/shamiali2008/qemu-master master-smmuv3-accel-v5

Please take a look and let me know your feedback.

Thanks,
Shameer

[0] https://lore.kernel.org/qemu-devel/20251022080639.243965-1-skolothumtho@nvidia.com/
[1] https://lore.kernel.org/linux-iommu/20251024040551.1711281-1-nicolinc@nvidia.com/
[2] https://lore.kernel.org/qemu-devel/20251024084349.102322-1-zhenzhong.duan@intel.com/

Details from RFCv3 Cover letter:
-------------------------------
https://lore.kernel.org/qemu-devel/20250714155941.22176-1-shameerali.kolothum.thodi@huawei.com/

This patch series introduces initial support for a user-creatable,
accelerated SMMUv3 device (-device arm-smmuv3,accel=on) in QEMU.

This is based on the user-creatable SMMUv3 device series [0].

Why this is needed:

On ARM, to enable vfio-pci pass-through devices in a VM, the host SMMUv3
must be set up in nested translation mode (Stage 1 + Stage 2), with
Stage 1 (S1) controlled by the guest and Stage 2 (S2) managed by the host.

This series introduces an optional accel property for the SMMUv3 device,
indicating that the guest will try to leverage host SMMUv3 features for
acceleration. By default, enabling accel configures the host SMMUv3 in
nested mode to support vfio-pci pass-through.

This new accelerated, user-creatable SMMUv3 device lets you:

 -Set up a VM with multiple SMMUv3s, each tied to a different physical SMMUv3
  on the host. Typically, you’d have multiple PCIe PXB root complexes in the
  VM (one per virtual NUMA node), and each of them can have its own SMMUv3.
  This setup mirrors the host's layout, where each NUMA node has its own
  SMMUv3, and helps build VMs that are more aligned with the host's NUMA
  topology.

 -The host–guest SMMUv3 association results in reduced invalidation broadcasts
  and lookups for devices behind different physical SMMUv3s.

 -Simplifies handling of host SMMUv3s with differing feature sets.

 -Lays the groundwork for additional capabilities like vCMDQ support.
-------------------------------

Eric Auger (2):
  hw/pci-host/gpex: Allow to generate preserve boot config DSM #5
  hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested
    binding

Nicolin Chen (4):
  backends/iommufd: Introduce iommufd_backend_alloc_viommu
  backends/iommufd: Introduce iommufd_backend_alloc_vdev
  hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
  hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support

Shameer Kolothum (25):
  hw/arm/smmu-common: Factor out common helper functions and export
  hw/arm/smmu-common: Make iommu ops part of SMMUState
  hw/arm/smmuv3-accel: Introduce smmuv3 accel device
  hw/arm/smmuv3-accel: Initialize shared system address space
  hw/pci/pci: Move pci_init_bus_master() after adding device to bus
  hw/pci/pci: Add optional supports_address_space() callback
  hw/pci-bridge/pci_expander_bridge: Move TYPE_PXB_PCIE_DEV to header
  hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints
    with iommufd
  hw/arm/smmuv3: Implement get_viommu_cap() callback
  hw/arm/smmuv3-accel: Install SMMUv3 GBPA based hwpt
  hw/pci/pci: Introduce optional get_msi_address_space() callback
  hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
  hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
  hw/arm/smmuv3: Initialize ID registers early during realize()
  hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate
  hw/arm/virt: Set PCI preserve_config for accel SMMUv3
  tests/qtest/bios-tables-test: Prepare for IORT revison upgrade
  tests/qtest/bios-tables-test: Update IORT blobs after revision upgrade
  hw/arm/smmuv3: Add accel property for SMMUv3 device
  hw/arm/smmuv3-accel: Add a property to specify RIL support
  hw/arm/smmuv3-accel: Add support for ATS
  hw/arm/smmuv3-accel: Add property to specify OAS bits
  backends/iommufd: Retrieve PASID width from
    iommufd_backend_get_device_info()
  Extend get_cap() callback to support PASID
  hw/arm/smmuv3-accel: Add support for PASID enable

Yi Liu (1):
  vfio: Synthesize vPASID capability to VM

 backends/iommufd.c                            |  77 +-
 backends/trace-events                         |   2 +
 hw/arm/Kconfig                                |   5 +
 hw/arm/meson.build                            |   3 +-
 hw/arm/smmu-common.c                          |  51 +-
 hw/arm/smmuv3-accel.c                         | 763 ++++++++++++++++++
 hw/arm/smmuv3-accel.h                         |  92 +++
 hw/arm/smmuv3-internal.h                      |  29 +-
 hw/arm/smmuv3.c                               | 161 +++-
 hw/arm/trace-events                           |   6 +
 hw/arm/virt-acpi-build.c                      | 128 ++-
 hw/arm/virt.c                                 |  31 +-
 hw/i386/intel_iommu.c                         |   5 +-
 hw/pci-bridge/pci_expander_bridge.c           |   1 -
 hw/pci-host/gpex-acpi.c                       |  29 +-
 hw/pci/pci.c                                  |  44 +-
 hw/vfio/container-legacy.c                    |   8 +-
 hw/vfio/iommufd.c                             |   7 +-
 hw/vfio/pci.c                                 |  37 +
 include/hw/arm/smmu-common.h                  |   7 +
 include/hw/arm/smmuv3.h                       |   9 +
 include/hw/arm/virt.h                         |   1 +
 include/hw/iommu.h                            |   1 +
 include/hw/pci-host/gpex.h                    |   1 +
 include/hw/pci/pci.h                          |  33 +
 include/hw/pci/pci_bridge.h                   |   1 +
 include/system/host_iommu_device.h            |  17 +-
 include/system/iommufd.h                      |  29 +-
 target/arm/kvm.c                              |   2 +-
 tests/data/acpi/aarch64/virt/IORT             | Bin 128 -> 128 bytes
 tests/data/acpi/aarch64/virt/IORT.its_off     | Bin 172 -> 172 bytes
 tests/data/acpi/aarch64/virt/IORT.smmuv3-dev  | Bin 364 -> 364 bytes
 .../data/acpi/aarch64/virt/IORT.smmuv3-legacy | Bin 276 -> 276 bytes
 33 files changed, 1506 insertions(+), 74 deletions(-)
 create mode 100644 hw/arm/smmuv3-accel.c
 create mode 100644 hw/arm/smmuv3-accel.h

-- 
2.43.0



^ permalink raw reply	[flat|nested] 148+ messages in thread

* [PATCH v5 01/32] backends/iommufd: Introduce iommufd_backend_alloc_viommu
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-10-31 10:49 ` [PATCH v5 02/32] backends/iommufd: Introduce iommufd_backend_alloc_vdev Shameer Kolothum
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

From: Nicolin Chen <nicolinc@nvidia.com>

Add a helper to allocate a viommu object.

Also introduce a struct IOMMUFDViommu that can be used later by vendor
IOMMU implementations.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 backends/iommufd.c       | 26 ++++++++++++++++++++++++++
 backends/trace-events    |  1 +
 include/system/iommufd.h | 14 ++++++++++++++
 3 files changed, 41 insertions(+)

diff --git a/backends/iommufd.c b/backends/iommufd.c
index fdfb7c9d67..3d4a4ae736 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -446,6 +446,32 @@ bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
     return !ret;
 }
 
+bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
+                                  uint32_t viommu_type, uint32_t hwpt_id,
+                                  uint32_t *out_viommu_id, Error **errp)
+{
+    int ret;
+    struct iommu_viommu_alloc alloc_viommu = {
+        .size = sizeof(alloc_viommu),
+        .type = viommu_type,
+        .dev_id = dev_id,
+        .hwpt_id = hwpt_id,
+    };
+
+    ret = ioctl(be->fd, IOMMU_VIOMMU_ALLOC, &alloc_viommu);
+
+    trace_iommufd_backend_alloc_viommu(be->fd, dev_id, viommu_type, hwpt_id,
+                                       alloc_viommu.out_viommu_id, ret);
+    if (ret) {
+        error_setg_errno(errp, errno, "IOMMU_VIOMMU_ALLOC failed");
+        return false;
+    }
+
+    g_assert(out_viommu_id);
+    *out_viommu_id = alloc_viommu.out_viommu_id;
+    return true;
+}
+
 bool host_iommu_device_iommufd_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
                                            uint32_t hwpt_id, Error **errp)
 {
diff --git a/backends/trace-events b/backends/trace-events
index 56132d3fd2..01c2d9bde9 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -21,3 +21,4 @@ iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%
 iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%u enable=%d (%d)"
 iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
 iommufd_backend_invalidate_cache(int iommufd, uint32_t id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
+iommufd_backend_alloc_viommu(int iommufd, uint32_t dev_id, uint32_t type, uint32_t hwpt_id, uint32_t viommu_id, int ret) " iommufd=%d type=%u dev_id=%u hwpt_id=%u viommu_id=%u (%d)"
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index a659f36a20..11b8413c3f 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -38,6 +38,16 @@ struct IOMMUFDBackend {
     /*< public >*/
 };
 
+/*
+ * Virtual IOMMU object that represents physical IOMMU's virtualization
+ * support
+ */
+typedef struct IOMMUFDViommu {
+    IOMMUFDBackend *iommufd;
+    uint32_t s2_hwpt_id; /* ID of stage 2 HWPT */
+    uint32_t viommu_id;  /* virtual IOMMU ID of allocated object */
+} IOMMUFDViommu;
+
 bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp);
 void iommufd_backend_disconnect(IOMMUFDBackend *be);
 
@@ -59,6 +69,10 @@ bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
                                 uint32_t data_type, uint32_t data_len,
                                 void *data_ptr, uint32_t *out_hwpt,
                                 Error **errp);
+bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
+                                  uint32_t viommu_type, uint32_t hwpt_id,
+                                  uint32_t *out_hwpt, Error **errp);
+
 bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t hwpt_id,
                                         bool start, Error **errp);
 bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 02/32] backends/iommufd: Introduce iommufd_backend_alloc_vdev
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
  2025-10-31 10:49 ` [PATCH v5 01/32] backends/iommufd: Introduce iommufd_backend_alloc_viommu Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-10-31 10:49 ` [PATCH v5 03/32] hw/arm/smmu-common: Factor out common helper functions and export Shameer Kolothum
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

From: Nicolin Chen <nicolinc@nvidia.com>

Add a helper to allocate an iommufd device's virtual device (in the user
space) per a viommu instance.

While at it, introduce a struct IOMMUFDVdev for later use by vendor
IOMMU implementations.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 backends/iommufd.c       | 27 +++++++++++++++++++++++++++
 backends/trace-events    |  1 +
 include/system/iommufd.h | 12 ++++++++++++
 3 files changed, 40 insertions(+)

diff --git a/backends/iommufd.c b/backends/iommufd.c
index 3d4a4ae736..e68a2c934f 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -472,6 +472,33 @@ bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
     return true;
 }
 
+bool iommufd_backend_alloc_vdev(IOMMUFDBackend *be, uint32_t dev_id,
+                                uint32_t viommu_id, uint64_t virt_id,
+                                uint32_t *out_vdev_id, Error **errp)
+{
+    int ret;
+    struct iommu_vdevice_alloc alloc_vdev = {
+        .size = sizeof(alloc_vdev),
+        .viommu_id = viommu_id,
+        .dev_id = dev_id,
+        .virt_id = virt_id,
+    };
+
+    ret = ioctl(be->fd, IOMMU_VDEVICE_ALLOC, &alloc_vdev);
+
+    trace_iommufd_backend_alloc_vdev(be->fd, dev_id, viommu_id, virt_id,
+                                     alloc_vdev.out_vdevice_id, ret);
+
+    if (ret) {
+        error_setg_errno(errp, errno, "IOMMU_VDEVICE_ALLOC failed");
+        return false;
+    }
+
+    g_assert(out_vdev_id);
+    *out_vdev_id = alloc_vdev.out_vdevice_id;
+    return true;
+}
+
 bool host_iommu_device_iommufd_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
                                            uint32_t hwpt_id, Error **errp)
 {
diff --git a/backends/trace-events b/backends/trace-events
index 01c2d9bde9..8408dc8701 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -22,3 +22,4 @@ iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) "
 iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
 iommufd_backend_invalidate_cache(int iommufd, uint32_t id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
 iommufd_backend_alloc_viommu(int iommufd, uint32_t dev_id, uint32_t type, uint32_t hwpt_id, uint32_t viommu_id, int ret) " iommufd=%d type=%u dev_id=%u hwpt_id=%u viommu_id=%u (%d)"
+iommufd_backend_alloc_vdev(int iommufd, uint32_t dev_id, uint32_t viommu_id, uint64_t virt_id, uint32_t vdev_id, int ret) " iommufd=%d dev_id=%u viommu_id=%u virt_id=0x%"PRIx64" vdev_id=%u (%d)"
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index 11b8413c3f..41e216c677 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -48,6 +48,14 @@ typedef struct IOMMUFDViommu {
     uint32_t viommu_id;  /* virtual IOMMU ID of allocated object */
 } IOMMUFDViommu;
 
+/*
+ * Virtual device object for a physical device bind to a vIOMMU.
+ */
+typedef struct IOMMUFDVdev {
+    uint32_t vdevice_id; /* object handle for vDevice */
+    uint32_t virt_id;  /* virtual device ID */
+} IOMMUFDVdev;
+
 bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp);
 void iommufd_backend_disconnect(IOMMUFDBackend *be);
 
@@ -73,6 +81,10 @@ bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
                                   uint32_t viommu_type, uint32_t hwpt_id,
                                   uint32_t *out_hwpt, Error **errp);
 
+bool iommufd_backend_alloc_vdev(IOMMUFDBackend *be, uint32_t dev_id,
+                                uint32_t viommu_id, uint64_t virt_id,
+                                uint32_t *out_vdev_id, Error **errp);
+
 bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t hwpt_id,
                                         bool start, Error **errp);
 bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 03/32] hw/arm/smmu-common: Factor out common helper functions and export
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
  2025-10-31 10:49 ` [PATCH v5 01/32] backends/iommufd: Introduce iommufd_backend_alloc_viommu Shameer Kolothum
  2025-10-31 10:49 ` [PATCH v5 02/32] backends/iommufd: Introduce iommufd_backend_alloc_vdev Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-10-31 10:49 ` [PATCH v5 04/32] hw/arm/smmu-common: Make iommu ops part of SMMUState Shameer Kolothum
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Factor out common helper functions and export. Subsequent patches for
smmuv3 accel support will make use of this.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmu-common.c         | 44 +++++++++++++++++++++---------------
 include/hw/arm/smmu-common.h |  6 +++++
 2 files changed, 32 insertions(+), 18 deletions(-)

diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 62a7612184..59d6147ec9 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -847,12 +847,24 @@ SMMUPciBus *smmu_find_smmu_pcibus(SMMUState *s, uint8_t bus_num)
     return NULL;
 }
 
-static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
+void smmu_init_sdev(SMMUState *s, SMMUDevice *sdev, PCIBus *bus, int devfn)
 {
-    SMMUState *s = opaque;
-    SMMUPciBus *sbus = g_hash_table_lookup(s->smmu_pcibus_by_busptr, bus);
-    SMMUDevice *sdev;
     static unsigned int index;
+    g_autofree char *name = g_strdup_printf("%s-%d-%d", s->mrtypename, devfn,
+                                            index++);
+    sdev->smmu = s;
+    sdev->bus = bus;
+    sdev->devfn = devfn;
+
+    memory_region_init_iommu(&sdev->iommu, sizeof(sdev->iommu),
+                             s->mrtypename, OBJECT(s), name, UINT64_MAX);
+    address_space_init(&sdev->as, MEMORY_REGION(&sdev->iommu), name);
+    trace_smmu_add_mr(name);
+}
+
+SMMUPciBus *smmu_get_sbus(SMMUState *s, PCIBus *bus)
+{
+    SMMUPciBus *sbus = g_hash_table_lookup(s->smmu_pcibus_by_busptr, bus);
 
     if (!sbus) {
         sbus = g_malloc0(sizeof(SMMUPciBus) +
@@ -861,23 +873,19 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
         g_hash_table_insert(s->smmu_pcibus_by_busptr, bus, sbus);
     }
 
+    return sbus;
+}
+
+static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
+{
+    SMMUState *s = opaque;
+    SMMUPciBus *sbus = smmu_get_sbus(s, bus);
+    SMMUDevice *sdev;
+
     sdev = sbus->pbdev[devfn];
     if (!sdev) {
-        char *name = g_strdup_printf("%s-%d-%d", s->mrtypename, devfn, index++);
-
         sdev = sbus->pbdev[devfn] = g_new0(SMMUDevice, 1);
-
-        sdev->smmu = s;
-        sdev->bus = bus;
-        sdev->devfn = devfn;
-
-        memory_region_init_iommu(&sdev->iommu, sizeof(sdev->iommu),
-                                 s->mrtypename,
-                                 OBJECT(s), name, UINT64_MAX);
-        address_space_init(&sdev->as,
-                           MEMORY_REGION(&sdev->iommu), name);
-        trace_smmu_add_mr(name);
-        g_free(name);
+        smmu_init_sdev(s, sdev, bus, devfn);
     }
 
     return &sdev->as;
diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
index 80d0fecfde..d307ddd952 100644
--- a/include/hw/arm/smmu-common.h
+++ b/include/hw/arm/smmu-common.h
@@ -180,6 +180,12 @@ OBJECT_DECLARE_TYPE(SMMUState, SMMUBaseClass, ARM_SMMU)
 /* Return the SMMUPciBus handle associated to a PCI bus number */
 SMMUPciBus *smmu_find_smmu_pcibus(SMMUState *s, uint8_t bus_num);
 
+/* Return the SMMUPciBus handle associated to a PCI bus */
+SMMUPciBus *smmu_get_sbus(SMMUState *s, PCIBus *bus);
+
+/* Initialize SMMUDevice handle associated to a SMMUPciBus */
+void smmu_init_sdev(SMMUState *s, SMMUDevice *sdev, PCIBus *bus, int devfn);
+
 /* Return the stream ID of an SMMU device */
 static inline uint16_t smmu_get_sid(SMMUDevice *sdev)
 {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 04/32] hw/arm/smmu-common: Make iommu ops part of SMMUState
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (2 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 03/32] hw/arm/smmu-common: Factor out common helper functions and export Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-10-31 10:49 ` [PATCH v5 05/32] hw/arm/smmuv3-accel: Introduce smmuv3 accel device Shameer Kolothum
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Make iommu ops part of SMMUState and set to the current default smmu_ops.
No functional change intended. This will allow SMMUv3 accel implementation
to set a different iommu ops later.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmu-common.c         | 7 +++++--
 include/hw/arm/smmu-common.h | 1 +
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 59d6147ec9..4d6516443e 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -952,6 +952,9 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    if (!s->iommu_ops) {
+        s->iommu_ops = &smmu_ops;
+    }
     /*
      * We only allow default PCIe Root Complex(pcie.0) or pxb-pcie based extra
      * root complexes to be associated with SMMU.
@@ -971,9 +974,9 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
         }
 
         if (s->smmu_per_bus) {
-            pci_setup_iommu_per_bus(pci_bus, &smmu_ops, s);
+            pci_setup_iommu_per_bus(pci_bus, s->iommu_ops, s);
         } else {
-            pci_setup_iommu(pci_bus, &smmu_ops, s);
+            pci_setup_iommu(pci_bus, s->iommu_ops, s);
         }
         return;
     }
diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
index d307ddd952..eebf2f49e2 100644
--- a/include/hw/arm/smmu-common.h
+++ b/include/hw/arm/smmu-common.h
@@ -162,6 +162,7 @@ struct SMMUState {
     uint8_t bus_num;
     PCIBus *primary_bus;
     bool smmu_per_bus; /* SMMU is specific to the primary_bus */
+    const PCIIOMMUOps *iommu_ops;
 };
 
 struct SMMUBaseClass {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 05/32] hw/arm/smmuv3-accel: Introduce smmuv3 accel device
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (3 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 04/32] hw/arm/smmu-common: Make iommu ops part of SMMUState Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-10-31 10:49 ` [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space Shameer Kolothum
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Set up dedicated PCIIOMMUOps for the accel SMMUv3, since it will need
different callback handling in upcoming patches. This also adds a
CONFIG_ARM_SMMUV3_ACCEL build option so the feature can be disabled
at compile time. Because we now include CONFIG_DEVICES in the header to
check for ARM_SMMUV3_ACCEL, the meson file entry for smmuv3.c needs to
be changed to arm_ss.add.

The “accel” property isn’t user visible yet and it will be introduced in
a later patch once all the supporting pieces are ready.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/Kconfig          |  5 ++++
 hw/arm/meson.build      |  3 ++-
 hw/arm/smmuv3-accel.c   | 59 +++++++++++++++++++++++++++++++++++++++++
 hw/arm/smmuv3-accel.h   | 27 +++++++++++++++++++
 hw/arm/smmuv3.c         |  5 ++++
 include/hw/arm/smmuv3.h |  3 +++
 6 files changed, 101 insertions(+), 1 deletion(-)
 create mode 100644 hw/arm/smmuv3-accel.c
 create mode 100644 hw/arm/smmuv3-accel.h

diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index b44b85f436..3a6dc122ef 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -12,6 +12,7 @@ config ARM_VIRT
     select ARM_GIC
     select ACPI
     select ARM_SMMUV3
+    select ARM_SMMUV3_ACCEL
     select GPIO_KEY
     select DEVICE_TREE
     select FW_CFG_DMA
@@ -628,6 +629,10 @@ config FSL_IMX8MP_EVK
 config ARM_SMMUV3
     bool
 
+config ARM_SMMUV3_ACCEL
+    bool
+    depends on ARM_SMMUV3 && IOMMUFD
+
 config FSL_IMX6UL
     bool
     default y
diff --git a/hw/arm/meson.build b/hw/arm/meson.build
index b88b5b06d7..32ec214434 100644
--- a/hw/arm/meson.build
+++ b/hw/arm/meson.build
@@ -62,7 +62,8 @@ arm_common_ss.add(when: 'CONFIG_ARMSSE', if_true: files('armsse.c'))
 arm_common_ss.add(when: 'CONFIG_FSL_IMX7', if_true: files('fsl-imx7.c', 'mcimx7d-sabre.c'))
 arm_common_ss.add(when: 'CONFIG_FSL_IMX8MP', if_true: files('fsl-imx8mp.c'))
 arm_common_ss.add(when: 'CONFIG_FSL_IMX8MP_EVK', if_true: files('imx8mp-evk.c'))
-arm_common_ss.add(when: 'CONFIG_ARM_SMMUV3', if_true: files('smmuv3.c'))
+arm_ss.add(when: 'CONFIG_ARM_SMMUV3', if_true: files('smmuv3.c'))
+arm_ss.add(when: 'CONFIG_ARM_SMMUV3_ACCEL', if_true: files('smmuv3-accel.c'))
 arm_common_ss.add(when: 'CONFIG_FSL_IMX6UL', if_true: files('fsl-imx6ul.c', 'mcimx6ul-evk.c'))
 arm_common_ss.add(when: 'CONFIG_NRF51_SOC', if_true: files('nrf51_soc.c'))
 arm_ss.add(when: 'CONFIG_XEN', if_true: files(
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
new file mode 100644
index 0000000000..99ef0db8c4
--- /dev/null
+++ b/hw/arm/smmuv3-accel.c
@@ -0,0 +1,59 @@
+/*
+ * Copyright (c) 2025 Huawei Technologies R & D (UK) Ltd
+ * Copyright (C) 2025 NVIDIA
+ * Written by Nicolin Chen, Shameer Kolothum
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+
+#include "hw/arm/smmuv3.h"
+#include "smmuv3-accel.h"
+
+static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
+                                               PCIBus *bus, int devfn)
+{
+    SMMUDevice *sdev = sbus->pbdev[devfn];
+    SMMUv3AccelDevice *accel_dev;
+
+    if (sdev) {
+        return container_of(sdev, SMMUv3AccelDevice, sdev);
+    }
+
+    accel_dev = g_new0(SMMUv3AccelDevice, 1);
+    sdev = &accel_dev->sdev;
+
+    sbus->pbdev[devfn] = sdev;
+    smmu_init_sdev(bs, sdev, bus, devfn);
+    return accel_dev;
+}
+
+/*
+ * Find or add an address space for the given PCI device.
+ *
+ * If a device matching @bus and @devfn already exists, return its
+ * corresponding address space. Otherwise, create a new device entry
+ * and initialize address space for it.
+ */
+static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
+                                              int devfn)
+{
+    SMMUState *bs = opaque;
+    SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
+    SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
+    SMMUDevice *sdev = &accel_dev->sdev;
+
+    return &sdev->as;
+}
+
+static const PCIIOMMUOps smmuv3_accel_ops = {
+    .get_address_space = smmuv3_accel_find_add_as,
+};
+
+void smmuv3_accel_init(SMMUv3State *s)
+{
+    SMMUState *bs = ARM_SMMU(s);
+
+    bs->iommu_ops = &smmuv3_accel_ops;
+}
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
new file mode 100644
index 0000000000..0dc6b00d35
--- /dev/null
+++ b/hw/arm/smmuv3-accel.h
@@ -0,0 +1,27 @@
+/*
+ * Copyright (c) 2025 Huawei Technologies R & D (UK) Ltd
+ * Copyright (C) 2025 NVIDIA
+ * Written by Nicolin Chen, Shameer Kolothum
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_ARM_SMMUV3_ACCEL_H
+#define HW_ARM_SMMUV3_ACCEL_H
+
+#include "hw/arm/smmu-common.h"
+#include CONFIG_DEVICES
+
+typedef struct SMMUv3AccelDevice {
+    SMMUDevice sdev;
+} SMMUv3AccelDevice;
+
+#ifdef CONFIG_ARM_SMMUV3_ACCEL
+void smmuv3_accel_init(SMMUv3State *s);
+#else
+static inline void smmuv3_accel_init(SMMUv3State *s)
+{
+}
+#endif
+
+#endif /* HW_ARM_SMMUV3_ACCEL_H */
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index bcf8af8dc7..ef991cb7d8 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -32,6 +32,7 @@
 #include "qapi/error.h"
 
 #include "hw/arm/smmuv3.h"
+#include "smmuv3-accel.h"
 #include "smmuv3-internal.h"
 #include "smmu-internal.h"
 
@@ -1882,6 +1883,10 @@ static void smmu_realize(DeviceState *d, Error **errp)
     SysBusDevice *dev = SYS_BUS_DEVICE(d);
     Error *local_err = NULL;
 
+    if (s->accel) {
+        smmuv3_accel_init(s);
+    }
+
     c->parent_realize(d, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index d183a62766..bb7076286b 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -63,6 +63,9 @@ struct SMMUv3State {
     qemu_irq     irq[4];
     QemuMutex mutex;
     char *stage;
+
+    /* SMMU has HW accelerator support for nested S1 + s2 */
+    bool accel;
 };
 
 typedef enum {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (4 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 05/32] hw/arm/smmuv3-accel: Introduce smmuv3 accel device Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-10-31 21:10   ` Nicolin Chen
                     ` (2 more replies)
  2025-10-31 10:49 ` [PATCH v5 07/32] hw/pci/pci: Move pci_init_bus_master() after adding device to bus Shameer Kolothum
                   ` (25 subsequent siblings)
  31 siblings, 3 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

To support accelerated SMMUv3 instances, introduce a shared system-wide
AddressSpace (shared_as_sysmem) that aliases the global system memory.
This shared AddressSpace will be used in a subsequent patch for all
vfio-pci devices behind all accelerated SMMUv3 instances within a VM.

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 99ef0db8c4..f62b6cf2c9 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -11,6 +11,15 @@
 #include "hw/arm/smmuv3.h"
 #include "smmuv3-accel.h"
 
+/*
+ * The root region aliases the global system memory, and shared_as_sysmem
+ * provides a shared Address Space referencing it. This Address Space is used
+ * by all vfio-pci devices behind all accelerated SMMUv3 instances within a VM.
+ */
+MemoryRegion root;
+MemoryRegion sysmem;
+static AddressSpace *shared_as_sysmem;
+
 static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
                                                PCIBus *bus, int devfn)
 {
@@ -51,9 +60,27 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
     .get_address_space = smmuv3_accel_find_add_as,
 };
 
+static void smmuv3_accel_as_init(SMMUv3State *s)
+{
+
+    if (shared_as_sysmem) {
+        return;
+    }
+
+    memory_region_init(&root, OBJECT(s), "root", UINT64_MAX);
+    memory_region_init_alias(&sysmem, OBJECT(s), "smmuv3-accel-sysmem",
+                             get_system_memory(), 0,
+                             memory_region_size(get_system_memory()));
+    memory_region_add_subregion(&root, 0, &sysmem);
+
+    shared_as_sysmem = g_new0(AddressSpace, 1);
+    address_space_init(shared_as_sysmem, &root, "smmuv3-accel-as-sysmem");
+}
+
 void smmuv3_accel_init(SMMUv3State *s)
 {
     SMMUState *bs = ARM_SMMU(s);
 
     bs->iommu_ops = &smmuv3_accel_ops;
+    smmuv3_accel_as_init(s);
 }
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 07/32] hw/pci/pci: Move pci_init_bus_master() after adding device to bus
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (5 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 13:24   ` Jonathan Cameron via
  2025-11-03 16:40   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 08/32] hw/pci/pci: Add optional supports_address_space() callback Shameer Kolothum
                   ` (24 subsequent siblings)
  31 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

During PCI hotplug, in do_pci_register_device(), pci_init_bus_master()
is called before storing the pci_dev pointer in bus->devices[devfn].

This causes a problem if pci_init_bus_master() (via its
get_address_space() callback) attempts to retrieve the device using
pci_find_device(), since the PCI device is not yet visible on the bus.

Fix this by moving the pci_init_bus_master() call to after the device
has been added to bus->devices[devfn].

This prepares for a subsequent patch where the accel SMMUv3
get_address_space() callback retrieves the pci_dev to identify the
attached device type.

No functional change intended.

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/pci/pci.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index c9932c87e3..9693d7f10c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1370,9 +1370,6 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
     pci_dev->bus_master_as.max_bounce_buffer_size =
         pci_dev->max_bounce_buffer_size;
 
-    if (phase_check(PHASE_MACHINE_READY)) {
-        pci_init_bus_master(pci_dev);
-    }
     pci_dev->irq_state = 0;
     pci_config_alloc(pci_dev);
 
@@ -1416,6 +1413,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
     pci_dev->config_write = config_write;
     bus->devices[devfn] = pci_dev;
     pci_dev->version_id = 2; /* Current pci device vmstate version */
+    if (phase_check(PHASE_MACHINE_READY)) {
+        pci_init_bus_master(pci_dev);
+    }
     return pci_dev;
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 08/32] hw/pci/pci: Add optional supports_address_space() callback
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (6 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 07/32] hw/pci/pci: Move pci_init_bus_master() after adding device to bus Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 13:30   ` Jonathan Cameron via
  2025-11-03 16:47   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 09/32] hw/pci-bridge/pci_expander_bridge: Move TYPE_PXB_PCIE_DEV to header Shameer Kolothum
                   ` (23 subsequent siblings)
  31 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Introduce an optional supports_address_space() callback in PCIIOMMUOps to
allow a vIOMMU implementation to reject devices that should not be attached
to it.

Currently, get_address_space() is the first and mandatory callback into the
vIOMMU layer, which always returns an address space. For certain setups, such
as hardware accelerated vIOMMUs (e.g. ARM SMMUv3 with accel=on), attaching
emulated endpoint devices is undesirable as it may impact the behavior or
performance of VFIO passthrough devices, for example, by triggering
unnecessary invalidations on the host IOMMU.

The new callback allows a vIOMMU to check and reject unsupported devices
early during PCI device registration.

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/pci/pci.c         | 20 ++++++++++++++++++++
 include/hw/pci/pci.h | 17 +++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 9693d7f10c..fa9cf5dab2 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -135,6 +135,21 @@ static void pci_set_master(PCIDevice *d, bool enable)
     d->is_master = enable; /* cache the status */
 }
 
+static bool
+pci_device_supports_iommu_address_space(PCIDevice *dev, Error **errp)
+{
+    PCIBus *bus;
+    PCIBus *iommu_bus;
+    int devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
+    if (iommu_bus && iommu_bus->iommu_ops->supports_address_space) {
+        return iommu_bus->iommu_ops->supports_address_space(bus,
+                                iommu_bus->iommu_opaque, devfn, errp);
+    }
+    return true;
+}
+
 static void pci_init_bus_master(PCIDevice *pci_dev)
 {
     AddressSpace *dma_as = pci_device_iommu_address_space(pci_dev);
@@ -1413,6 +1428,11 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
     pci_dev->config_write = config_write;
     bus->devices[devfn] = pci_dev;
     pci_dev->version_id = 2; /* Current pci device vmstate version */
+    if (!pci_device_supports_iommu_address_space(pci_dev, errp)) {
+        do_pci_unregister_device(pci_dev);
+        bus->devices[devfn] = NULL;
+        return NULL;
+    }
     if (phase_check(PHASE_MACHINE_READY)) {
         pci_init_bus_master(pci_dev);
     }
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index cf99b5bb68..dfeba8c9bd 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -417,6 +417,23 @@ typedef struct IOMMUPRINotifier {
  * framework for a set of devices on a PCI bus.
  */
 typedef struct PCIIOMMUOps {
+    /**
+     * @supports_address_space: Optional pre-check to determine if a PCI
+     * device can have an IOMMU address space.
+     *
+     * @bus: the #PCIBus being accessed.
+     *
+     * @opaque: the data passed to pci_setup_iommu().
+     *
+     * @devfn: device and function number.
+     *
+     * @errp: pass an Error out only when return false
+     *
+     * Returns: true if the device can be associated with an IOMMU address
+     * space, false otherwise with errp set.
+     */
+    bool (*supports_address_space)(PCIBus *bus, void *opaque, int devfn,
+                                   Error **errp);
     /**
      * @get_address_space: get the address space for a set of devices
      * on a PCI bus.
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 09/32] hw/pci-bridge/pci_expander_bridge: Move TYPE_PXB_PCIE_DEV to header
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (7 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 08/32] hw/pci/pci: Add optional supports_address_space() callback Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 13:30   ` Jonathan Cameron via
  2025-11-03 14:25   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 10/32] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd Shameer Kolothum
                   ` (22 subsequent siblings)
  31 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Move the TYPE_PXB_PCIE_DEV definition to header so that it can be
referenced by other code in subsequent patch.

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/pci-bridge/pci_expander_bridge.c | 1 -
 include/hw/pci/pci_bridge.h         | 1 +
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
index 1bcceddbc4..a8eb2d2426 100644
--- a/hw/pci-bridge/pci_expander_bridge.c
+++ b/hw/pci-bridge/pci_expander_bridge.c
@@ -48,7 +48,6 @@ struct PXBBus {
     char bus_path[8];
 };
 
-#define TYPE_PXB_PCIE_DEV "pxb-pcie"
 OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
 
 static GList *pxb_dev_list;
diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
index a055fd8d32..b61360b900 100644
--- a/include/hw/pci/pci_bridge.h
+++ b/include/hw/pci/pci_bridge.h
@@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
 
 #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
 #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
+#define TYPE_PXB_PCIE_DEV "pxb-pcie"
 #define TYPE_PXB_DEV "pxb"
 OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 10/32] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (8 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 09/32] hw/pci-bridge/pci_expander_bridge: Move TYPE_PXB_PCIE_DEV to header Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 16:51   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 11/32] hw/arm/smmuv3: Implement get_viommu_cap() callback Shameer Kolothum
                   ` (21 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Accelerated SMMUv3 is only meaningful when a device can leverage the
host SMMUv3 in nested mode (S1+S2 translation). To keep the model
consistent and correct, this mode is restricted to vfio-pci endpoint
devices using the iommufd backend.

Non-endpoint emulated devices such as PCIe root ports and bridges are
also permitted so that vfio-pci devices can be attached beneath them.
All other device types are unsupported in accelerated mode.

Implement supports_address_space() callaback to reject all such
unsupported devices.

This restriction also avoids complications with IOTLB invalidations.
Some TLBI commands (e.g. CMD_TLBI_NH_ASID) lack an associated SID,
making it difficult to trace the originating device. Allowing emulated
endpoints would require invalidating both QEMU’s software IOTLB and the
host’s hardware IOTLB, which can significantly degrade performance.

For vfio-pci devices in nested mode, get_address_space() returns an
address space aliased to system address space so that the VFIO core
can set up the correct stage-2 mappings for guest RAM.

In summary:
 - vfio-pci devices(with iommufd as backend) return an address space
   aliased to system address space.
 - bridges and root ports return the IOMMU address space.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c | 66 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index f62b6cf2c9..550a0496fe 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -7,8 +7,13 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/error-report.h"
 
 #include "hw/arm/smmuv3.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci-host/gpex.h"
+#include "hw/vfio/pci.h"
+
 #include "smmuv3-accel.h"
 
 /*
@@ -38,6 +43,41 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
     return accel_dev;
 }
 
+static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
+{
+
+    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
+        object_dynamic_cast(OBJECT(pdev), TYPE_PXB_PCIE_DEV) ||
+        object_dynamic_cast(OBJECT(pdev), TYPE_GPEX_ROOT_DEVICE)) {
+        return true;
+    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI))) {
+        *vfio_pci = true;
+        if (object_property_get_link(OBJECT(pdev), "iommufd", NULL)) {
+            return true;
+        }
+    }
+    return false;
+}
+
+static bool smmuv3_accel_supports_as(PCIBus *bus, void *opaque, int devfn,
+                                     Error **errp)
+{
+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
+    bool vfio_pci = false;
+
+    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
+        if (vfio_pci) {
+            error_setg(errp, "vfio-pci endpoint devices without an iommufd "
+                       "backend not allowed when using arm-smmuv3,accel=on");
+
+        } else {
+            error_setg(errp, "Emulated endpoint devices are not allowed when "
+                       "using arm-smmuv3,accel=on");
+        }
+        return false;
+    }
+    return true;
+}
 /*
  * Find or add an address space for the given PCI device.
  *
@@ -48,15 +88,39 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
 static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
                                               int devfn)
 {
+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
     SMMUState *bs = opaque;
     SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
     SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
     SMMUDevice *sdev = &accel_dev->sdev;
+    bool vfio_pci = false;
 
-    return &sdev->as;
+    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
+        /* Should never be here: supports_address_space() filters these out */
+        g_assert_not_reached();
+    }
+
+    /*
+     * In the accelerated mode, a vfio-pci device attached via the iommufd
+     * backend must remain in the system address space. Such a device is
+     * always translated by its physical SMMU (using either a stage-2-only
+     * STE or a nested STE), where the parent stage-2 page table is allocated
+     * by the VFIO core to back the system address space.
+     *
+     * Return the shared_as_sysmem aliased to the global system memory in this
+     * case. Sharing address_space_memory also allows devices under different
+     * vSMMU instances in the same VM to reuse a single nesting parent HWPT in
+     * the VFIO core.
+     */
+    if (vfio_pci) {
+        return shared_as_sysmem;
+    } else {
+        return &sdev->as;
+    }
 }
 
 static const PCIIOMMUOps smmuv3_accel_ops = {
+    .supports_address_space = smmuv3_accel_supports_as,
     .get_address_space = smmuv3_accel_find_add_as,
 };
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 11/32] hw/arm/smmuv3: Implement get_viommu_cap() callback
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (9 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 10/32] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 16:55   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback Shameer Kolothum
                   ` (20 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

For accelerated SMMUv3, we need nested parent domain creation. Add the
callback support so that VFIO can create a nested parent.

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 550a0496fe..a1d672208f 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -10,6 +10,7 @@
 #include "qemu/error-report.h"
 
 #include "hw/arm/smmuv3.h"
+#include "hw/iommu.h"
 #include "hw/pci/pci_bridge.h"
 #include "hw/pci-host/gpex.h"
 #include "hw/vfio/pci.h"
@@ -119,9 +120,21 @@ static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
     }
 }
 
+static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
+{
+    /*
+     * We return VIOMMU_FLAG_WANT_NESTING_PARENT to inform VFIO core to create a
+     * nesting parent which is required for accelerated SMMUv3 support.
+     * The real HW nested support should be reported from host SMMUv3 and if
+     * it doesn't, the nesting parent allocation will fail anyway in VFIO core.
+     */
+    return VIOMMU_FLAG_WANT_NESTING_PARENT;
+}
+
 static const PCIIOMMUOps smmuv3_accel_ops = {
     .supports_address_space = smmuv3_accel_supports_as,
     .get_address_space = smmuv3_accel_find_add_as,
+    .get_viommu_flags = smmuv3_accel_get_viommu_flags,
 };
 
 static void smmuv3_accel_as_init(SMMUv3State *s)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (10 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 11/32] hw/arm/smmuv3: Implement get_viommu_cap() callback Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-10-31 22:02   ` Nicolin Chen
  2025-10-31 10:49 ` [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support Shameer Kolothum
                   ` (19 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

From: Nicolin Chen <nicolinc@nvidia.com>

Implement the VFIO/PCI callbacks to attach and detach a HostIOMMUDevice
to a vSMMUv3 when accel=on,

 - set_iommu_device(): attach a HostIOMMUDevice to a vIOMMU
 - unset_iommu_device(): detach and release associated resources

In SMMUv3 accel=on mode, the guest SMMUv3 is backed by the host SMMUv3 via
IOMMUFD. A vIOMMU object (created via IOMMU_VIOMMU_ALLOC) provides a per-VM,
security-isolated handle to the physical SMMUv3. Without a vIOMMU, the
vSMMUv3 cannot relay guest operations to the host hardware nor maintain
isolation across VMs or devices. Therefore, set_iommu_device() allocates
a vIOMMU object if one does not already exist.

There are two main points to consider in this implementation:

1) VFIO core allocates and attaches a S2 HWPT that acts as the nesting
   parent for nested HWPTs(IOMMU_DOMAIN_NESTED). This parent HWPT will
   be shared across multiple vSMMU instances within a VM.

2) A device cannot attach directly to a vIOMMU. Instead, it attaches
   through a proxy nested HWPT (IOMMU_DOMAIN_NESTED). Based on the STE
   configuration,there are three types of nested HWPTs: bypass, abort,
   and translate.
    -The bypass and abort proxy HWPTs are pre-allocated. When SMMUv3
     operates in global abort or bypass modes, as controlled by the GBPA
     register, or issues a vSTE for bypass or abort we attach these
     pre-allocated nested HWPTs.
    -The translate HWPT requires a vDEVICE to be allocated first, since
     invalidations and events depend on a valid vSID.
    -The vDEVICE allocation and actual attach operations for these proxy
     HWPTs are implemented in subsequent patches.

In summary, a device placed behind a vSMMU instance must have a vSID for
translate vSTE. The bypass and abort vSTEs are pre-allocated as proxy
nested HWPTs and is attached based on GBPA register. The core-managed
nesting parent S2 HWPT is used as parent S2 HWPT for all the nested
HWPTs and is intended to be shared across vSMMU instances within the
same VM.

set_iommu_device():
  - Reuse an existing vIOMMU for the same physical SMMU if available.
    If not, allocate a new one using the nesting parent S2 HWPT.
  - Pre-allocate two proxy nested HWPTs (bypass and abort) under the
    vIOMMU.
  - Add the device to the vIOMMU’s device list.

unset_iommu_device():
  - Re-attach device to the nesting parent S2 HWPT.
  - Remove the device from the vIOMMU’s device list.
  - If the list is empty, free the proxy HWPTs (bypass and abort)
    and release the vIOMMU object.

Introduce SMMUv3AccelState, which holds a reference to an SMMUViommu
structure representing a virtual SMMU instance backed by an iommufd
vIOMMU object.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c    | 150 +++++++++++++++++++++++++++++++++++++++
 hw/arm/smmuv3-accel.h    |  22 ++++++
 hw/arm/smmuv3-internal.h |   5 ++
 hw/arm/trace-events      |   4 ++
 include/hw/arm/smmuv3.h  |   1 +
 5 files changed, 182 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index a1d672208f..d4d65299a8 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -8,6 +8,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
+#include "trace.h"
 
 #include "hw/arm/smmuv3.h"
 #include "hw/iommu.h"
@@ -15,6 +16,7 @@
 #include "hw/pci-host/gpex.h"
 #include "hw/vfio/pci.h"
 
+#include "smmuv3-internal.h"
 #include "smmuv3-accel.h"
 
 /*
@@ -44,6 +46,151 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
     return accel_dev;
 }
 
+static bool
+smmuv3_accel_dev_alloc_viommu(SMMUv3AccelDevice *accel_dev,
+                              HostIOMMUDeviceIOMMUFD *idev, Error **errp)
+{
+    struct iommu_hwpt_arm_smmuv3 bypass_data = {
+        .ste = { SMMU_STE_CFG_BYPASS | SMMU_STE_VALID, 0x0ULL },
+    };
+    struct iommu_hwpt_arm_smmuv3 abort_data = {
+        .ste = { SMMU_STE_VALID, 0x0ULL },
+    };
+    SMMUDevice *sdev = &accel_dev->sdev;
+    SMMUState *bs = sdev->smmu;
+    SMMUv3State *s = ARM_SMMUV3(bs);
+    SMMUv3AccelState *s_accel = s->s_accel;
+    uint32_t s2_hwpt_id = idev->hwpt_id;
+    SMMUViommu *vsmmu;
+    uint32_t viommu_id;
+
+    if (s_accel->vsmmu) {
+        accel_dev->vsmmu = s_accel->vsmmu;
+        return true;
+    }
+
+    if (!iommufd_backend_alloc_viommu(idev->iommufd, idev->devid,
+                                      IOMMU_VIOMMU_TYPE_ARM_SMMUV3,
+                                      s2_hwpt_id, &viommu_id, errp)) {
+        return false;
+    }
+
+    vsmmu = g_new0(SMMUViommu, 1);
+    vsmmu->viommu.viommu_id = viommu_id;
+    vsmmu->viommu.s2_hwpt_id = s2_hwpt_id;
+    vsmmu->viommu.iommufd = idev->iommufd;
+
+    /*
+     * Pre-allocate HWPTs for S1 bypass and abort cases. These will be attached
+     * later for guest STEs or GBPAs that require bypass or abort configuration.
+     */
+    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid, viommu_id,
+                                    0, IOMMU_HWPT_DATA_ARM_SMMUV3,
+                                    sizeof(abort_data), &abort_data,
+                                    &vsmmu->abort_hwpt_id, errp)) {
+        goto free_viommu;
+    }
+
+    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid, viommu_id,
+                                    0, IOMMU_HWPT_DATA_ARM_SMMUV3,
+                                    sizeof(bypass_data), &bypass_data,
+                                    &vsmmu->bypass_hwpt_id, errp)) {
+        goto free_abort_hwpt;
+    }
+
+    vsmmu->iommufd = idev->iommufd;
+    s_accel->vsmmu = vsmmu;
+    accel_dev->vsmmu = vsmmu;
+    return true;
+
+free_abort_hwpt:
+    iommufd_backend_free_id(idev->iommufd, vsmmu->abort_hwpt_id);
+free_viommu:
+    iommufd_backend_free_id(idev->iommufd, vsmmu->viommu.viommu_id);
+    g_free(vsmmu);
+    return false;
+}
+
+static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
+                                          HostIOMMUDevice *hiod, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
+    SMMUState *bs = opaque;
+    SMMUv3State *s = ARM_SMMUV3(bs);
+    SMMUv3AccelState *s_accel = s->s_accel;
+    SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
+    SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
+    SMMUDevice *sdev = &accel_dev->sdev;
+    uint16_t sid = smmu_get_sid(sdev);
+
+    if (!idev) {
+        return true;
+    }
+
+    if (accel_dev->idev) {
+        if (accel_dev->idev != idev) {
+            error_setg(errp, "Device 0x%x already has an associated IOMMU dev",
+                       sid);
+            return false;
+        }
+        return true;
+    }
+
+    if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
+        error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
+        return false;
+    }
+
+    accel_dev->idev = idev;
+    QLIST_INSERT_HEAD(&s_accel->vsmmu->device_list, accel_dev, next);
+    trace_smmuv3_accel_set_iommu_device(devfn, idev->devid);
+    return true;
+}
+
+static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
+                                            int devfn)
+{
+    SMMUState *bs = opaque;
+    SMMUv3State *s = ARM_SMMUV3(bs);
+    SMMUPciBus *sbus = g_hash_table_lookup(bs->smmu_pcibus_by_busptr, bus);
+    SMMUv3AccelDevice *accel_dev;
+    SMMUViommu *vsmmu;
+    SMMUDevice *sdev;
+    uint16_t sid;
+
+    if (!sbus) {
+        return;
+    }
+
+    sdev = sbus->pbdev[devfn];
+    if (!sdev) {
+        return;
+    }
+
+    sid = smmu_get_sid(sdev);
+    accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
+    /* Re-attach the default s2 hwpt id */
+    if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
+                                               accel_dev->idev->hwpt_id,
+                                               NULL)) {
+        error_report("Unable to attach dev 0x%x to the default HW pagetable",
+                     sid);
+    }
+
+    accel_dev->idev = NULL;
+    QLIST_REMOVE(accel_dev, next);
+    trace_smmuv3_accel_unset_iommu_device(devfn, sid);
+
+    vsmmu = s->s_accel->vsmmu;
+    if (QLIST_EMPTY(&vsmmu->device_list)) {
+        iommufd_backend_free_id(vsmmu->iommufd, vsmmu->bypass_hwpt_id);
+        iommufd_backend_free_id(vsmmu->iommufd, vsmmu->abort_hwpt_id);
+        iommufd_backend_free_id(vsmmu->iommufd, vsmmu->viommu.viommu_id);
+        g_free(vsmmu);
+        s->s_accel->vsmmu = NULL;
+    }
+}
+
 static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
 {
 
@@ -135,6 +282,8 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
     .supports_address_space = smmuv3_accel_supports_as,
     .get_address_space = smmuv3_accel_find_add_as,
     .get_viommu_flags = smmuv3_accel_get_viommu_flags,
+    .set_iommu_device = smmuv3_accel_set_iommu_device,
+    .unset_iommu_device = smmuv3_accel_unset_iommu_device,
 };
 
 static void smmuv3_accel_as_init(SMMUv3State *s)
@@ -160,4 +309,5 @@ void smmuv3_accel_init(SMMUv3State *s)
 
     bs->iommu_ops = &smmuv3_accel_ops;
     smmuv3_accel_as_init(s);
+    s->s_accel = g_new0(SMMUv3AccelState, 1);
 }
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index 0dc6b00d35..d81f90c32c 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -10,12 +10,34 @@
 #define HW_ARM_SMMUV3_ACCEL_H
 
 #include "hw/arm/smmu-common.h"
+#include "system/iommufd.h"
+#include <linux/iommufd.h>
 #include CONFIG_DEVICES
 
+/*
+ * Represents a virtual SMMU instance backed by an iommufd vIOMMU object.
+ * Holds references to the core iommufd vIOMMU object and to proxy HWPTs
+ * (bypass and abort) used for device attachment.
+ */
+typedef struct SMMUViommu {
+    IOMMUFDBackend *iommufd;
+    IOMMUFDViommu viommu;
+    uint32_t bypass_hwpt_id;
+    uint32_t abort_hwpt_id;
+    QLIST_HEAD(, SMMUv3AccelDevice) device_list;
+} SMMUViommu;
+
 typedef struct SMMUv3AccelDevice {
     SMMUDevice sdev;
+    HostIOMMUDeviceIOMMUFD *idev;
+    SMMUViommu *vsmmu;
+    QLIST_ENTRY(SMMUv3AccelDevice) next;
 } SMMUv3AccelDevice;
 
+typedef struct SMMUv3AccelState {
+    SMMUViommu *vsmmu;
+} SMMUv3AccelState;
+
 #ifdef CONFIG_ARM_SMMUV3_ACCEL
 void smmuv3_accel_init(SMMUv3State *s);
 #else
diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
index b6b7399347..03d86cfc5c 100644
--- a/hw/arm/smmuv3-internal.h
+++ b/hw/arm/smmuv3-internal.h
@@ -583,6 +583,11 @@ typedef struct CD {
     ((extract64((x)->word[7], 0, 16) << 32) |           \
      ((x)->word[6] & 0xfffffff0))
 
+#define SMMU_STE_VALID      (1ULL << 0)
+#define SMMU_STE_CFG_BYPASS (1ULL << 3)
+
+#define SMMU_GBPA_ABORT (1UL << 20)
+
 static inline int oas2bits(int oas_field)
 {
     switch (oas_field) {
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index f3386bd7ae..49c0460f30 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -66,6 +66,10 @@ smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu mr=%s
 smmuv3_inv_notifiers_iova(const char *name, int asid, int vmid, uint64_t iova, uint8_t tg, uint64_t num_pages, int stage) "iommu mr=%s asid=%d vmid=%d iova=0x%"PRIx64" tg=%d num_pages=0x%"PRIx64" stage=%d"
 smmu_reset_exit(void) ""
 
+#smmuv3-accel.c
+smmuv3_accel_set_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (idev devid=0x%x)"
+smmuv3_accel_unset_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (idev devid=0x%x)"
+
 # strongarm.c
 strongarm_uart_update_parameters(const char *label, int speed, char parity, int data_bits, int stop_bits) "%s speed=%d parity=%c data=%d stop=%d"
 strongarm_ssp_read_underrun(void) "SSP rx underrun"
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index bb7076286b..e54ece2d38 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -66,6 +66,7 @@ struct SMMUv3State {
 
     /* SMMU has HW accelerator support for nested S1 + s2 */
     bool accel;
+    struct SMMUv3AccelState *s_accel;
 };
 
 typedef enum {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (11 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-10-31 23:52   ` Nicolin Chen
  2025-11-04 11:05   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 14/32] hw/arm/smmuv3-accel: Install SMMUv3 GBPA based hwpt Shameer Kolothum
                   ` (18 subsequent siblings)
  31 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

From: Nicolin Chen <nicolinc@nvidia.com>

A device placed behind a vSMMU instance must have corresponding vSTEs
(bypass, abort, or translate) installed. The bypass and abort proxy nested
HWPTs are pre-allocated.

For translat HWPT, a vDEVICE object is allocated and associated with the
vIOMMU for each guest device. This allows the host kernel to establish a
virtual SID to physical SID mapping, which is required for handling
invalidations and event reporting.

An translate HWPT is allocated based on the guest STE configuration and
attached to the device when the guest issues SMMU_CMD_CFGI_STE or
SMMU_CMD_CFGI_STE_RANGE, provided the STE enables S1 translation.

If the guest STE is invalid or S1 translation is disabled, the device is
attached to one of the pre-allocated ABORT or BYPASS HWPTs instead.

While at it, export both smmu_find_ste() and smmuv3_flush_config() for
use here.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c    | 193 +++++++++++++++++++++++++++++++++++++++
 hw/arm/smmuv3-accel.h    |  23 +++++
 hw/arm/smmuv3-internal.h |  20 ++++
 hw/arm/smmuv3.c          |  18 +++-
 hw/arm/trace-events      |   2 +
 5 files changed, 253 insertions(+), 3 deletions(-)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index d4d65299a8..c74e95a0ea 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -28,6 +28,191 @@ MemoryRegion root;
 MemoryRegion sysmem;
 static AddressSpace *shared_as_sysmem;
 
+static bool
+smmuv3_accel_alloc_vdev(SMMUv3AccelDevice *accel_dev, int sid, Error **errp)
+{
+    SMMUViommu *vsmmu = accel_dev->vsmmu;
+    IOMMUFDVdev *vdev;
+    uint32_t vdevice_id;
+
+    if (!accel_dev->idev || accel_dev->vdev) {
+        return true;
+    }
+
+    if (!iommufd_backend_alloc_vdev(vsmmu->iommufd, accel_dev->idev->devid,
+                                    vsmmu->viommu.viommu_id, sid,
+                                    &vdevice_id, errp)) {
+            return false;
+    }
+    if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
+                                               vsmmu->bypass_hwpt_id, errp)) {
+        iommufd_backend_free_id(vsmmu->iommufd, vdevice_id);
+        return false;
+    }
+
+    vdev = g_new(IOMMUFDVdev, 1);
+    vdev->vdevice_id = vdevice_id;
+    vdev->virt_id = sid;
+    accel_dev->vdev = vdev;
+    return true;
+}
+
+static bool
+smmuv3_accel_dev_uninstall_nested_ste(SMMUv3AccelDevice *accel_dev, bool abort,
+                                      Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
+    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
+    uint32_t hwpt_id;
+
+    if (!s1_hwpt || !accel_dev->vsmmu) {
+        return true;
+    }
+
+    if (abort) {
+        hwpt_id = accel_dev->vsmmu->abort_hwpt_id;
+    } else {
+        hwpt_id = accel_dev->vsmmu->bypass_hwpt_id;
+    }
+
+    if (!host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp)) {
+        return false;
+    }
+    trace_smmuv3_accel_uninstall_nested_ste(smmu_get_sid(&accel_dev->sdev),
+                                            abort ? "abort" : "bypass",
+                                            hwpt_id);
+
+    iommufd_backend_free_id(s1_hwpt->iommufd, s1_hwpt->hwpt_id);
+    accel_dev->s1_hwpt = NULL;
+    g_free(s1_hwpt);
+    return true;
+}
+
+static bool
+smmuv3_accel_dev_install_nested_ste(SMMUv3AccelDevice *accel_dev,
+                                    uint32_t data_type, uint32_t data_len,
+                                    void *data, Error **errp)
+{
+    SMMUViommu *vsmmu = accel_dev->vsmmu;
+    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
+    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
+    uint32_t flags = 0;
+
+    if (!idev || !vsmmu) {
+        error_setg(errp, "Device 0x%x has no associated IOMMU dev or vIOMMU",
+                   smmu_get_sid(&accel_dev->sdev));
+        return false;
+    }
+
+    if (s1_hwpt) {
+        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev, true, errp)) {
+            return false;
+        }
+    }
+
+    s1_hwpt = g_new0(SMMUS1Hwpt, 1);
+    s1_hwpt->iommufd = idev->iommufd;
+    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+                                    vsmmu->viommu.viommu_id, flags,
+                                    data_type, data_len, data,
+                                    &s1_hwpt->hwpt_id, errp)) {
+        return false;
+    }
+
+    if (!host_iommu_device_iommufd_attach_hwpt(idev, s1_hwpt->hwpt_id, errp)) {
+        iommufd_backend_free_id(idev->iommufd, s1_hwpt->hwpt_id);
+        return false;
+    }
+    accel_dev->s1_hwpt = s1_hwpt;
+    return true;
+}
+
+bool
+smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
+                                Error **errp)
+{
+    SMMUv3AccelDevice *accel_dev;
+    SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid,
+                           .inval_ste_allowed = true};
+    struct iommu_hwpt_arm_smmuv3 nested_data = {};
+    uint64_t ste_0, ste_1;
+    uint32_t config;
+    STE ste;
+    int ret;
+
+    if (!s->accel) {
+        return true;
+    }
+
+    accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
+    if (!accel_dev->vsmmu) {
+        return true;
+    }
+
+    if (!smmuv3_accel_alloc_vdev(accel_dev, sid, errp)) {
+        return false;
+    }
+
+    ret = smmu_find_ste(sdev->smmu, sid, &ste, &event);
+    if (ret) {
+        error_setg(errp, "Failed to find STE for Device 0x%x", sid);
+        return true;
+    }
+
+    config = STE_CONFIG(&ste);
+    if (!STE_VALID(&ste) || !STE_CFG_S1_ENABLED(config)) {
+        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev,
+                                                   STE_CFG_ABORT(config),
+                                                   errp)) {
+            return false;
+        }
+        smmuv3_flush_config(sdev);
+        return true;
+    }
+
+    ste_0 = (uint64_t)ste.word[0] | (uint64_t)ste.word[1] << 32;
+    ste_1 = (uint64_t)ste.word[2] | (uint64_t)ste.word[3] << 32;
+    nested_data.ste[0] = cpu_to_le64(ste_0 & STE0_MASK);
+    nested_data.ste[1] = cpu_to_le64(ste_1 & STE1_MASK);
+
+    if (!smmuv3_accel_dev_install_nested_ste(accel_dev,
+                                             IOMMU_HWPT_DATA_ARM_SMMUV3,
+                                             sizeof(nested_data),
+                                             &nested_data, errp)) {
+        error_append_hint(errp, "Unable to install sid=0x%x nested STE="
+                          "0x%"PRIx64":=0x%"PRIx64"", sid,
+                          (uint64_t)le64_to_cpu(nested_data.ste[1]),
+                          (uint64_t)le64_to_cpu(nested_data.ste[0]));
+        return false;
+    }
+    trace_smmuv3_accel_install_nested_ste(sid, nested_data.ste[1],
+                                          nested_data.ste[0]);
+    return true;
+}
+
+bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
+                                           Error **errp)
+{
+    SMMUv3AccelState *s_accel = s->s_accel;
+    SMMUv3AccelDevice *accel_dev;
+
+    if (!s_accel || !s_accel->vsmmu) {
+        return true;
+    }
+
+    QLIST_FOREACH(accel_dev, &s_accel->vsmmu->device_list, next) {
+        uint32_t sid = smmu_get_sid(&accel_dev->sdev);
+
+        if (sid >= range->start && sid <= range->end) {
+            if (!smmuv3_accel_install_nested_ste(s, &accel_dev->sdev,
+                                                 sid, errp)) {
+                return false;
+            }
+        }
+    }
+    return true;
+}
+
 static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
                                                PCIBus *bus, int devfn)
 {
@@ -154,6 +339,7 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
     SMMUv3State *s = ARM_SMMUV3(bs);
     SMMUPciBus *sbus = g_hash_table_lookup(bs->smmu_pcibus_by_busptr, bus);
     SMMUv3AccelDevice *accel_dev;
+    IOMMUFDVdev *vdev;
     SMMUViommu *vsmmu;
     SMMUDevice *sdev;
     uint16_t sid;
@@ -182,6 +368,13 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
     trace_smmuv3_accel_unset_iommu_device(devfn, sid);
 
     vsmmu = s->s_accel->vsmmu;
+    vdev = accel_dev->vdev;
+    if (vdev) {
+        iommufd_backend_free_id(vsmmu->iommufd, vdev->vdevice_id);
+        g_free(vdev);
+        accel_dev->vdev = NULL;
+    }
+
     if (QLIST_EMPTY(&vsmmu->device_list)) {
         iommufd_backend_free_id(vsmmu->iommufd, vsmmu->bypass_hwpt_id);
         iommufd_backend_free_id(vsmmu->iommufd, vsmmu->abort_hwpt_id);
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index d81f90c32c..73b44cd7be 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -27,9 +27,16 @@ typedef struct SMMUViommu {
     QLIST_HEAD(, SMMUv3AccelDevice) device_list;
 } SMMUViommu;
 
+typedef struct SMMUS1Hwpt {
+    IOMMUFDBackend *iommufd;
+    uint32_t hwpt_id;
+} SMMUS1Hwpt;
+
 typedef struct SMMUv3AccelDevice {
     SMMUDevice sdev;
     HostIOMMUDeviceIOMMUFD *idev;
+    SMMUS1Hwpt *s1_hwpt;
+    IOMMUFDVdev *vdev;
     SMMUViommu *vsmmu;
     QLIST_ENTRY(SMMUv3AccelDevice) next;
 } SMMUv3AccelDevice;
@@ -40,10 +47,26 @@ typedef struct SMMUv3AccelState {
 
 #ifdef CONFIG_ARM_SMMUV3_ACCEL
 void smmuv3_accel_init(SMMUv3State *s);
+bool smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
+                                     Error **errp);
+bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
+                                           Error **errp);
 #else
 static inline void smmuv3_accel_init(SMMUv3State *s)
 {
 }
+static inline bool
+smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
+                                Error **errp)
+{
+    return true;
+}
+static inline bool
+smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
+                                      Error **errp)
+{
+    return true;
+}
 #endif
 
 #endif /* HW_ARM_SMMUV3_ACCEL_H */
diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
index 03d86cfc5c..5fd88b4257 100644
--- a/hw/arm/smmuv3-internal.h
+++ b/hw/arm/smmuv3-internal.h
@@ -547,6 +547,9 @@ typedef struct CD {
     uint32_t word[16];
 } CD;
 
+int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste, SMMUEventInfo *event);
+void smmuv3_flush_config(SMMUDevice *sdev);
+
 /* STE fields */
 
 #define STE_VALID(x)   extract32((x)->word[0], 0, 1)
@@ -586,6 +589,23 @@ typedef struct CD {
 #define SMMU_STE_VALID      (1ULL << 0)
 #define SMMU_STE_CFG_BYPASS (1ULL << 3)
 
+#define STE0_V       MAKE_64BIT_MASK(0, 1)
+#define STE0_CONFIG  MAKE_64BIT_MASK(1, 3)
+#define STE0_S1FMT   MAKE_64BIT_MASK(4, 2)
+#define STE0_CTXPTR  MAKE_64BIT_MASK(6, 50)
+#define STE0_S1CDMAX MAKE_64BIT_MASK(59, 5)
+#define STE0_MASK    (STE0_S1CDMAX | STE0_CTXPTR | STE0_S1FMT | STE0_CONFIG | \
+                      STE0_V)
+
+#define STE1_S1DSS    MAKE_64BIT_MASK(0, 2)
+#define STE1_S1CIR    MAKE_64BIT_MASK(2, 2)
+#define STE1_S1COR    MAKE_64BIT_MASK(4, 2)
+#define STE1_S1CSH    MAKE_64BIT_MASK(6, 2)
+#define STE1_S1STALLD MAKE_64BIT_MASK(27, 1)
+#define STE1_EATS     MAKE_64BIT_MASK(28, 2)
+#define STE1_MASK     (STE1_EATS | STE1_S1STALLD | STE1_S1CSH | STE1_S1COR | \
+                       STE1_S1CIR | STE1_S1DSS)
+
 #define SMMU_GBPA_ABORT (1UL << 20)
 
 static inline int oas2bits(int oas_field)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index ef991cb7d8..1fd8aaa0c7 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -630,8 +630,7 @@ bad_ste:
  * Supports linear and 2-level stream table
  * Return 0 on success, -EINVAL otherwise
  */
-static int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste,
-                         SMMUEventInfo *event)
+int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste, SMMUEventInfo *event)
 {
     dma_addr_t addr, strtab_base;
     uint32_t log2size;
@@ -900,7 +899,7 @@ static SMMUTransCfg *smmuv3_get_config(SMMUDevice *sdev, SMMUEventInfo *event)
     return cfg;
 }
 
-static void smmuv3_flush_config(SMMUDevice *sdev)
+void smmuv3_flush_config(SMMUDevice *sdev)
 {
     SMMUv3State *s = sdev->smmu;
     SMMUState *bc = &s->smmu_state;
@@ -1330,6 +1329,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
         {
             uint32_t sid = CMD_SID(&cmd);
             SMMUDevice *sdev = smmu_find_sdev(bs, sid);
+            Error *local_err = NULL;
 
             if (CMD_SSEC(&cmd)) {
                 cmd_error = SMMU_CERROR_ILL;
@@ -1341,6 +1341,11 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
             }
 
             trace_smmuv3_cmdq_cfgi_ste(sid);
+            if (!smmuv3_accel_install_nested_ste(s, sdev, sid, &local_err)) {
+                error_report_err(local_err);
+                cmd_error = SMMU_CERROR_ILL;
+                break;
+            }
             smmuv3_flush_config(sdev);
 
             break;
@@ -1350,6 +1355,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
             uint32_t sid = CMD_SID(&cmd), mask;
             uint8_t range = CMD_STE_RANGE(&cmd);
             SMMUSIDRange sid_range;
+            Error *local_err = NULL;
 
             if (CMD_SSEC(&cmd)) {
                 cmd_error = SMMU_CERROR_ILL;
@@ -1361,6 +1367,12 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
             sid_range.end = sid_range.start + mask;
 
             trace_smmuv3_cmdq_cfgi_ste_range(sid_range.start, sid_range.end);
+            if (!smmuv3_accel_install_nested_ste_range(s, &sid_range,
+                                                       &local_err)) {
+                error_report_err(local_err);
+                cmd_error = SMMU_CERROR_ILL;
+                break;
+            }
             smmu_configs_inv_sid_range(bs, sid_range);
             break;
         }
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index 49c0460f30..2e0b1f8f6f 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -69,6 +69,8 @@ smmu_reset_exit(void) ""
 #smmuv3-accel.c
 smmuv3_accel_set_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (idev devid=0x%x)"
 smmuv3_accel_unset_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (idev devid=0x%x)"
+smmuv3_accel_install_nested_ste(uint32_t sid, uint64_t ste_1, uint64_t ste_0) "sid=%d ste=%"PRIx64":%"PRIx64
+smmuv3_accel_uninstall_nested_ste(uint32_t sid, const char *ste_cfg, uint32_t hwpt_id) "sid=%d attached %s hwpt_id=%u"
 
 # strongarm.c
 strongarm_uart_update_parameters(const char *label, int speed, char parity, int data_bits, int stop_bits) "%s speed=%d parity=%c data=%d stop=%d"
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 14/32] hw/arm/smmuv3-accel: Install SMMUv3 GBPA based hwpt
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (12 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-04 13:28   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback Shameer Kolothum
                   ` (17 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

When the Guest reboots or updates the GBPA we need to attach a nested HWPT
based on the GBPA register values.

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 hw/arm/smmuv3-accel.h |  8 ++++++++
 hw/arm/smmuv3.c       |  2 ++
 3 files changed, 52 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index c74e95a0ea..0573ae3772 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -479,6 +479,48 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
     .unset_iommu_device = smmuv3_accel_unset_iommu_device,
 };
 
+
+/* Based on SMUUv3 GBPA configuration, attach a corresponding HWPT */
+void smmuv3_accel_gbpa_update(SMMUv3State *s)
+{
+    SMMUv3AccelDevice *accel_dev;
+    Error *local_err = NULL;
+    SMMUViommu *vsmmu;
+    uint32_t hwpt_id;
+
+    if (!s->accel || !s->s_accel->vsmmu) {
+        return;
+    }
+
+    vsmmu = s->s_accel->vsmmu;
+    /*
+     * The Linux kernel does not allow configuring GBPA MemAttr, MTCFG,
+     * ALLOCCFG, SHCFG, PRIVCFG, or INSTCFG fields for a vSTE. Host kernel
+     * has final control over these parameters. Hence, use one of the
+     * pre-allocated HWPTs depending on GBPA.ABORT value.
+     */
+    if (s->gbpa & SMMU_GBPA_ABORT) {
+        hwpt_id = vsmmu->abort_hwpt_id;
+    } else {
+        hwpt_id = vsmmu->bypass_hwpt_id;
+    }
+
+    QLIST_FOREACH(accel_dev, &vsmmu->device_list, next) {
+        if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev, hwpt_id,
+                                                   &local_err)) {
+            error_append_hint(&local_err, "Failed to attach GBPA hwpt id %u "
+                              "for dev id %u", hwpt_id, accel_dev->idev->devid);
+            error_report_err(local_err);
+        }
+    }
+}
+
+void smmuv3_accel_reset(SMMUv3State *s)
+{
+     /* Attach a HWPT based on GBPA reset value */
+     smmuv3_accel_gbpa_update(s);
+}
+
 static void smmuv3_accel_as_init(SMMUv3State *s)
 {
 
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index 73b44cd7be..8931e83dc5 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -51,6 +51,8 @@ bool smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
                                      Error **errp);
 bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
                                            Error **errp);
+void smmuv3_accel_gbpa_update(SMMUv3State *s);
+void smmuv3_accel_reset(SMMUv3State *s);
 #else
 static inline void smmuv3_accel_init(SMMUv3State *s)
 {
@@ -67,6 +69,12 @@ smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
 {
     return true;
 }
+static inline void smmuv3_accel_gbpa_update(SMMUv3State *s)
+{
+}
+static inline void smmuv3_accel_reset(SMMUv3State *s)
+{
+}
 #endif
 
 #endif /* HW_ARM_SMMUV3_ACCEL_H */
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 1fd8aaa0c7..cc32b618ed 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1603,6 +1603,7 @@ static MemTxResult smmu_writel(SMMUv3State *s, hwaddr offset,
         if (data & R_GBPA_UPDATE_MASK) {
             /* Ignore update bit as write is synchronous. */
             s->gbpa = data & ~R_GBPA_UPDATE_MASK;
+            smmuv3_accel_gbpa_update(s);
         }
         return MEMTX_OK;
     case A_STRTAB_BASE: /* 64b */
@@ -1885,6 +1886,7 @@ static void smmu_reset_exit(Object *obj, ResetType type)
     }
 
     smmuv3_init_regs(s);
+    smmuv3_accel_reset(s);
 }
 
 static void smmu_realize(DeviceState *d, Error **errp)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (13 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 14/32] hw/arm/smmuv3-accel: Install SMMUv3 GBPA based hwpt Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-04 14:11   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 16/32] hw/arm/smmuv3-accel: Make use of " Shameer Kolothum
                   ` (16 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On ARM, devices behind an IOMMU have their MSI doorbell addresses
translated by the IOMMU. In nested mode, this translation happens in
two stages (gIOVA → gPA → ITS page).

In accelerated SMMUv3 mode, both stages are handled by hardware, so
get_address_space() returns the system address space so that VFIO
can setup stage-2 mappings for system address space.

However, QEMU/KVM also calls this callback when resolving
MSI doorbells:

  kvm_irqchip_add_msi_route()
    kvm_arch_fixup_msi_route()
      pci_device_iommu_address_space()
        get_address_space()

VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
directly. This needs to be translated to vITS gPA. In order to do the
doorbell transalation it needs IOMMU address space.

Add an optional get_msi_address_space() callback and use it in this
path to return the correct address space for such cases.

Cc: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/pci/pci.c         | 18 ++++++++++++++++++
 include/hw/pci/pci.h | 16 ++++++++++++++++
 target/arm/kvm.c     |  2 +-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index fa9cf5dab2..1edd711247 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2982,6 +2982,24 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     return &address_space_memory;
 }
 
+AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
+{
+    PCIBus *bus;
+    PCIBus *iommu_bus;
+    int devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
+    if (iommu_bus) {
+        if (iommu_bus->iommu_ops->get_msi_address_space) {
+            return iommu_bus->iommu_ops->get_msi_address_space(bus,
+                                 iommu_bus->iommu_opaque, devfn);
+        }
+        return iommu_bus->iommu_ops->get_address_space(bus,
+                                 iommu_bus->iommu_opaque, devfn);
+    }
+    return &address_space_memory;
+}
+
 int pci_iommu_init_iotlb_notifier(PCIDevice *dev, IOMMUNotifier *n,
                                   IOMMUNotify fn, void *opaque)
 {
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index dfeba8c9bd..b731443c67 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -664,6 +664,21 @@ typedef struct PCIIOMMUOps {
                             uint32_t pasid, bool priv_req, bool exec_req,
                             hwaddr addr, bool lpig, uint16_t prgi, bool is_read,
                             bool is_write);
+    /**
+     * @get_msi_address_space: get the address space for MSI doorbell address
+     * for devices
+     *
+     * Optional callback which returns a pointer to an #AddressSpace. This
+     * is required if MSI doorbell also gets translated through vIOMMU(eg: ARM)
+     *
+     * @bus: the #PCIBus being accessed.
+     *
+     * @opaque: the data passed to pci_setup_iommu().
+     *
+     * @devfn: device and function number
+     */
+    AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
+                                            int devfn);
 } PCIIOMMUOps;
 
 bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
@@ -672,6 +687,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
 void pci_device_unset_iommu_device(PCIDevice *dev);
+AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
 
 /**
  * pci_device_get_viommu_flags: get vIOMMU flags.
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 0d57081e69..0df41128d0 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1611,7 +1611,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
 int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
                              uint64_t address, uint32_t data, PCIDevice *dev)
 {
-    AddressSpace *as = pci_device_iommu_address_space(dev);
+    AddressSpace *as = pci_device_iommu_msi_address_space(dev);
     hwaddr xlat, len, doorbell_gpa;
     MemoryRegionSection mrs;
     MemoryRegion *mr;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 16/32] hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (14 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-10-31 23:57   ` Nicolin Chen
  2025-10-31 10:49 ` [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host Shameer Kolothum
                   ` (15 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Implement support for get_msi_address_space() callback and return the
IOMMU address space if the device has S1 translation enabled by Guest.
Otherwise return system address space.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 0573ae3772..395c8175da 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -384,6 +384,26 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
     }
 }
 
+static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void *opaque,
+                                             int devfn)
+{
+    SMMUState *bs = opaque;
+    SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
+    SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
+    SMMUDevice *sdev = &accel_dev->sdev;
+
+    /*
+     * If the assigned vfio-pci dev has S1 translation enabled by Guest,
+     * return IOMMU address space for MSI translation. Otherwise, return
+     * system address space.
+     */
+    if (accel_dev->s1_hwpt) {
+        return &sdev->as;
+    } else {
+        return &address_space_memory;
+    }
+}
+
 static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
 {
 
@@ -477,6 +497,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
     .get_viommu_flags = smmuv3_accel_get_viommu_flags,
     .set_iommu_device = smmuv3_accel_set_iommu_device,
     .unset_iommu_device = smmuv3_accel_unset_iommu_device,
+    .get_msi_address_space = smmuv3_accel_get_msi_as,
 };
 
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (15 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 16/32] hw/arm/smmuv3-accel: Make use of " Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-01  0:35   ` Nicolin Chen via
  2025-11-03 17:11   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 18/32] hw/arm/smmuv3: Initialize ID registers early during realize() Shameer Kolothum
                   ` (14 subsequent siblings)
  31 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Provide a helper and use that to issue the invalidation cmd to host SMMUv3.
We only issue one cmd at a time for now.

Support for batching of commands will be added later after analysing the
impact.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c | 35 +++++++++++++++++++++++++++++++++++
 hw/arm/smmuv3-accel.h |  8 ++++++++
 hw/arm/smmuv3.c       | 30 ++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 395c8175da..a2deda3c32 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -213,6 +213,41 @@ bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
     return true;
 }
 
+/*
+ * This issues the invalidation cmd to the host SMMUv3.
+ * Note: sdev can be NULL for certain invalidation commands
+ * e.g., SMMU_CMD_TLBI_NH_ASID, SMMU_CMD_TLBI_NH_VA etc.
+ */
+bool smmuv3_accel_issue_inv_cmd(SMMUv3State *bs, void *cmd, SMMUDevice *sdev,
+                                Error **errp)
+{
+    SMMUv3State *s = ARM_SMMUV3(bs);
+    SMMUv3AccelState *s_accel = s->s_accel;
+    IOMMUFDViommu *viommu;
+    uint32_t entry_num = 1;
+
+    /* No vIOMMU means no VFIO/IOMMUFD devices, nothing to invalidate. */
+    if (!s_accel || !s_accel->vsmmu) {
+        return true;
+    }
+
+    /*
+     * Called for emulated bridges or root ports, but SID-based
+     * invalidations (e.g. CFGI_CD) apply only to vfio-pci endpoints
+     * with a valid vIOMMU vdev.
+     */
+    if (sdev && !container_of(sdev, SMMUv3AccelDevice, sdev)->vdev) {
+        return true;
+    }
+
+    viommu = &s_accel->vsmmu->viommu;
+    /* Single command (entry_num = 1); no need to check returned entry_num */
+    return iommufd_backend_invalidate_cache(
+                   viommu->iommufd, viommu->viommu_id,
+                   IOMMU_VIOMMU_INVALIDATE_DATA_ARM_SMMUV3,
+                   sizeof(Cmd), &entry_num, cmd, errp);
+}
+
 static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
                                                PCIBus *bus, int devfn)
 {
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index 8931e83dc5..ee79548370 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -51,6 +51,8 @@ bool smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
                                      Error **errp);
 bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
                                            Error **errp);
+bool smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
+                                Error **errp);
 void smmuv3_accel_gbpa_update(SMMUv3State *s);
 void smmuv3_accel_reset(SMMUv3State *s);
 #else
@@ -69,6 +71,12 @@ smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
 {
     return true;
 }
+static inline bool
+smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
+                           Error **errp)
+{
+    return true;
+}
 static inline void smmuv3_accel_gbpa_update(SMMUv3State *s)
 {
 }
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index cc32b618ed..15173ddc9c 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1381,6 +1381,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
         {
             uint32_t sid = CMD_SID(&cmd);
             SMMUDevice *sdev = smmu_find_sdev(bs, sid);
+            Error *local_err = NULL;
 
             if (CMD_SSEC(&cmd)) {
                 cmd_error = SMMU_CERROR_ILL;
@@ -1393,11 +1394,17 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
 
             trace_smmuv3_cmdq_cfgi_cd(sid);
             smmuv3_flush_config(sdev);
+            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, sdev, &local_err)) {
+                error_report_err(local_err);
+                cmd_error = SMMU_CERROR_ILL;
+                break;
+            }
             break;
         }
         case SMMU_CMD_TLBI_NH_ASID:
         {
             int asid = CMD_ASID(&cmd);
+            Error *local_err = NULL;
             int vmid = -1;
 
             if (!STAGE1_SUPPORTED(s)) {
@@ -1416,6 +1423,11 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
             trace_smmuv3_cmdq_tlbi_nh_asid(asid);
             smmu_inv_notifiers_all(&s->smmu_state);
             smmu_iotlb_inv_asid_vmid(bs, asid, vmid);
+            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
+                error_report_err(local_err);
+                cmd_error = SMMU_CERROR_ILL;
+                break;
+            }
             break;
         }
         case SMMU_CMD_TLBI_NH_ALL:
@@ -1440,18 +1452,36 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
             QEMU_FALLTHROUGH;
         }
         case SMMU_CMD_TLBI_NSNH_ALL:
+        {
+            Error *local_err = NULL;
+
             trace_smmuv3_cmdq_tlbi_nsnh();
             smmu_inv_notifiers_all(&s->smmu_state);
             smmu_iotlb_inv_all(bs);
+            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
+                error_report_err(local_err);
+                cmd_error = SMMU_CERROR_ILL;
+                break;
+            }
             break;
+        }
         case SMMU_CMD_TLBI_NH_VAA:
         case SMMU_CMD_TLBI_NH_VA:
+        {
+            Error *local_err = NULL;
+
             if (!STAGE1_SUPPORTED(s)) {
                 cmd_error = SMMU_CERROR_ILL;
                 break;
             }
             smmuv3_range_inval(bs, &cmd, SMMU_STAGE_1);
+            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
+                error_report_err(local_err);
+                cmd_error = SMMU_CERROR_ILL;
+                break;
+            }
             break;
+        }
         case SMMU_CMD_TLBI_S12_VMALL:
         {
             int vmid = CMD_VMID(&cmd);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 18/32] hw/arm/smmuv3: Initialize ID registers early during realize()
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (16 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-01  0:24   ` Nicolin Chen
                     ` (2 more replies)
  2025-10-31 10:49 ` [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate Shameer Kolothum
                   ` (13 subsequent siblings)
  31 siblings, 3 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Factor out ID register init into smmuv3_init_id_regs() and call it from
realize(). This ensures ID registers are initialized early for use in the
accelerated SMMUv3 path and will be utilized in subsequent patch.

Other registers remain initialized in smmuv3_reset().

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 15173ddc9c..fae545f35c 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -258,7 +258,12 @@ void smmuv3_record_event(SMMUv3State *s, SMMUEventInfo *info)
     info->recorded = true;
 }
 
-static void smmuv3_init_regs(SMMUv3State *s)
+/*
+ * Called during realize(), as the ID registers will be accessed early in the
+ * SMMUv3 accel path for feature compatibility checks. The remaining registers
+ * are initialized later in smmuv3_reset().
+ */
+static void smmuv3_init_id_regs(SMMUv3State *s)
 {
     /* Based on sys property, the stages supported in smmu will be advertised.*/
     if (s->stage && !strcmp("2", s->stage)) {
@@ -298,7 +303,11 @@ static void smmuv3_init_regs(SMMUv3State *s)
     s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1);
     s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1);
     s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
+    s->aidr = 0x1;
+}
 
+static void smmuv3_reset(SMMUv3State *s)
+{
     s->cmdq.base = deposit64(s->cmdq.base, 0, 5, SMMU_CMDQS);
     s->cmdq.prod = 0;
     s->cmdq.cons = 0;
@@ -310,7 +319,6 @@ static void smmuv3_init_regs(SMMUv3State *s)
 
     s->features = 0;
     s->sid_split = 0;
-    s->aidr = 0x1;
     s->cr[0] = 0;
     s->cr0ack = 0;
     s->irq_ctrl = 0;
@@ -1915,7 +1923,7 @@ static void smmu_reset_exit(Object *obj, ResetType type)
         c->parent_phases.exit(obj, type);
     }
 
-    smmuv3_init_regs(s);
+    smmuv3_reset(s);
     smmuv3_accel_reset(s);
 }
 
@@ -1947,6 +1955,7 @@ static void smmu_realize(DeviceState *d, Error **errp)
     sysbus_init_mmio(dev, &sys->iomem);
 
     smmu_init_irq(s, dev);
+    smmuv3_init_id_regs(s);
 }
 
 static const VMStateDescription vmstate_smmuv3_queue = {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (17 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 18/32] hw/arm/smmuv3: Initialize ID registers early during realize() Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-01  0:49   ` Nicolin Chen
                     ` (2 more replies)
  2025-10-31 10:49 ` [PATCH v5 20/32] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5 Shameer Kolothum
                   ` (12 subsequent siblings)
  31 siblings, 3 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Just before the device gets attached to the SMMUv3, make sure QEMU SMMUv3
features are compatible with the host SMMUv3.

Not all fields in the host SMMUv3 IDR registers are meaningful for userspace.
Only the following fields can be used:

  - IDR0: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN, CD2L, ASID16, TTF
  - IDR1: SIDSIZE, SSIDSIZE
  - IDR3: BBML, RIL
  - IDR5: VAX, GRAN64K, GRAN16K, GRAN4K

For now, the check is to make sure the features are in sync to enable
basic accelerated SMMUv3 support.

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c | 100 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index a2deda3c32..8b9f88dd8e 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -28,6 +28,98 @@ MemoryRegion root;
 MemoryRegion sysmem;
 static AddressSpace *shared_as_sysmem;
 
+static bool
+smmuv3_accel_check_hw_compatible(SMMUv3State *s,
+                                 struct iommu_hw_info_arm_smmuv3 *info,
+                                 Error **errp)
+{
+    /* QEMU SMMUv3 supports both linear and 2-level stream tables */
+    if (FIELD_EX32(info->idr[0], IDR0, STLEVEL) !=
+                FIELD_EX32(s->idr[0], IDR0, STLEVEL)) {
+        error_setg(errp, "Host SMMUv3 differs in Stream Table format");
+        return false;
+    }
+
+    /* QEMU SMMUv3 supports only little-endian translation table walks */
+    if (FIELD_EX32(info->idr[0], IDR0, TTENDIAN) >
+                FIELD_EX32(s->idr[0], IDR0, TTENDIAN)) {
+        error_setg(errp, "Host SMMUv3 doesn't support Little-endian "
+                   "translation table");
+        return false;
+    }
+
+    /* QEMU SMMUv3 supports only AArch64 translation table format */
+    if (FIELD_EX32(info->idr[0], IDR0, TTF) <
+                FIELD_EX32(s->idr[0], IDR0, TTF)) {
+        error_setg(errp, "Host SMMUv3 doesn't support AArch64 translation "
+                   "table format");
+        return false;
+    }
+
+    /* QEMU SMMUv3 supports SIDSIZE 16 */
+    if (FIELD_EX32(info->idr[1], IDR1, SIDSIZE) <
+                FIELD_EX32(s->idr[1], IDR1, SIDSIZE)) {
+        error_setg(errp, "Host SMMUv3 SIDSIZE not compatible");
+        return false;
+    }
+
+    /* QEMU SMMUv3 supports Range Invalidation by default */
+    if (FIELD_EX32(info->idr[3], IDR3, RIL) !=
+                FIELD_EX32(s->idr[3], IDR3, RIL)) {
+        error_setg(errp, "Host SMMUv3 doesn't support Range Invalidation");
+        return false;
+    }
+
+    /* QEMU SMMUv3 supports GRAN4K/GRAN16K/GRAN64K translation granules */
+    if (FIELD_EX32(info->idr[5], IDR5, GRAN4K) !=
+                FIELD_EX32(s->idr[5], IDR5, GRAN4K)) {
+        error_setg(errp, "Host SMMUv3 doesn't support 4K translation granule");
+        return false;
+    }
+    if (FIELD_EX32(info->idr[5], IDR5, GRAN16K) !=
+                FIELD_EX32(s->idr[5], IDR5, GRAN16K)) {
+        error_setg(errp, "Host SMMUv3 doesn't support 16K translation granule");
+        return false;
+    }
+    if (FIELD_EX32(info->idr[5], IDR5, GRAN64K) !=
+                FIELD_EX32(s->idr[5], IDR5, GRAN64K)) {
+        error_setg(errp, "Host SMMUv3 doesn't support 64K translation granule");
+        return false;
+    }
+
+    /* QEMU SMMUv3 supports architecture version 3.1 */
+    if (info->aidr < s->aidr) {
+        error_setg(errp, "Host SMMUv3 architecture version not compatible");
+        return false;
+    }
+    return true;
+}
+
+static bool
+smmuv3_accel_hw_compatible(SMMUv3State *s, HostIOMMUDeviceIOMMUFD *idev,
+                           Error **errp)
+{
+    struct iommu_hw_info_arm_smmuv3 info;
+    uint32_t data_type;
+    uint64_t caps;
+
+    if (!iommufd_backend_get_device_info(idev->iommufd, idev->devid, &data_type,
+                                         &info, sizeof(info), &caps, errp)) {
+        return false;
+    }
+
+    if (data_type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3) {
+        error_setg(errp, "Wrong data type (%d) for Host SMMUv3 device info",
+                     data_type);
+        return false;
+    }
+
+    if (!smmuv3_accel_check_hw_compatible(s, &info, errp)) {
+        return false;
+    }
+    return true;
+}
+
 static bool
 smmuv3_accel_alloc_vdev(SMMUv3AccelDevice *accel_dev, int sid, Error **errp)
 {
@@ -356,6 +448,14 @@ static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
         return true;
     }
 
+    /*
+     * Check the host SMMUv3 associated with the dev is compatible with the
+     * QEMU SMMUv3 accel.
+     */
+    if (!smmuv3_accel_hw_compatible(s, idev, errp)) {
+        return false;
+    }
+
     if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
         error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
         return false;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 20/32] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (18 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 13:58   ` Jonathan Cameron via
  2025-10-31 10:49 ` [PATCH v5 21/32] hw/arm/virt: Set PCI preserve_config for accel SMMUv3 Shameer Kolothum
                   ` (11 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

From: Eric Auger <eric.auger@redhat.com>

Add a 'preserve_config' field in struct GPEXConfig and, if set, generate
the _DSM function #5 for preserving PCI boot configurations.

This will be used for SMMUv3 accel=on support in subsequent patch. When
SMMUv3 acceleration (accel=on) is enabled, QEMU exposes IORT Reserved
Memory Region (RMR) nodes to support MSI doorbell translations. As per
the Arm IORT specification, using IORT RMRs mandates the presence of
_DSM function #5 so that the OS retains the firmware-assigned PCI
configuration. Hence, this patch adds conditional support for generating
_DSM #5.

According to the ACPI Specification, Revision 6.6, Section 9.1.1 -
“_DSM (Device Specific Method)”,

"
If Function Index is zero, the return is a buffer containing one bit for
each function index, starting with zero. Bit 0 indicates whether there
is support for any functions other than function 0 for the specified
UUID and Revision ID. If set to zero, no functions are supported (other
than function zero) for the specified UUID and Revision ID. If set to
one, at least one additional function is supported. For all other bits
in the buffer, a bit is set to zero to indicate if that function index
is not supported for the specific UUID and Revision ID. (For example,
bit 1 set to 0 indicates that function index 1 is not supported for the
specific UUID and Revision ID.)
"

Please refer PCI Firmware Specification, Revision 3.3, Section 4.6.5 —
"_DSM for Preserving PCI Boot Configurations" for Function 5 of _DSM
method.

Also, while at it, move the byte_list declaration to the top of the
function for clarity.

At the moment, DSM generation is not yet enabled.

The resulting AML when preserve_config=true is:

    Method (_DSM, 4, NotSerialized)
        {
            If ((Arg0 == ToUUID ("e5c937d0-3553-4d7a-9117-ea4d19c3434d")))
                {
                    If ((Arg2 == Zero))
                    {
                        Return (Buffer (One)
                        {
                             0x21
                        })
                    }

                    If ((Arg2 == 0x05))
                    {
                        Return (Zero)
                    }
                }
         ...
      }

Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
[Shameer: Removed possible duplicate _DSM creations]
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
Previously, QEMU reverted an attempt to enable DSM #5 because it caused a
regression,
https://lore.kernel.org/all/20210724185234.GA2265457@roeck-us.net/.

However, in this series, we enable it selectively, only when SMMUv3 is in
accelerator mode. The devices involved in the earlier regression are not
expected in accelerated SMMUv3 use cases.
---
 hw/pci-host/gpex-acpi.c    | 29 +++++++++++++++++++++++------
 include/hw/pci-host/gpex.h |  1 +
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
index 4587baeb78..d9820f9b41 100644
--- a/hw/pci-host/gpex-acpi.c
+++ b/hw/pci-host/gpex-acpi.c
@@ -51,10 +51,11 @@ static void acpi_dsdt_add_pci_route_table(Aml *dev, uint32_t irq,
     }
 }
 
-static Aml *build_pci_host_bridge_dsm_method(void)
+static Aml *build_pci_host_bridge_dsm_method(bool preserve_config)
 {
     Aml *method = aml_method("_DSM", 4, AML_NOTSERIALIZED);
     Aml *UUID, *ifctx, *ifctx1, *buf;
+    uint8_t byte_list[1] = {0};
 
     /* PCI Firmware Specification 3.0
      * 4.6.1. _DSM for PCI Express Slot Information
@@ -64,10 +65,23 @@ static Aml *build_pci_host_bridge_dsm_method(void)
     UUID = aml_touuid("E5C937D0-3553-4D7A-9117-EA4D19C3434D");
     ifctx = aml_if(aml_equal(aml_arg(0), UUID));
     ifctx1 = aml_if(aml_equal(aml_arg(2), aml_int(0)));
-    uint8_t byte_list[1] = {0};
+    if (preserve_config) {
+        /* support functions other than 0, specifically function 5 */
+        byte_list[0] = 0x21;
+    }
     buf = aml_buffer(1, byte_list);
     aml_append(ifctx1, aml_return(buf));
     aml_append(ifctx, ifctx1);
+    if (preserve_config) {
+        Aml *ifctx2 = aml_if(aml_equal(aml_arg(2), aml_int(5)));
+        /*
+         * 0 - The operating system must not ignore the PCI configuration that
+         *     firmware has done at boot time.
+         */
+        aml_append(ifctx2, aml_return(aml_int(0)));
+        aml_append(ifctx, ifctx2);
+    }
+
     aml_append(method, ifctx);
 
     byte_list[0] = 0;
@@ -77,12 +91,13 @@ static Aml *build_pci_host_bridge_dsm_method(void)
 }
 
 static void acpi_dsdt_add_host_bridge_methods(Aml *dev,
-                                              bool enable_native_pcie_hotplug)
+                                              bool enable_native_pcie_hotplug,
+                                              bool preserve_config)
 {
     /* Declare an _OSC (OS Control Handoff) method */
     aml_append(dev,
                build_pci_host_bridge_osc_method(enable_native_pcie_hotplug));
-    aml_append(dev, build_pci_host_bridge_dsm_method());
+    aml_append(dev, build_pci_host_bridge_dsm_method(preserve_config));
 }
 
 void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
@@ -152,7 +167,8 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
                 build_cxl_osc_method(dev);
             } else {
                 /* pxb bridges do not have ACPI PCI Hot-plug enabled */
-                acpi_dsdt_add_host_bridge_methods(dev, true);
+                acpi_dsdt_add_host_bridge_methods(dev, true,
+                                                  cfg->preserve_config);
             }
 
             aml_append(scope, dev);
@@ -227,7 +243,8 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
     }
     aml_append(dev, aml_name_decl("_CRS", rbuf));
 
-    acpi_dsdt_add_host_bridge_methods(dev, cfg->pci_native_hotplug);
+    acpi_dsdt_add_host_bridge_methods(dev, cfg->pci_native_hotplug,
+                                      cfg->preserve_config);
 
     Aml *dev_res0 = aml_device("%s", "RES0");
     aml_append(dev_res0, aml_name_decl("_HID", aml_string("PNP0C02")));
diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h
index feaf827474..7eea16e728 100644
--- a/include/hw/pci-host/gpex.h
+++ b/include/hw/pci-host/gpex.h
@@ -46,6 +46,7 @@ struct GPEXConfig {
     int         irq;
     PCIBus      *bus;
     bool        pci_native_hotplug;
+    bool        preserve_config;
 };
 
 typedef struct GPEXIrq GPEXIrq;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 21/32] hw/arm/virt: Set PCI preserve_config for accel SMMUv3
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (19 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 20/32] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5 Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 14:58   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 22/32] tests/qtest/bios-tables-test: Prepare for IORT revison upgrade Shameer Kolothum
                   ` (10 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Introduce a new pci_preserve_config field in virt machine state which
allows  the generation of DSM #5. This field is only set if accel SMMU
is instantiated.

In a subsequent patch, SMMUv3 accel mode will make use of IORT RMR nodes
to enable nested translation of MSI doorbell addresses. IORT RMR requires
_DSM #5 to be set for the PCI host bridge so that the Guest kernel
preserves the PCI boot configuration.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/virt-acpi-build.c | 8 ++++++++
 hw/arm/virt.c            | 4 ++++
 include/hw/arm/virt.h    | 1 +
 3 files changed, 13 insertions(+)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 8bb6b60515..d51da6e27d 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -163,6 +163,14 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
         .pci_native_hotplug = !acpi_pcihp,
     };
 
+    /*
+     * Accel SMMU requires RMRs for MSI 1-1 mapping, which require _DSM for
+     * preserving PCI Boot Configurations
+     */
+    if (vms->pci_preserve_config) {
+        cfg.preserve_config = true;
+    }
+
     if (vms->highmem_mmio) {
         cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
     }
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 175023897a..8a347a6e39 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -3091,6 +3091,10 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
             }
 
             create_smmuv3_dev_dtb(vms, dev, bus);
+            if (object_property_find(OBJECT(dev), "accel") &&
+                object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
+                vms->pci_preserve_config = true;
+            }
         }
     }
 
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index 04a09af354..60db5d40b2 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -182,6 +182,7 @@ struct VirtMachineState {
     bool ns_el2_virt_timer_irq;
     CXLState cxl_devices_state;
     bool legacy_smmuv3_present;
+    bool pci_preserve_config;
 };
 
 #define VIRT_ECAM_ID(high) (high ? VIRT_HIGH_PCIE_ECAM : VIRT_PCIE_ECAM)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 22/32] tests/qtest/bios-tables-test: Prepare for IORT revison upgrade
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (20 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 21/32] hw/arm/virt: Set PCI preserve_config for accel SMMUv3 Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 14:48   ` Jonathan Cameron via
  2025-11-03 14:59   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 23/32] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Shameer Kolothum
                   ` (9 subsequent siblings)
  31 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Subsequent patch will upgrade IORT revision to 5 to add support
for IORT RMR nodes.

Add the affected IORT blobs to allowed-diff list for bios-table
tests.

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 tests/qtest/bios-tables-test-allowed-diff.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tests/qtest/bios-tables-test-allowed-diff.h b/tests/qtest/bios-tables-test-allowed-diff.h
index dfb8523c8b..3279638ad0 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1 +1,5 @@
 /* List of comma-separated changed AML files to ignore */
+"tests/data/acpi/aarch64/virt/IORT",
+"tests/data/acpi/aarch64/virt/IORT.its_off",
+"tests/data/acpi/aarch64/virt/IORT.smmuv3-legacy",
+"tests/data/acpi/aarch64/virt/IORT.smmuv3-dev",
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 23/32] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (21 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 22/32] tests/qtest/bios-tables-test: Prepare for IORT revison upgrade Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 14:53   ` Jonathan Cameron via
  2025-10-31 10:49 ` [PATCH v5 24/32] tests/qtest/bios-tables-test: Update IORT blobs after revision upgrade Shameer Kolothum
                   ` (8 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

From: Eric Auger <eric.auger@redhat.com>

To handle SMMUv3 accel=on mode(which configures the host SMMUv3 in nested
mode), it is practical to expose the guest with reserved memory regions
(RMRs) covering the IOVAs used by the host kernel to map physical MSI
doorbells.

Those IOVAs belong to [0x8000000, 0x8100000] matching MSI_IOVA_BASE and
MSI_IOVA_LENGTH definitions in kernel arm-smmu-v3 driver. This is the
window used to allocate IOVAs matching physical MSI doorbells.

With those RMRs, the guest is forced to use a flat mapping for this range.
Hence the assigned device is programmed with one IOVA from this range.
Stage 1, owned by the guest has a flat mapping for this IOVA. Stage2,
owned by the VMM then enforces a mapping from this IOVA to the physical
MSI doorbell.

The creation of those RMR nodes is only relevant if nested stage SMMU is
in use, along with VFIO. As VFIO devices can be hotplugged, all RMRs need
to be created in advance.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/virt-acpi-build.c | 112 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 104 insertions(+), 8 deletions(-)

diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index d51da6e27d..097a48cc83 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -256,6 +256,29 @@ static void acpi_dsdt_add_tpm(Aml *scope, VirtMachineState *vms)
 #define ROOT_COMPLEX_ENTRY_SIZE 36
 #define IORT_NODE_OFFSET 48
 
+#define IORT_RMR_NUM_ID_MAPPINGS     1
+#define IORT_RMR_NUM_MEM_RANGE_DESC  1
+#define IORT_RMR_COMMON_HEADER_SIZE  28
+#define IORT_RMR_MEM_RANGE_DESC_SIZE 20
+
+/*
+ * IORT RMR flags:
+ *   Bit[0] = 0  Disallow remapping of reserved ranges
+ *   Bit[1] = 0  Unprivileged access
+ *   Bits[9:2] = 0x00 Device nGnRnE memory
+ */
+#define IORT_RMR_FLAGS  0
+
+/*
+ * MSI doorbell IOVA window used by the host kernel SMMUv3 driver.
+ * Described in IORT RMR nodes to reserve the IOVA range where the host
+ * kernel maps physical MSI doorbells for devices. This ensures guests
+ * preserve a flat mapping for MSI doorbell in nested SMMUv3(accel=on)
+ * configurations.
+ */
+#define MSI_IOVA_BASE   0x8000000
+#define MSI_IOVA_LENGTH 0x100000
+
 /*
  * Append an ID mapping entry as described by "Table 4 ID mapping format" in
  * "IO Remapping Table System Software on ARM Platforms", Chapter 3.
@@ -264,7 +287,8 @@ static void acpi_dsdt_add_tpm(Aml *scope, VirtMachineState *vms)
  * Note that @id_count gets internally subtracted by one, following the spec.
  */
 static void build_iort_id_mapping(GArray *table_data, uint32_t input_base,
-                                  uint32_t id_count, uint32_t out_ref)
+                                  uint32_t id_count, uint32_t out_ref,
+                                  uint32_t flags)
 {
     build_append_int_noprefix(table_data, input_base, 4); /* Input base */
     /* Number of IDs - The number of IDs in the range minus one */
@@ -272,7 +296,7 @@ static void build_iort_id_mapping(GArray *table_data, uint32_t input_base,
     build_append_int_noprefix(table_data, input_base, 4); /* Output base */
     build_append_int_noprefix(table_data, out_ref, 4); /* Output Reference */
     /* Flags */
-    build_append_int_noprefix(table_data, 0 /* Single mapping (disabled) */, 4);
+    build_append_int_noprefix(table_data, flags, 4);
 }
 
 struct AcpiIortIdMapping {
@@ -320,6 +344,7 @@ typedef struct AcpiIortSMMUv3Dev {
     GArray *rc_smmu_idmaps;
     /* Offset of the SMMUv3 IORT Node relative to the start of the IORT */
     size_t offset;
+    bool accel;
 } AcpiIortSMMUv3Dev;
 
 /*
@@ -374,6 +399,9 @@ static int iort_smmuv3_devices(Object *obj, void *opaque)
     }
 
     bus = PCI_BUS(object_property_get_link(obj, "primary-bus", &error_abort));
+    if (object_property_find(obj, "accel")) {
+        sdev.accel = object_property_get_bool(obj, "accel", &error_abort);
+    }
     pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
     sbdev = SYS_BUS_DEVICE(obj);
     sdev.base = platform_bus_get_mmio_addr(pbus, sbdev, 0);
@@ -447,10 +475,70 @@ static void create_rc_its_idmaps(GArray *its_idmaps, GArray *smmuv3_devs)
     }
 }
 
+static void
+build_iort_rmr_nodes(GArray *table_data, GArray *smmuv3_devices, uint32_t *id)
+{
+    AcpiIortSMMUv3Dev *sdev;
+    AcpiIortIdMapping *idmap;
+    int i;
+
+    for (i = 0; i < smmuv3_devices->len; i++) {
+        uint16_t rmr_len;
+        int bdf;
+
+        sdev = &g_array_index(smmuv3_devices, AcpiIortSMMUv3Dev, i);
+        if (!sdev->accel) {
+            continue;
+        }
+
+        /*
+         * Spec reference:Arm IO Remapping Table(IORT), ARM DEN 0049E.d,
+         * Section 3.1.1.5 "Reserved Memory Range node"
+         */
+        idmap = &g_array_index(sdev->rc_smmu_idmaps, AcpiIortIdMapping, 0);
+        bdf = idmap->input_base;
+        rmr_len = IORT_RMR_COMMON_HEADER_SIZE
+                 + (IORT_RMR_NUM_ID_MAPPINGS * ID_MAPPING_ENTRY_SIZE)
+                 + (IORT_RMR_NUM_MEM_RANGE_DESC * IORT_RMR_MEM_RANGE_DESC_SIZE);
+
+        /* Table 18 Reserved Memory Range Node */
+        build_append_int_noprefix(table_data, 6 /* RMR */, 1); /* Type */
+        /* Length */
+        build_append_int_noprefix(table_data, rmr_len, 2);
+        build_append_int_noprefix(table_data, 3, 1); /* Revision */
+        build_append_int_noprefix(table_data, (*id)++, 4); /* Identifier */
+        /* Number of ID mappings */
+        build_append_int_noprefix(table_data, IORT_RMR_NUM_ID_MAPPINGS, 4);
+        /* Reference to ID Array */
+        build_append_int_noprefix(table_data, IORT_RMR_COMMON_HEADER_SIZE, 4);
+
+        /* RMR specific data */
+
+        /* Flags */
+        build_append_int_noprefix(table_data, IORT_RMR_FLAGS, 4);
+        /* Number of Memory Range Descriptors */
+        build_append_int_noprefix(table_data, IORT_RMR_NUM_MEM_RANGE_DESC, 4);
+        /* Reference to Memory Range Descriptors */
+        build_append_int_noprefix(table_data, IORT_RMR_COMMON_HEADER_SIZE +
+                        (IORT_RMR_NUM_ID_MAPPINGS * ID_MAPPING_ENTRY_SIZE), 4);
+        build_iort_id_mapping(table_data, bdf, idmap->id_count, sdev->offset,
+                              1);
+
+        /* Table 19 Memory Range Descriptor */
+
+        /* Physical Range offset */
+        build_append_int_noprefix(table_data, MSI_IOVA_BASE, 8);
+        /* Physical Range length */
+        build_append_int_noprefix(table_data, MSI_IOVA_LENGTH, 8);
+        build_append_int_noprefix(table_data, 0, 4); /* Reserved */
+        *id += 1;
+    }
+}
+
 /*
  * Input Output Remapping Table (IORT)
  * Conforms to "IO Remapping Table System Software on ARM Platforms",
- * Document number: ARM DEN 0049E.b, Feb 2021
+ * Document number: ARM DEN 0049E.d, Feb 2022
  */
 static void
 build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
@@ -464,7 +552,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     GArray *smmuv3_devs = g_array_new(false, true, sizeof(AcpiIortSMMUv3Dev));
     GArray *rc_its_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping));
 
-    AcpiTable table = { .sig = "IORT", .rev = 3, .oem_id = vms->oem_id,
+    AcpiTable table = { .sig = "IORT", .rev = 5, .oem_id = vms->oem_id,
                         .oem_table_id = vms->oem_table_id };
     /* Table 2 The IORT */
     acpi_table_begin(&table, table_data);
@@ -490,6 +578,13 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
             nb_nodes++; /* ITS */
             rc_mapping_count += rc_its_idmaps->len;
         }
+        /* Calculate RMR nodes required. One per SMMUv3 with accelerated mode */
+        for (i = 0; i < num_smmus; i++) {
+            sdev = &g_array_index(smmuv3_devs, AcpiIortSMMUv3Dev, i);
+            if (sdev->accel) {
+                nb_nodes++;
+            }
+        }
     } else {
         if (vms->its) {
             nb_nodes = 2; /* RC and ITS */
@@ -562,7 +657,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
         /* Array of ID mappings */
         if (smmu_mapping_count) {
             /* Output IORT node is the ITS Group node (the first node). */
-            build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET);
+            build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET, 0);
         }
     }
 
@@ -614,7 +709,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
                                        AcpiIortIdMapping, j);
                 /* Output IORT node is the SMMUv3 node. */
                 build_iort_id_mapping(table_data, range->input_base,
-                                      range->id_count, sdev->offset);
+                                      range->id_count, sdev->offset, 0);
             }
         }
 
@@ -627,7 +722,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
                 range = &g_array_index(rc_its_idmaps, AcpiIortIdMapping, i);
                 /* Output IORT node is the ITS Group node (the first node). */
                 build_iort_id_mapping(table_data, range->input_base,
-                                      range->id_count, IORT_NODE_OFFSET);
+                                      range->id_count, IORT_NODE_OFFSET, 0);
             }
         }
     } else {
@@ -636,9 +731,10 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
          * SMMU: RC -> ITS.
          * Output IORT node is the ITS Group node (the first node).
          */
-        build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET);
+        build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET, 0);
     }
 
+    build_iort_rmr_nodes(table_data, smmuv3_devs, &id);
     acpi_table_end(linker, &table);
     g_array_free(rc_its_idmaps, true);
     for (i = 0; i < num_smmus; i++) {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 24/32] tests/qtest/bios-tables-test: Update IORT blobs after revision upgrade
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (22 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 23/32] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 14:54   ` Jonathan Cameron via
  2025-11-03 15:01   ` Eric Auger
  2025-10-31 10:49 ` [PATCH v5 25/32] hw/arm/smmuv3: Add accel property for SMMUv3 device Shameer Kolothum
                   ` (7 subsequent siblings)
  31 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Update the reference IORT blobs after revision upgrade for RMR node
support. This affects the aarch64 'virt' IORT tests.

IORT diff is the same for all the tests:

 /*
  * Intel ACPI Component Architecture
  * AML/ASL+ Disassembler version 20230628 (64-bit version)
  * Copyright (c) 2000 - 2023 Intel Corporation
  *
- * Disassembly of tests/data/acpi/aarch64/virt/IORT, Mon Oct 20 14:42:41 2025
+ * Disassembly of /tmp/aml-B4ZRE3, Mon Oct 20 14:42:41 2025
  *
  * ACPI Data Table [IORT]
  *
  * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue (in hex)
  */

 [000h 0000 004h]                   Signature : "IORT"    [IO Remapping Table]
 [004h 0004 004h]                Table Length : 00000080
-[008h 0008 001h]                    Revision : 03
-[009h 0009 001h]                    Checksum : B3
+[008h 0008 001h]                    Revision : 05
+[009h 0009 001h]                    Checksum : B1
 [00Ah 0010 006h]                      Oem ID : "BOCHS "
 [010h 0016 008h]                Oem Table ID : "BXPC    "
 [018h 0024 004h]                Oem Revision : 00000001
 [01Ch 0028 004h]             Asl Compiler ID : "BXPC"
 [020h 0032 004h]       Asl Compiler Revision : 00000001
 ...

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 tests/data/acpi/aarch64/virt/IORT               | Bin 128 -> 128 bytes
 tests/data/acpi/aarch64/virt/IORT.its_off       | Bin 172 -> 172 bytes
 tests/data/acpi/aarch64/virt/IORT.smmuv3-dev    | Bin 364 -> 364 bytes
 tests/data/acpi/aarch64/virt/IORT.smmuv3-legacy | Bin 276 -> 276 bytes
 tests/qtest/bios-tables-test-allowed-diff.h     |   4 ----
 5 files changed, 4 deletions(-)

diff --git a/tests/data/acpi/aarch64/virt/IORT b/tests/data/acpi/aarch64/virt/IORT
index 7efd0ce8a6b3928efa7e1373f688ab4c5f50543b..a234aae4c2d04668d34313836d32ca20e19c0880 100644
GIT binary patch
delta 18
ZcmZo*Y+&T_^bZPYU|?Wi-8hk}3;-#Q1d;#%

delta 18
ZcmZo*Y+&T_^bZPYU|?Wi-aL`33;-#O1d;#%

diff --git a/tests/data/acpi/aarch64/virt/IORT.its_off b/tests/data/acpi/aarch64/virt/IORT.its_off
index c10da4e61dd00e7eb062558a2735d49ca0b20620..0cf52b52f671637bf4dbc9e0fc80c3c73d0b01d3 100644
GIT binary patch
delta 18
ZcmZ3(xQ3C-(?2L=4FdxM>(q%{ivTdM1ttIh

delta 18
ZcmZ3(xQ3C-(?2L=4FdxM^Yn>aivTdK1ttIh

diff --git a/tests/data/acpi/aarch64/virt/IORT.smmuv3-dev b/tests/data/acpi/aarch64/virt/IORT.smmuv3-dev
index 67be268f62afbf2d9459540984da5e9340afdaaa..43a15fe2bf6cc650ffcbceff86919ea892928c0e 100644
GIT binary patch
delta 19
acmaFE^oEJc(?2LAhmnDS^~6T5Bt`%|fCYU3

delta 19
acmaFE^oEJc(?2LAhmnDS`P4?PBt`%|eg%C1

diff --git a/tests/data/acpi/aarch64/virt/IORT.smmuv3-legacy b/tests/data/acpi/aarch64/virt/IORT.smmuv3-legacy
index 41981a449fc306b80cccd87ddec3c593a8d72c07..5779d0e225a62b9cd70bebbacb7fd1e519c9e3c4 100644
GIT binary patch
delta 19
acmbQjG=+)F(?2Lggpq-P)oUXc7b5^FiUXej

delta 19
acmbQjG=+)F(?2Lggpq-P*=Hjc7b5^Fhy$Mh

diff --git a/tests/qtest/bios-tables-test-allowed-diff.h b/tests/qtest/bios-tables-test-allowed-diff.h
index 3279638ad0..dfb8523c8b 100644
--- a/tests/qtest/bios-tables-test-allowed-diff.h
+++ b/tests/qtest/bios-tables-test-allowed-diff.h
@@ -1,5 +1 @@
 /* List of comma-separated changed AML files to ignore */
-"tests/data/acpi/aarch64/virt/IORT",
-"tests/data/acpi/aarch64/virt/IORT.its_off",
-"tests/data/acpi/aarch64/virt/IORT.smmuv3-legacy",
-"tests/data/acpi/aarch64/virt/IORT.smmuv3-dev",
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 25/32] hw/arm/smmuv3: Add accel property for SMMUv3 device
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (23 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 24/32] tests/qtest/bios-tables-test: Update IORT blobs after revision upgrade Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 14:56   ` Jonathan Cameron via
  2025-10-31 10:49 ` [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to specify RIL support Shameer Kolothum
                   ` (6 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Introduce an “accel” property to enable accelerator mode.

Live migration is currently unsupported when accelerator mode is enabled,
so a migration blocker is added.

Because this mode relies on IORT RMR for MSI support, accelerator mode is
not supported for device tree boot.

Also, in the accelerated SMMUv3 case, the host SMMUv3 is configured in nested
mode (S1 + S2), and the guest owns the Stage-1 page table. Therefore, we
expose only Stage-1 to the guest to ensure it uses the correct page table
format.

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3.c          | 28 ++++++++++++++++++++++++++++
 hw/arm/virt-acpi-build.c |  4 +---
 hw/arm/virt.c            | 31 ++++++++++++++++++++++---------
 include/hw/arm/smmuv3.h  |  1 +
 4 files changed, 52 insertions(+), 12 deletions(-)

diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index fae545f35c..f040e6b91e 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -20,6 +20,7 @@
 #include "qemu/bitops.h"
 #include "hw/irq.h"
 #include "hw/sysbus.h"
+#include "migration/blocker.h"
 #include "migration/vmstate.h"
 #include "hw/qdev-properties.h"
 #include "hw/qdev-core.h"
@@ -1927,6 +1928,17 @@ static void smmu_reset_exit(Object *obj, ResetType type)
     smmuv3_accel_reset(s);
 }
 
+static bool smmu_validate_property(SMMUv3State *s, Error **errp)
+{
+#ifndef CONFIG_ARM_SMMUV3_ACCEL
+    if (s->accel) {
+        error_setg(errp, "accel=on support not compiled in");
+        return false;
+    }
+#endif
+    return true;
+}
+
 static void smmu_realize(DeviceState *d, Error **errp)
 {
     SMMUState *sys = ARM_SMMU(d);
@@ -1935,8 +1947,17 @@ static void smmu_realize(DeviceState *d, Error **errp)
     SysBusDevice *dev = SYS_BUS_DEVICE(d);
     Error *local_err = NULL;
 
+    if (!smmu_validate_property(s, errp)) {
+        return;
+    }
+
     if (s->accel) {
         smmuv3_accel_init(s);
+        error_setg(&s->migration_blocker, "Migration not supported with SMMUv3 "
+                   "accelerator mode enabled");
+        if (migrate_add_blocker(&s->migration_blocker, errp) < 0) {
+            return;
+        }
     }
 
     c->parent_realize(d, &local_err);
@@ -2035,6 +2056,7 @@ static const Property smmuv3_properties[] = {
      * Defaults to stage 1
      */
     DEFINE_PROP_STRING("stage", SMMUv3State, stage),
+    DEFINE_PROP_BOOL("accel", SMMUv3State, accel, false),
 };
 
 static void smmuv3_instance_init(Object *obj)
@@ -2056,6 +2078,12 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
     device_class_set_props(dc, smmuv3_properties);
     dc->hotpluggable = false;
     dc->user_creatable = true;
+
+    object_class_property_set_description(klass,
+                                          "accel",
+                                          "Enable SMMUv3 accelerator support."
+                                          "Allows host SMMUv3 to be configured "
+                                          "in nested mode for vfio-pci dev assignment");
 }
 
 static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 097a48cc83..6106ad1b6e 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -399,9 +399,7 @@ static int iort_smmuv3_devices(Object *obj, void *opaque)
     }
 
     bus = PCI_BUS(object_property_get_link(obj, "primary-bus", &error_abort));
-    if (object_property_find(obj, "accel")) {
-        sdev.accel = object_property_get_bool(obj, "accel", &error_abort);
-    }
+    sdev.accel = object_property_get_bool(obj, "accel", &error_abort);
     pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
     sbdev = SYS_BUS_DEVICE(obj);
     sdev.base = platform_bus_get_mmio_addr(pbus, sbdev, 0);
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 8a347a6e39..2498e3beff 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1488,8 +1488,8 @@ static void create_smmuv3_dt_bindings(const VirtMachineState *vms, hwaddr base,
     g_free(node);
 }
 
-static void create_smmuv3_dev_dtb(VirtMachineState *vms,
-                                  DeviceState *dev, PCIBus *bus)
+static void create_smmuv3_dev_dtb(VirtMachineState *vms, DeviceState *dev,
+                                  PCIBus *bus, Error **errp)
 {
     PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
     SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
@@ -1497,10 +1497,15 @@ static void create_smmuv3_dev_dtb(VirtMachineState *vms,
     hwaddr base = platform_bus_get_mmio_addr(pbus, sbdev, 0);
     MachineState *ms = MACHINE(vms);
 
-    if (!(vms->bootinfo.firmware_loaded && virt_is_acpi_enabled(vms)) &&
-        strcmp("pcie.0", bus->qbus.name)) {
-        warn_report("SMMUv3 device only supported with pcie.0 for DT");
-        return;
+    if (!(vms->bootinfo.firmware_loaded && virt_is_acpi_enabled(vms))) {
+        if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
+            error_setg(errp, "SMMUv3 with accel=on not supported for DT");
+            return;
+        }
+        if (strcmp("pcie.0", bus->qbus.name)) {
+            warn_report("SMMUv3 device only supported with pcie.0 for DT");
+            return;
+        }
     }
     base += vms->memmap[VIRT_PLATFORM_BUS].base;
     irq += vms->irqmap[VIRT_PLATFORM_BUS];
@@ -3090,9 +3095,17 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
                 return;
             }
 
-            create_smmuv3_dev_dtb(vms, dev, bus);
-            if (object_property_find(OBJECT(dev), "accel") &&
-                object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
+            create_smmuv3_dev_dtb(vms, dev, bus, errp);
+            if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
+                char *stage;
+                stage = object_property_get_str(OBJECT(dev), "stage",
+                                                &error_fatal);
+                /* If no stage specified, SMMUv3 will default to stage 1 */
+                if (*stage && strcmp("1", stage)) {
+                    error_setg(errp, "Only stage1 is supported for SMMUV3 with "
+                               "accel=on");
+                    return;
+                }
                 vms->pci_preserve_config = true;
             }
         }
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index e54ece2d38..6b9c27a9c4 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -67,6 +67,7 @@ struct SMMUv3State {
     /* SMMU has HW accelerator support for nested S1 + s2 */
     bool accel;
     struct SMMUv3AccelState *s_accel;
+    Error *migration_blocker;
 };
 
 typedef enum {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to specify RIL support
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (24 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 25/32] hw/arm/smmuv3: Add accel property for SMMUv3 device Shameer Kolothum
@ 2025-10-31 10:49 ` Shameer Kolothum
  2025-11-03 15:07   ` Eric Auger
  2025-11-04  9:38   ` Eric Auger
  2025-10-31 10:50 ` [PATCH v5 27/32] hw/arm/smmuv3-accel: Add support for ATS Shameer Kolothum
                   ` (5 subsequent siblings)
  31 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:49 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Currently QEMU SMMUv3 has RIL support by default. But if accelerated mode
is enabled, RIL has to be compatible with host SMMUv3 support.

Add a property so that the user can specify this.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c   | 15 +++++++++++++--
 hw/arm/smmuv3-accel.h   |  4 ++++
 hw/arm/smmuv3.c         | 12 ++++++++++++
 include/hw/arm/smmuv3.h |  1 +
 4 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 8b9f88dd8e..35298350cb 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -63,10 +63,10 @@ smmuv3_accel_check_hw_compatible(SMMUv3State *s,
         return false;
     }
 
-    /* QEMU SMMUv3 supports Range Invalidation by default */
+    /* User can disable QEMU SMMUv3 Range Invalidation support */
     if (FIELD_EX32(info->idr[3], IDR3, RIL) !=
                 FIELD_EX32(s->idr[3], IDR3, RIL)) {
-        error_setg(errp, "Host SMMUv3 doesn't support Range Invalidation");
+        error_setg(errp, "Host SMMUv3 differs in Range Invalidation support");
         return false;
     }
 
@@ -635,6 +635,17 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
     .get_msi_address_space = smmuv3_accel_get_msi_as,
 };
 
+void smmuv3_accel_idr_override(SMMUv3State *s)
+{
+    if (!s->accel) {
+        return;
+    }
+
+    /* By default QEMU SMMUv3 has RIL. Update IDR3 if user has disabled it */
+    if (!s->ril) {
+        s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 0);
+    }
+}
 
 /* Based on SMUUv3 GBPA configuration, attach a corresponding HWPT */
 void smmuv3_accel_gbpa_update(SMMUv3State *s)
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index ee79548370..4f5b672712 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -55,6 +55,7 @@ bool smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
                                 Error **errp);
 void smmuv3_accel_gbpa_update(SMMUv3State *s);
 void smmuv3_accel_reset(SMMUv3State *s);
+void smmuv3_accel_idr_override(SMMUv3State *s);
 #else
 static inline void smmuv3_accel_init(SMMUv3State *s)
 {
@@ -83,6 +84,9 @@ static inline void smmuv3_accel_gbpa_update(SMMUv3State *s)
 static inline void smmuv3_accel_reset(SMMUv3State *s)
 {
 }
+static inline void smmuv3_accel_idr_override(SMMUv3State *s)
+{
+}
 #endif
 
 #endif /* HW_ARM_SMMUV3_ACCEL_H */
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index f040e6b91e..b9d96f5762 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -305,6 +305,7 @@ static void smmuv3_init_id_regs(SMMUv3State *s)
     s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1);
     s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
     s->aidr = 0x1;
+    smmuv3_accel_idr_override(s);
 }
 
 static void smmuv3_reset(SMMUv3State *s)
@@ -1936,6 +1937,13 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
         return false;
     }
 #endif
+    if (!s->accel) {
+        if (!s->ril) {
+            error_setg(errp, "ril can only be disabled if accel=on");
+            return false;
+        }
+        return false;
+    }
     return true;
 }
 
@@ -2057,6 +2065,8 @@ static const Property smmuv3_properties[] = {
      */
     DEFINE_PROP_STRING("stage", SMMUv3State, stage),
     DEFINE_PROP_BOOL("accel", SMMUv3State, accel, false),
+    /* RIL can be turned off for accel cases */
+    DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
 };
 
 static void smmuv3_instance_init(Object *obj)
@@ -2084,6 +2094,8 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
                                           "Enable SMMUv3 accelerator support."
                                           "Allows host SMMUv3 to be configured "
                                           "in nested mode for vfio-pci dev assignment");
+    object_class_property_set_description(klass, "ril",
+        "Disable range invalidation support (for accel=on)");
 }
 
 static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index 6b9c27a9c4..95202c2757 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -68,6 +68,7 @@ struct SMMUv3State {
     bool accel;
     struct SMMUv3AccelState *s_accel;
     Error *migration_blocker;
+    bool ril;
 };
 
 typedef enum {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 27/32] hw/arm/smmuv3-accel: Add support for ATS
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (25 preceding siblings ...)
  2025-10-31 10:49 ` [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to specify RIL support Shameer Kolothum
@ 2025-10-31 10:50 ` Shameer Kolothum
  2025-11-04 14:22   ` Eric Auger
  2025-10-31 10:50 ` [PATCH v5 28/32] hw/arm/smmuv3-accel: Add property to specify OAS bits Shameer Kolothum
                   ` (4 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:50 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

QEMU SMMUv3 does not enable ATS (Address Translation Services) by default.
When accelerated mode is enabled and the host SMMUv3 supports ATS, it can
be useful to report ATS capability to the guest so it can take advantage
of it if the device also supports ATS.

Note: ATS support cannot be reliably detected from the host SMMUv3 IDR
registers alone, as firmware ACPI IORT tables may override them. The
user must therefore ensure the support before enabling it.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c    |  4 ++++
 hw/arm/smmuv3.c          | 25 ++++++++++++++++++++++++-
 hw/arm/virt-acpi-build.c | 10 ++++++++--
 include/hw/arm/smmuv3.h  |  1 +
 4 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 35298350cb..5b0ef3804a 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -645,6 +645,10 @@ void smmuv3_accel_idr_override(SMMUv3State *s)
     if (!s->ril) {
         s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 0);
     }
+    /* QEMU SMMUv3 has no ATS. Update IDR0 if user has enabled it */
+    if (s->ats) {
+        s->idr[0] = FIELD_DP32(s->idr[0], IDR0, ATS, 1); /* ATS */
+    }
 }
 
 /* Based on SMUUv3 GBPA configuration, attach a corresponding HWPT */
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index b9d96f5762..d95279a733 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1517,13 +1517,28 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
              */
             smmuv3_range_inval(bs, &cmd, SMMU_STAGE_2);
             break;
+        case SMMU_CMD_ATC_INV:
+        {
+            SMMUDevice *sdev = smmu_find_sdev(bs, CMD_SID(&cmd));
+            Error *local_err = NULL;
+
+            if (!sdev) {
+                break;
+            }
+
+            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, sdev, &local_err)) {
+                error_report_err(local_err);
+                cmd_error = SMMU_CERROR_ILL;
+                break;
+            }
+            break;
+        }
         case SMMU_CMD_TLBI_EL3_ALL:
         case SMMU_CMD_TLBI_EL3_VA:
         case SMMU_CMD_TLBI_EL2_ALL:
         case SMMU_CMD_TLBI_EL2_ASID:
         case SMMU_CMD_TLBI_EL2_VA:
         case SMMU_CMD_TLBI_EL2_VAA:
-        case SMMU_CMD_ATC_INV:
         case SMMU_CMD_PRI_RESP:
         case SMMU_CMD_RESUME:
         case SMMU_CMD_STALL_TERM:
@@ -1942,6 +1957,10 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
             error_setg(errp, "ril can only be disabled if accel=on");
             return false;
         }
+        if (s->ats) {
+            error_setg(errp, "ats can only be enabled if accel=on");
+            return false;
+        }
         return false;
     }
     return true;
@@ -2067,6 +2086,7 @@ static const Property smmuv3_properties[] = {
     DEFINE_PROP_BOOL("accel", SMMUv3State, accel, false),
     /* RIL can be turned off for accel cases */
     DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
+    DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
 };
 
 static void smmuv3_instance_init(Object *obj)
@@ -2096,6 +2116,9 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
                                           "in nested mode for vfio-pci dev assignment");
     object_class_property_set_description(klass, "ril",
         "Disable range invalidation support (for accel=on)");
+    object_class_property_set_description(klass, "ats",
+        "Enable/disable ATS support (for accel=on). Please ensure host "
+        "platform has ATS support before enabling this");
 }
 
 static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 6106ad1b6e..1b0d0a2029 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -345,6 +345,7 @@ typedef struct AcpiIortSMMUv3Dev {
     /* Offset of the SMMUv3 IORT Node relative to the start of the IORT */
     size_t offset;
     bool accel;
+    bool ats;
 } AcpiIortSMMUv3Dev;
 
 /*
@@ -400,6 +401,7 @@ static int iort_smmuv3_devices(Object *obj, void *opaque)
 
     bus = PCI_BUS(object_property_get_link(obj, "primary-bus", &error_abort));
     sdev.accel = object_property_get_bool(obj, "accel", &error_abort);
+    sdev.ats = object_property_get_bool(obj, "ats", &error_abort);
     pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
     sbdev = SYS_BUS_DEVICE(obj);
     sdev.base = platform_bus_get_mmio_addr(pbus, sbdev, 0);
@@ -544,6 +546,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     int i, nb_nodes, rc_mapping_count;
     AcpiIortSMMUv3Dev *sdev;
     size_t node_size;
+    bool ats_needed = false;
     int num_smmus = 0;
     uint32_t id = 0;
     int rc_smmu_idmaps_len = 0;
@@ -579,6 +582,9 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
         /* Calculate RMR nodes required. One per SMMUv3 with accelerated mode */
         for (i = 0; i < num_smmus; i++) {
             sdev = &g_array_index(smmuv3_devs, AcpiIortSMMUv3Dev, i);
+            if (sdev->ats) {
+                ats_needed = true;
+            }
             if (sdev->accel) {
                 nb_nodes++;
             }
@@ -678,8 +684,8 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
     build_append_int_noprefix(table_data, 0, 2); /* Reserved */
     /* Table 15 Memory Access Flags */
     build_append_int_noprefix(table_data, 0x3 /* CCA = CPM = DACS = 1 */, 1);
-
-    build_append_int_noprefix(table_data, 0, 4); /* ATS Attribute */
+    /* ATS Attribute */
+    build_append_int_noprefix(table_data, (ats_needed ? 1 : 0), 4);
     /* MCFG pci_segment */
     build_append_int_noprefix(table_data, 0, 4); /* PCI Segment number */
 
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index 95202c2757..5fd5ec7b49 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -69,6 +69,7 @@ struct SMMUv3State {
     struct SMMUv3AccelState *s_accel;
     Error *migration_blocker;
     bool ril;
+    bool ats;
 };
 
 typedef enum {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 28/32] hw/arm/smmuv3-accel: Add property to specify OAS bits
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (26 preceding siblings ...)
  2025-10-31 10:50 ` [PATCH v5 27/32] hw/arm/smmuv3-accel: Add support for ATS Shameer Kolothum
@ 2025-10-31 10:50 ` Shameer Kolothum
  2025-11-04 14:35   ` Eric Auger
  2025-10-31 10:50 ` [PATCH v5 29/32] backends/iommufd: Retrieve PASID width from iommufd_backend_get_device_info() Shameer Kolothum
                   ` (3 subsequent siblings)
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:50 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

QEMU SMMUv3 currently sets the output address size (OAS) to 44 bits. With
accelerator mode enabled, a guest device may use SVA where CPU page tables
are shared with SMMUv3, requiring OAS at least equal to the CPU OAS. Add
a user option to set this.

Note: Linux kernel docs currently state the OAS field in the IDR register
is not meaningful for users. But looks like we need this information.

Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c    | 22 ++++++++++++++++++++++
 hw/arm/smmuv3-internal.h |  3 ++-
 hw/arm/smmuv3.c          | 16 +++++++++++++++-
 include/hw/arm/smmuv3.h  |  1 +
 4 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 5b0ef3804a..c46510150e 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -28,6 +28,12 @@ MemoryRegion root;
 MemoryRegion sysmem;
 static AddressSpace *shared_as_sysmem;
 
+static int smmuv3_oas_bits(uint32_t oas)
+{
+    static const int map[] = { 32, 36, 40, 42, 44, 48, 52, 56 };
+    return (oas < ARRAY_SIZE(map)) ? map[oas] : -EINVAL;
+}
+
 static bool
 smmuv3_accel_check_hw_compatible(SMMUv3State *s,
                                  struct iommu_hw_info_arm_smmuv3 *info,
@@ -70,6 +76,18 @@ smmuv3_accel_check_hw_compatible(SMMUv3State *s,
         return false;
     }
 
+    /*
+     * TODO: OAS is not something Linux kernel doc says meaningful for user.
+     * But looks like OAS needs to be compatible for accelerator support. Please
+     * check.
+     */
+    if (FIELD_EX32(info->idr[5], IDR5, OAS) <
+                FIELD_EX32(s->idr[5], IDR5, OAS)) {
+        error_setg(errp, "Host SMMUv3 OAS(%d) bits not compatible",
+                   smmuv3_oas_bits(FIELD_EX32(info->idr[5], IDR5, OAS)));
+        return false;
+    }
+
     /* QEMU SMMUv3 supports GRAN4K/GRAN16K/GRAN64K translation granules */
     if (FIELD_EX32(info->idr[5], IDR5, GRAN4K) !=
                 FIELD_EX32(s->idr[5], IDR5, GRAN4K)) {
@@ -649,6 +667,10 @@ void smmuv3_accel_idr_override(SMMUv3State *s)
     if (s->ats) {
         s->idr[0] = FIELD_DP32(s->idr[0], IDR0, ATS, 1); /* ATS */
     }
+    /* QEMU SMMUv3 has OAS set 44. Update IDR5 if user has it set to 48 bits*/
+    if (s->oas == 48) {
+        s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS_48);
+    }
 }
 
 /* Based on SMUUv3 GBPA configuration, attach a corresponding HWPT */
diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
index 5fd88b4257..cfc5897569 100644
--- a/hw/arm/smmuv3-internal.h
+++ b/hw/arm/smmuv3-internal.h
@@ -111,7 +111,8 @@ REG32(IDR5,                0x14)
      FIELD(IDR5, VAX,        10, 2);
      FIELD(IDR5, STALL_MAX,  16, 16);
 
-#define SMMU_IDR5_OAS 4
+#define SMMU_IDR5_OAS_44 4
+#define SMMU_IDR5_OAS_48 5
 
 REG32(IIDR,                0x18)
 REG32(AIDR,                0x1c)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index d95279a733..c4d28a3786 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -299,7 +299,8 @@ static void smmuv3_init_id_regs(SMMUv3State *s)
     s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 1);
     s->idr[3] = FIELD_DP32(s->idr[3], IDR3, BBML, 2);
 
-    s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS); /* 44 bits */
+    /* OAS: 44 bits */
+    s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS_44);
     /* 4K, 16K and 64K granule support */
     s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1);
     s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1);
@@ -1961,6 +1962,15 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
             error_setg(errp, "ats can only be enabled if accel=on");
             return false;
         }
+        if (s->oas != 44) {
+            error_setg(errp, "OAS can only be set to 44 bits if accel=off");
+            return false;
+        }
+        return false;
+    }
+
+    if (s->oas != 44 && s->oas != 48) {
+        error_setg(errp, "OAS can only be set to 44 or 48 bits");
         return false;
     }
     return true;
@@ -2087,6 +2097,7 @@ static const Property smmuv3_properties[] = {
     /* RIL can be turned off for accel cases */
     DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
     DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
+    DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
 };
 
 static void smmuv3_instance_init(Object *obj)
@@ -2119,6 +2130,9 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
     object_class_property_set_description(klass, "ats",
         "Enable/disable ATS support (for accel=on). Please ensure host "
         "platform has ATS support before enabling this");
+    object_class_property_set_description(klass, "oas",
+        "Specify Output Address Size (for accel =on). Supported values "
+        "are 44 or 48 bits. Defaults to 44 bits");
 }
 
 static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index 5fd5ec7b49..e4226b66f3 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -70,6 +70,7 @@ struct SMMUv3State {
     Error *migration_blocker;
     bool ril;
     bool ats;
+    uint8_t oas;
 };
 
 typedef enum {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 29/32] backends/iommufd: Retrieve PASID width from iommufd_backend_get_device_info()
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (27 preceding siblings ...)
  2025-10-31 10:50 ` [PATCH v5 28/32] hw/arm/smmuv3-accel: Add property to specify OAS bits Shameer Kolothum
@ 2025-10-31 10:50 ` Shameer Kolothum
  2025-10-31 10:50 ` [PATCH v5 30/32] Extend get_cap() callback to support PASID Shameer Kolothum
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:50 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Retrieve PASID width from iommufd_backend_get_device_info() and store it
in HostIOMMUDeviceCaps for later use.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 backends/iommufd.c                 | 6 +++++-
 hw/arm/smmuv3-accel.c              | 3 ++-
 hw/vfio/iommufd.c                  | 7 +++++--
 include/system/host_iommu_device.h | 3 +++
 include/system/iommufd.h           | 3 ++-
 5 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/backends/iommufd.c b/backends/iommufd.c
index e68a2c934f..6381f9664b 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -388,7 +388,8 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
 
 bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
                                      uint32_t *type, void *data, uint32_t len,
-                                     uint64_t *caps, Error **errp)
+                                     uint64_t *caps, uint8_t *max_pasid_log2,
+                                     Error **errp)
 {
     struct iommu_hw_info info = {
         .size = sizeof(info),
@@ -407,6 +408,9 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
     g_assert(caps);
     *caps = info.out_capabilities;
 
+    if (max_pasid_log2) {
+        *max_pasid_log2 = info.out_max_pasid_log2;
+    }
     return true;
 }
 
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index c46510150e..caa4a1d82d 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -122,7 +122,8 @@ smmuv3_accel_hw_compatible(SMMUv3State *s, HostIOMMUDeviceIOMMUFD *idev,
     uint64_t caps;
 
     if (!iommufd_backend_get_device_info(idev->iommufd, idev->devid, &data_type,
-                                         &info, sizeof(info), &caps, errp)) {
+                                         &info, sizeof(info), &caps, NULL,
+                                         errp)) {
         return false;
     }
 
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 2ab52723c6..212970e2e2 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -366,7 +366,8 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
      * instead.
      */
     if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
-                                         &type, NULL, 0, &hw_caps, errp)) {
+                                         &type, NULL, 0, &hw_caps, NULL,
+                                         errp)) {
         return false;
     }
 
@@ -901,19 +902,21 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
     HostIOMMUDeviceCaps *caps = &hiod->caps;
     VendorCaps *vendor_caps = &caps->vendor_caps;
     enum iommu_hw_info_type type;
+    uint8_t max_pasid_log2;
     uint64_t hw_caps;
 
     hiod->agent = opaque;
 
     if (!iommufd_backend_get_device_info(vdev->iommufd, vdev->devid, &type,
                                          vendor_caps, sizeof(*vendor_caps),
-                                         &hw_caps, errp)) {
+                                         &hw_caps, &max_pasid_log2, errp)) {
         return false;
     }
 
     hiod->name = g_strdup(vdev->name);
     caps->type = type;
     caps->hw_caps = hw_caps;
+    caps->max_pasid_log2 = max_pasid_log2;
 
     idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
     idev->iommufd = vdev->iommufd;
diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index ab849a4a82..bfb2b60478 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -30,6 +30,8 @@ typedef union VendorCaps {
  * @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this represents
  *           the @out_capabilities value returned from IOMMU_GET_HW_INFO ioctl)
  *
+ * @max_pasid_log2: width of PASIDs supported by host IOMMU device
+ *
  * @vendor_caps: host platform IOMMU vendor specific capabilities (e.g. on
  *               IOMMUFD this represents a user-space buffer filled by kernel
  *               with host IOMMU @type specific hardware information data)
@@ -37,6 +39,7 @@ typedef union VendorCaps {
 typedef struct HostIOMMUDeviceCaps {
     uint32_t type;
     uint64_t hw_caps;
+    uint8_t max_pasid_log2;
     VendorCaps vendor_caps;
 } HostIOMMUDeviceCaps;
 #endif
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index 41e216c677..aa78bf1e1d 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -71,7 +71,8 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
                               hwaddr iova, uint64_t size);
 bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
                                      uint32_t *type, void *data, uint32_t len,
-                                     uint64_t *caps, Error **errp);
+                                     uint64_t *caps, uint8_t *max_pasid_log2,
+                                     Error **errp);
 bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
                                 uint32_t pt_id, uint32_t flags,
                                 uint32_t data_type, uint32_t data_len,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 30/32] Extend get_cap() callback to support PASID
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (28 preceding siblings ...)
  2025-10-31 10:50 ` [PATCH v5 29/32] backends/iommufd: Retrieve PASID width from iommufd_backend_get_device_info() Shameer Kolothum
@ 2025-10-31 10:50 ` Shameer Kolothum
  2025-11-03 14:58   ` Jonathan Cameron via
  2025-11-06  8:45   ` Eric Auger
  2025-10-31 10:50 ` [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM Shameer Kolothum
  2025-10-31 10:50 ` [PATCH v5 32/32] hw/arm/smmuv3-accel: Add support for PASID enable Shameer Kolothum
  31 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:50 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

Modify get_cap() callback so that it can return cap via an output
uint64_t param. And add support for generic iommu hw capability
info and max_pasid_log2(pasid width).

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 backends/iommufd.c                 | 18 +++++++++++++++---
 hw/i386/intel_iommu.c              |  5 +++--
 hw/vfio/container-legacy.c         |  8 ++++++--
 include/system/host_iommu_device.h | 14 ++++++++++----
 4 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/backends/iommufd.c b/backends/iommufd.c
index 6381f9664b..392f9cf2a8 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -523,19 +523,31 @@ bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
     return idevc->detach_hwpt(idev, errp);
 }
 
-static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
+static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap,
+                                uint64_t *out_cap, Error **errp)
 {
     HostIOMMUDeviceCaps *caps = &hiod->caps;
 
+    g_assert(out_cap);
+
     switch (cap) {
     case HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE:
-        return caps->type;
+        *out_cap = caps->type;
+        break;
     case HOST_IOMMU_DEVICE_CAP_AW_BITS:
-        return vfio_device_get_aw_bits(hiod->agent);
+        *out_cap = vfio_device_get_aw_bits(hiod->agent);
+        break;
+    case HOST_IOMMU_DEVICE_CAP_GENERIC_HW:
+        *out_cap = caps->hw_caps;
+        break;
+    case HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2:
+        *out_cap = caps->max_pasid_log2;
+        break;
     default:
         error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
         return -EINVAL;
     }
+    return 0;
 }
 
 static void hiod_iommufd_class_init(ObjectClass *oc, const void *data)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6a168d5107..91d0d643ea 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4444,6 +4444,7 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
                            Error **errp)
 {
     HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
+    uint64_t out_cap;
     int ret;
 
     if (!hiodc->get_cap) {
@@ -4452,11 +4453,11 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
     }
 
     /* Common checks */
-    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_AW_BITS, errp);
+    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_AW_BITS, &out_cap, errp);
     if (ret < 0) {
         return false;
     }
-    if (s->aw_bits > ret) {
+    if (s->aw_bits > out_cap) {
         error_setg(errp, "aw-bits %d > host aw-bits %d", s->aw_bits, ret);
         return false;
     }
diff --git a/hw/vfio/container-legacy.c b/hw/vfio/container-legacy.c
index a3615d7b5d..ac8370bd4b 100644
--- a/hw/vfio/container-legacy.c
+++ b/hw/vfio/container-legacy.c
@@ -1197,15 +1197,19 @@ static bool hiod_legacy_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
 }
 
 static int hiod_legacy_vfio_get_cap(HostIOMMUDevice *hiod, int cap,
-                                    Error **errp)
+                                    uint64_t *out_cap, Error **errp)
 {
+    g_assert(out_cap);
+
     switch (cap) {
     case HOST_IOMMU_DEVICE_CAP_AW_BITS:
-        return vfio_device_get_aw_bits(hiod->agent);
+        *out_cap = vfio_device_get_aw_bits(hiod->agent);
+        break;
     default:
         error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
         return -EINVAL;
     }
+    return 0;
 }
 
 static GList *
diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index bfb2b60478..f89dbafd9e 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -94,13 +94,15 @@ struct HostIOMMUDeviceClass {
      *
      * @cap: capability to check.
      *
+     * @out_cap: 0 if a @cap is unsupported or else 1 or some positive
+     * value for some special @cap, i.e., HOST_IOMMU_DEVICE_CAP_AW_BITS.
+     *
      * @errp: pass an Error out when fails to query capability.
      *
-     * Returns: <0 on failure, 0 if a @cap is unsupported, or else
-     * 1 or some positive value for some special @cap,
-     * i.e., HOST_IOMMU_DEVICE_CAP_AW_BITS.
+     * Returns: <0 on failure, 0 on success.
      */
-    int (*get_cap)(HostIOMMUDevice *hiod, int cap, Error **errp);
+    int (*get_cap)(HostIOMMUDevice *hiod, int cap, uint64_t *out_cap,
+                   Error **errp);
     /**
      * @get_iova_ranges: Return the list of usable iova_ranges along with
      * @hiod Host IOMMU device
@@ -123,6 +125,10 @@ struct HostIOMMUDeviceClass {
  */
 #define HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE        0
 #define HOST_IOMMU_DEVICE_CAP_AW_BITS           1
+/* Generic IOMMU HW capability info */
+#define HOST_IOMMU_DEVICE_CAP_GENERIC_HW        2
+/* PASID width */
+#define HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2    3
 
 #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
 #endif
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (29 preceding siblings ...)
  2025-10-31 10:50 ` [PATCH v5 30/32] Extend get_cap() callback to support PASID Shameer Kolothum
@ 2025-10-31 10:50 ` Shameer Kolothum
  2025-11-03 15:00   ` Jonathan Cameron via
  2025-11-06 13:55   ` Eric Auger
  2025-10-31 10:50 ` [PATCH v5 32/32] hw/arm/smmuv3-accel: Add support for PASID enable Shameer Kolothum
  31 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:50 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

From: Yi Liu <yi.l.liu@intel.com>

If user wants to expose PASID capability in vIOMMU, then VFIO would also
report the PASID cap for this device if the underlying hardware supports
it as well.

As a start, this chooses to put the vPASID cap in the last 8 bytes of the
vconfig space. This is a choice in the good hope of no conflict with any
existing cap or hidden registers. For the devices that has hidden registers,
user should figure out a proper offset for the vPASID cap. This may require
an option for user to config it. Here we leave it as a future extension.
There are more discussions on the mechanism of finding the proper offset.

https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/

Since we add a check to ensure the vIOMMU supports PASID, only devices
under those vIOMMUs can synthesize the vPASID capability. This gives
users control over which devices expose vPASID.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/vfio/pci.c      | 37 +++++++++++++++++++++++++++++++++++++
 include/hw/iommu.h |  1 +
 2 files changed, 38 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 06b06afc2b..2054eac897 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -24,6 +24,7 @@
 #include <sys/ioctl.h>
 
 #include "hw/hw.h"
+#include "hw/iommu.h"
 #include "hw/pci/msi.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci_bridge.h"
@@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
 
 static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
 {
+    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
+    HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
     PCIDevice *pdev = PCI_DEVICE(vdev);
+    uint64_t max_pasid_log2 = 0;
+    bool pasid_cap_added = false;
+    uint64_t hw_caps;
     uint32_t header;
     uint16_t cap_id, next, size;
     uint8_t cap_ver;
@@ -2578,12 +2584,43 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
                 pcie_add_capability(pdev, cap_id, cap_ver, next, size);
             }
             break;
+        case PCI_EXT_CAP_ID_PASID:
+             pasid_cap_added = true;
+             /* fallthrough */
         default:
             pcie_add_capability(pdev, cap_id, cap_ver, next, size);
         }
 
     }
 
+#ifdef CONFIG_IOMMUFD
+    /*
+     * Although we check for PCI_EXT_CAP_ID_PASID above, the Linux VFIO
+     * framework currently hides this capability. Try to retrieve it
+     * through alternative kernel interfaces (e.g. IOMMUFD APIs).
+     */
+    if (!pasid_cap_added && hiodc->get_cap) {
+        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_GENERIC_HW, &hw_caps, NULL);
+        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2,
+                       &max_pasid_log2, NULL);
+    }
+
+    /*
+     * If supported, adds the PASID capability in the end of the PCIe config
+     * space. TODO: Add option for enabling pasid at a safe offset.
+     */
+    if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
+                           VIOMMU_FLAG_PASID_SUPPORTED)) {
+        bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC) ? true : false;
+        bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV) ? true : false;
+
+        pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF,
+                        max_pasid_log2, exec_perm, priv_mod);
+        /* PASID capability is fully emulated by QEMU */
+        memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
+    }
+#endif
+
     /* Cleanup chain head ID if necessary */
     if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
         pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
diff --git a/include/hw/iommu.h b/include/hw/iommu.h
index 9b8bb94fc2..9635770bee 100644
--- a/include/hw/iommu.h
+++ b/include/hw/iommu.h
@@ -20,6 +20,7 @@
 enum viommu_flags {
     /* vIOMMU needs nesting parent HWPT to create nested HWPT */
     VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
+    VIOMMU_FLAG_PASID_SUPPORTED = BIT_ULL(1),
 };
 
 #endif /* HW_IOMMU_H */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* [PATCH v5 32/32] hw/arm/smmuv3-accel: Add support for PASID enable
  2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
                   ` (30 preceding siblings ...)
  2025-10-31 10:50 ` [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM Shameer Kolothum
@ 2025-10-31 10:50 ` Shameer Kolothum
  2025-11-06 16:46   ` Eric Auger
  31 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-10-31 10:50 UTC (permalink / raw)
  To: qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

QEMU SMMUv3 currently forces SSID (Substream ID) to zero. One key use case
for accelerated mode is Shared Virtual Addressing (SVA), which requires
SSID support so the guest can maintain multiple context descriptors per
substream ID.

Provide an option for user to enable PASID support. A SSIDSIZE of 16
is currently used as default.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c    | 23 ++++++++++++++++++++++-
 hw/arm/smmuv3-internal.h |  1 +
 hw/arm/smmuv3.c          | 10 +++++++++-
 include/hw/arm/smmuv3.h  |  1 +
 4 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index caa4a1d82d..1f206be8e4 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -68,6 +68,12 @@ smmuv3_accel_check_hw_compatible(SMMUv3State *s,
         error_setg(errp, "Host SMMUv3 SIDSIZE not compatible");
         return false;
     }
+    /* If user enables PASID support(pasid=on), QEMU sets SSIDSIZE to 16 */
+    if (FIELD_EX32(info->idr[1], IDR1, SSIDSIZE) <
+                FIELD_EX32(s->idr[1], IDR1, SSIDSIZE)) {
+        error_setg(errp, "Host SMMUv3 SSIDSIZE not compatible");
+        return false;
+    }
 
     /* User can disable QEMU SMMUv3 Range Invalidation support */
     if (FIELD_EX32(info->idr[3], IDR3, RIL) !=
@@ -642,7 +648,14 @@ static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
      * The real HW nested support should be reported from host SMMUv3 and if
      * it doesn't, the nesting parent allocation will fail anyway in VFIO core.
      */
-    return VIOMMU_FLAG_WANT_NESTING_PARENT;
+    uint64_t flags = VIOMMU_FLAG_WANT_NESTING_PARENT;
+    SMMUState *bs = opaque;
+    SMMUv3State *s = ARM_SMMUV3(bs);
+
+    if (s->pasid) {
+        flags |= VIOMMU_FLAG_PASID_SUPPORTED;
+    }
+    return flags;
 }
 
 static const PCIIOMMUOps smmuv3_accel_ops = {
@@ -672,6 +685,14 @@ void smmuv3_accel_idr_override(SMMUv3State *s)
     if (s->oas == 48) {
         s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS_48);
     }
+
+    /*
+     * By default QEMU SMMUv3 has no PASID(SSID) support. Update IDR1 if user
+     * has enabled it.
+     */
+    if (s->pasid) {
+        s->idr[1] = FIELD_DP32(s->idr[1], IDR1, SSIDSIZE, SMMU_IDR1_SSIDSIZE);
+    }
 }
 
 /* Based on SMUUv3 GBPA configuration, attach a corresponding HWPT */
diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
index cfc5897569..2e0d8d538b 100644
--- a/hw/arm/smmuv3-internal.h
+++ b/hw/arm/smmuv3-internal.h
@@ -81,6 +81,7 @@ REG32(IDR1,                0x4)
     FIELD(IDR1, ECMDQ,        31, 1)
 
 #define SMMU_IDR1_SIDSIZE 16
+#define SMMU_IDR1_SSIDSIZE 16
 #define SMMU_CMDQS   19
 #define SMMU_EVENTQS 19
 
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index c4d28a3786..e1140fe087 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -611,7 +611,8 @@ static int decode_ste(SMMUv3State *s, SMMUTransCfg *cfg,
         }
     }
 
-    if (STE_S1CDMAX(ste) != 0) {
+    /* If pasid enabled, we report SSIDSIZE = 16 */
+    if (!FIELD_EX32(s->idr[1], IDR1, SSIDSIZE) && STE_S1CDMAX(ste) != 0) {
         qemu_log_mask(LOG_UNIMP,
                       "SMMUv3 does not support multiple context descriptors yet\n");
         goto bad_ste;
@@ -1966,6 +1967,10 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
             error_setg(errp, "OAS can only be set to 44 bits if accel=off");
             return false;
         }
+        if (s->pasid) {
+            error_setg(errp, "pasid can only be enabled if accel=on");
+            return false;
+        }
         return false;
     }
 
@@ -2098,6 +2103,7 @@ static const Property smmuv3_properties[] = {
     DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
     DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
     DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
+    DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
 };
 
 static void smmuv3_instance_init(Object *obj)
@@ -2133,6 +2139,8 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
     object_class_property_set_description(klass, "oas",
         "Specify Output Address Size (for accel =on). Supported values "
         "are 44 or 48 bits. Defaults to 44 bits");
+    object_class_property_set_description(klass, "pasid",
+        "Enable/disable PASID support (for accel=on)");
 }
 
 static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index e4226b66f3..ee0b5ed74f 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -71,6 +71,7 @@ struct SMMUv3State {
     bool ril;
     bool ats;
     uint8_t oas;
+    bool pasid;
 };
 
 typedef enum {
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space
  2025-10-31 10:49 ` [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space Shameer Kolothum
@ 2025-10-31 21:10   ` Nicolin Chen
  2025-11-03 14:17     ` Shameer Kolothum
  2025-11-03 13:12   ` Jonathan Cameron via
  2025-11-03 13:39   ` Philippe Mathieu-Daudé
  2 siblings, 1 reply; 148+ messages in thread
From: Nicolin Chen @ 2025-10-31 21:10 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
	berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, Oct 31, 2025 at 10:49:39AM +0000, Shameer Kolothum wrote:
> To support accelerated SMMUv3 instances, introduce a shared system-wide
> AddressSpace (shared_as_sysmem) that aliases the global system memory.
> This shared AddressSpace will be used in a subsequent patch for all
> vfio-pci devices behind all accelerated SMMUv3 instances within a VM.
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

> +/*
> + * The root region aliases the global system memory, and shared_as_sysmem
> + * provides a shared Address Space referencing it. This Address Space is used
> + * by all vfio-pci devices behind all accelerated SMMUv3 instances within a VM.
> + */
> +MemoryRegion root;
> +MemoryRegion sysmem;
> +static AddressSpace *shared_as_sysmem;

static MemoryRegion root, sysmem; ?


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
  2025-10-31 10:49 ` [PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback Shameer Kolothum
@ 2025-10-31 22:02   ` Nicolin Chen
  2025-10-31 22:08     ` Nicolin Chen
  2025-11-03 14:19     ` Shameer Kolothum
  0 siblings, 2 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-10-31 22:02 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
	berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, Oct 31, 2025 at 10:49:45AM +0000, Shameer Kolothum wrote:
> +static bool
> +smmuv3_accel_dev_alloc_viommu(SMMUv3AccelDevice *accel_dev,
> +                              HostIOMMUDeviceIOMMUFD *idev, Error **errp)

Let's make it simply do alloc() on s_accel:

static bool smmuv3_accel_alloc_viommu(SMMUv3AccelState *s_accel,
                                      HostIOMMUDeviceIOMMUFD *idev,
                                      Error **errp)

Then..

> +    SMMUDevice *sdev = &accel_dev->sdev;
> +    SMMUState *bs = sdev->smmu;
> +    SMMUv3State *s = ARM_SMMUV3(bs);
> +    SMMUv3AccelState *s_accel = s->s_accel;

Drop these.

> +    if (s_accel->vsmmu) {
> +        accel_dev->vsmmu = s_accel->vsmmu;
> +        return true;
> +    }

And this.

> +static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> +                                          HostIOMMUDevice *hiod, Error **errp)
[...]
> +    if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
> +        error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
> +        return false;
> +    }

And here:

    if (!s_accel->vsmmu && !smmuv3_accel_alloc_viommu(s_accel, idev, errp)) {
        error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
        return false;
    }

    accel_dev->idev = idev;
    accel_dev->vsmmu = s_accel->vsmmu;

Feels slightly cleaner.

> +/*
> + * Represents a virtual SMMU instance backed by an iommufd vIOMMU object.
> + * Holds references to the core iommufd vIOMMU object and to proxy HWPTs

I read "reference" as a pointer, yet...

> + * (bypass and abort) used for device attachment.
> + */
> +typedef struct SMMUViommu {
> +    IOMMUFDBackend *iommufd;
> +    IOMMUFDViommu viommu;
> +    uint32_t bypass_hwpt_id;
> +    uint32_t abort_hwpt_id;

...viommu is a containment and two HWPTs are IDs.

So, it'd sound more accurate, being:

/*
 * Represents a virtual SMMU instance backed by an iommufd vIOMMU object.
 * Holds bypass and abort proxy HWPT ids used for device attachment.
 */

> +typedef struct SMMUv3AccelState {
> +    SMMUViommu *vsmmu;
> +} SMMUv3AccelState;

Hmm, maybe we don't need another layer of structure. Every access
or validation to s_accel is for s_accel->vsmmu.

e.g.
    if (!s_accel || !s_accel->vsmmu) {
could be
    if (!s_accel) {

So, let's try merging them into one. And feel free to leave one
of your favorite.

Nic


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
  2025-10-31 22:02   ` Nicolin Chen
@ 2025-10-31 22:08     ` Nicolin Chen
  2025-11-03 14:19     ` Shameer Kolothum
  1 sibling, 0 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-10-31 22:08 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
	berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, Oct 31, 2025 at 03:02:08PM -0700, Nicolin Chen wrote:
> On Fri, Oct 31, 2025 at 10:49:45AM +0000, Shameer Kolothum wrote:
> > +static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> > +                                          HostIOMMUDevice *hiod, Error **errp)
> [...]
> > +    if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
> > +        error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
> > +        return false;
> > +    }
> 
> And here:
> 
>     if (!s_accel->vsmmu && !smmuv3_accel_alloc_viommu(s_accel, idev, errp)) {
>         error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
>         return false;
>     }
> 
>     accel_dev->idev = idev;
>     accel_dev->vsmmu = s_accel->vsmmu;
> 
> Feels slightly cleaner.

Also, because we set accel_dev->vsmmu under the hood, we missed
"accel_dev->vsmmu = NULL" in the revert path and unset().


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support
  2025-10-31 10:49 ` [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support Shameer Kolothum
@ 2025-10-31 23:52   ` Nicolin Chen
  2025-11-01  0:20     ` Nicolin Chen
  2025-11-03 15:11     ` Shameer Kolothum
  2025-11-04 11:05   ` Eric Auger
  1 sibling, 2 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-10-31 23:52 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
	berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, Oct 31, 2025 at 10:49:46AM +0000, Shameer Kolothum wrote:
> +static bool
> +smmuv3_accel_alloc_vdev(SMMUv3AccelDevice *accel_dev, int sid, Error **errp)
> +{
> +    SMMUViommu *vsmmu = accel_dev->vsmmu;
> +    IOMMUFDVdev *vdev;
> +    uint32_t vdevice_id;
> +
> +    if (!accel_dev->idev || accel_dev->vdev) {
> +        return true;
> +    }

We probably don't need to check !accel_dev->dev. It should have
been blocked by its caller, which does block !accel_dev->vsmmu.
Once we fix the missing "accel_dev->vsmmu NULL", it should work.

> +
> +    if (!iommufd_backend_alloc_vdev(vsmmu->iommufd, accel_dev->idev->devid,
> +                                    vsmmu->viommu.viommu_id, sid,
> +                                    &vdevice_id, errp)) {
> +            return false;
> +    }
> +    if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
> +                                               vsmmu->bypass_hwpt_id, errp)) {
> +        iommufd_backend_free_id(vsmmu->iommufd, vdevice_id);
> +        return false;
> +    }

This should check SMMUEN bit?

Linux driver (as an example) seems to set CMDQEN and install all
the default bypass STEs, before SMMUEN=1.

In this case, the target hwpt here should follow guest's GBPA.

> +static bool
> +smmuv3_accel_dev_uninstall_nested_ste(SMMUv3AccelDevice *accel_dev, bool abort,
> +                                      Error **errp)
> +{
> +    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
> +    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
> +    uint32_t hwpt_id;
> +
> +    if (!s1_hwpt || !accel_dev->vsmmu) {
> +        return true;
> +    }
> +
> +    if (abort) {
> +        hwpt_id = accel_dev->vsmmu->abort_hwpt_id;
> +    } else {
> +        hwpt_id = accel_dev->vsmmu->bypass_hwpt_id;
> +    }

This should probably check SMMUEN/GBPA as well.

Likely we need "enabled" and "gbpa_abort" flags in SMMUState.

> +static bool
> +smmuv3_accel_dev_install_nested_ste(SMMUv3AccelDevice *accel_dev,
> +                                    uint32_t data_type, uint32_t data_len,
> +                                    void *data, Error **errp)
> +{
> +    SMMUViommu *vsmmu = accel_dev->vsmmu;
> +    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
> +    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
> +    uint32_t flags = 0;
> +
> +    if (!idev || !vsmmu) {
> +        error_setg(errp, "Device 0x%x has no associated IOMMU dev or vIOMMU",
> +                   smmu_get_sid(&accel_dev->sdev));
> +        return false;
> +    }
> +
> +    if (s1_hwpt) {
> +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev, true, errp)) {
> +            return false;
> +        }
> +    }

I think we could have some improvements here.

The current flow is:
    (attached to s1_hwpt1)
    attach to bypass/abort_hwpt // no issue though.
    free s1_hwpt1
    alloc s2_hwpt2
    attach to s2_hwpt2

It could have been a flow like replace() in the kernel:
    (attached to s1_hwpt1)
    alloc s2_hwpt2
    attach to s2_hwpt2 /* skipping bypass/abort */
    free s1_hwpt

> +smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,\
[...]
> +    config = STE_CONFIG(&ste);
> +    if (!STE_VALID(&ste) || !STE_CFG_S1_ENABLED(config)) {
> +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev,
> +                                                   STE_CFG_ABORT(config),

This smmuv3_accel_uninstall_nested_ste() feels a bit redundant now.

Perhaps we could try something like this:

#define accel_dev_to_smmuv3(dev) ARM_SMMUV3(&dev->sdev.smmu)

static bool smmuv3_accel_dev_install_nested_ste(SMMUv3AccelDevice *accel_dev,
                                                int sid, STE *ste)
{
    SMMUv3State *s = accel_dev_to_smmuv3(accel_dev);
    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
    uint32_t config = STE_CONFIG(ste);
    SMMUS1Hwpt *s1_hwpt = NULL;
    uint64_t ste_0, ste_1;
    uint32_t hwpt_id = 0;

    if (!s->enabled) {
        if (s->gbpa_abort) {
            hwpt_id = accel_dev->vsmmu->abort_hwpt_id;
        } else {
            hwpt_id = accel_dev->vsmmu->bypass_hwpt_id;
        }
    } else {
        if (!STE_VALID(ste) || STE_CFG_ABORT(config)) {
            hwpt_id = accel_dev->vsmmu->abort_hwpt_id;
        } else if (STE_CFG_BYPASS(config))
            hwpt_id = accel_dev->vsmmu->bypass_hwpt_id;
        } else {
            // FIXME handle STE_CFG_S2_ENABLED()
        }
    }

    if (!hwpt_id) {
        uint64_t ste_0 = (uint64_t)ste->word[0] | (uint64_t)ste->word[1] << 32;
        uint64_t ste_1 = (uint64_t)ste->word[2] | (uint64_t)ste->word[3] << 32;
        struct iommu_hwpt_arm_smmuv3 nested_data = {
            .ste[2] = {
                cpu_to_le64(ste_0 & STE0_MASK),
                cpu_to_le64(ste_1 & STE1_MASK),
            },
        };

        trace_smmuv3_accel_install_nested_ste(sid, nested_data.ste[1],
                                              nested_data.ste[0]);
        s1_hwpt = g_new0(SMMUS1Hwpt, 1);
	[...]
	iommufd_backend_alloc_hwpt(..., &s1_hwpt->hwpt_id);
        hwpt_id = s1_hwpt->hwpt_id;
    }

    host_iommu_device_iommufd_attach_hwpt(.., hwpt_id);

    if (accel_dev->s1_hwpt) {
        iommufd_backend_free_id(idev->iommufd, accel_dev->s1_hwpt->hwpt_id);
    }
    accel_dev->s1_hwpt = s1_hwpt;
    return true;
}

> +bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
> +                                           Error **errp)
> +{
> +    SMMUv3AccelState *s_accel = s->s_accel;
> +    SMMUv3AccelDevice *accel_dev;
> +
> +    if (!s_accel || !s_accel->vsmmu) {
> +        return true;
> +    }
> +
> +    QLIST_FOREACH(accel_dev, &s_accel->vsmmu->device_list, next) {
> +        uint32_t sid = smmu_get_sid(&accel_dev->sdev);
> +
> +        if (sid >= range->start && sid <= range->end) {
> +            if (!smmuv3_accel_install_nested_ste(s, &accel_dev->sdev,
> +                                                 sid, errp)) {
> +                return false;
> +            }
> +        }

This is a bit tricky..

I think CFGI_STE_RANGE shouldn't stop in the middle, if one of the
STEs fails. 

That being said, HW doesn't seem to propagate C_BAD_STE during a
CFGI_STE or CFGI_STE_RANGE, IIUIC. It reports C_BAD_STE event when
a transaction starts. If we want to perfectly mimic the hardware,
we'd have to set up a bad STE down to the HW, which will trigger a
C_BAD_STE vevent to be forwarded by vEVENTQ.

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 16/32] hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
  2025-10-31 10:49 ` [PATCH v5 16/32] hw/arm/smmuv3-accel: Make use of " Shameer Kolothum
@ 2025-10-31 23:57   ` Nicolin Chen
  2025-11-03 15:19     ` Shameer Kolothum
  0 siblings, 1 reply; 148+ messages in thread
From: Nicolin Chen @ 2025-10-31 23:57 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
	berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, Oct 31, 2025 at 10:49:49AM +0000, Shameer Kolothum wrote:
> +static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void *opaque,
> +                                             int devfn)
> +{
> +    SMMUState *bs = opaque;
> +    SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> +    SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
> +    SMMUDevice *sdev = &accel_dev->sdev;
> +
> +    /*
> +     * If the assigned vfio-pci dev has S1 translation enabled by Guest,
> +     * return IOMMU address space for MSI translation. Otherwise, return
> +     * system address space.
> +     */
> +    if (accel_dev->s1_hwpt) {
> +        return &sdev->as;
> +    } else {
> +        return &address_space_memory;

Should we use the global shared_as? Or is this on purpose to align
with the "&address_space_memory" in kvm_arch_fixup_msi_route()?

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support
  2025-10-31 23:52   ` Nicolin Chen
@ 2025-11-01  0:20     ` Nicolin Chen
  2025-11-03 15:11     ` Shameer Kolothum
  1 sibling, 0 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-11-01  0:20 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
	berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, Oct 31, 2025 at 04:52:50PM -0700, Nicolin Chen wrote:
> On Fri, Oct 31, 2025 at 10:49:46AM +0000, Shameer Kolothum wrote:
> > +smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,\
> [...]
> > +    config = STE_CONFIG(&ste);
> > +    if (!STE_VALID(&ste) || !STE_CFG_S1_ENABLED(config)) {
> > +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev,
> > +                                                   STE_CFG_ABORT(config),
> 
> This smmuv3_accel_uninstall_nested_ste() feels a bit redundant now.
> 
> Perhaps we could try something like this:
> 
> #define accel_dev_to_smmuv3(dev) ARM_SMMUV3(&dev->sdev.smmu)

Oops. This should be:

#define accel_dev_to_smmuv3(dev) ARM_SMMUV3((SMMUState *)dev->sdev.smmu)

But it doesn't seem to be very useful anyway. So, feel free to
keep it inline or just drop it :)

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 18/32] hw/arm/smmuv3: Initialize ID registers early during realize()
  2025-10-31 10:49 ` [PATCH v5 18/32] hw/arm/smmuv3: Initialize ID registers early during realize() Shameer Kolothum
@ 2025-11-01  0:24   ` Nicolin Chen
  2025-11-03 13:57   ` Jonathan Cameron via
  2025-11-03 15:11   ` Eric Auger
  2 siblings, 0 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-11-01  0:24 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
	berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, Oct 31, 2025 at 10:49:51AM +0000, Shameer Kolothum wrote:
> Factor out ID register init into smmuv3_init_id_regs() and call it from
> realize(). This ensures ID registers are initialized early for use in the
> accelerated SMMUv3 path and will be utilized in subsequent patch.
> 
> Other registers remain initialized in smmuv3_reset().
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
  2025-10-31 10:49 ` [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host Shameer Kolothum
@ 2025-11-01  0:35   ` Nicolin Chen via
  2025-11-03 15:28     ` Shameer Kolothum
  2025-11-03 17:11   ` Eric Auger
  1 sibling, 1 reply; 148+ messages in thread
From: Nicolin Chen via @ 2025-11-01  0:35 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
	berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, Oct 31, 2025 at 10:49:50AM +0000, Shameer Kolothum wrote:
> Provide a helper and use that to issue the invalidation cmd to host SMMUv3.
> We only issue one cmd at a time for now.
> 
> Support for batching of commands will be added later after analysing the
> impact.
> 
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>

I think I have given my tag in v4.. anyway..

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

>          case SMMU_CMD_TLBI_NH_VAA:
>          case SMMU_CMD_TLBI_NH_VA:
> +        {
> +            Error *local_err = NULL;
> +
>              if (!STAGE1_SUPPORTED(s)) {
>                  cmd_error = SMMU_CERROR_ILL;
>                  break;
>              }
>              smmuv3_range_inval(bs, &cmd, SMMU_STAGE_1);
> +            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
> +                error_report_err(local_err);
> +                cmd_error = SMMU_CERROR_ILL;
> +                break;
> +            }
>              break;
> +        }

The local_err isn't used anywhere but by the error_report_err()
alone. So, it could be moved into smmuv3_accel_issue_inv_cmd().

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate
  2025-10-31 10:49 ` [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate Shameer Kolothum
@ 2025-11-01  0:49   ` Nicolin Chen
  2025-11-01 14:20   ` Zhangfei Gao
  2025-11-03 14:47   ` Jonathan Cameron via
  2 siblings, 0 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-11-01  0:49 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
	berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, Oct 31, 2025 at 10:49:52AM +0000, Shameer Kolothum wrote:
> Just before the device gets attached to the SMMUv3, make sure QEMU SMMUv3
> features are compatible with the host SMMUv3.
> 
> Not all fields in the host SMMUv3 IDR registers are meaningful for userspace.
> Only the following fields can be used:
> 
>   - IDR0: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN, CD2L, ASID16, TTF
>   - IDR1: SIDSIZE, SSIDSIZE
>   - IDR3: BBML, RIL
>   - IDR5: VAX, GRAN64K, GRAN16K, GRAN4K
> 
> For now, the check is to make sure the features are in sync to enable
> basic accelerated SMMUv3 support.

Note that SSIDSIZE will be added in the follow-up PASID support.

> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index a2deda3c32..8b9f88dd8e 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -28,6 +28,98 @@ MemoryRegion root;
>  MemoryRegion sysmem;
>  static AddressSpace *shared_as_sysmem;
>  
> +static bool
> +smmuv3_accel_check_hw_compatible(SMMUv3State *s,

Maybe rename to:
    SMMUv3State *smmu

then...

> +                                 struct iommu_hw_info_arm_smmuv3 *info,
> +                                 Error **errp)
> +{
> +    /* QEMU SMMUv3 supports both linear and 2-level stream tables */
> +    if (FIELD_EX32(info->idr[0], IDR0, STLEVEL) !=
> +                FIELD_EX32(s->idr[0], IDR0, STLEVEL)) {

this looks nicer:

    if (FIELD_EX32(info->idr[0], IDR0, STLEVEL) !=
        FIELD_EX32(smmu->idr[0], IDR0, STLEVEL)) {

> +static bool
> +smmuv3_accel_hw_compatible(SMMUv3State *s, HostIOMMUDeviceIOMMUFD *idev,
> +                           Error **errp)
> +{
> +    struct iommu_hw_info_arm_smmuv3 info;
> +    uint32_t data_type;
> +    uint64_t caps;
> +
> +    if (!iommufd_backend_get_device_info(idev->iommufd, idev->devid, &data_type,
> +                                         &info, sizeof(info), &caps, errp)) {
> +        return false;
> +    }
> +
> +    if (data_type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3) {
> +        error_setg(errp, "Wrong data type (%d) for Host SMMUv3 device info",
> +                     data_type);
> +        return false;
> +    }
> +
> +    if (!smmuv3_accel_check_hw_compatible(s, &info, errp)) {

Nit: it doesn't seem to be necessary to have a wrapper?

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate
  2025-10-31 10:49 ` [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate Shameer Kolothum
  2025-11-01  0:49   ` Nicolin Chen
@ 2025-11-01 14:20   ` Zhangfei Gao
  2025-11-03 15:42     ` Shameer Kolothum
  2025-11-03 14:47   ` Jonathan Cameron via
  2 siblings, 1 reply; 148+ messages in thread
From: Zhangfei Gao @ 2025-11-01 14:20 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, jonathan.cameron, zhenzhong.duan, yi.l.liu, kjaju

Hi, Shameer

On Fri, 31 Oct 2025 at 18:54, Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>
> Just before the device gets attached to the SMMUv3, make sure QEMU SMMUv3
> features are compatible with the host SMMUv3.
>
> Not all fields in the host SMMUv3 IDR registers are meaningful for userspace.
> Only the following fields can be used:
>
>   - IDR0: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN, CD2L, ASID16, TTF
>   - IDR1: SIDSIZE, SSIDSIZE
>   - IDR3: BBML, RIL
>   - IDR5: VAX, GRAN64K, GRAN16K, GRAN4K
>
> For now, the check is to make sure the features are in sync to enable
> basic accelerated SMMUv3 support.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/arm/smmuv3-accel.c | 100 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 100 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index a2deda3c32..8b9f88dd8e 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -28,6 +28,98 @@ MemoryRegion root;
>  MemoryRegion sysmem;
>  static AddressSpace *shared_as_sysmem;
>
> +static bool
> +smmuv3_accel_check_hw_compatible(SMMUv3State *s,
> +                                 struct iommu_hw_info_arm_smmuv3 *info,
> +                                 Error **errp)
> +{

> +    /* QEMU SMMUv3 supports architecture version 3.1 */
> +    if (info->aidr < s->aidr) {
> +        error_setg(errp, "Host SMMUv3 architecture version not compatible");
> +        return false;
> +    }

Why has this requirement?
We have SMMUv3 version 3.0 and info->aidr = 0.
and qemu fails to boot here.

Thanks


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space
  2025-10-31 10:49 ` [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space Shameer Kolothum
  2025-10-31 21:10   ` Nicolin Chen
@ 2025-11-03 13:12   ` Jonathan Cameron via
  2025-11-03 15:53     ` Shameer Kolothum
  2025-11-03 13:39   ` Philippe Mathieu-Daudé
  2 siblings, 1 reply; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 13:12 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:49:39 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> To support accelerated SMMUv3 instances, introduce a shared system-wide
> AddressSpace (shared_as_sysmem) that aliases the global system memory.
> This shared AddressSpace will be used in a subsequent patch for all
> vfio-pci devices behind all accelerated SMMUv3 instances within a VM.
No problem with the patch, but perhaps this description could mention
something about 'why' this address space is useful thing to have?

> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 07/32] hw/pci/pci: Move pci_init_bus_master() after adding device to bus
  2025-10-31 10:49 ` [PATCH v5 07/32] hw/pci/pci: Move pci_init_bus_master() after adding device to bus Shameer Kolothum
@ 2025-11-03 13:24   ` Jonathan Cameron via
  2025-11-03 16:40   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 13:24 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:49:40 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> During PCI hotplug, in do_pci_register_device(), pci_init_bus_master()
> is called before storing the pci_dev pointer in bus->devices[devfn].
> 
> This causes a problem if pci_init_bus_master() (via its
> get_address_space() callback) attempts to retrieve the device using
> pci_find_device(), since the PCI device is not yet visible on the bus.
> 
> Fix this by moving the pci_init_bus_master() call to after the device
> has been added to bus->devices[devfn].
> 
> This prepares for a subsequent patch where the accel SMMUv3
> get_address_space() callback retrieves the pci_dev to identify the
> attached device type.
> 
> No functional change intended.
> 
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Seems harmless.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  hw/pci/pci.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index c9932c87e3..9693d7f10c 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -1370,9 +1370,6 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
>      pci_dev->bus_master_as.max_bounce_buffer_size =
>          pci_dev->max_bounce_buffer_size;
>  
> -    if (phase_check(PHASE_MACHINE_READY)) {
> -        pci_init_bus_master(pci_dev);
> -    }
>      pci_dev->irq_state = 0;
>      pci_config_alloc(pci_dev);
>  
> @@ -1416,6 +1413,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
>      pci_dev->config_write = config_write;
>      bus->devices[devfn] = pci_dev;
>      pci_dev->version_id = 2; /* Current pci device vmstate version */
> +    if (phase_check(PHASE_MACHINE_READY)) {
> +        pci_init_bus_master(pci_dev);
> +    }
>      return pci_dev;
>  }
>  



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 08/32] hw/pci/pci: Add optional supports_address_space() callback
  2025-10-31 10:49 ` [PATCH v5 08/32] hw/pci/pci: Add optional supports_address_space() callback Shameer Kolothum
@ 2025-11-03 13:30   ` Jonathan Cameron via
  2025-11-03 16:47   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 13:30 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:49:41 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> Introduce an optional supports_address_space() callback in PCIIOMMUOps to
> allow a vIOMMU implementation to reject devices that should not be attached
> to it.
> 
> Currently, get_address_space() is the first and mandatory callback into the
> vIOMMU layer, which always returns an address space. For certain setups, such
> as hardware accelerated vIOMMUs (e.g. ARM SMMUv3 with accel=on), attaching
> emulated endpoint devices is undesirable as it may impact the behavior or
> performance of VFIO passthrough devices, for example, by triggering
> unnecessary invalidations on the host IOMMU.
> 
> The new callback allows a vIOMMU to check and reject unsupported devices
> early during PCI device registration.
LGTM
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> 
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/pci/pci.c         | 20 ++++++++++++++++++++
>  include/hw/pci/pci.h | 17 +++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 9693d7f10c..fa9cf5dab2 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -135,6 +135,21 @@ static void pci_set_master(PCIDevice *d, bool enable)
>      d->is_master = enable; /* cache the status */
>  }
>  
> +static bool
> +pci_device_supports_iommu_address_space(PCIDevice *dev, Error **errp)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    int devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> +    if (iommu_bus && iommu_bus->iommu_ops->supports_address_space) {
> +        return iommu_bus->iommu_ops->supports_address_space(bus,
> +                                iommu_bus->iommu_opaque, devfn, errp);
> +    }
> +    return true;
> +}
> +
>  static void pci_init_bus_master(PCIDevice *pci_dev)
>  {
>      AddressSpace *dma_as = pci_device_iommu_address_space(pci_dev);
> @@ -1413,6 +1428,11 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
>      pci_dev->config_write = config_write;
>      bus->devices[devfn] = pci_dev;
>      pci_dev->version_id = 2; /* Current pci device vmstate version */
> +    if (!pci_device_supports_iommu_address_space(pci_dev, errp)) {
> +        do_pci_unregister_device(pci_dev);
> +        bus->devices[devfn] = NULL;
> +        return NULL;
> +    }
>      if (phase_check(PHASE_MACHINE_READY)) {
>          pci_init_bus_master(pci_dev);
>      }
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cf99b5bb68..dfeba8c9bd 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -417,6 +417,23 @@ typedef struct IOMMUPRINotifier {
>   * framework for a set of devices on a PCI bus.
>   */
>  typedef struct PCIIOMMUOps {
> +    /**
> +     * @supports_address_space: Optional pre-check to determine if a PCI
> +     * device can have an IOMMU address space.
> +     *
> +     * @bus: the #PCIBus being accessed.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * @devfn: device and function number.
> +     *
> +     * @errp: pass an Error out only when return false
> +     *
> +     * Returns: true if the device can be associated with an IOMMU address
> +     * space, false otherwise with errp set.
> +     */
> +    bool (*supports_address_space)(PCIBus *bus, void *opaque, int devfn,
> +                                   Error **errp);
>      /**
>       * @get_address_space: get the address space for a set of devices
>       * on a PCI bus.



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 09/32] hw/pci-bridge/pci_expander_bridge: Move TYPE_PXB_PCIE_DEV to header
  2025-10-31 10:49 ` [PATCH v5 09/32] hw/pci-bridge/pci_expander_bridge: Move TYPE_PXB_PCIE_DEV to header Shameer Kolothum
@ 2025-11-03 13:30   ` Jonathan Cameron via
  2025-11-03 14:25   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 13:30 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:49:42 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> Move the TYPE_PXB_PCIE_DEV definition to header so that it can be
> referenced by other code in subsequent patch.
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space
  2025-10-31 10:49 ` [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space Shameer Kolothum
  2025-10-31 21:10   ` Nicolin Chen
  2025-11-03 13:12   ` Jonathan Cameron via
@ 2025-11-03 13:39   ` Philippe Mathieu-Daudé
  2025-11-03 16:30     ` Eric Auger
  2 siblings, 1 reply; 148+ messages in thread
From: Philippe Mathieu-Daudé @ 2025-11-03 13:39 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
	nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
	jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju,
	Peter Xu, Mark Cave-Ayland

Hi,

On 31/10/25 11:49, Shameer Kolothum wrote:
> To support accelerated SMMUv3 instances, introduce a shared system-wide
> AddressSpace (shared_as_sysmem) that aliases the global system memory.
> This shared AddressSpace will be used in a subsequent patch for all
> vfio-pci devices behind all accelerated SMMUv3 instances within a VM.
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>   hw/arm/smmuv3-accel.c | 27 +++++++++++++++++++++++++++
>   1 file changed, 27 insertions(+)
> 
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 99ef0db8c4..f62b6cf2c9 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -11,6 +11,15 @@
>   #include "hw/arm/smmuv3.h"
>   #include "smmuv3-accel.h"
>   
> +/*
> + * The root region aliases the global system memory, and shared_as_sysmem
> + * provides a shared Address Space referencing it. This Address Space is used
> + * by all vfio-pci devices behind all accelerated SMMUv3 instances within a VM.
> + */
> +MemoryRegion root;
> +MemoryRegion sysmem;

Why can't we store that in SMMUv3State?

> +static AddressSpace *shared_as_sysmem;

FYI we have object_resolve_type_unambiguous() to check whether an
instance exists only once (singleton).

> +
>   static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>                                                  PCIBus *bus, int devfn)
>   {
> @@ -51,9 +60,27 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
>       .get_address_space = smmuv3_accel_find_add_as,
>   };
>   
> +static void smmuv3_accel_as_init(SMMUv3State *s)
> +{
> +
> +    if (shared_as_sysmem) {
> +        return;
> +    }
> +
> +    memory_region_init(&root, OBJECT(s), "root", UINT64_MAX);
> +    memory_region_init_alias(&sysmem, OBJECT(s), "smmuv3-accel-sysmem",
> +                             get_system_memory(), 0,
> +                             memory_region_size(get_system_memory()));
> +    memory_region_add_subregion(&root, 0, &sysmem);
> +
> +    shared_as_sysmem = g_new0(AddressSpace, 1);
> +    address_space_init(shared_as_sysmem, &root, "smmuv3-accel-as-sysmem");
> +}
> +
>   void smmuv3_accel_init(SMMUv3State *s)
>   {
>       SMMUState *bs = ARM_SMMU(s);
>   
>       bs->iommu_ops = &smmuv3_accel_ops;
> +    smmuv3_accel_as_init(s);
>   }



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 18/32] hw/arm/smmuv3: Initialize ID registers early during realize()
  2025-10-31 10:49 ` [PATCH v5 18/32] hw/arm/smmuv3: Initialize ID registers early during realize() Shameer Kolothum
  2025-11-01  0:24   ` Nicolin Chen
@ 2025-11-03 13:57   ` Jonathan Cameron via
  2025-11-03 15:11   ` Eric Auger
  2 siblings, 0 replies; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 13:57 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:49:51 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> Factor out ID register init into smmuv3_init_id_regs() and call it from
> realize(). This ensures ID registers are initialized early for use in the
> accelerated SMMUv3 path and will be utilized in subsequent patch.
> 
> Other registers remain initialized in smmuv3_reset().
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 20/32] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5
  2025-10-31 10:49 ` [PATCH v5 20/32] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5 Shameer Kolothum
@ 2025-11-03 13:58   ` Jonathan Cameron via
  0 siblings, 0 replies; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 13:58 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:49:53 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> From: Eric Auger <eric.auger@redhat.com>
> 
> Add a 'preserve_config' field in struct GPEXConfig and, if set, generate
> the _DSM function #5 for preserving PCI boot configurations.
> 
> This will be used for SMMUv3 accel=on support in subsequent patch. When
> SMMUv3 acceleration (accel=on) is enabled, QEMU exposes IORT Reserved
> Memory Region (RMR) nodes to support MSI doorbell translations. As per
> the Arm IORT specification, using IORT RMRs mandates the presence of
> _DSM function #5 so that the OS retains the firmware-assigned PCI
> configuration. Hence, this patch adds conditional support for generating
> _DSM #5.
> 
> According to the ACPI Specification, Revision 6.6, Section 9.1.1 -
> “_DSM (Device Specific Method)”,
> 
> "
> If Function Index is zero, the return is a buffer containing one bit for
> each function index, starting with zero. Bit 0 indicates whether there
> is support for any functions other than function 0 for the specified
> UUID and Revision ID. If set to zero, no functions are supported (other
> than function zero) for the specified UUID and Revision ID. If set to
> one, at least one additional function is supported. For all other bits
> in the buffer, a bit is set to zero to indicate if that function index
> is not supported for the specific UUID and Revision ID. (For example,
> bit 1 set to 0 indicates that function index 1 is not supported for the
> specific UUID and Revision ID.)
> "
> 
> Please refer PCI Firmware Specification, Revision 3.3, Section 4.6.5 —
> "_DSM for Preserving PCI Boot Configurations" for Function 5 of _DSM
> method.
> 
> Also, while at it, move the byte_list declaration to the top of the
> function for clarity.
> 
> At the moment, DSM generation is not yet enabled.
> 
> The resulting AML when preserve_config=true is:
> 
>     Method (_DSM, 4, NotSerialized)
>         {
>             If ((Arg0 == ToUUID ("e5c937d0-3553-4d7a-9117-ea4d19c3434d")))
>                 {
>                     If ((Arg2 == Zero))
>                     {
>                         Return (Buffer (One)
>                         {
>                              0x21
>                         })
>                     }
> 
>                     If ((Arg2 == 0x05))
>                     {
>                         Return (Zero)
>                     }
>                 }
>          ...
>       }
> 
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> [Shameer: Removed possible duplicate _DSM creations]
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
> Previously, QEMU reverted an attempt to enable DSM #5 because it caused a
> regression,
> https://lore.kernel.org/all/20210724185234.GA2265457@roeck-us.net/.
> 
> However, in this series, we enable it selectively, only when SMMUv3 is in
> accelerator mode. The devices involved in the earlier regression are not
> expected in accelerated SMMUv3 use cases.
> ---
>  hw/pci-host/gpex-acpi.c    | 29 +++++++++++++++++++++++------
>  include/hw/pci-host/gpex.h |  1 +
>  2 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
> index 4587baeb78..d9820f9b41 100644
> --- a/hw/pci-host/gpex-acpi.c
> +++ b/hw/pci-host/gpex-acpi.c
> @@ -51,10 +51,11 @@ static void acpi_dsdt_add_pci_route_table(Aml *dev, uint32_t irq,
>      }
>  }
>  
> -static Aml *build_pci_host_bridge_dsm_method(void)
> +static Aml *build_pci_host_bridge_dsm_method(bool preserve_config)
>  {
>      Aml *method = aml_method("_DSM", 4, AML_NOTSERIALIZED);
>      Aml *UUID, *ifctx, *ifctx1, *buf;
> +    uint8_t byte_list[1] = {0};
>  
>      /* PCI Firmware Specification 3.0
>       * 4.6.1. _DSM for PCI Express Slot Information
> @@ -64,10 +65,23 @@ static Aml *build_pci_host_bridge_dsm_method(void)
>      UUID = aml_touuid("E5C937D0-3553-4D7A-9117-EA4D19C3434D");
>      ifctx = aml_if(aml_equal(aml_arg(0), UUID));
>      ifctx1 = aml_if(aml_equal(aml_arg(2), aml_int(0)));
> -    uint8_t byte_list[1] = {0};
> +    if (preserve_config) {
> +        /* support functions other than 0, specifically function 5 */
> +        byte_list[0] = 0x21;
> +    }
>      buf = aml_buffer(1, byte_list);
>      aml_append(ifctx1, aml_return(buf));
>      aml_append(ifctx, ifctx1);
> +    if (preserve_config) {
> +        Aml *ifctx2 = aml_if(aml_equal(aml_arg(2), aml_int(5)));
> +        /*
> +         * 0 - The operating system must not ignore the PCI configuration that
> +         *     firmware has done at boot time.
> +         */
> +        aml_append(ifctx2, aml_return(aml_int(0)));
> +        aml_append(ifctx, ifctx2);
> +    }
> +
>      aml_append(method, ifctx);
>  
>      byte_list[0] = 0;
> @@ -77,12 +91,13 @@ static Aml *build_pci_host_bridge_dsm_method(void)
>  }
>  
>  static void acpi_dsdt_add_host_bridge_methods(Aml *dev,
> -                                              bool enable_native_pcie_hotplug)
> +                                              bool enable_native_pcie_hotplug,
> +                                              bool preserve_config)
>  {
>      /* Declare an _OSC (OS Control Handoff) method */
>      aml_append(dev,
>                 build_pci_host_bridge_osc_method(enable_native_pcie_hotplug));
> -    aml_append(dev, build_pci_host_bridge_dsm_method());
> +    aml_append(dev, build_pci_host_bridge_dsm_method(preserve_config));
>  }
>  
>  void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
> @@ -152,7 +167,8 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
>                  build_cxl_osc_method(dev);
>              } else {
>                  /* pxb bridges do not have ACPI PCI Hot-plug enabled */
> -                acpi_dsdt_add_host_bridge_methods(dev, true);
> +                acpi_dsdt_add_host_bridge_methods(dev, true,
> +                                                  cfg->preserve_config);
>              }
>  
>              aml_append(scope, dev);
> @@ -227,7 +243,8 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
>      }
>      aml_append(dev, aml_name_decl("_CRS", rbuf));
>  
> -    acpi_dsdt_add_host_bridge_methods(dev, cfg->pci_native_hotplug);
> +    acpi_dsdt_add_host_bridge_methods(dev, cfg->pci_native_hotplug,
> +                                      cfg->preserve_config);
>  
>      Aml *dev_res0 = aml_device("%s", "RES0");
>      aml_append(dev_res0, aml_name_decl("_HID", aml_string("PNP0C02")));
> diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h
> index feaf827474..7eea16e728 100644
> --- a/include/hw/pci-host/gpex.h
> +++ b/include/hw/pci-host/gpex.h
> @@ -46,6 +46,7 @@ struct GPEXConfig {
>      int         irq;
>      PCIBus      *bus;
>      bool        pci_native_hotplug;
> +    bool        preserve_config;
>  };
>  
>  typedef struct GPEXIrq GPEXIrq;



^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space
  2025-10-31 21:10   ` Nicolin Chen
@ 2025-11-03 14:17     ` Shameer Kolothum
  0 siblings, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-03 14:17 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 31 October 2025 21:11
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system
> address space
> 
> On Fri, Oct 31, 2025 at 10:49:39AM +0000, Shameer Kolothum wrote:
> > To support accelerated SMMUv3 instances, introduce a shared
> > system-wide AddressSpace (shared_as_sysmem) that aliases the global
> system memory.
> > This shared AddressSpace will be used in a subsequent patch for all
> > vfio-pci devices behind all accelerated SMMUv3 instances within a VM.
> >
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> 
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> 
> > +/*
> > + * The root region aliases the global system memory, and
> > +shared_as_sysmem
> > + * provides a shared Address Space referencing it. This Address Space
> > +is used
> > + * by all vfio-pci devices behind all accelerated SMMUv3 instances within a
> VM.
> > + */
> > +MemoryRegion root;
> > +MemoryRegion sysmem;
> > +static AddressSpace *shared_as_sysmem;
> 
> static MemoryRegion root, sysmem; ?

Sure.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
  2025-10-31 22:02   ` Nicolin Chen
  2025-10-31 22:08     ` Nicolin Chen
@ 2025-11-03 14:19     ` Shameer Kolothum
  1 sibling, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-03 14:19 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 31 October 2025 22:02
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 12/32] hw/arm/smmuv3-accel: Add
> set/unset_iommu_device callback
> 
> On Fri, Oct 31, 2025 at 10:49:45AM +0000, Shameer Kolothum wrote:
> > +static bool
> > +smmuv3_accel_dev_alloc_viommu(SMMUv3AccelDevice *accel_dev,
> > +                              HostIOMMUDeviceIOMMUFD *idev, Error
> > +**errp)
> 
> Let's make it simply do alloc() on s_accel:
> 
> static bool smmuv3_accel_alloc_viommu(SMMUv3AccelState *s_accel,
>                                       HostIOMMUDeviceIOMMUFD *idev,
>                                       Error **errp)
> 
> Then..
> 
> > +    SMMUDevice *sdev = &accel_dev->sdev;
> > +    SMMUState *bs = sdev->smmu;
> > +    SMMUv3State *s = ARM_SMMUV3(bs);
> > +    SMMUv3AccelState *s_accel = s->s_accel;
> 
> Drop these.
> 
> > +    if (s_accel->vsmmu) {
> > +        accel_dev->vsmmu = s_accel->vsmmu;
> > +        return true;
> > +    }
> 
> And this.
> 
> > +static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque,
> int devfn,
> > +                                          HostIOMMUDevice *hiod,
> > +Error **errp)
> [...]
> > +    if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
> > +        error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
> > +        return false;
> > +    }
> 
> And here:
> 
>     if (!s_accel->vsmmu && !smmuv3_accel_alloc_viommu(s_accel, idev, errp))
> {
>         error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
>         return false;
>     }
> 
>     accel_dev->idev = idev;
>     accel_dev->vsmmu = s_accel->vsmmu;
> 
> Feels slightly cleaner.
> 
> > +/*
> > + * Represents a virtual SMMU instance backed by an iommufd vIOMMU
> object.
> > + * Holds references to the core iommufd vIOMMU object and to proxy
> > +HWPTs
> 
> I read "reference" as a pointer, yet...
> 
> > + * (bypass and abort) used for device attachment.
> > + */
> > +typedef struct SMMUViommu {
> > +    IOMMUFDBackend *iommufd;
> > +    IOMMUFDViommu viommu;
> > +    uint32_t bypass_hwpt_id;
> > +    uint32_t abort_hwpt_id;
> 
> ...viommu is a containment and two HWPTs are IDs.
> 
> So, it'd sound more accurate, being:
> 
> /*
>  * Represents a virtual SMMU instance backed by an iommufd vIOMMU
> object.
>  * Holds bypass and abort proxy HWPT ids used for device attachment.
>  */
> 
> > +typedef struct SMMUv3AccelState {
> > +    SMMUViommu *vsmmu;
> > +} SMMUv3AccelState;
> 
> Hmm, maybe we don't need another layer of structure. Every access or
> validation to s_accel is for s_accel->vsmmu.
> 
> e.g.
>     if (!s_accel || !s_accel->vsmmu) {
> could be
>     if (!s_accel) {
> 
> So, let's try merging them into one. And feel free to leave one of your favorite.

Looks sensible to me. I will take a look during next revision.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 09/32] hw/pci-bridge/pci_expander_bridge: Move TYPE_PXB_PCIE_DEV to header
  2025-10-31 10:49 ` [PATCH v5 09/32] hw/pci-bridge/pci_expander_bridge: Move TYPE_PXB_PCIE_DEV to header Shameer Kolothum
  2025-11-03 13:30   ` Jonathan Cameron via
@ 2025-11-03 14:25   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 14:25 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> Move the TYPE_PXB_PCIE_DEV definition to header so that it can be
> referenced by other code in subsequent patch.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/pci-bridge/pci_expander_bridge.c | 1 -
>  include/hw/pci/pci_bridge.h         | 1 +
>  2 files changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
> index 1bcceddbc4..a8eb2d2426 100644
> --- a/hw/pci-bridge/pci_expander_bridge.c
> +++ b/hw/pci-bridge/pci_expander_bridge.c
> @@ -48,7 +48,6 @@ struct PXBBus {
>      char bus_path[8];
>  };
>  
> -#define TYPE_PXB_PCIE_DEV "pxb-pcie"
>  OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
>  
>  static GList *pxb_dev_list;
> diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
> index a055fd8d32..b61360b900 100644
> --- a/include/hw/pci/pci_bridge.h
> +++ b/include/hw/pci/pci_bridge.h
> @@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
>  
>  #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
>  #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
> +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
>  #define TYPE_PXB_DEV "pxb"
>  OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
>  



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate
  2025-10-31 10:49 ` [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate Shameer Kolothum
  2025-11-01  0:49   ` Nicolin Chen
  2025-11-01 14:20   ` Zhangfei Gao
@ 2025-11-03 14:47   ` Jonathan Cameron via
  2 siblings, 0 replies; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 14:47 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:49:52 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> Just before the device gets attached to the SMMUv3, make sure QEMU SMMUv3
> features are compatible with the host SMMUv3.
> 
> Not all fields in the host SMMUv3 IDR registers are meaningful for userspace.
> Only the following fields can be used:
> 
>   - IDR0: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN, CD2L, ASID16, TTF
>   - IDR1: SIDSIZE, SSIDSIZE
>   - IDR3: BBML, RIL
>   - IDR5: VAX, GRAN64K, GRAN16K, GRAN4K
> 
> For now, the check is to make sure the features are in sync to enable
> basic accelerated SMMUv3 support.
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
LGTM
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 22/32] tests/qtest/bios-tables-test: Prepare for IORT revison upgrade
  2025-10-31 10:49 ` [PATCH v5 22/32] tests/qtest/bios-tables-test: Prepare for IORT revison upgrade Shameer Kolothum
@ 2025-11-03 14:48   ` Jonathan Cameron via
  2025-11-03 14:59   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 14:48 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:49:55 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> Subsequent patch will upgrade IORT revision to 5 to add support
> for IORT RMR nodes.
> 
> Add the affected IORT blobs to allowed-diff list for bios-table
> tests.
> 
FWIW
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  tests/qtest/bios-tables-test-allowed-diff.h | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/tests/qtest/bios-tables-test-allowed-diff.h b/tests/qtest/bios-tables-test-allowed-diff.h
> index dfb8523c8b..3279638ad0 100644
> --- a/tests/qtest/bios-tables-test-allowed-diff.h
> +++ b/tests/qtest/bios-tables-test-allowed-diff.h
> @@ -1 +1,5 @@
>  /* List of comma-separated changed AML files to ignore */
> +"tests/data/acpi/aarch64/virt/IORT",
> +"tests/data/acpi/aarch64/virt/IORT.its_off",
> +"tests/data/acpi/aarch64/virt/IORT.smmuv3-legacy",
> +"tests/data/acpi/aarch64/virt/IORT.smmuv3-dev",



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 23/32] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding
  2025-10-31 10:49 ` [PATCH v5 23/32] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Shameer Kolothum
@ 2025-11-03 14:53   ` Jonathan Cameron via
  2025-11-03 15:43     ` Shameer Kolothum
  0 siblings, 1 reply; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 14:53 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:49:56 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> From: Eric Auger <eric.auger@redhat.com>
> 
> To handle SMMUv3 accel=on mode(which configures the host SMMUv3 in nested
> mode), it is practical to expose the guest with reserved memory regions
> (RMRs) covering the IOVAs used by the host kernel to map physical MSI
> doorbells.
> 
> Those IOVAs belong to [0x8000000, 0x8100000] matching MSI_IOVA_BASE and
> MSI_IOVA_LENGTH definitions in kernel arm-smmu-v3 driver. This is the
> window used to allocate IOVAs matching physical MSI doorbells.
> 
> With those RMRs, the guest is forced to use a flat mapping for this range.
> Hence the assigned device is programmed with one IOVA from this range.
> Stage 1, owned by the guest has a flat mapping for this IOVA. Stage2,
> owned by the VMM then enforces a mapping from this IOVA to the physical
> MSI doorbell.
> 
> The creation of those RMR nodes is only relevant if nested stage SMMU is
> in use, along with VFIO. As VFIO devices can be hotplugged, all RMRs need
> to be created in advance.
> 
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>

One small question inline on the id increment.

With that tidied up.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> @@ -447,10 +475,70 @@ static void create_rc_its_idmaps(GArray *its_idmaps, GArray *smmuv3_devs)
>      }
>  }
>  
> +static void
> +build_iort_rmr_nodes(GArray *table_data, GArray *smmuv3_devices, uint32_t *id)
> +{
> +    AcpiIortSMMUv3Dev *sdev;
> +    AcpiIortIdMapping *idmap;
> +    int i;
> +
> +    for (i = 0; i < smmuv3_devices->len; i++) {
> +        uint16_t rmr_len;
> +        int bdf;
> +
> +        sdev = &g_array_index(smmuv3_devices, AcpiIortSMMUv3Dev, i);
> +        if (!sdev->accel) {
> +            continue;
> +        }
> +
> +        /*
> +         * Spec reference:Arm IO Remapping Table(IORT), ARM DEN 0049E.d,
> +         * Section 3.1.1.5 "Reserved Memory Range node"
> +         */
> +        idmap = &g_array_index(sdev->rc_smmu_idmaps, AcpiIortIdMapping, 0);
> +        bdf = idmap->input_base;
> +        rmr_len = IORT_RMR_COMMON_HEADER_SIZE
> +                 + (IORT_RMR_NUM_ID_MAPPINGS * ID_MAPPING_ENTRY_SIZE)
> +                 + (IORT_RMR_NUM_MEM_RANGE_DESC * IORT_RMR_MEM_RANGE_DESC_SIZE);
> +
> +        /* Table 18 Reserved Memory Range Node */
> +        build_append_int_noprefix(table_data, 6 /* RMR */, 1); /* Type */
> +        /* Length */
> +        build_append_int_noprefix(table_data, rmr_len, 2);
> +        build_append_int_noprefix(table_data, 3, 1); /* Revision */
> +        build_append_int_noprefix(table_data, (*id)++, 4); /* Identifier */
So *id is incremented here and...
> +        /* Number of ID mappings */
> +        build_append_int_noprefix(table_data, IORT_RMR_NUM_ID_MAPPINGS, 4);
> +        /* Reference to ID Array */
> +        build_append_int_noprefix(table_data, IORT_RMR_COMMON_HEADER_SIZE, 4);
> +
> +        /* RMR specific data */
> +
> +        /* Flags */
> +        build_append_int_noprefix(table_data, IORT_RMR_FLAGS, 4);
> +        /* Number of Memory Range Descriptors */
> +        build_append_int_noprefix(table_data, IORT_RMR_NUM_MEM_RANGE_DESC, 4);
> +        /* Reference to Memory Range Descriptors */
> +        build_append_int_noprefix(table_data, IORT_RMR_COMMON_HEADER_SIZE +
> +                        (IORT_RMR_NUM_ID_MAPPINGS * ID_MAPPING_ENTRY_SIZE), 4);
> +        build_iort_id_mapping(table_data, bdf, idmap->id_count, sdev->offset,
> +                              1);
> +
> +        /* Table 19 Memory Range Descriptor */
> +
> +        /* Physical Range offset */
> +        build_append_int_noprefix(table_data, MSI_IOVA_BASE, 8);
> +        /* Physical Range length */
> +        build_append_int_noprefix(table_data, MSI_IOVA_LENGTH, 8);
> +        build_append_int_noprefix(table_data, 0, 4); /* Reserved */
> +        *id += 1;
here. Why this second one? Perhaps a comment if this is intended.

> +    }
> +}


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 24/32] tests/qtest/bios-tables-test: Update IORT blobs after revision upgrade
  2025-10-31 10:49 ` [PATCH v5 24/32] tests/qtest/bios-tables-test: Update IORT blobs after revision upgrade Shameer Kolothum
@ 2025-11-03 14:54   ` Jonathan Cameron via
  2025-11-03 15:01   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 14:54 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:49:57 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> Update the reference IORT blobs after revision upgrade for RMR node
> support. This affects the aarch64 'virt' IORT tests.
> 
> IORT diff is the same for all the tests:
> 
>  /*
>   * Intel ACPI Component Architecture
>   * AML/ASL+ Disassembler version 20230628 (64-bit version)
>   * Copyright (c) 2000 - 2023 Intel Corporation
>   *
> - * Disassembly of tests/data/acpi/aarch64/virt/IORT, Mon Oct 20 14:42:41 2025
> + * Disassembly of /tmp/aml-B4ZRE3, Mon Oct 20 14:42:41 2025
>   *
>   * ACPI Data Table [IORT]
>   *
>   * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue (in hex)
>   */
> 
>  [000h 0000 004h]                   Signature : "IORT"    [IO Remapping Table]
>  [004h 0004 004h]                Table Length : 00000080
> -[008h 0008 001h]                    Revision : 03
> -[009h 0009 001h]                    Checksum : B3
> +[008h 0008 001h]                    Revision : 05
> +[009h 0009 001h]                    Checksum : B1
>  [00Ah 0010 006h]                      Oem ID : "BOCHS "
>  [010h 0016 008h]                Oem Table ID : "BXPC    "
>  [018h 0024 004h]                Oem Revision : 00000001
>  [01Ch 0028 004h]             Asl Compiler ID : "BXPC"
>  [020h 0032 004h]       Asl Compiler Revision : 00000001
>  ...
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
FWIW given trivial change.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 25/32] hw/arm/smmuv3: Add accel property for SMMUv3 device
  2025-10-31 10:49 ` [PATCH v5 25/32] hw/arm/smmuv3: Add accel property for SMMUv3 device Shameer Kolothum
@ 2025-11-03 14:56   ` Jonathan Cameron via
  0 siblings, 0 replies; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 14:56 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:49:58 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> Introduce an “accel” property to enable accelerator mode.
> 
> Live migration is currently unsupported when accelerator mode is enabled,
> so a migration blocker is added.
> 
> Because this mode relies on IORT RMR for MSI support, accelerator mode is
> not supported for device tree boot.
> 
> Also, in the accelerated SMMUv3 case, the host SMMUv3 is configured in nested
> mode (S1 + S2), and the guest owns the Stage-1 page table. Therefore, we
> expose only Stage-1 to the guest to ensure it uses the correct page table
> format.
> 
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 21/32] hw/arm/virt: Set PCI preserve_config for accel SMMUv3
  2025-10-31 10:49 ` [PATCH v5 21/32] hw/arm/virt: Set PCI preserve_config for accel SMMUv3 Shameer Kolothum
@ 2025-11-03 14:58   ` Eric Auger
  2025-11-03 15:03     ` Eric Auger via
  2025-11-03 16:01     ` Shameer Kolothum
  0 siblings, 2 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 14:58 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> Introduce a new pci_preserve_config field in virt machine state which
> allows  the generation of DSM #5. This field is only set if accel SMMU
> is instantiated.
>
> In a subsequent patch, SMMUv3 accel mode will make use of IORT RMR nodes
> to enable nested translation of MSI doorbell addresses. IORT RMR requires
> _DSM #5 to be set for the PCI host bridge so that the Guest kernel
> preserves the PCI boot configuration.
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/arm/virt-acpi-build.c | 8 ++++++++
>  hw/arm/virt.c            | 4 ++++
>  include/hw/arm/virt.h    | 1 +
>  3 files changed, 13 insertions(+)
>
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 8bb6b60515..d51da6e27d 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -163,6 +163,14 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
>          .pci_native_hotplug = !acpi_pcihp,
>      };
>  
> +    /*
> +     * Accel SMMU requires RMRs for MSI 1-1 mapping, which require _DSM for
> +     * preserving PCI Boot Configurations
as suggested in v4 you can be more precise and explictly state

_DSM function 5 (Ignore PCI Boot Configuration)

> +     */
> +    if (vms->pci_preserve_config) {
> +        cfg.preserve_config = true;
> +    }
> +
>      if (vms->highmem_mmio) {
>          cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
>      }
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 175023897a..8a347a6e39 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -3091,6 +3091,10 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
>              }
>  
>              create_smmuv3_dev_dtb(vms, dev, bus);
> +            if (object_property_find(OBJECT(dev), "accel") &&
why do you need to test

object_property_find(OBJECT(dev), "accel")?

> +                object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
> +                vms->pci_preserve_config = true;
> +            }
>          }
>      }
>  
> diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> index 04a09af354..60db5d40b2 100644
> --- a/include/hw/arm/virt.h
> +++ b/include/hw/arm/virt.h
> @@ -182,6 +182,7 @@ struct VirtMachineState {
>      bool ns_el2_virt_timer_irq;
>      CXLState cxl_devices_state;
>      bool legacy_smmuv3_present;
> +    bool pci_preserve_config;
>  };
>  
>  #define VIRT_ECAM_ID(high) (high ? VIRT_HIGH_PCIE_ECAM : VIRT_PCIE_ECAM)
With those changes takin into account
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 30/32] Extend get_cap() callback to support PASID
  2025-10-31 10:50 ` [PATCH v5 30/32] Extend get_cap() callback to support PASID Shameer Kolothum
@ 2025-11-03 14:58   ` Jonathan Cameron via
  2025-11-06  8:45   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 14:58 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:50:03 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> Modify get_cap() callback so that it can return cap via an output
> uint64_t param. And add support for generic iommu hw capability
> info and max_pasid_log2(pasid width).
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
LGTM
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 22/32] tests/qtest/bios-tables-test: Prepare for IORT revison upgrade
  2025-10-31 10:49 ` [PATCH v5 22/32] tests/qtest/bios-tables-test: Prepare for IORT revison upgrade Shameer Kolothum
  2025-11-03 14:48   ` Jonathan Cameron via
@ 2025-11-03 14:59   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 14:59 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> Subsequent patch will upgrade IORT revision to 5 to add support
> for IORT RMR nodes.
>
> Add the affected IORT blobs to allowed-diff list for bios-table
> tests.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  tests/qtest/bios-tables-test-allowed-diff.h | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/tests/qtest/bios-tables-test-allowed-diff.h b/tests/qtest/bios-tables-test-allowed-diff.h
> index dfb8523c8b..3279638ad0 100644
> --- a/tests/qtest/bios-tables-test-allowed-diff.h
> +++ b/tests/qtest/bios-tables-test-allowed-diff.h
> @@ -1 +1,5 @@
>  /* List of comma-separated changed AML files to ignore */
> +"tests/data/acpi/aarch64/virt/IORT",
> +"tests/data/acpi/aarch64/virt/IORT.its_off",
> +"tests/data/acpi/aarch64/virt/IORT.smmuv3-legacy",
> +"tests/data/acpi/aarch64/virt/IORT.smmuv3-dev",

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM
  2025-10-31 10:50 ` [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM Shameer Kolothum
@ 2025-11-03 15:00   ` Jonathan Cameron via
  2025-11-06 13:55   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Jonathan Cameron via @ 2025-11-03 15:00 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu, kjaju

On Fri, 31 Oct 2025 10:50:04 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> From: Yi Liu <yi.l.liu@intel.com>
> 
> If user wants to expose PASID capability in vIOMMU, then VFIO would also
> report the PASID cap for this device if the underlying hardware supports
> it as well.
> 
> As a start, this chooses to put the vPASID cap in the last 8 bytes of the
> vconfig space. This is a choice in the good hope of no conflict with any
> existing cap or hidden registers. For the devices that has hidden registers,
> user should figure out a proper offset for the vPASID cap. This may require
> an option for user to config it. Here we leave it as a future extension.
> There are more discussions on the mechanism of finding the proper offset.
> 
> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/
> 
> Since we add a check to ensure the vIOMMU supports PASID, only devices
> under those vIOMMUs can synthesize the vPASID capability. This gives
> users control over which devices expose vPASID.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Whilst not particularly keen on this hack, I can't see a better solution.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>




^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 24/32] tests/qtest/bios-tables-test: Update IORT blobs after revision upgrade
  2025-10-31 10:49 ` [PATCH v5 24/32] tests/qtest/bios-tables-test: Update IORT blobs after revision upgrade Shameer Kolothum
  2025-11-03 14:54   ` Jonathan Cameron via
@ 2025-11-03 15:01   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 15:01 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> Update the reference IORT blobs after revision upgrade for RMR node
> support. This affects the aarch64 'virt' IORT tests.
>
> IORT diff is the same for all the tests:
>
>  /*
>   * Intel ACPI Component Architecture
>   * AML/ASL+ Disassembler version 20230628 (64-bit version)
>   * Copyright (c) 2000 - 2023 Intel Corporation
>   *
> - * Disassembly of tests/data/acpi/aarch64/virt/IORT, Mon Oct 20 14:42:41 2025
> + * Disassembly of /tmp/aml-B4ZRE3, Mon Oct 20 14:42:41 2025
>   *
>   * ACPI Data Table [IORT]
>   *
>   * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue (in hex)
>   */
>
>  [000h 0000 004h]                   Signature : "IORT"    [IO Remapping Table]
>  [004h 0004 004h]                Table Length : 00000080
> -[008h 0008 001h]                    Revision : 03
> -[009h 0009 001h]                    Checksum : B3
> +[008h 0008 001h]                    Revision : 05
> +[009h 0009 001h]                    Checksum : B1
>  [00Ah 0010 006h]                      Oem ID : "BOCHS "
>  [010h 0016 008h]                Oem Table ID : "BXPC    "
>  [018h 0024 004h]                Oem Revision : 00000001
>  [01Ch 0028 004h]             Asl Compiler ID : "BXPC"
>  [020h 0032 004h]       Asl Compiler Revision : 00000001
>  ...
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  tests/data/acpi/aarch64/virt/IORT               | Bin 128 -> 128 bytes
>  tests/data/acpi/aarch64/virt/IORT.its_off       | Bin 172 -> 172 bytes
>  tests/data/acpi/aarch64/virt/IORT.smmuv3-dev    | Bin 364 -> 364 bytes
>  tests/data/acpi/aarch64/virt/IORT.smmuv3-legacy | Bin 276 -> 276 bytes
>  tests/qtest/bios-tables-test-allowed-diff.h     |   4 ----
>  5 files changed, 4 deletions(-)
>
> diff --git a/tests/data/acpi/aarch64/virt/IORT b/tests/data/acpi/aarch64/virt/IORT
> index 7efd0ce8a6b3928efa7e1373f688ab4c5f50543b..a234aae4c2d04668d34313836d32ca20e19c0880 100644
> GIT binary patch
> delta 18
> ZcmZo*Y+&T_^bZPYU|?Wi-8hk}3;-#Q1d;#%
>
> delta 18
> ZcmZo*Y+&T_^bZPYU|?Wi-aL`33;-#O1d;#%
>
> diff --git a/tests/data/acpi/aarch64/virt/IORT.its_off b/tests/data/acpi/aarch64/virt/IORT.its_off
> index c10da4e61dd00e7eb062558a2735d49ca0b20620..0cf52b52f671637bf4dbc9e0fc80c3c73d0b01d3 100644
> GIT binary patch
> delta 18
> ZcmZ3(xQ3C-(?2L=4FdxM>(q%{ivTdM1ttIh
>
> delta 18
> ZcmZ3(xQ3C-(?2L=4FdxM^Yn>aivTdK1ttIh
>
> diff --git a/tests/data/acpi/aarch64/virt/IORT.smmuv3-dev b/tests/data/acpi/aarch64/virt/IORT.smmuv3-dev
> index 67be268f62afbf2d9459540984da5e9340afdaaa..43a15fe2bf6cc650ffcbceff86919ea892928c0e 100644
> GIT binary patch
> delta 19
> acmaFE^oEJc(?2LAhmnDS^~6T5Bt`%|fCYU3
>
> delta 19
> acmaFE^oEJc(?2LAhmnDS`P4?PBt`%|eg%C1
>
> diff --git a/tests/data/acpi/aarch64/virt/IORT.smmuv3-legacy b/tests/data/acpi/aarch64/virt/IORT.smmuv3-legacy
> index 41981a449fc306b80cccd87ddec3c593a8d72c07..5779d0e225a62b9cd70bebbacb7fd1e519c9e3c4 100644
> GIT binary patch
> delta 19
> acmbQjG=+)F(?2Lggpq-P)oUXc7b5^FiUXej
>
> delta 19
> acmbQjG=+)F(?2Lggpq-P*=Hjc7b5^Fhy$Mh
>
> diff --git a/tests/qtest/bios-tables-test-allowed-diff.h b/tests/qtest/bios-tables-test-allowed-diff.h
> index 3279638ad0..dfb8523c8b 100644
> --- a/tests/qtest/bios-tables-test-allowed-diff.h
> +++ b/tests/qtest/bios-tables-test-allowed-diff.h
> @@ -1,5 +1 @@
>  /* List of comma-separated changed AML files to ignore */
> -"tests/data/acpi/aarch64/virt/IORT",
> -"tests/data/acpi/aarch64/virt/IORT.its_off",
> -"tests/data/acpi/aarch64/virt/IORT.smmuv3-legacy",
> -"tests/data/acpi/aarch64/virt/IORT.smmuv3-dev",



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 21/32] hw/arm/virt: Set PCI preserve_config for accel SMMUv3
  2025-11-03 14:58   ` Eric Auger
@ 2025-11-03 15:03     ` Eric Auger via
  2025-11-03 16:01     ` Shameer Kolothum
  1 sibling, 0 replies; 148+ messages in thread
From: Eric Auger via @ 2025-11-03 15:03 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 11/3/25 3:58 PM, Eric Auger wrote:
> 
> 
> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>> Introduce a new pci_preserve_config field in virt machine state which
>> allows  the generation of DSM #5. This field is only set if accel SMMU
>> is instantiated.
>>
>> In a subsequent patch, SMMUv3 accel mode will make use of IORT RMR nodes
>> to enable nested translation of MSI doorbell addresses. IORT RMR requires
>> _DSM #5 to be set for the PCI host bridge so that the Guest kernel
>> preserves the PCI boot configuration.
>>
>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>> ---
>>  hw/arm/virt-acpi-build.c | 8 ++++++++
>>  hw/arm/virt.c            | 4 ++++
>>  include/hw/arm/virt.h    | 1 +
>>  3 files changed, 13 insertions(+)
>>
>> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
>> index 8bb6b60515..d51da6e27d 100644
>> --- a/hw/arm/virt-acpi-build.c
>> +++ b/hw/arm/virt-acpi-build.c
>> @@ -163,6 +163,14 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
>>          .pci_native_hotplug = !acpi_pcihp,
>>      };
>>  
>> +    /*
>> +     * Accel SMMU requires RMRs for MSI 1-1 mapping, which require _DSM for
>> +     * preserving PCI Boot Configurations
> as suggested in v4 you can be more precise and explictly state
> 
> _DSM function 5 (Ignore PCI Boot Configuration)
> 
>> +     */
>> +    if (vms->pci_preserve_config) {
>> +        cfg.preserve_config = true;
>> +    }
>> +
>>      if (vms->highmem_mmio) {
>>          cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
>>      }
>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>> index 175023897a..8a347a6e39 100644
>> --- a/hw/arm/virt.c
>> +++ b/hw/arm/virt.c
>> @@ -3091,6 +3091,10 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
>>              }
>>  
>>              create_smmuv3_dev_dtb(vms, dev, bus);
>> +            if (object_property_find(OBJECT(dev), "accel") &&
> why do you need to test
> 
> object_property_find(OBJECT(dev), "accel")?

Hum, because at that moment it does not exist yet. So you can remove it
in 25/32 I think

Eric
> 
>> +                object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
>> +                vms->pci_preserve_config = true;
>> +            }
>>          }
>>      }
>>  
>> diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
>> index 04a09af354..60db5d40b2 100644
>> --- a/include/hw/arm/virt.h
>> +++ b/include/hw/arm/virt.h
>> @@ -182,6 +182,7 @@ struct VirtMachineState {
>>      bool ns_el2_virt_timer_irq;
>>      CXLState cxl_devices_state;
>>      bool legacy_smmuv3_present;
>> +    bool pci_preserve_config;
>>  };
>>  
>>  #define VIRT_ECAM_ID(high) (high ? VIRT_HIGH_PCIE_ECAM : VIRT_PCIE_ECAM)
> With those changes takin into account
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> 
> Eric
> 



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to specify RIL support
  2025-10-31 10:49 ` [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to specify RIL support Shameer Kolothum
@ 2025-11-03 15:07   ` Eric Auger
  2025-11-03 16:08     ` Shameer Kolothum
  2025-11-04  9:38   ` Eric Auger
  1 sibling, 1 reply; 148+ messages in thread
From: Eric Auger @ 2025-11-03 15:07 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju

Hi Shameer,

On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> Currently QEMU SMMUv3 has RIL support by default. But if accelerated mode
> is enabled, RIL has to be compatible with host SMMUv3 support.
>
> Add a property so that the user can specify this.
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>

I have not seen any reply on
https://lore.kernel.org/all/b6105534-4a17-4700-bb0b-e961babd10bb@redhat.com/

I guess you chose to restrict RIL to accel only. About AIDR consistency
check, did you have a look?

Eric


> ---
>  hw/arm/smmuv3-accel.c   | 15 +++++++++++++--
>  hw/arm/smmuv3-accel.h   |  4 ++++
>  hw/arm/smmuv3.c         | 12 ++++++++++++
>  include/hw/arm/smmuv3.h |  1 +
>  4 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 8b9f88dd8e..35298350cb 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -63,10 +63,10 @@ smmuv3_accel_check_hw_compatible(SMMUv3State *s,
>          return false;
>      }
>  
> -    /* QEMU SMMUv3 supports Range Invalidation by default */
> +    /* User can disable QEMU SMMUv3 Range Invalidation support */
>      if (FIELD_EX32(info->idr[3], IDR3, RIL) !=
>                  FIELD_EX32(s->idr[3], IDR3, RIL)) {
> -        error_setg(errp, "Host SMMUv3 doesn't support Range Invalidation");
> +        error_setg(errp, "Host SMMUv3 differs in Range Invalidation support");
>          return false;
>      }
>  
> @@ -635,6 +635,17 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
>      .get_msi_address_space = smmuv3_accel_get_msi_as,
>  };
>  
> +void smmuv3_accel_idr_override(SMMUv3State *s)
> +{
> +    if (!s->accel) {
> +        return;
> +    }
> +
> +    /* By default QEMU SMMUv3 has RIL. Update IDR3 if user has disabled it */
> +    if (!s->ril) {
> +        s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 0);
> +    }
> +}
>  
>  /* Based on SMUUv3 GBPA configuration, attach a corresponding HWPT */
>  void smmuv3_accel_gbpa_update(SMMUv3State *s)
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> index ee79548370..4f5b672712 100644
> --- a/hw/arm/smmuv3-accel.h
> +++ b/hw/arm/smmuv3-accel.h
> @@ -55,6 +55,7 @@ bool smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
>                                  Error **errp);
>  void smmuv3_accel_gbpa_update(SMMUv3State *s);
>  void smmuv3_accel_reset(SMMUv3State *s);
> +void smmuv3_accel_idr_override(SMMUv3State *s);
>  #else
>  static inline void smmuv3_accel_init(SMMUv3State *s)
>  {
> @@ -83,6 +84,9 @@ static inline void smmuv3_accel_gbpa_update(SMMUv3State *s)
>  static inline void smmuv3_accel_reset(SMMUv3State *s)
>  {
>  }
> +static inline void smmuv3_accel_idr_override(SMMUv3State *s)
> +{
> +}
>  #endif
>  
>  #endif /* HW_ARM_SMMUV3_ACCEL_H */
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index f040e6b91e..b9d96f5762 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -305,6 +305,7 @@ static void smmuv3_init_id_regs(SMMUv3State *s)
>      s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1);
>      s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
>      s->aidr = 0x1;
> +    smmuv3_accel_idr_override(s);
>  }
>  
>  static void smmuv3_reset(SMMUv3State *s)
> @@ -1936,6 +1937,13 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
>          return false;
>      }
>  #endif
> +    if (!s->accel) {
> +        if (!s->ril) {
> +            error_setg(errp, "ril can only be disabled if accel=on");
> +            return false;
> +        }
> +        return false;
> +    }
>      return true;
>  }
>  
> @@ -2057,6 +2065,8 @@ static const Property smmuv3_properties[] = {
>       */
>      DEFINE_PROP_STRING("stage", SMMUv3State, stage),
>      DEFINE_PROP_BOOL("accel", SMMUv3State, accel, false),
> +    /* RIL can be turned off for accel cases */
> +    DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
>  };
>  
>  static void smmuv3_instance_init(Object *obj)
> @@ -2084,6 +2094,8 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
>                                            "Enable SMMUv3 accelerator support."
>                                            "Allows host SMMUv3 to be configured "
>                                            "in nested mode for vfio-pci dev assignment");
> +    object_class_property_set_description(klass, "ril",
> +        "Disable range invalidation support (for accel=on)");
>  }
>  
>  static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index 6b9c27a9c4..95202c2757 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -68,6 +68,7 @@ struct SMMUv3State {
>      bool accel;
>      struct SMMUv3AccelState *s_accel;
>      Error *migration_blocker;
> +    bool ril;
>  };
>  
>  typedef enum {



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 18/32] hw/arm/smmuv3: Initialize ID registers early during realize()
  2025-10-31 10:49 ` [PATCH v5 18/32] hw/arm/smmuv3: Initialize ID registers early during realize() Shameer Kolothum
  2025-11-01  0:24   ` Nicolin Chen
  2025-11-03 13:57   ` Jonathan Cameron via
@ 2025-11-03 15:11   ` Eric Auger
  2 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 15:11 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> Factor out ID register init into smmuv3_init_id_regs() and call it from
> realize(). This ensures ID registers are initialized early for use in the
> accelerated SMMUv3 path and will be utilized in subsequent patch.
>
> Other registers remain initialized in smmuv3_reset().
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/arm/smmuv3.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index 15173ddc9c..fae545f35c 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -258,7 +258,12 @@ void smmuv3_record_event(SMMUv3State *s, SMMUEventInfo *info)
>      info->recorded = true;
>  }
>  
> -static void smmuv3_init_regs(SMMUv3State *s)
> +/*
> + * Called during realize(), as the ID registers will be accessed early in the
> + * SMMUv3 accel path for feature compatibility checks. The remaining registers
> + * are initialized later in smmuv3_reset().
> + */
> +static void smmuv3_init_id_regs(SMMUv3State *s)
>  {
>      /* Based on sys property, the stages supported in smmu will be advertised.*/
>      if (s->stage && !strcmp("2", s->stage)) {
> @@ -298,7 +303,11 @@ static void smmuv3_init_regs(SMMUv3State *s)
>      s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1);
>      s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1);
>      s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
> +    s->aidr = 0x1;
> +}
>  
> +static void smmuv3_reset(SMMUv3State *s)
> +{
>      s->cmdq.base = deposit64(s->cmdq.base, 0, 5, SMMU_CMDQS);
>      s->cmdq.prod = 0;
>      s->cmdq.cons = 0;
> @@ -310,7 +319,6 @@ static void smmuv3_init_regs(SMMUv3State *s)
>  
>      s->features = 0;
>      s->sid_split = 0;
> -    s->aidr = 0x1;
>      s->cr[0] = 0;
>      s->cr0ack = 0;
>      s->irq_ctrl = 0;
> @@ -1915,7 +1923,7 @@ static void smmu_reset_exit(Object *obj, ResetType type)
>          c->parent_phases.exit(obj, type);
>      }
>  
> -    smmuv3_init_regs(s);
> +    smmuv3_reset(s);
>      smmuv3_accel_reset(s);
>  }
>  
> @@ -1947,6 +1955,7 @@ static void smmu_realize(DeviceState *d, Error **errp)
>      sysbus_init_mmio(dev, &sys->iomem);
>  
>      smmu_init_irq(s, dev);
> +    smmuv3_init_id_regs(s);
>  }
>  
>  static const VMStateDescription vmstate_smmuv3_queue = {



^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support
  2025-10-31 23:52   ` Nicolin Chen
  2025-11-01  0:20     ` Nicolin Chen
@ 2025-11-03 15:11     ` Shameer Kolothum
  2025-11-03 17:32       ` Nicolin Chen
  1 sibling, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-03 15:11 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 31 October 2025 23:53
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE
> install/uninstall support
> 
> On Fri, Oct 31, 2025 at 10:49:46AM +0000, Shameer Kolothum wrote:
> > +static bool
> > +smmuv3_accel_alloc_vdev(SMMUv3AccelDevice *accel_dev, int sid, Error
> **errp)
> > +{
> > +    SMMUViommu *vsmmu = accel_dev->vsmmu;
> > +    IOMMUFDVdev *vdev;
> > +    uint32_t vdevice_id;
> > +
> > +    if (!accel_dev->idev || accel_dev->vdev) {
> > +        return true;
> > +    }
> 
> We probably don't need to check !accel_dev->dev. It should have
> been blocked by its caller, which does block !accel_dev->vsmmu.
> Once we fix the missing "accel_dev->vsmmu NULL", it should work.

Ok.

> 
> > +
> > +    if (!iommufd_backend_alloc_vdev(vsmmu->iommufd, accel_dev->idev-
> >devid,
> > +                                    vsmmu->viommu.viommu_id, sid,
> > +                                    &vdevice_id, errp)) {
> > +            return false;
> > +    }
> > +    if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
> > +                                               vsmmu->bypass_hwpt_id, errp)) {
> > +        iommufd_backend_free_id(vsmmu->iommufd, vdevice_id);
> > +        return false;
> > +    }
> 
> This should check SMMUEN bit?
> 
> Linux driver (as an example) seems to set CMDQEN and install all
> the default bypass STEs, before SMMUEN=1.

Yeah. For RMR I think.
 
> In this case, the target hwpt here should follow guest's GBPA.
>
> > +static bool
> > +smmuv3_accel_dev_uninstall_nested_ste(SMMUv3AccelDevice *accel_dev,
> bool abort,
> > +                                      Error **errp)
> > +{
> > +    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
> > +    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
> > +    uint32_t hwpt_id;
> > +
> > +    if (!s1_hwpt || !accel_dev->vsmmu) {
> > +        return true;
> > +    }
> > +
> > +    if (abort) {
> > +        hwpt_id = accel_dev->vsmmu->abort_hwpt_id;
> > +    } else {
> > +        hwpt_id = accel_dev->vsmmu->bypass_hwpt_id;
> > +    }
> 
> This should probably check SMMUEN/GBPA as well.
> 
> Likely we need "enabled" and "gbpa_abort" flags in SMMUState.
> 
> > +static bool
> > +smmuv3_accel_dev_install_nested_ste(SMMUv3AccelDevice *accel_dev,
> > +                                    uint32_t data_type, uint32_t data_len,
> > +                                    void *data, Error **errp)
> > +{
> > +    SMMUViommu *vsmmu = accel_dev->vsmmu;
> > +    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
> > +    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
> > +    uint32_t flags = 0;
> > +
> > +    if (!idev || !vsmmu) {
> > +        error_setg(errp, "Device 0x%x has no associated IOMMU dev or
> vIOMMU",
> > +                   smmu_get_sid(&accel_dev->sdev));
> > +        return false;
> > +    }
> > +
> > +    if (s1_hwpt) {
> > +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev, true, errp)) {
> > +            return false;
> > +        }
> > +    }
> 
> I think we could have some improvements here.
> 
> The current flow is:
>     (attached to s1_hwpt1)
>     attach to bypass/abort_hwpt // no issue though.
>     free s1_hwpt1
>     alloc s2_hwpt2
>     attach to s2_hwpt2
> 
> It could have been a flow like replace() in the kernel:
>     (attached to s1_hwpt1)
>     alloc s2_hwpt2
>     attach to s2_hwpt2 /* skipping bypass/abort */
>     free s1_hwpt

Not sure I get the above, you mean in this _instatl_nested_ste() path,
we have a case where we need to alloc a S2 HWPT and attach?

> > +smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev,
> int sid,\
> [...]
> > +    config = STE_CONFIG(&ste);
> > +    if (!STE_VALID(&ste) || !STE_CFG_S1_ENABLED(config)) {
> > +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev,
> > +                                                   STE_CFG_ABORT(config),
> 
> This smmuv3_accel_uninstall_nested_ste() feels a bit redundant now.

Agree. It crossed my mind too.

> 
> Perhaps we could try something like this:
> 
> #define accel_dev_to_smmuv3(dev) ARM_SMMUV3(&dev->sdev.smmu)
> 
> static bool smmuv3_accel_dev_install_nested_ste(SMMUv3AccelDevice
> *accel_dev,
>                                                 int sid, STE *ste)
> {
>     SMMUv3State *s = accel_dev_to_smmuv3(accel_dev);
>     HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
>     uint32_t config = STE_CONFIG(ste);
>     SMMUS1Hwpt *s1_hwpt = NULL;
>     uint64_t ste_0, ste_1;
>     uint32_t hwpt_id = 0;
> 
>     if (!s->enabled) {
>         if (s->gbpa_abort) {
>             hwpt_id = accel_dev->vsmmu->abort_hwpt_id;
>         } else {
>             hwpt_id = accel_dev->vsmmu->bypass_hwpt_id;
>         }
>     } else {
>         if (!STE_VALID(ste) || STE_CFG_ABORT(config)) {
>             hwpt_id = accel_dev->vsmmu->abort_hwpt_id;
>         } else if (STE_CFG_BYPASS(config))
>             hwpt_id = accel_dev->vsmmu->bypass_hwpt_id;
>         } else {
>             // FIXME handle STE_CFG_S2_ENABLED()
>         }
>     }
> 
>     if (!hwpt_id) {
>         uint64_t ste_0 = (uint64_t)ste->word[0] | (uint64_t)ste->word[1] << 32;
>         uint64_t ste_1 = (uint64_t)ste->word[2] | (uint64_t)ste->word[3] << 32;
>         struct iommu_hwpt_arm_smmuv3 nested_data = {
>             .ste[2] = {
>                 cpu_to_le64(ste_0 & STE0_MASK),
>                 cpu_to_le64(ste_1 & STE1_MASK),
>             },
>         };
> 
>         trace_smmuv3_accel_install_nested_ste(sid, nested_data.ste[1],
>                                               nested_data.ste[0]);
>         s1_hwpt = g_new0(SMMUS1Hwpt, 1);
> 	[...]
> 	iommufd_backend_alloc_hwpt(..., &s1_hwpt->hwpt_id);
>         hwpt_id = s1_hwpt->hwpt_id;
>     }
> 
>     host_iommu_device_iommufd_attach_hwpt(.., hwpt_id);
> 
>     if (accel_dev->s1_hwpt) {
>         iommufd_backend_free_id(idev->iommufd, accel_dev->s1_hwpt-
> >hwpt_id);
>     }
>     accel_dev->s1_hwpt = s1_hwpt;
>     return true;
> }

Ok. I will take a look at this.

> > +bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s,
> SMMUSIDRange *range,
> > +                                           Error **errp)
> > +{
> > +    SMMUv3AccelState *s_accel = s->s_accel;
> > +    SMMUv3AccelDevice *accel_dev;
> > +
> > +    if (!s_accel || !s_accel->vsmmu) {
> > +        return true;
> > +    }
> > +
> > +    QLIST_FOREACH(accel_dev, &s_accel->vsmmu->device_list, next) {
> > +        uint32_t sid = smmu_get_sid(&accel_dev->sdev);
> > +
> > +        if (sid >= range->start && sid <= range->end) {
> > +            if (!smmuv3_accel_install_nested_ste(s, &accel_dev->sdev,
> > +                                                 sid, errp)) {
> > +                return false;
> > +            }
> > +        }
> 
> This is a bit tricky..
> 
> I think CFGI_STE_RANGE shouldn't stop in the middle, if one of the
> STEs fails.

True.

> That being said, HW doesn't seem to propagate C_BAD_STE during a
> CFGI_STE or CFGI_STE_RANGE, IIUIC. It reports C_BAD_STE event when
> a transaction starts. If we want to perfectly mimic the hardware,
> we'd have to set up a bad STE down to the HW, which will trigger a
> C_BAD_STE vevent to be forwarded by vEVENTQ.

I don't think we need to mimic that behaviour. We could return an event
from here to Guest if required or just have error_report().

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 16/32] hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
  2025-10-31 23:57   ` Nicolin Chen
@ 2025-11-03 15:19     ` Shameer Kolothum
  2025-11-03 17:34       ` Nicolin Chen
  0 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-03 15:19 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 31 October 2025 23:58
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 16/32] hw/arm/smmuv3-accel: Make use of
> get_msi_address_space() callback
> 
> On Fri, Oct 31, 2025 at 10:49:49AM +0000, Shameer Kolothum wrote:
> > +static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void
> *opaque,
> > +                                             int devfn)
> > +{
> > +    SMMUState *bs = opaque;
> > +    SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> > +    SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus,
> devfn);
> > +    SMMUDevice *sdev = &accel_dev->sdev;
> > +
> > +    /*
> > +     * If the assigned vfio-pci dev has S1 translation enabled by Guest,
> > +     * return IOMMU address space for MSI translation. Otherwise, return
> > +     * system address space.
> > +     */
> > +    if (accel_dev->s1_hwpt) {
> > +        return &sdev->as;
> > +    } else {
> > +        return &address_space_memory;
> 
> Should we use the global shared_as? Or is this on purpose to align
> with the "&address_space_memory" in kvm_arch_fixup_msi_route()?

Yes, that's on purpose.

Another way to handle is, if  "address_space_memory" is complete no-no, 
to return NULL here and handle it in pci_device_iommu_msi_address_space().

I like the current approach. Possibly can update the doc for get_msi_address_space()
In previous patch to make it clear that "&address_space_memory" should be 
returned if no msi translation is required.

Thanks,
Shameer






^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
  2025-11-01  0:35   ` Nicolin Chen via
@ 2025-11-03 15:28     ` Shameer Kolothum
  2025-11-03 17:43       ` Nicolin Chen
  0 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-03 15:28 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 01 November 2025 00:35
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue
> invalidation cmd to host
> 
> On Fri, Oct 31, 2025 at 10:49:50AM +0000, Shameer Kolothum wrote:
> > Provide a helper and use that to issue the invalidation cmd to host SMMUv3.
> > We only issue one cmd at a time for now.
> >
> > Support for batching of commands will be added later after analysing the
> > impact.
> >
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> 
> I think I have given my tag in v4.. anyway..

Sorry I missed that.
> 
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

Thanks.

> >          case SMMU_CMD_TLBI_NH_VAA:
> >          case SMMU_CMD_TLBI_NH_VA:
> > +        {
> > +            Error *local_err = NULL;
> > +
> >              if (!STAGE1_SUPPORTED(s)) {
> >                  cmd_error = SMMU_CERROR_ILL;
> >                  break;
> >              }
> >              smmuv3_range_inval(bs, &cmd, SMMU_STAGE_1);
> > +            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
> > +                error_report_err(local_err);
> > +                cmd_error = SMMU_CERROR_ILL;
> > +                break;
> > +            }
> >              break;
> > +        }
> 
> The local_err isn't used anywhere but by the error_report_err()
> alone. So, it could be moved into smmuv3_accel_issue_inv_cmd().

Though that is true, it is following the same pattern as 
smmuv3_accel_install_nested_ste()/_range()  functions. The general
idea is, we will pass the errp to accel functions and report or propagate
from here.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate
  2025-11-01 14:20   ` Zhangfei Gao
@ 2025-11-03 15:42     ` Shameer Kolothum
  2025-11-03 17:16       ` Eric Auger
  0 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-03 15:42 UTC (permalink / raw)
  To: Zhangfei Gao
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Zhangfei Gao <zhangfei.gao@linaro.org>
> Sent: 01 November 2025 14:20
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>;
> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com; Krishnakant Jaju
> <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw
> info and validate
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi, Shameer
> 
> On Fri, 31 Oct 2025 at 18:54, Shameer Kolothum
> <skolothumtho@nvidia.com> wrote:
> >
> > Just before the device gets attached to the SMMUv3, make sure QEMU
> > SMMUv3 features are compatible with the host SMMUv3.
> >
> > Not all fields in the host SMMUv3 IDR registers are meaningful for
> userspace.
> > Only the following fields can be used:
> >
> >   - IDR0: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN, CD2L, ASID16,
> TTF
> >   - IDR1: SIDSIZE, SSIDSIZE
> >   - IDR3: BBML, RIL
> >   - IDR5: VAX, GRAN64K, GRAN16K, GRAN4K
> >
> > For now, the check is to make sure the features are in sync to enable
> > basic accelerated SMMUv3 support.
> >
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > ---
> >  hw/arm/smmuv3-accel.c | 100
> > ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 100 insertions(+)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index
> > a2deda3c32..8b9f88dd8e 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> > @@ -28,6 +28,98 @@ MemoryRegion root;
> >  MemoryRegion sysmem;
> >  static AddressSpace *shared_as_sysmem;
> >
> > +static bool
> > +smmuv3_accel_check_hw_compatible(SMMUv3State *s,
> > +                                 struct iommu_hw_info_arm_smmuv3 *info,
> > +                                 Error **errp) {
> 
> > +    /* QEMU SMMUv3 supports architecture version 3.1 */
> > +    if (info->aidr < s->aidr) {
> > +        error_setg(errp, "Host SMMUv3 architecture version not compatible");
> > +        return false;
> > +    }
> 
> Why has this requirement?

Right. That was added based on a comment from Eric here,
https://lore.kernel.org/all/b6105534-4a17-4700-bb0b-e961babd10bb@redhat.com/

> We have SMMUv3 version 3.0 and info->aidr = 0.
> and qemu fails to boot here.

Hmm.. It is true that there are hardware out there which implements a cross
section of features from architecture revisions.

Since we are checking the ID registers that matters here individually anyway,
I am not sure whether we should restrict those with AIDR mismatch or just
warn the user.

Thanks,
Shameer




^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 23/32] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding
  2025-11-03 14:53   ` Jonathan Cameron via
@ 2025-11-03 15:43     ` Shameer Kolothum
  0 siblings, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-03 15:43 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 03 November 2025 14:54
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com; Krishnakant Jaju
> <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 23/32] hw/arm/virt-acpi-build: Add IORT RMR regions
> to handle MSI nested binding
> 
> External email: Use caution opening links or attachments
> 
> 
> On Fri, 31 Oct 2025 10:49:56 +0000
> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> 
> > From: Eric Auger <eric.auger@redhat.com>
> >
> > To handle SMMUv3 accel=on mode(which configures the host SMMUv3 in
> > nested mode), it is practical to expose the guest with reserved memory
> > regions
> > (RMRs) covering the IOVAs used by the host kernel to map physical MSI
> > doorbells.
> >
> > Those IOVAs belong to [0x8000000, 0x8100000] matching MSI_IOVA_BASE
> > and MSI_IOVA_LENGTH definitions in kernel arm-smmu-v3 driver. This is
> > the window used to allocate IOVAs matching physical MSI doorbells.
> >
> > With those RMRs, the guest is forced to use a flat mapping for this range.
> > Hence the assigned device is programmed with one IOVA from this range.
> > Stage 1, owned by the guest has a flat mapping for this IOVA. Stage2,
> > owned by the VMM then enforces a mapping from this IOVA to the
> > physical MSI doorbell.
> >
> > The creation of those RMR nodes is only relevant if nested stage SMMU
> > is in use, along with VFIO. As VFIO devices can be hotplugged, all
> > RMRs need to be created in advance.
> >
> > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> 
> One small question inline on the id increment.
> 
> With that tidied up.
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> 
> > @@ -447,10 +475,70 @@ static void create_rc_its_idmaps(GArray
> *its_idmaps, GArray *smmuv3_devs)
> >      }
> >  }
> >
> > +static void
> > +build_iort_rmr_nodes(GArray *table_data, GArray *smmuv3_devices,
> > +uint32_t *id) {
> > +    AcpiIortSMMUv3Dev *sdev;
> > +    AcpiIortIdMapping *idmap;
> > +    int i;
> > +
> > +    for (i = 0; i < smmuv3_devices->len; i++) {
> > +        uint16_t rmr_len;
> > +        int bdf;
> > +
> > +        sdev = &g_array_index(smmuv3_devices, AcpiIortSMMUv3Dev, i);
> > +        if (!sdev->accel) {
> > +            continue;
> > +        }
> > +
> > +        /*
> > +         * Spec reference:Arm IO Remapping Table(IORT), ARM DEN 0049E.d,
> > +         * Section 3.1.1.5 "Reserved Memory Range node"
> > +         */
> > +        idmap = &g_array_index(sdev->rc_smmu_idmaps, AcpiIortIdMapping,
> 0);
> > +        bdf = idmap->input_base;
> > +        rmr_len = IORT_RMR_COMMON_HEADER_SIZE
> > +                 + (IORT_RMR_NUM_ID_MAPPINGS * ID_MAPPING_ENTRY_SIZE)
> > +                 + (IORT_RMR_NUM_MEM_RANGE_DESC *
> > + IORT_RMR_MEM_RANGE_DESC_SIZE);
> > +
> > +        /* Table 18 Reserved Memory Range Node */
> > +        build_append_int_noprefix(table_data, 6 /* RMR */, 1); /* Type */
> > +        /* Length */
> > +        build_append_int_noprefix(table_data, rmr_len, 2);
> > +        build_append_int_noprefix(table_data, 3, 1); /* Revision */
> > +        build_append_int_noprefix(table_data, (*id)++, 4); /*
> > + Identifier */
> So *id is incremented here and...
> > +        /* Number of ID mappings */
> > +        build_append_int_noprefix(table_data,
> IORT_RMR_NUM_ID_MAPPINGS, 4);
> > +        /* Reference to ID Array */
> > +        build_append_int_noprefix(table_data,
> > + IORT_RMR_COMMON_HEADER_SIZE, 4);
> > +
> > +        /* RMR specific data */
> > +
> > +        /* Flags */
> > +        build_append_int_noprefix(table_data, IORT_RMR_FLAGS, 4);
> > +        /* Number of Memory Range Descriptors */
> > +        build_append_int_noprefix(table_data,
> IORT_RMR_NUM_MEM_RANGE_DESC, 4);
> > +        /* Reference to Memory Range Descriptors */
> > +        build_append_int_noprefix(table_data,
> IORT_RMR_COMMON_HEADER_SIZE +
> > +                        (IORT_RMR_NUM_ID_MAPPINGS *
> ID_MAPPING_ENTRY_SIZE), 4);
> > +        build_iort_id_mapping(table_data, bdf, idmap->id_count, sdev->offset,
> > +                              1);
> > +
> > +        /* Table 19 Memory Range Descriptor */
> > +
> > +        /* Physical Range offset */
> > +        build_append_int_noprefix(table_data, MSI_IOVA_BASE, 8);
> > +        /* Physical Range length */
> > +        build_append_int_noprefix(table_data, MSI_IOVA_LENGTH, 8);
> > +        build_append_int_noprefix(table_data, 0, 4); /* Reserved */
> > +        *id += 1;
> here. Why this second one? Perhaps a comment if this is intended.

Oops. I forgot to delete that one.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space
  2025-11-03 13:12   ` Jonathan Cameron via
@ 2025-11-03 15:53     ` Shameer Kolothum
  0 siblings, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-03 15:53 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 03 November 2025 13:12
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com; Krishnakant Jaju
> <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared
> system address space
> 
> External email: Use caution opening links or attachments
> 
> 
> On Fri, 31 Oct 2025 10:49:39 +0000
> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> 
> > To support accelerated SMMUv3 instances, introduce a shared
> > system-wide AddressSpace (shared_as_sysmem) that aliases the global
> system memory.
> > This shared AddressSpace will be used in a subsequent patch for all
> > vfio-pci devices behind all accelerated SMMUv3 instances within a VM.
> No problem with the patch, but perhaps this description could mention
> something about 'why' this address space is useful thing to have?

Yeah. The "why" is missing. Will add.
> >
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 21/32] hw/arm/virt: Set PCI preserve_config for accel SMMUv3
  2025-11-03 14:58   ` Eric Auger
  2025-11-03 15:03     ` Eric Auger via
@ 2025-11-03 16:01     ` Shameer Kolothum
  1 sibling, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-03 16:01 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 03 November 2025 14:58
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 21/32] hw/arm/virt: Set PCI preserve_config for accel
> SMMUv3
> 
> External email: Use caution opening links or attachments
> 
> 
> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> > Introduce a new pci_preserve_config field in virt machine state which
> > allows  the generation of DSM #5. This field is only set if accel SMMU
> > is instantiated.
> >
> > In a subsequent patch, SMMUv3 accel mode will make use of IORT RMR
> > nodes to enable nested translation of MSI doorbell addresses. IORT RMR
> > requires _DSM #5 to be set for the PCI host bridge so that the Guest
> > kernel preserves the PCI boot configuration.
> >
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > ---
> >  hw/arm/virt-acpi-build.c | 8 ++++++++
> >  hw/arm/virt.c            | 4 ++++
> >  include/hw/arm/virt.h    | 1 +
> >  3 files changed, 13 insertions(+)
> >
> > diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c index
> > 8bb6b60515..d51da6e27d 100644
> > --- a/hw/arm/virt-acpi-build.c
> > +++ b/hw/arm/virt-acpi-build.c
> > @@ -163,6 +163,14 @@ static void acpi_dsdt_add_pci(Aml *scope, const
> MemMapEntry *memmap,
> >          .pci_native_hotplug = !acpi_pcihp,
> >      };
> >
> > +    /*
> > +     * Accel SMMU requires RMRs for MSI 1-1 mapping, which require _DSM
> for
> > +     * preserving PCI Boot Configurations
> as suggested in v4 you can be more precise and explictly state
> 
> _DSM function 5 (Ignore PCI Boot Configuration)

Ok. Will update.

> 
> > +     */
> > +    if (vms->pci_preserve_config) {
> > +        cfg.preserve_config = true;
> > +    }
> > +
> >      if (vms->highmem_mmio) {
> >          cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
> >      }
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c index
> > 175023897a..8a347a6e39 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -3091,6 +3091,10 @@ static void
> virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
> >              }
> >
> >              create_smmuv3_dev_dtb(vms, dev, bus);
> > +            if (object_property_find(OBJECT(dev), "accel") &&
> why do you need to test
> 
> object_property_find(OBJECT(dev), "accel")?

Looks like you probably found the reason now 😉. 

Anyway for the record, we don't have the "accel" property yet and 
is only introduced in 25/32. Without the above check the "make check "
tests will report error. This will be removed once we introduce the
property.

> 
> > +                object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
> > +                vms->pci_preserve_config = true;
> > +            }
> >          }
> >      }
> >
> > diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h index
> > 04a09af354..60db5d40b2 100644
> > --- a/include/hw/arm/virt.h
> > +++ b/include/hw/arm/virt.h
> > @@ -182,6 +182,7 @@ struct VirtMachineState {
> >      bool ns_el2_virt_timer_irq;
> >      CXLState cxl_devices_state;
> >      bool legacy_smmuv3_present;
> > +    bool pci_preserve_config;
> >  };
> >
> >  #define VIRT_ECAM_ID(high) (high ? VIRT_HIGH_PCIE_ECAM :
> > VIRT_PCIE_ECAM)
> With those changes takin into account
> Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to specify RIL support
  2025-11-03 15:07   ` Eric Auger
@ 2025-11-03 16:08     ` Shameer Kolothum
  2025-11-03 16:25       ` Eric Auger
  0 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-03 16:08 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 03 November 2025 15:07
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to
> specify RIL support
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Shameer,
> 
> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> > Currently QEMU SMMUv3 has RIL support by default. But if accelerated
> > mode is enabled, RIL has to be compatible with host SMMUv3 support.
> >
> > Add a property so that the user can specify this.
> >
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> 
> I have not seen any reply on
> https://lore.kernel.org/all/b6105534-4a17-4700-bb0b-
> e961babd10bb@redhat.com/

Sorry, looks like I missed to reply.
 
> I guess you chose to restrict RIL to accel only.

Yes. I have updated the description. 

 About AIDR consistency check,
> did you have a look?

I have added that check in patch #19. But Zhangfei has reported a problem
with that as his hardware reports AIDR = 0. . Please take a look that
discussion.

Thanks,
Shameer
 

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to specify RIL support
  2025-11-03 16:08     ` Shameer Kolothum
@ 2025-11-03 16:25       ` Eric Auger
  0 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 16:25 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju



On 11/3/25 5:08 PM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 03 November 2025 15:07
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to
>> specify RIL support
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Shameer,
>>
>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>>> Currently QEMU SMMUv3 has RIL support by default. But if accelerated
>>> mode is enabled, RIL has to be compatible with host SMMUv3 support.
>>>
>>> Add a property so that the user can specify this.
>>>
>>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>>> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>> I have not seen any reply on
>> https://lore.kernel.org/all/b6105534-4a17-4700-bb0b-
>> e961babd10bb@redhat.com/
> Sorry, looks like I missed to reply.
>  
>> I guess you chose to restrict RIL to accel only.
> Yes. I have updated the description. 
>
>  About AIDR consistency check,
>> did you have a look?
> I have added that check in patch #19. But Zhangfei has reported a problem
> with that as his hardware reports AIDR = 0. . Please take a look that
> discussion.

OK Thanks. I still think it may be relevant to support disabling RIL for
non accel mode but this can be added later on.

feel free to add my

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric

>
> Thanks,
> Shameer
>  



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space
  2025-11-03 13:39   ` Philippe Mathieu-Daudé
@ 2025-11-03 16:30     ` Eric Auger
  0 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 16:30 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé, Shameer Kolothum, qemu-arm,
	qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju, Peter Xu, Mark Cave-Ayland



On 11/3/25 2:39 PM, Philippe Mathieu-Daudé wrote:
> Hi,
>
> On 31/10/25 11:49, Shameer Kolothum wrote:
>> To support accelerated SMMUv3 instances, introduce a shared system-wide
>> AddressSpace (shared_as_sysmem) that aliases the global system memory.
>> This shared AddressSpace will be used in a subsequent patch for all
>> vfio-pci devices behind all accelerated SMMUv3 instances within a VM.
>>
>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>> ---
>>   hw/arm/smmuv3-accel.c | 27 +++++++++++++++++++++++++++
>>   1 file changed, 27 insertions(+)
>>
>> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
>> index 99ef0db8c4..f62b6cf2c9 100644
>> --- a/hw/arm/smmuv3-accel.c
>> +++ b/hw/arm/smmuv3-accel.c
>> @@ -11,6 +11,15 @@
>>   #include "hw/arm/smmuv3.h"
>>   #include "smmuv3-accel.h"
>>   +/*
>> + * The root region aliases the global system memory, and
>> shared_as_sysmem
>> + * provides a shared Address Space referencing it. This Address
>> Space is used
>> + * by all vfio-pci devices behind all accelerated SMMUv3 instances
>> within a VM.
>> + */
>> +MemoryRegion root;
>> +MemoryRegion sysmem;
>
> Why can't we store that in SMMUv3State?
>
>> +static AddressSpace *shared_as_sysmem;

We will have several instances of SMMUv3State which all share the same
as, hence the choice of having a global.

Eric
>
> FYI we have object_resolve_type_unambiguous() to check whether an
> instance exists only once (singleton).
>
>> +
>>   static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs,
>> SMMUPciBus *sbus,
>>                                                  PCIBus *bus, int devfn)
>>   {
>> @@ -51,9 +60,27 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
>>       .get_address_space = smmuv3_accel_find_add_as,
>>   };
>>   +static void smmuv3_accel_as_init(SMMUv3State *s)
>> +{
>> +
>> +    if (shared_as_sysmem) {
>> +        return;
>> +    }
>> +
>> +    memory_region_init(&root, OBJECT(s), "root", UINT64_MAX);
>> +    memory_region_init_alias(&sysmem, OBJECT(s), "smmuv3-accel-sysmem",
>> +                             get_system_memory(), 0,
>> +                             memory_region_size(get_system_memory()));
>> +    memory_region_add_subregion(&root, 0, &sysmem);
>> +
>> +    shared_as_sysmem = g_new0(AddressSpace, 1);
>> +    address_space_init(shared_as_sysmem, &root,
>> "smmuv3-accel-as-sysmem");
>> +}
>> +
>>   void smmuv3_accel_init(SMMUv3State *s)
>>   {
>>       SMMUState *bs = ARM_SMMU(s);
>>         bs->iommu_ops = &smmuv3_accel_ops;
>> +    smmuv3_accel_as_init(s);
>>   }
>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 07/32] hw/pci/pci: Move pci_init_bus_master() after adding device to bus
  2025-10-31 10:49 ` [PATCH v5 07/32] hw/pci/pci: Move pci_init_bus_master() after adding device to bus Shameer Kolothum
  2025-11-03 13:24   ` Jonathan Cameron via
@ 2025-11-03 16:40   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 16:40 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> During PCI hotplug, in do_pci_register_device(), pci_init_bus_master()
> is called before storing the pci_dev pointer in bus->devices[devfn].
>
> This causes a problem if pci_init_bus_master() (via its
> get_address_space() callback) attempts to retrieve the device using
> pci_find_device(), since the PCI device is not yet visible on the bus.
>
> Fix this by moving the pci_init_bus_master() call to after the device
> has been added to bus->devices[devfn].
>
> This prepares for a subsequent patch where the accel SMMUv3
> get_address_space() callback retrieves the pci_dev to identify the
> attached device type.
>
> No functional change intended.
>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/pci/pci.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index c9932c87e3..9693d7f10c 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -1370,9 +1370,6 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
>      pci_dev->bus_master_as.max_bounce_buffer_size =
>          pci_dev->max_bounce_buffer_size;
>  
> -    if (phase_check(PHASE_MACHINE_READY)) {
> -        pci_init_bus_master(pci_dev);
> -    }
>      pci_dev->irq_state = 0;
>      pci_config_alloc(pci_dev);
>  
> @@ -1416,6 +1413,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
>      pci_dev->config_write = config_write;
>      bus->devices[devfn] = pci_dev;
>      pci_dev->version_id = 2; /* Current pci device vmstate version */
> +    if (phase_check(PHASE_MACHINE_READY)) {
> +        pci_init_bus_master(pci_dev);
> +    }
>      return pci_dev;
>  }
>  



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 08/32] hw/pci/pci: Add optional supports_address_space() callback
  2025-10-31 10:49 ` [PATCH v5 08/32] hw/pci/pci: Add optional supports_address_space() callback Shameer Kolothum
  2025-11-03 13:30   ` Jonathan Cameron via
@ 2025-11-03 16:47   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 16:47 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> Introduce an optional supports_address_space() callback in PCIIOMMUOps to
> allow a vIOMMU implementation to reject devices that should not be attached
> to it.
>
> Currently, get_address_space() is the first and mandatory callback into the
> vIOMMU layer, which always returns an address space. For certain setups, such
> as hardware accelerated vIOMMUs (e.g. ARM SMMUv3 with accel=on), attaching
> emulated endpoint devices is undesirable as it may impact the behavior or
> performance of VFIO passthrough devices, for example, by triggering
> unnecessary invalidations on the host IOMMU.
>
> The new callback allows a vIOMMU to check and reject unsupported devices
> early during PCI device registration.
>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>

Looks a reasonable solution to me
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric

> ---
>  hw/pci/pci.c         | 20 ++++++++++++++++++++
>  include/hw/pci/pci.h | 17 +++++++++++++++++
>  2 files changed, 37 insertions(+)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 9693d7f10c..fa9cf5dab2 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -135,6 +135,21 @@ static void pci_set_master(PCIDevice *d, bool enable)
>      d->is_master = enable; /* cache the status */
>  }
>  
> +static bool
> +pci_device_supports_iommu_address_space(PCIDevice *dev, Error **errp)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    int devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> +    if (iommu_bus && iommu_bus->iommu_ops->supports_address_space) {
> +        return iommu_bus->iommu_ops->supports_address_space(bus,
> +                                iommu_bus->iommu_opaque, devfn, errp);
> +    }
> +    return true;
> +}
> +
>  static void pci_init_bus_master(PCIDevice *pci_dev)
>  {
>      AddressSpace *dma_as = pci_device_iommu_address_space(pci_dev);
> @@ -1413,6 +1428,11 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
>      pci_dev->config_write = config_write;
>      bus->devices[devfn] = pci_dev;
>      pci_dev->version_id = 2; /* Current pci device vmstate version */
> +    if (!pci_device_supports_iommu_address_space(pci_dev, errp)) {
> +        do_pci_unregister_device(pci_dev);
> +        bus->devices[devfn] = NULL;
> +        return NULL;
> +    }
>      if (phase_check(PHASE_MACHINE_READY)) {
>          pci_init_bus_master(pci_dev);
>      }
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index cf99b5bb68..dfeba8c9bd 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -417,6 +417,23 @@ typedef struct IOMMUPRINotifier {
>   * framework for a set of devices on a PCI bus.
>   */
>  typedef struct PCIIOMMUOps {
> +    /**
> +     * @supports_address_space: Optional pre-check to determine if a PCI
> +     * device can have an IOMMU address space.
> +     *
> +     * @bus: the #PCIBus being accessed.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * @devfn: device and function number.
> +     *
> +     * @errp: pass an Error out only when return false
> +     *
> +     * Returns: true if the device can be associated with an IOMMU address
> +     * space, false otherwise with errp set.
> +     */
> +    bool (*supports_address_space)(PCIBus *bus, void *opaque, int devfn,
> +                                   Error **errp);
>      /**
>       * @get_address_space: get the address space for a set of devices
>       * on a PCI bus.



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 10/32] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
  2025-10-31 10:49 ` [PATCH v5 10/32] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd Shameer Kolothum
@ 2025-11-03 16:51   ` Eric Auger
  0 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 16:51 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> Accelerated SMMUv3 is only meaningful when a device can leverage the
> host SMMUv3 in nested mode (S1+S2 translation). To keep the model
> consistent and correct, this mode is restricted to vfio-pci endpoint
> devices using the iommufd backend.
>
> Non-endpoint emulated devices such as PCIe root ports and bridges are
> also permitted so that vfio-pci devices can be attached beneath them.
s/beneath them/downstream?
> All other device types are unsupported in accelerated mode.
>
> Implement supports_address_space() callaback to reject all such
callback
> unsupported devices.
>
> This restriction also avoids complications with IOTLB invalidations.
> Some TLBI commands (e.g. CMD_TLBI_NH_ASID) lack an associated SID,
> making it difficult to trace the originating device. Allowing emulated
> endpoints would require invalidating both QEMU’s software IOTLB and the
> host’s hardware IOTLB, which can significantly degrade performance.
>
> For vfio-pci devices in nested mode, get_address_space() returns an
> address space aliased to system address space so that the VFIO core
> can set up the correct stage-2 mappings for guest RAM.
>
> In summary:
>  - vfio-pci devices(with iommufd as backend) return an address space
>    aliased to system address space.
>  - bridges and root ports return the IOMMU address space.
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/arm/smmuv3-accel.c | 66 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 65 insertions(+), 1 deletion(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index f62b6cf2c9..550a0496fe 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -7,8 +7,13 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>  
>  #include "hw/arm/smmuv3.h"
> +#include "hw/pci/pci_bridge.h"
> +#include "hw/pci-host/gpex.h"
> +#include "hw/vfio/pci.h"
> +
>  #include "smmuv3-accel.h"
>  
>  /*
> @@ -38,6 +43,41 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>      return accel_dev;
>  }
>  
> +static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> +{
> +
> +    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
> +        object_dynamic_cast(OBJECT(pdev), TYPE_PXB_PCIE_DEV) ||
> +        object_dynamic_cast(OBJECT(pdev), TYPE_GPEX_ROOT_DEVICE)) {
> +        return true;
> +    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI))) {
> +        *vfio_pci = true;
> +        if (object_property_get_link(OBJECT(pdev), "iommufd", NULL)) {
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
> +static bool smmuv3_accel_supports_as(PCIBus *bus, void *opaque, int devfn,
> +                                     Error **errp)
> +{
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> +    bool vfio_pci = false;
> +
> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> +        if (vfio_pci) {
> +            error_setg(errp, "vfio-pci endpoint devices without an iommufd "
> +                       "backend not allowed when using arm-smmuv3,accel=on");
> +
> +        } else {
> +            error_setg(errp, "Emulated endpoint devices are not allowed when "
> +                       "using arm-smmuv3,accel=on");
> +        }
> +        return false;
> +    }
> +    return true;
> +}
>  /*
>   * Find or add an address space for the given PCI device.
>   *
> @@ -48,15 +88,39 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>  static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
>                                                int devfn)
>  {
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>      SMMUState *bs = opaque;
>      SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
>      SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
>      SMMUDevice *sdev = &accel_dev->sdev;
> +    bool vfio_pci = false;
>  
> -    return &sdev->as;
> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> +        /* Should never be here: supports_address_space() filters these out */
> +        g_assert_not_reached();
> +    }
> +
> +    /*
> +     * In the accelerated mode, a vfio-pci device attached via the iommufd
> +     * backend must remain in the system address space. Such a device is
> +     * always translated by its physical SMMU (using either a stage-2-only
> +     * STE or a nested STE), where the parent stage-2 page table is allocated
> +     * by the VFIO core to back the system address space.
> +     *
> +     * Return the shared_as_sysmem aliased to the global system memory in this
> +     * case. Sharing address_space_memory also allows devices under different
> +     * vSMMU instances in the same VM to reuse a single nesting parent HWPT in
> +     * the VFIO core.
> +     */
> +    if (vfio_pci) {
> +        return shared_as_sysmem;
> +    } else {
> +        return &sdev->as;
> +    }
>  }
>  
>  static const PCIIOMMUOps smmuv3_accel_ops = {
> +    .supports_address_space = smmuv3_accel_supports_as,
>      .get_address_space = smmuv3_accel_find_add_as,
>  };
>  
Besides

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 11/32] hw/arm/smmuv3: Implement get_viommu_cap() callback
  2025-10-31 10:49 ` [PATCH v5 11/32] hw/arm/smmuv3: Implement get_viommu_cap() callback Shameer Kolothum
@ 2025-11-03 16:55   ` Eric Auger
  0 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 16:55 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> For accelerated SMMUv3, we need nested parent domain creation. Add the
> callback support so that VFIO can create a nested parent.
>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/arm/smmuv3-accel.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 550a0496fe..a1d672208f 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -10,6 +10,7 @@
>  #include "qemu/error-report.h"
>  
>  #include "hw/arm/smmuv3.h"
> +#include "hw/iommu.h"
>  #include "hw/pci/pci_bridge.h"
>  #include "hw/pci-host/gpex.h"
>  #include "hw/vfio/pci.h"
> @@ -119,9 +120,21 @@ static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
>      }
>  }
>  
> +static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
> +{
> +    /*
> +     * We return VIOMMU_FLAG_WANT_NESTING_PARENT to inform VFIO core to create a
> +     * nesting parent which is required for accelerated SMMUv3 support.
> +     * The real HW nested support should be reported from host SMMUv3 and if
> +     * it doesn't, the nesting parent allocation will fail anyway in VFIO core.
> +     */
> +    return VIOMMU_FLAG_WANT_NESTING_PARENT;
> +}
> +
>  static const PCIIOMMUOps smmuv3_accel_ops = {
>      .supports_address_space = smmuv3_accel_supports_as,
>      .get_address_space = smmuv3_accel_find_add_as,
> +    .get_viommu_flags = smmuv3_accel_get_viommu_flags,
>  };
>  
>  static void smmuv3_accel_as_init(SMMUv3State *s)



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
  2025-10-31 10:49 ` [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host Shameer Kolothum
  2025-11-01  0:35   ` Nicolin Chen via
@ 2025-11-03 17:11   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 17:11 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> Provide a helper and use that to issue the invalidation cmd to host SMMUv3.
> We only issue one cmd at a time for now.
>
> Support for batching of commands will be added later after analysing the
> impact.
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/arm/smmuv3-accel.c | 35 +++++++++++++++++++++++++++++++++++
>  hw/arm/smmuv3-accel.h |  8 ++++++++
>  hw/arm/smmuv3.c       | 30 ++++++++++++++++++++++++++++++
>  3 files changed, 73 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 395c8175da..a2deda3c32 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -213,6 +213,41 @@ bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
>      return true;
>  }
>  
> +/*
> + * This issues the invalidation cmd to the host SMMUv3.
> + * Note: sdev can be NULL for certain invalidation commands
> + * e.g., SMMU_CMD_TLBI_NH_ASID, SMMU_CMD_TLBI_NH_VA etc.
> + */
> +bool smmuv3_accel_issue_inv_cmd(SMMUv3State *bs, void *cmd, SMMUDevice *sdev,
> +                                Error **errp)
> +{
> +    SMMUv3State *s = ARM_SMMUV3(bs);
> +    SMMUv3AccelState *s_accel = s->s_accel;
> +    IOMMUFDViommu *viommu;
> +    uint32_t entry_num = 1;
> +
> +    /* No vIOMMU means no VFIO/IOMMUFD devices, nothing to invalidate. */
> +    if (!s_accel || !s_accel->vsmmu) {
> +        return true;
> +    }
> +
> +    /*
> +     * Called for emulated bridges or root ports, but SID-based
> +     * invalidations (e.g. CFGI_CD) apply only to vfio-pci endpoints
> +     * with a valid vIOMMU vdev.
> +     */
> +    if (sdev && !container_of(sdev, SMMUv3AccelDevice, sdev)->vdev) {
> +        return true;
> +    }
> +
> +    viommu = &s_accel->vsmmu->viommu;
> +    /* Single command (entry_num = 1); no need to check returned entry_num */
> +    return iommufd_backend_invalidate_cache(
> +                   viommu->iommufd, viommu->viommu_id,
> +                   IOMMU_VIOMMU_INVALIDATE_DATA_ARM_SMMUV3,
> +                   sizeof(Cmd), &entry_num, cmd, errp);
> +}
> +
>  static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>                                                 PCIBus *bus, int devfn)
>  {
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> index 8931e83dc5..ee79548370 100644
> --- a/hw/arm/smmuv3-accel.h
> +++ b/hw/arm/smmuv3-accel.h
> @@ -51,6 +51,8 @@ bool smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
>                                       Error **errp);
>  bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
>                                             Error **errp);
> +bool smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
> +                                Error **errp);
>  void smmuv3_accel_gbpa_update(SMMUv3State *s);
>  void smmuv3_accel_reset(SMMUv3State *s);
>  #else
> @@ -69,6 +71,12 @@ smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
>  {
>      return true;
>  }
> +static inline bool
> +smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
> +                           Error **errp)
> +{
> +    return true;
> +}
>  static inline void smmuv3_accel_gbpa_update(SMMUv3State *s)
>  {
>  }
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index cc32b618ed..15173ddc9c 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -1381,6 +1381,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
>          {
>              uint32_t sid = CMD_SID(&cmd);
>              SMMUDevice *sdev = smmu_find_sdev(bs, sid);
> +            Error *local_err = NULL;
>  
>              if (CMD_SSEC(&cmd)) {
>                  cmd_error = SMMU_CERROR_ILL;
> @@ -1393,11 +1394,17 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
>  
>              trace_smmuv3_cmdq_cfgi_cd(sid);
>              smmuv3_flush_config(sdev);
> +            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, sdev, &local_err)) {
> +                error_report_err(local_err);
> +                cmd_error = SMMU_CERROR_ILL;
> +                break;
> +            }
>              break;
>          }
>          case SMMU_CMD_TLBI_NH_ASID:
>          {
>              int asid = CMD_ASID(&cmd);
> +            Error *local_err = NULL;
>              int vmid = -1;
>  
>              if (!STAGE1_SUPPORTED(s)) {
> @@ -1416,6 +1423,11 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
>              trace_smmuv3_cmdq_tlbi_nh_asid(asid);
>              smmu_inv_notifiers_all(&s->smmu_state);
>              smmu_iotlb_inv_asid_vmid(bs, asid, vmid);
> +            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
> +                error_report_err(local_err);
> +                cmd_error = SMMU_CERROR_ILL;
> +                break;
> +            }
>              break;
>          }
>          case SMMU_CMD_TLBI_NH_ALL:
> @@ -1440,18 +1452,36 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
>              QEMU_FALLTHROUGH;
>          }
>          case SMMU_CMD_TLBI_NSNH_ALL:
> +        {
> +            Error *local_err = NULL;
> +
>              trace_smmuv3_cmdq_tlbi_nsnh();
>              smmu_inv_notifiers_all(&s->smmu_state);
>              smmu_iotlb_inv_all(bs);
> +            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
> +                error_report_err(local_err);
> +                cmd_error = SMMU_CERROR_ILL;
> +                break;
> +            }
>              break;
> +        }
>          case SMMU_CMD_TLBI_NH_VAA:
>          case SMMU_CMD_TLBI_NH_VA:
> +        {
> +            Error *local_err = NULL;
> +
>              if (!STAGE1_SUPPORTED(s)) {
>                  cmd_error = SMMU_CERROR_ILL;
>                  break;
>              }
>              smmuv3_range_inval(bs, &cmd, SMMU_STAGE_1);
> +            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
> +                error_report_err(local_err);
> +                cmd_error = SMMU_CERROR_ILL;
> +                break;
> +            }
>              break;
> +        }
>          case SMMU_CMD_TLBI_S12_VMALL:
>          {
>              int vmid = CMD_VMID(&cmd);



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate
  2025-11-03 15:42     ` Shameer Kolothum
@ 2025-11-03 17:16       ` Eric Auger
  0 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-03 17:16 UTC (permalink / raw)
  To: Shameer Kolothum, Zhangfei Gao
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



On 11/3/25 4:42 PM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Zhangfei Gao <zhangfei.gao@linaro.org>
>> Sent: 01 November 2025 14:20
>> To: Shameer Kolothum <skolothumtho@nvidia.com>
>> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
>> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
>> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>;
>> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
>> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhenzhong.duan@intel.com; yi.l.liu@intel.com; Krishnakant Jaju
>> <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw
>> info and validate
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi, Shameer
>>
>> On Fri, 31 Oct 2025 at 18:54, Shameer Kolothum
>> <skolothumtho@nvidia.com> wrote:
>>> Just before the device gets attached to the SMMUv3, make sure QEMU
>>> SMMUv3 features are compatible with the host SMMUv3.
>>>
>>> Not all fields in the host SMMUv3 IDR registers are meaningful for
>> userspace.
>>> Only the following fields can be used:
>>>
>>>   - IDR0: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN, CD2L, ASID16,
>> TTF
>>>   - IDR1: SIDSIZE, SSIDSIZE
>>>   - IDR3: BBML, RIL
>>>   - IDR5: VAX, GRAN64K, GRAN16K, GRAN4K
>>>
>>> For now, the check is to make sure the features are in sync to enable
>>> basic accelerated SMMUv3 support.
>>>
>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>>> ---
>>>  hw/arm/smmuv3-accel.c | 100
>>> ++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 100 insertions(+)
>>>
>>> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index
>>> a2deda3c32..8b9f88dd8e 100644
>>> --- a/hw/arm/smmuv3-accel.c
>>> +++ b/hw/arm/smmuv3-accel.c
>>> @@ -28,6 +28,98 @@ MemoryRegion root;
>>>  MemoryRegion sysmem;
>>>  static AddressSpace *shared_as_sysmem;
>>>
>>> +static bool
>>> +smmuv3_accel_check_hw_compatible(SMMUv3State *s,
>>> +                                 struct iommu_hw_info_arm_smmuv3 *info,
>>> +                                 Error **errp) {
>>> +    /* QEMU SMMUv3 supports architecture version 3.1 */
>>> +    if (info->aidr < s->aidr) {
>>> +        error_setg(errp, "Host SMMUv3 architecture version not compatible");
>>> +        return false;
>>> +    }
>> Why has this requirement?
> Right. That was added based on a comment from Eric here,
> https://lore.kernel.org/all/b6105534-4a17-4700-bb0b-e961babd10bb@redhat.com/
>
>> We have SMMUv3 version 3.0 and info->aidr = 0.
>> and qemu fails to boot here.
> Hmm.. It is true that there are hardware out there which implements a cross
> section of features from architecture revisions.
>
> Since we are checking the ID registers that matters here individually anyway,
> I am not sure whether we should restrict those with AIDR mismatch or just
> warn the user.
OK. Just maybe document its is irrelevant to check AIDR in the commit
msg for that reason.

With that commit msg update + removal of AIDR code feel free to take my
Reviewed-by: Eric Auger <eric.auger@redhat.com>


Eric
>
> Thanks,
> Shameer
>
>
>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support
  2025-11-03 15:11     ` Shameer Kolothum
@ 2025-11-03 17:32       ` Nicolin Chen
  0 siblings, 0 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-11-03 17:32 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

On Mon, Nov 03, 2025 at 07:11:18AM -0800, Shameer Kolothum wrote:
> > > +    if (s1_hwpt) {
> > > +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev, true, errp)) {
> > > +            return false;
> > > +        }
> > > +    }
> > 
> > I think we could have some improvements here.
> > 
> > The current flow is:
> >     (attached to s1_hwpt1)
> >     attach to bypass/abort_hwpt // no issue though.
> >     free s1_hwpt1
> >     alloc s2_hwpt2
> >     attach to s2_hwpt2
> > 
> > It could have been a flow like replace() in the kernel:
> >     (attached to s1_hwpt1)
> >     alloc s2_hwpt2
> >     attach to s2_hwpt2 /* skipping bypass/abort */
> >     free s1_hwpt
> 
> Not sure I get the above, you mean in this _instatl_nested_ste() path,
> we have a case where we need to alloc a S2 HWPT and attach?

Oh no. s1_hwpt1 and s1_hwpt2. The point is that we don't really
need that bypass/abort attachment (i.e. the uninstall function)
when switching between two nested hwpts. The sample code that I
shared should cover this already.

> > That being said, HW doesn't seem to propagate C_BAD_STE during a
> > CFGI_STE or CFGI_STE_RANGE, IIUIC. It reports C_BAD_STE event when
> > a transaction starts. If we want to perfectly mimic the hardware,
> > we'd have to set up a bad STE down to the HW, which will trigger a
> > C_BAD_STE vevent to be forwarded by vEVENTQ.
> 
> I don't think we need to mimic that behaviour. We could return an event
> from here to Guest if required or just have error_report().

I am not sure about that. Reporting event on CMD_CFGI_STE doesn't
sound like a correct behavior following the HW spec.

error_report() would be fine. But we might need to leave a FIXME.

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 16/32] hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
  2025-11-03 15:19     ` Shameer Kolothum
@ 2025-11-03 17:34       ` Nicolin Chen
  0 siblings, 0 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-11-03 17:34 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

On Mon, Nov 03, 2025 at 07:19:08AM -0800, Shameer Kolothum wrote:
> I like the current approach. Possibly can update the doc for get_msi_address_space()
> In previous patch to make it clear that "&address_space_memory" should be 
> returned if no msi translation is required.

Yea. That'd be clearer.

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
  2025-11-03 15:28     ` Shameer Kolothum
@ 2025-11-03 17:43       ` Nicolin Chen
  2025-11-03 18:17         ` Shameer Kolothum
  0 siblings, 1 reply; 148+ messages in thread
From: Nicolin Chen @ 2025-11-03 17:43 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

On Mon, Nov 03, 2025 at 07:28:14AM -0800, Shameer Kolothum wrote:
> > >          case SMMU_CMD_TLBI_NH_VAA:
> > >          case SMMU_CMD_TLBI_NH_VA:
> > > +        {
> > > +            Error *local_err = NULL;
> > > +
> > >              if (!STAGE1_SUPPORTED(s)) {
> > >                  cmd_error = SMMU_CERROR_ILL;
> > >                  break;
> > >              }
> > >              smmuv3_range_inval(bs, &cmd, SMMU_STAGE_1);
> > > +            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
> > > +                error_report_err(local_err);
> > > +                cmd_error = SMMU_CERROR_ILL;
> > > +                break;
> > > +            }
> > >              break;
> > > +        }
> > 
> > The local_err isn't used anywhere but by the error_report_err()
> > alone. So, it could be moved into smmuv3_accel_issue_inv_cmd().
> 
> Though that is true, it is following the same pattern as 
> smmuv3_accel_install_nested_ste()/_range()  functions.

We could drop the one in smmuv3_accel_install_nested_ste() too.

> The general
> idea is, we will pass the errp to accel functions and report or propagate
> from here.

But there is no "errp" in smmuv3_cmdq_consume() to propagate the
these local_errs further? It ends at the error_report_err().

If we only get local_err and print them, why not just print them
inside the _accel functions?

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
  2025-11-03 17:43       ` Nicolin Chen
@ 2025-11-03 18:17         ` Shameer Kolothum
  2025-11-03 18:51           ` Nicolin Chen
  0 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-03 18:17 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 03 November 2025 17:44
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue
> invalidation cmd to host
> 
> >
> > Though that is true, it is following the same pattern as
> > smmuv3_accel_install_nested_ste()/_range()  functions.
> 
> We could drop the one in smmuv3_accel_install_nested_ste() too.
> 
> > The general
> > idea is, we will pass the errp to accel functions and report or
> > propagate from here.
> 
> But there is no "errp" in smmuv3_cmdq_consume() to propagate the these
> local_errs further? It ends at the error_report_err().
> 
> If we only get local_err and print them, why not just print them inside the
> _accel functions?

Right, we don't propagate error now. But in future it might come
handy. I would personally keep the error propagation facility if possible.
Also, this was added as per Eric's comment on RFC v3.

https://lore.kernel.org/qemu-devel/41ceadf1-07de-4c8a-8935-d709ac7cf6bc@redhat.com/

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
  2025-11-03 18:17         ` Shameer Kolothum
@ 2025-11-03 18:51           ` Nicolin Chen
  2025-11-04  8:55             ` Eric Auger
  0 siblings, 1 reply; 148+ messages in thread
From: Nicolin Chen @ 2025-11-03 18:51 UTC (permalink / raw)
  To: Shameer Kolothum, eric.auger@redhat.com
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

On Mon, Nov 03, 2025 at 10:17:20AM -0800, Shameer Kolothum wrote:
> > > The general
> > > idea is, we will pass the errp to accel functions and report or
> > > propagate from here.
> > 
> > But there is no "errp" in smmuv3_cmdq_consume() to propagate the these
> > local_errs further? It ends at the error_report_err().
> > 
> > If we only get local_err and print them, why not just print them inside the
> > _accel functions?
> 
> Right, we don’t propagate error now. But in future it might come
> handy. I would personally keep the error propagation facility if possible.

smmuv3_cmdq_consume() is called in smmu_writel() only. Where do we
plan to propagate that in the future?

> Also, this was added as per Eric's comment on RFC v3.
>
> https://lore.kernel.org/qemu-devel/41ceadf1-07de-4c8a-8935-d709ac7cf6bc@redhat.com/

If only we have a top function that does error_report_err() in one
place.. Duplicating error_report_err(local_err) doesn't look clean
to me.

Maybe smmu_writel() could do:
{
+   Error *errp = NULL;

    switch (offset) {
    case A_XXX:
        smmuv3_cmdq_consume(..., errp);
+       return MEMTX_OK;
-       break;
    ...
    case A_YYY:
        smmuv3_cmdq_consume(..., errp);
+       return MEMTX_OK;
-       break;
    }
+   error_report_err(errp);
+   return MEMTX_OK;
}

Any better idea, Eric?

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
  2025-11-03 18:51           ` Nicolin Chen
@ 2025-11-04  8:55             ` Eric Auger
  2025-11-04 16:41               ` Nicolin Chen
  0 siblings, 1 reply; 148+ messages in thread
From: Eric Auger @ 2025-11-04  8:55 UTC (permalink / raw)
  To: Nicolin Chen, Shameer Kolothum
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

Hi,

On 11/3/25 7:51 PM, Nicolin Chen wrote:
> On Mon, Nov 03, 2025 at 10:17:20AM -0800, Shameer Kolothum wrote:
>>>> The general
>>>> idea is, we will pass the errp to accel functions and report or
>>>> propagate from here.
>>> But there is no "errp" in smmuv3_cmdq_consume() to propagate the these
>>> local_errs further? It ends at the error_report_err().
>>>
>>> If we only get local_err and print them, why not just print them inside the
>>> _accel functions?
>> Right, we don’t propagate error now. But in future it might come
>> handy. I would personally keep the error propagation facility if possible.
> smmuv3_cmdq_consume() is called in smmu_writel() only. Where do we
> plan to propagate that in the future?
>
>> Also, this was added as per Eric's comment on RFC v3.
>>
>> https://lore.kernel.org/qemu-devel/41ceadf1-07de-4c8a-8935-d709ac7cf6bc@redhat.com/
> If only we have a top function that does error_report_err() in one
> place.. Duplicating error_report_err(local_err) doesn't look clean
> to me.
>
> Maybe smmu_writel() could do:
> {
> +   Error *errp = NULL;
>
>     switch (offset) {
>     case A_XXX:
>         smmuv3_cmdq_consume(..., errp);
> +       return MEMTX_OK;
> -       break;
>     ...
>     case A_YYY:
>         smmuv3_cmdq_consume(..., errp);
> +       return MEMTX_OK;
> -       break;
>     }
> +   error_report_err(errp);
> +   return MEMTX_OK;
> }
>
> Any better idea, Eric?

Can't we move local_err outside of case block and after the switch,

 if (cmd_error) {
   if (local_err) {
      error_report_err(local_err);
   }
../..  

Eric
 
>
> Nicolin
>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to specify RIL support
  2025-10-31 10:49 ` [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to specify RIL support Shameer Kolothum
  2025-11-03 15:07   ` Eric Auger
@ 2025-11-04  9:38   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-04  9:38 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> Currently QEMU SMMUv3 has RIL support by default. But if accelerated mode
> is enabled, RIL has to be compatible with host SMMUv3 support.
>
> Add a property so that the user can specify this.
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/arm/smmuv3-accel.c   | 15 +++++++++++++--
>  hw/arm/smmuv3-accel.h   |  4 ++++
>  hw/arm/smmuv3.c         | 12 ++++++++++++
>  include/hw/arm/smmuv3.h |  1 +
>  4 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 8b9f88dd8e..35298350cb 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -63,10 +63,10 @@ smmuv3_accel_check_hw_compatible(SMMUv3State *s,
>          return false;
>      }
>  
> -    /* QEMU SMMUv3 supports Range Invalidation by default */
> +    /* User can disable QEMU SMMUv3 Range Invalidation support */
>      if (FIELD_EX32(info->idr[3], IDR3, RIL) !=
>                  FIELD_EX32(s->idr[3], IDR3, RIL)) {
> -        error_setg(errp, "Host SMMUv3 doesn't support Range Invalidation");
> +        error_setg(errp, "Host SMMUv3 differs in Range Invalidation support");
>          return false;
>      }
>  
> @@ -635,6 +635,17 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
>      .get_msi_address_space = smmuv3_accel_get_msi_as,
>  };
>  
> +void smmuv3_accel_idr_override(SMMUv3State *s)
> +{
> +    if (!s->accel) {
> +        return;
> +    }
> +
> +    /* By default QEMU SMMUv3 has RIL. Update IDR3 if user has disabled it */
> +    if (!s->ril) {
> +        s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 0);
Can't you directly set

s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, s->rid);

Eric

> +    }
> +}
>  
>  /* Based on SMUUv3 GBPA configuration, attach a corresponding HWPT */
>  void smmuv3_accel_gbpa_update(SMMUv3State *s)
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> index ee79548370..4f5b672712 100644
> --- a/hw/arm/smmuv3-accel.h
> +++ b/hw/arm/smmuv3-accel.h
> @@ -55,6 +55,7 @@ bool smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
>                                  Error **errp);
>  void smmuv3_accel_gbpa_update(SMMUv3State *s);
>  void smmuv3_accel_reset(SMMUv3State *s);
> +void smmuv3_accel_idr_override(SMMUv3State *s);
>  #else
>  static inline void smmuv3_accel_init(SMMUv3State *s)
>  {
> @@ -83,6 +84,9 @@ static inline void smmuv3_accel_gbpa_update(SMMUv3State *s)
>  static inline void smmuv3_accel_reset(SMMUv3State *s)
>  {
>  }
> +static inline void smmuv3_accel_idr_override(SMMUv3State *s)
> +{
> +}
>  #endif
>  
>  #endif /* HW_ARM_SMMUV3_ACCEL_H */
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index f040e6b91e..b9d96f5762 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -305,6 +305,7 @@ static void smmuv3_init_id_regs(SMMUv3State *s)
>      s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1);
>      s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
>      s->aidr = 0x1;
> +    smmuv3_accel_idr_override(s);
>  }
>  
>  static void smmuv3_reset(SMMUv3State *s)
> @@ -1936,6 +1937,13 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
>          return false;
>      }
>  #endif
> +    if (!s->accel) {
> +        if (!s->ril) {
> +            error_setg(errp, "ril can only be disabled if accel=on");
> +            return false;
> +        }
> +        return false;
this one is wrong. It should return true because it is a valid and
default config.

Eric
> +    }
>      return true;
>  }
>  
> @@ -2057,6 +2065,8 @@ static const Property smmuv3_properties[] = {
>       */
>      DEFINE_PROP_STRING("stage", SMMUv3State, stage),
>      DEFINE_PROP_BOOL("accel", SMMUv3State, accel, false),
> +    /* RIL can be turned off for accel cases */
> +    DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
>  };
>  
>  static void smmuv3_instance_init(Object *obj)
> @@ -2084,6 +2094,8 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
>                                            "Enable SMMUv3 accelerator support."
>                                            "Allows host SMMUv3 to be configured "
>                                            "in nested mode for vfio-pci dev assignment");
> +    object_class_property_set_description(klass, "ril",
> +        "Disable range invalidation support (for accel=on)");
>  }
>  
>  static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index 6b9c27a9c4..95202c2757 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -68,6 +68,7 @@ struct SMMUv3State {
>      bool accel;
>      struct SMMUv3AccelState *s_accel;
>      Error *migration_blocker;
> +    bool ril;
>  };
>  
>  typedef enum {



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support
  2025-10-31 10:49 ` [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support Shameer Kolothum
  2025-10-31 23:52   ` Nicolin Chen
@ 2025-11-04 11:05   ` Eric Auger
  2025-11-04 12:26     ` Shameer Kolothum
  1 sibling, 1 reply; 148+ messages in thread
From: Eric Auger @ 2025-11-04 11:05 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju

Hi Shameer,

On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> A device placed behind a vSMMU instance must have corresponding vSTEs
> (bypass, abort, or translate) installed. The bypass and abort proxy nested
> HWPTs are pre-allocated.
>
> For translat HWPT, a vDEVICE object is allocated and associated with the
> vIOMMU for each guest device. This allows the host kernel to establish a
> virtual SID to physical SID mapping, which is required for handling
> invalidations and event reporting.
>
> An translate HWPT is allocated based on the guest STE configuration and
> attached to the device when the guest issues SMMU_CMD_CFGI_STE or
> SMMU_CMD_CFGI_STE_RANGE, provided the STE enables S1 translation.
>
> If the guest STE is invalid or S1 translation is disabled, the device is
> attached to one of the pre-allocated ABORT or BYPASS HWPTs instead.
>
> While at it, export both smmu_find_ste() and smmuv3_flush_config() for
> use here.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/arm/smmuv3-accel.c    | 193 +++++++++++++++++++++++++++++++++++++++
>  hw/arm/smmuv3-accel.h    |  23 +++++
>  hw/arm/smmuv3-internal.h |  20 ++++
>  hw/arm/smmuv3.c          |  18 +++-
>  hw/arm/trace-events      |   2 +
>  5 files changed, 253 insertions(+), 3 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index d4d65299a8..c74e95a0ea 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -28,6 +28,191 @@ MemoryRegion root;
>  MemoryRegion sysmem;
>  static AddressSpace *shared_as_sysmem;
>  
> +static bool
> +smmuv3_accel_alloc_vdev(SMMUv3AccelDevice *accel_dev, int sid, Error **errp)
> +{
> +    SMMUViommu *vsmmu = accel_dev->vsmmu;
> +    IOMMUFDVdev *vdev;
> +    uint32_t vdevice_id;
> +
> +    if (!accel_dev->idev || accel_dev->vdev) {
> +        return true;
> +    }
> +
> +    if (!iommufd_backend_alloc_vdev(vsmmu->iommufd, accel_dev->idev->devid,
> +                                    vsmmu->viommu.viommu_id, sid,
> +                                    &vdevice_id, errp)) {
> +            return false;
> +    }
> +    if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
> +                                               vsmmu->bypass_hwpt_id, errp)) {
> +        iommufd_backend_free_id(vsmmu->iommufd, vdevice_id);
> +        return false;
> +    }
> +
> +    vdev = g_new(IOMMUFDVdev, 1);
> +    vdev->vdevice_id = vdevice_id;
> +    vdev->virt_id = sid;
> +    accel_dev->vdev = vdev;
> +    return true;
> +}
> +
> +static bool
> +smmuv3_accel_dev_uninstall_nested_ste(SMMUv3AccelDevice *accel_dev, bool abort,
> +                                      Error **errp)+{
> +    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
> +    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
> +    uint32_t hwpt_id;
> +
> +    if (!s1_hwpt || !accel_dev->vsmmu) {
> +        return true;
> +    }
> +
> +    if (abort) {
> +        hwpt_id = accel_dev->vsmmu->abort_hwpt_id;
> +    } else {
> +        hwpt_id = accel_dev->vsmmu->bypass_hwpt_id;
> +    }
> +
> +    if (!host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp)) {
> +        return false;
> +    }
> +    trace_smmuv3_accel_uninstall_nested_ste(smmu_get_sid(&accel_dev->sdev),
> +                                            abort ? "abort" : "bypass",
> +                                            hwpt_id);
> +
> +    iommufd_backend_free_id(s1_hwpt->iommufd, s1_hwpt->hwpt_id);
> +    accel_dev->s1_hwpt = NULL;
> +    g_free(s1_hwpt);
> +    return true;
> +}
> +
> +static bool
> +smmuv3_accel_dev_install_nested_ste(SMMUv3AccelDevice *accel_dev,
> +                                    uint32_t data_type, uint32_t data_len,
> +                                    void *data, Error **errp)
the name is very close to the caller function, ie.
smmuv3_accel_install_nested_ste which also takes a sdev.
I would rename to smmuv3_accel_install_hwpt() or something alike
> +{
> +    SMMUViommu *vsmmu = accel_dev->vsmmu;
> +    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
> +    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
> +    uint32_t flags = 0;
> +
> +    if (!idev || !vsmmu) {
> +        error_setg(errp, "Device 0x%x has no associated IOMMU dev or vIOMMU",
> +                   smmu_get_sid(&accel_dev->sdev));
> +        return false;
> +    }
> +
> +    if (s1_hwpt) {
> +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev, true, errp)) {
> +            return false;
> +        }
> +    }
> +
> +    s1_hwpt = g_new0(SMMUS1Hwpt, 1);
> +    s1_hwpt->iommufd = idev->iommufd;
> +    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> +                                    vsmmu->viommu.viommu_id, flags,
> +                                    data_type, data_len, data,
> +                                    &s1_hwpt->hwpt_id, errp)) {
> +        return false;
> +    }
> +
> +    if (!host_iommu_device_iommufd_attach_hwpt(idev, s1_hwpt->hwpt_id, errp)) {
> +        iommufd_backend_free_id(idev->iommufd, s1_hwpt->hwpt_id);
> +        return false;
> +    }
> +    accel_dev->s1_hwpt = s1_hwpt;
> +    return true;
> +}
> +
> +bool
> +smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
> +                                Error **errp)
> +{
> +    SMMUv3AccelDevice *accel_dev;
> +    SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid,
> +                           .inval_ste_allowed = true};
> +    struct iommu_hwpt_arm_smmuv3 nested_data = {};
> +    uint64_t ste_0, ste_1;
> +    uint32_t config;
> +    STE ste;
> +    int ret;
> +
> +    if (!s->accel) {
don't you want to check !s->vsmmu as well done in
smmuv3_accel_install_nested_ste_range()
> +        return true;
> +    }
> +
> +    accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
> +    if (!accel_dev->vsmmu) {
> +        return true;
> +    }
> +
> +    if (!smmuv3_accel_alloc_vdev(accel_dev, sid, errp)) {
> +        return false;
> +    }
> +
> +    ret = smmu_find_ste(sdev->smmu, sid, &ste, &event);
> +    if (ret) {
> +        error_setg(errp, "Failed to find STE for Device 0x%x", sid);
> +        return true;
returning true while setting errp looks wrong to me.
> +    }
> +
> +    config = STE_CONFIG(&ste);
> +    if (!STE_VALID(&ste) || !STE_CFG_S1_ENABLED(config)) {
> +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev,
> +                                                   STE_CFG_ABORT(config),
> +                                                   errp)) {
> +            return false;
> +        }
> +        smmuv3_flush_config(sdev);
> +        return true;
> +    }
> +
> +    ste_0 = (uint64_t)ste.word[0] | (uint64_t)ste.word[1] << 32;
> +    ste_1 = (uint64_t)ste.word[2] | (uint64_t)ste.word[3] << 32;
> +    nested_data.ste[0] = cpu_to_le64(ste_0 & STE0_MASK);
> +    nested_data.ste[1] = cpu_to_le64(ste_1 & STE1_MASK);
> +
> +    if (!smmuv3_accel_dev_install_nested_ste(accel_dev,
> +                                             IOMMU_HWPT_DATA_ARM_SMMUV3,
> +                                             sizeof(nested_data),
> +                                             &nested_data, errp)) {
> +        error_append_hint(errp, "Unable to install sid=0x%x nested STE="
> +                          "0x%"PRIx64":=0x%"PRIx64"", sid,
nit: why ":=" between both 64b?
> +                          (uint64_t)le64_to_cpu(nested_data.ste[1]),
> +                          (uint64_t)le64_to_cpu(nested_data.ste[0]));
> +        return false;
in case of various failure cases, do we need to free the vdev?
> +    }
> +    trace_smmuv3_accel_install_nested_ste(sid, nested_data.ste[1],
> +                                          nested_data.ste[0]);
> +    return true;
> +}
> +
> +bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
> +                                           Error **errp)
> +{
> +    SMMUv3AccelState *s_accel = s->s_accel;
> +    SMMUv3AccelDevice *accel_dev;
> +
> +    if (!s_accel || !s_accel->vsmmu) {
> +        return true;
> +    }
> +
> +    QLIST_FOREACH(accel_dev, &s_accel->vsmmu->device_list, next) {
> +        uint32_t sid = smmu_get_sid(&accel_dev->sdev);
> +
> +        if (sid >= range->start && sid <= range->end) {
> +            if (!smmuv3_accel_install_nested_ste(s, &accel_dev->sdev,
> +                                                 sid, errp)) {
> +                return false;
> +            }
> +        }
> +    }
> +    return true;
> +}
> +
>  static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>                                                 PCIBus *bus, int devfn)
>  {
> @@ -154,6 +339,7 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
>      SMMUv3State *s = ARM_SMMUV3(bs);
>      SMMUPciBus *sbus = g_hash_table_lookup(bs->smmu_pcibus_by_busptr, bus);
>      SMMUv3AccelDevice *accel_dev;
> +    IOMMUFDVdev *vdev;
>      SMMUViommu *vsmmu;
>      SMMUDevice *sdev;
>      uint16_t sid;
> @@ -182,6 +368,13 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
>      trace_smmuv3_accel_unset_iommu_device(devfn, sid);
>  
>      vsmmu = s->s_accel->vsmmu;
> +    vdev = accel_dev->vdev;
> +    if (vdev) {
> +        iommufd_backend_free_id(vsmmu->iommufd, vdev->vdevice_id);
> +        g_free(vdev);
> +        accel_dev->vdev = NULL;
> +    }
> +
>      if (QLIST_EMPTY(&vsmmu->device_list)) {
>          iommufd_backend_free_id(vsmmu->iommufd, vsmmu->bypass_hwpt_id);
>          iommufd_backend_free_id(vsmmu->iommufd, vsmmu->abort_hwpt_id);
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> index d81f90c32c..73b44cd7be 100644
> --- a/hw/arm/smmuv3-accel.h
> +++ b/hw/arm/smmuv3-accel.h
> @@ -27,9 +27,16 @@ typedef struct SMMUViommu {
>      QLIST_HEAD(, SMMUv3AccelDevice) device_list;
>  } SMMUViommu;
>  
> +typedef struct SMMUS1Hwpt {
> +    IOMMUFDBackend *iommufd;
> +    uint32_t hwpt_id;
> +} SMMUS1Hwpt;
> +
>  typedef struct SMMUv3AccelDevice {
>      SMMUDevice sdev;
>      HostIOMMUDeviceIOMMUFD *idev;
> +    SMMUS1Hwpt *s1_hwpt;
> +    IOMMUFDVdev *vdev;
>      SMMUViommu *vsmmu;
>      QLIST_ENTRY(SMMUv3AccelDevice) next;
>  } SMMUv3AccelDevice;
> @@ -40,10 +47,26 @@ typedef struct SMMUv3AccelState {
>  
>  #ifdef CONFIG_ARM_SMMUV3_ACCEL
>  void smmuv3_accel_init(SMMUv3State *s);
> +bool smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
> +                                     Error **errp);
> +bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
> +                                           Error **errp);
>  #else
>  static inline void smmuv3_accel_init(SMMUv3State *s)
>  {
>  }
> +static inline bool
> +smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
> +                                Error **errp)
> +{
> +    return true;
> +}
> +static inline bool
> +smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
> +                                      Error **errp)
> +{
> +    return true;
> +}
>  #endif
>  
>  #endif /* HW_ARM_SMMUV3_ACCEL_H */
> diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
> index 03d86cfc5c..5fd88b4257 100644
> --- a/hw/arm/smmuv3-internal.h
> +++ b/hw/arm/smmuv3-internal.h
> @@ -547,6 +547,9 @@ typedef struct CD {
>      uint32_t word[16];
>  } CD;
>  
> +int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste, SMMUEventInfo *event);
> +void smmuv3_flush_config(SMMUDevice *sdev);
> +
>  /* STE fields */
>  
>  #define STE_VALID(x)   extract32((x)->word[0], 0, 1)
> @@ -586,6 +589,23 @@ typedef struct CD {
>  #define SMMU_STE_VALID      (1ULL << 0)
>  #define SMMU_STE_CFG_BYPASS (1ULL << 3)
>  
> +#define STE0_V       MAKE_64BIT_MASK(0, 1)
> +#define STE0_CONFIG  MAKE_64BIT_MASK(1, 3)
> +#define STE0_S1FMT   MAKE_64BIT_MASK(4, 2)
> +#define STE0_CTXPTR  MAKE_64BIT_MASK(6, 50)
> +#define STE0_S1CDMAX MAKE_64BIT_MASK(59, 5)
> +#define STE0_MASK    (STE0_S1CDMAX | STE0_CTXPTR | STE0_S1FMT | STE0_CONFIG | \
> +                      STE0_V)
> +
> +#define STE1_S1DSS    MAKE_64BIT_MASK(0, 2)
> +#define STE1_S1CIR    MAKE_64BIT_MASK(2, 2)
> +#define STE1_S1COR    MAKE_64BIT_MASK(4, 2)
> +#define STE1_S1CSH    MAKE_64BIT_MASK(6, 2)
> +#define STE1_S1STALLD MAKE_64BIT_MASK(27, 1)
> +#define STE1_EATS     MAKE_64BIT_MASK(28, 2)
> +#define STE1_MASK     (STE1_EATS | STE1_S1STALLD | STE1_S1CSH | STE1_S1COR | \
> +                       STE1_S1CIR | STE1_S1DSS)
> +
>  #define SMMU_GBPA_ABORT (1UL << 20)
>  
>  static inline int oas2bits(int oas_field)
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index ef991cb7d8..1fd8aaa0c7 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -630,8 +630,7 @@ bad_ste:
>   * Supports linear and 2-level stream table
>   * Return 0 on success, -EINVAL otherwise
>   */
> -static int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste,
> -                         SMMUEventInfo *event)
> +int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste, SMMUEventInfo *event)
>  {
>      dma_addr_t addr, strtab_base;
>      uint32_t log2size;
> @@ -900,7 +899,7 @@ static SMMUTransCfg *smmuv3_get_config(SMMUDevice *sdev, SMMUEventInfo *event)
>      return cfg;
>  }
>  
> -static void smmuv3_flush_config(SMMUDevice *sdev)
> +void smmuv3_flush_config(SMMUDevice *sdev)
>  {
>      SMMUv3State *s = sdev->smmu;
>      SMMUState *bc = &s->smmu_state;
> @@ -1330,6 +1329,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
>          {
>              uint32_t sid = CMD_SID(&cmd);
>              SMMUDevice *sdev = smmu_find_sdev(bs, sid);
> +            Error *local_err = NULL;
>  
>              if (CMD_SSEC(&cmd)) {
>                  cmd_error = SMMU_CERROR_ILL;
> @@ -1341,6 +1341,11 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
>              }
>  
>              trace_smmuv3_cmdq_cfgi_ste(sid);
> +            if (!smmuv3_accel_install_nested_ste(s, sdev, sid, &local_err)) {
> +                error_report_err(local_err);
> +                cmd_error = SMMU_CERROR_ILL;
> +                break;
> +            }
>              smmuv3_flush_config(sdev);
>  
>              break;
> @@ -1350,6 +1355,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
>              uint32_t sid = CMD_SID(&cmd), mask;
>              uint8_t range = CMD_STE_RANGE(&cmd);
>              SMMUSIDRange sid_range;
> +            Error *local_err = NULL;
>  
>              if (CMD_SSEC(&cmd)) {
>                  cmd_error = SMMU_CERROR_ILL;
> @@ -1361,6 +1367,12 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
>              sid_range.end = sid_range.start + mask;
>  
>              trace_smmuv3_cmdq_cfgi_ste_range(sid_range.start, sid_range.end);
> +            if (!smmuv3_accel_install_nested_ste_range(s, &sid_range,
> +                                                       &local_err)) {
> +                error_report_err(local_err);
> +                cmd_error = SMMU_CERROR_ILL;
> +                break;
> +            }
>              smmu_configs_inv_sid_range(bs, sid_range);
>              break;
>          }
> diff --git a/hw/arm/trace-events b/hw/arm/trace-events
> index 49c0460f30..2e0b1f8f6f 100644
> --- a/hw/arm/trace-events
> +++ b/hw/arm/trace-events
> @@ -69,6 +69,8 @@ smmu_reset_exit(void) ""
>  #smmuv3-accel.c
>  smmuv3_accel_set_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (idev devid=0x%x)"
>  smmuv3_accel_unset_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (idev devid=0x%x)"
> +smmuv3_accel_install_nested_ste(uint32_t sid, uint64_t ste_1, uint64_t ste_0) "sid=%d ste=%"PRIx64":%"PRIx64
> +smmuv3_accel_uninstall_nested_ste(uint32_t sid, const char *ste_cfg, uint32_t hwpt_id) "sid=%d attached %s hwpt_id=%u"
>  
>  # strongarm.c
>  strongarm_uart_update_parameters(const char *label, int speed, char parity, int data_bits, int stop_bits) "%s speed=%d parity=%c data=%d stop=%d"
Eric



^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support
  2025-11-04 11:05   ` Eric Auger
@ 2025-11-04 12:26     ` Shameer Kolothum
  2025-11-04 13:30       ` Eric Auger
  2025-11-04 16:48       ` Nicolin Chen
  0 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-04 12:26 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 04 November 2025 11:06
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE
> install/uninstall support
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Shameer,
> 
> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> >
> > A device placed behind a vSMMU instance must have corresponding vSTEs
> > (bypass, abort, or translate) installed. The bypass and abort proxy
> > nested HWPTs are pre-allocated.
> >
> > For translat HWPT, a vDEVICE object is allocated and associated with
> > the vIOMMU for each guest device. This allows the host kernel to
> > establish a virtual SID to physical SID mapping, which is required for
> > handling invalidations and event reporting.
> >
> > An translate HWPT is allocated based on the guest STE configuration
> > and attached to the device when the guest issues SMMU_CMD_CFGI_STE or
> > SMMU_CMD_CFGI_STE_RANGE, provided the STE enables S1 translation.
> >
> > If the guest STE is invalid or S1 translation is disabled, the device
> > is attached to one of the pre-allocated ABORT or BYPASS HWPTs instead.
> >
> > While at it, export both smmu_find_ste() and smmuv3_flush_config() for
> > use here.
> >
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > ---
> >  hw/arm/smmuv3-accel.c    | 193
> +++++++++++++++++++++++++++++++++++++++
> >  hw/arm/smmuv3-accel.h    |  23 +++++
> >  hw/arm/smmuv3-internal.h |  20 ++++
> >  hw/arm/smmuv3.c          |  18 +++-
> >  hw/arm/trace-events      |   2 +
> >  5 files changed, 253 insertions(+), 3 deletions(-)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index
> > d4d65299a8..c74e95a0ea 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> > @@ -28,6 +28,191 @@ MemoryRegion root;  MemoryRegion sysmem;
> static
> > AddressSpace *shared_as_sysmem;
> >
> > +static bool
> > +smmuv3_accel_alloc_vdev(SMMUv3AccelDevice *accel_dev, int sid, Error
> > +**errp) {
> > +    SMMUViommu *vsmmu = accel_dev->vsmmu;
> > +    IOMMUFDVdev *vdev;
> > +    uint32_t vdevice_id;
> > +
> > +    if (!accel_dev->idev || accel_dev->vdev) {
> > +        return true;
> > +    }
> > +
> > +    if (!iommufd_backend_alloc_vdev(vsmmu->iommufd, accel_dev->idev-
> >devid,
> > +                                    vsmmu->viommu.viommu_id, sid,
> > +                                    &vdevice_id, errp)) {
> > +            return false;
> > +    }
> > +    if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
> > +                                               vsmmu->bypass_hwpt_id, errp)) {
> > +        iommufd_backend_free_id(vsmmu->iommufd, vdevice_id);
> > +        return false;
> > +    }
> > +
> > +    vdev = g_new(IOMMUFDVdev, 1);
> > +    vdev->vdevice_id = vdevice_id;
> > +    vdev->virt_id = sid;
> > +    accel_dev->vdev = vdev;
> > +    return true;
> > +}
> > +
> > +static bool
> > +smmuv3_accel_dev_uninstall_nested_ste(SMMUv3AccelDevice *accel_dev,
> bool abort,
> > +                                      Error **errp)+{
> > +    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
> > +    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
> > +    uint32_t hwpt_id;
> > +
> > +    if (!s1_hwpt || !accel_dev->vsmmu) {
> > +        return true;
> > +    }
> > +
> > +    if (abort) {
> > +        hwpt_id = accel_dev->vsmmu->abort_hwpt_id;
> > +    } else {
> > +        hwpt_id = accel_dev->vsmmu->bypass_hwpt_id;
> > +    }
> > +
> > +    if (!host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp)) {
> > +        return false;
> > +    }
> > +    trace_smmuv3_accel_uninstall_nested_ste(smmu_get_sid(&accel_dev-
> >sdev),
> > +                                            abort ? "abort" : "bypass",
> > +                                            hwpt_id);
> > +
> > +    iommufd_backend_free_id(s1_hwpt->iommufd, s1_hwpt->hwpt_id);
> > +    accel_dev->s1_hwpt = NULL;
> > +    g_free(s1_hwpt);
> > +    return true;
> > +}
> > +
> > +static bool
> > +smmuv3_accel_dev_install_nested_ste(SMMUv3AccelDevice *accel_dev,
> > +                                    uint32_t data_type, uint32_t data_len,
> > +                                    void *data, Error **errp)
> the name is very close to the caller function, ie.
> smmuv3_accel_install_nested_ste which also takes a sdev.
> I would rename to smmuv3_accel_install_hwpt() or something alike

This one is going to change a bit based on Nicolin's feedback on taking 
care of SMMUEN/GBPA values.
https://lore.kernel.org/all/aQVLzfaxxSfw1HBL@Asurada-Nvidia/

Probably smmuv3_accel_attach_nested_hwpt() suits better considering
that’s what it finally ends up doing.

> > +{
> > +    SMMUViommu *vsmmu = accel_dev->vsmmu;
> > +    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
> > +    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
> > +    uint32_t flags = 0;
> > +
> > +    if (!idev || !vsmmu) {
> > +        error_setg(errp, "Device 0x%x has no associated IOMMU dev or
> vIOMMU",
> > +                   smmu_get_sid(&accel_dev->sdev));
> > +        return false;
> > +    }
> > +
> > +    if (s1_hwpt) {
> > +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev, true, errp)) {
> > +            return false;
> > +        }
> > +    }
> > +
> > +    s1_hwpt = g_new0(SMMUS1Hwpt, 1);
> > +    s1_hwpt->iommufd = idev->iommufd;
> > +    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> > +                                    vsmmu->viommu.viommu_id, flags,
> > +                                    data_type, data_len, data,
> > +                                    &s1_hwpt->hwpt_id, errp)) {
> > +        return false;
> > +    }
> > +
> > +    if (!host_iommu_device_iommufd_attach_hwpt(idev, s1_hwpt-
> >hwpt_id, errp)) {
> > +        iommufd_backend_free_id(idev->iommufd, s1_hwpt->hwpt_id);
> > +        return false;
> > +    }
> > +    accel_dev->s1_hwpt = s1_hwpt;
> > +    return true;
> > +}
> > +
> > +bool
> > +smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev,
> int sid,
> > +                                Error **errp) {
> > +    SMMUv3AccelDevice *accel_dev;
> > +    SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid,
> > +                           .inval_ste_allowed = true};
> > +    struct iommu_hwpt_arm_smmuv3 nested_data = {};
> > +    uint64_t ste_0, ste_1;
> > +    uint32_t config;
> > +    STE ste;
> > +    int ret;
> > +
> > +    if (!s->accel) {
> don't you want to check !s->vsmmu as well done in
> smmuv3_accel_install_nested_ste_range()

Nicolin has a suggestion to merge struct SMMUViommu and
SMMUv3AccelState into one and avoid the extra layering. I will
attempt that and all these checking might change as a result.

> > +        return true;
> > +    }
> > +
> > +    accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
> > +    if (!accel_dev->vsmmu) {
> > +        return true;
> > +    }
> > +
> > +    if (!smmuv3_accel_alloc_vdev(accel_dev, sid, errp)) {
> > +        return false;
> > +    }
> > +
> > +    ret = smmu_find_ste(sdev->smmu, sid, &ste, &event);
> > +    if (ret) {
> > +        error_setg(errp, "Failed to find STE for Device 0x%x", sid);
> > +        return true;
> returning true while setting errp looks wrong to me.

Right, will just return true from here. I am not sure under what circumstances
we will hit here though.

> +    }
> +
> +    config = STE_CONFIG(&ste);
> +    if (!STE_VALID(&ste) || !STE_CFG_S1_ENABLED(config)) {
> +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev,
> +                                                   STE_CFG_ABORT(config),
> +                                                   errp)) {
> +            return false;
> +        }
> +        smmuv3_flush_config(sdev);
> +        return true;
> +    }
> +
> +    ste_0 = (uint64_t)ste.word[0] | (uint64_t)ste.word[1] << 32;
> +    ste_1 = (uint64_t)ste.word[2] | (uint64_t)ste.word[3] << 32;
> +    nested_data.ste[0] = cpu_to_le64(ste_0 & STE0_MASK);
> +    nested_data.ste[1] = cpu_to_le64(ste_1 & STE1_MASK);
> +
> +    if (!smmuv3_accel_dev_install_nested_ste(accel_dev,
> +                                             IOMMU_HWPT_DATA_ARM_SMMUV3,
> +                                             sizeof(nested_data),
> +                                             &nested_data, errp)) {
> +        error_append_hint(errp, "Unable to install sid=0x%x nested STE="
> +                          "0x%"PRIx64":=0x%"PRIx64"", sid,
nit: why ":=" between both 64b?
> +                          (uint64_t)le64_to_cpu(nested_data.ste[1]),
> +                          (uint64_t)le64_to_cpu(nested_data.ste[0]));
> +        return false;
in case of various failure cases, do we need to free the vdev?

I don't think we need to fee the vdev corresponding to this vSID on failures
here. I think the association between vSID and host SID can remain
intact even if the nested HWPT alloc/attach fails for whatever reason.

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 14/32] hw/arm/smmuv3-accel: Install SMMUv3 GBPA based hwpt
  2025-10-31 10:49 ` [PATCH v5 14/32] hw/arm/smmuv3-accel: Install SMMUv3 GBPA based hwpt Shameer Kolothum
@ 2025-11-04 13:28   ` Eric Auger
  0 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-04 13:28 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju

Hi Shameer,

On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> When the Guest reboots or updates the GBPA we need to attach a nested HWPT
> based on the GBPA register values.
In practice you only take into account GPBA.ABORT bit. Also reminds what
this latter does, ie.
"attach a nested HWPT based on the new GPBA.ABORT bit which either
aborts all incoming transactions or bypass them"
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/arm/smmuv3-accel.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  hw/arm/smmuv3-accel.h |  8 ++++++++
>  hw/arm/smmuv3.c       |  2 ++
>  3 files changed, 52 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index c74e95a0ea..0573ae3772 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -479,6 +479,48 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
>      .unset_iommu_device = smmuv3_accel_unset_iommu_device,
>  };
>  
> +
> +/* Based on SMUUv3 GBPA configuration, attach a corresponding HWPT */
> +void smmuv3_accel_gbpa_update(SMMUv3State *s)
> +{
> +    SMMUv3AccelDevice *accel_dev;
> +    Error *local_err = NULL;
> +    SMMUViommu *vsmmu;
> +    uint32_t hwpt_id;
> +
> +    if (!s->accel || !s->s_accel->vsmmu) {
> +        return;
> +    }
> +
> +    vsmmu = s->s_accel->vsmmu;
> +    /*
> +     * The Linux kernel does not allow configuring GBPA MemAttr, MTCFG,
> +     * ALLOCCFG, SHCFG, PRIVCFG, or INSTCFG fields for a vSTE. Host kernel
I think in general we shall avoid doing any assumptions about linux
kernel capability.
> +     * has final control over these parameters. Hence, use one of the
It seems to be contradictory to the above statement.
> +     * pre-allocated HWPTs depending on GBPA.ABORT value.
I would remove the comment
> +     */
> +    if (s->gbpa & SMMU_GBPA_ABORT) {
> +        hwpt_id = vsmmu->abort_hwpt_id;
> +    } else {
> +        hwpt_id = vsmmu->bypass_hwpt_id;
> +    }
> +
> +    QLIST_FOREACH(accel_dev, &vsmmu->device_list, next) {
> +        if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev, hwpt_id,
> +                                                   &local_err)) {
> +            error_append_hint(&local_err, "Failed to attach GBPA hwpt id %u "
> +                              "for dev id %u", hwpt_id, accel_dev->idev->devid);
> +            error_report_err(local_err);
> +        }
> +    }
> +}
> +
> +void smmuv3_accel_reset(SMMUv3State *s)
> +{
> +     /* Attach a HWPT based on GBPA reset value */
> +     smmuv3_accel_gbpa_update(s);
> +}
> +
>  static void smmuv3_accel_as_init(SMMUv3State *s)
>  {
>  
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> index 73b44cd7be..8931e83dc5 100644
> --- a/hw/arm/smmuv3-accel.h
> +++ b/hw/arm/smmuv3-accel.h
> @@ -51,6 +51,8 @@ bool smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
>                                       Error **errp);
>  bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
>                                             Error **errp);
> +void smmuv3_accel_gbpa_update(SMMUv3State *s);
> +void smmuv3_accel_reset(SMMUv3State *s);
>  #else
>  static inline void smmuv3_accel_init(SMMUv3State *s)
>  {
> @@ -67,6 +69,12 @@ smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
>  {
>      return true;
>  }
> +static inline void smmuv3_accel_gbpa_update(SMMUv3State *s)
> +{
> +}
> +static inline void smmuv3_accel_reset(SMMUv3State *s)
> +{
> +}
>  #endif
>  
>  #endif /* HW_ARM_SMMUV3_ACCEL_H */
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index 1fd8aaa0c7..cc32b618ed 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -1603,6 +1603,7 @@ static MemTxResult smmu_writel(SMMUv3State *s, hwaddr offset,
>          if (data & R_GBPA_UPDATE_MASK) {
>              /* Ignore update bit as write is synchronous. */
>              s->gbpa = data & ~R_GBPA_UPDATE_MASK;
> +            smmuv3_accel_gbpa_update(s);
>          }
>          return MEMTX_OK;
>      case A_STRTAB_BASE: /* 64b */
> @@ -1885,6 +1886,7 @@ static void smmu_reset_exit(Object *obj, ResetType type)
>      }
>  
>      smmuv3_init_regs(s);
> +    smmuv3_accel_reset(s);
>  }
>  
>  static void smmu_realize(DeviceState *d, Error **errp)
Besides

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support
  2025-11-04 12:26     ` Shameer Kolothum
@ 2025-11-04 13:30       ` Eric Auger
  2025-11-04 16:48       ` Nicolin Chen
  1 sibling, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-04 13:30 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju



On 11/4/25 1:26 PM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 04 November 2025 11:06
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE
>> install/uninstall support
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Shameer,
>>
>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>>
>>> A device placed behind a vSMMU instance must have corresponding vSTEs
>>> (bypass, abort, or translate) installed. The bypass and abort proxy
>>> nested HWPTs are pre-allocated.
>>>
>>> For translat HWPT, a vDEVICE object is allocated and associated with
>>> the vIOMMU for each guest device. This allows the host kernel to
>>> establish a virtual SID to physical SID mapping, which is required for
>>> handling invalidations and event reporting.
>>>
>>> An translate HWPT is allocated based on the guest STE configuration
>>> and attached to the device when the guest issues SMMU_CMD_CFGI_STE or
>>> SMMU_CMD_CFGI_STE_RANGE, provided the STE enables S1 translation.
>>>
>>> If the guest STE is invalid or S1 translation is disabled, the device
>>> is attached to one of the pre-allocated ABORT or BYPASS HWPTs instead.
>>>
>>> While at it, export both smmu_find_ste() and smmuv3_flush_config() for
>>> use here.
>>>
>>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>>> Signed-off-by: Shameer Kolothum
>> <shameerali.kolothum.thodi@huawei.com>
>>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>>> ---
>>>  hw/arm/smmuv3-accel.c    | 193
>> +++++++++++++++++++++++++++++++++++++++
>>>  hw/arm/smmuv3-accel.h    |  23 +++++
>>>  hw/arm/smmuv3-internal.h |  20 ++++
>>>  hw/arm/smmuv3.c          |  18 +++-
>>>  hw/arm/trace-events      |   2 +
>>>  5 files changed, 253 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index
>>> d4d65299a8..c74e95a0ea 100644
>>> --- a/hw/arm/smmuv3-accel.c
>>> +++ b/hw/arm/smmuv3-accel.c
>>> @@ -28,6 +28,191 @@ MemoryRegion root;  MemoryRegion sysmem;
>> static
>>> AddressSpace *shared_as_sysmem;
>>>
>>> +static bool
>>> +smmuv3_accel_alloc_vdev(SMMUv3AccelDevice *accel_dev, int sid, Error
>>> +**errp) {
>>> +    SMMUViommu *vsmmu = accel_dev->vsmmu;
>>> +    IOMMUFDVdev *vdev;
>>> +    uint32_t vdevice_id;
>>> +
>>> +    if (!accel_dev->idev || accel_dev->vdev) {
>>> +        return true;
>>> +    }
>>> +
>>> +    if (!iommufd_backend_alloc_vdev(vsmmu->iommufd, accel_dev->idev-
>>> devid,
>>> +                                    vsmmu->viommu.viommu_id, sid,
>>> +                                    &vdevice_id, errp)) {
>>> +            return false;
>>> +    }
>>> +    if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
>>> +                                               vsmmu->bypass_hwpt_id, errp)) {
>>> +        iommufd_backend_free_id(vsmmu->iommufd, vdevice_id);
>>> +        return false;
>>> +    }
>>> +
>>> +    vdev = g_new(IOMMUFDVdev, 1);
>>> +    vdev->vdevice_id = vdevice_id;
>>> +    vdev->virt_id = sid;
>>> +    accel_dev->vdev = vdev;
>>> +    return true;
>>> +}
>>> +
>>> +static bool
>>> +smmuv3_accel_dev_uninstall_nested_ste(SMMUv3AccelDevice *accel_dev,
>> bool abort,
>>> +                                      Error **errp)+{
>>> +    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
>>> +    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
>>> +    uint32_t hwpt_id;
>>> +
>>> +    if (!s1_hwpt || !accel_dev->vsmmu) {
>>> +        return true;
>>> +    }
>>> +
>>> +    if (abort) {
>>> +        hwpt_id = accel_dev->vsmmu->abort_hwpt_id;
>>> +    } else {
>>> +        hwpt_id = accel_dev->vsmmu->bypass_hwpt_id;
>>> +    }
>>> +
>>> +    if (!host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp)) {
>>> +        return false;
>>> +    }
>>> +    trace_smmuv3_accel_uninstall_nested_ste(smmu_get_sid(&accel_dev-
>>> sdev),
>>> +                                            abort ? "abort" : "bypass",
>>> +                                            hwpt_id);
>>> +
>>> +    iommufd_backend_free_id(s1_hwpt->iommufd, s1_hwpt->hwpt_id);
>>> +    accel_dev->s1_hwpt = NULL;
>>> +    g_free(s1_hwpt);
>>> +    return true;
>>> +}
>>> +
>>> +static bool
>>> +smmuv3_accel_dev_install_nested_ste(SMMUv3AccelDevice *accel_dev,
>>> +                                    uint32_t data_type, uint32_t data_len,
>>> +                                    void *data, Error **errp)
>> the name is very close to the caller function, ie.
>> smmuv3_accel_install_nested_ste which also takes a sdev.
>> I would rename to smmuv3_accel_install_hwpt() or something alike
> This one is going to change a bit based on Nicolin's feedback on taking 
> care of SMMUEN/GBPA values.
> https://lore.kernel.org/all/aQVLzfaxxSfw1HBL@Asurada-Nvidia/

OK
>
> Probably smmuv3_accel_attach_nested_hwpt() suits better considering
> that’s what it finally ends up doing.
>
>>> +{
>>> +    SMMUViommu *vsmmu = accel_dev->vsmmu;
>>> +    SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
>>> +    HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
>>> +    uint32_t flags = 0;
>>> +
>>> +    if (!idev || !vsmmu) {
>>> +        error_setg(errp, "Device 0x%x has no associated IOMMU dev or
>> vIOMMU",
>>> +                   smmu_get_sid(&accel_dev->sdev));
>>> +        return false;
>>> +    }
>>> +
>>> +    if (s1_hwpt) {
>>> +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev, true, errp)) {
>>> +            return false;
>>> +        }
>>> +    }
>>> +
>>> +    s1_hwpt = g_new0(SMMUS1Hwpt, 1);
>>> +    s1_hwpt->iommufd = idev->iommufd;
>>> +    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
>>> +                                    vsmmu->viommu.viommu_id, flags,
>>> +                                    data_type, data_len, data,
>>> +                                    &s1_hwpt->hwpt_id, errp)) {
>>> +        return false;
>>> +    }
>>> +
>>> +    if (!host_iommu_device_iommufd_attach_hwpt(idev, s1_hwpt-
>>> hwpt_id, errp)) {
>>> +        iommufd_backend_free_id(idev->iommufd, s1_hwpt->hwpt_id);
>>> +        return false;
>>> +    }
>>> +    accel_dev->s1_hwpt = s1_hwpt;
>>> +    return true;
>>> +}
>>> +
>>> +bool
>>> +smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev,
>> int sid,
>>> +                                Error **errp) {
>>> +    SMMUv3AccelDevice *accel_dev;
>>> +    SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid,
>>> +                           .inval_ste_allowed = true};
>>> +    struct iommu_hwpt_arm_smmuv3 nested_data = {};
>>> +    uint64_t ste_0, ste_1;
>>> +    uint32_t config;
>>> +    STE ste;
>>> +    int ret;
>>> +
>>> +    if (!s->accel) {
>> don't you want to check !s->vsmmu as well done in
>> smmuv3_accel_install_nested_ste_range()
> Nicolin has a suggestion to merge struct SMMUViommu and
> SMMUv3AccelState into one and avoid the extra layering. I will
> attempt that and all these checking might change as a result.

interesting idea indeed.
>
>>> +        return true;
>>> +    }
>>> +
>>> +    accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
>>> +    if (!accel_dev->vsmmu) {
>>> +        return true;
>>> +    }
>>> +
>>> +    if (!smmuv3_accel_alloc_vdev(accel_dev, sid, errp)) {
>>> +        return false;
>>> +    }
>>> +
>>> +    ret = smmu_find_ste(sdev->smmu, sid, &ste, &event);
>>> +    if (ret) {
>>> +        error_setg(errp, "Failed to find STE for Device 0x%x", sid);
>>> +        return true;
>> returning true while setting errp looks wrong to me.
> Right, will just return true from here. I am not sure under what circumstances
> we will hit here though.
>
>> +    }
>> +
>> +    config = STE_CONFIG(&ste);
>> +    if (!STE_VALID(&ste) || !STE_CFG_S1_ENABLED(config)) {
>> +        if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev,
>> +                                                   STE_CFG_ABORT(config),
>> +                                                   errp)) {
>> +            return false;
>> +        }
>> +        smmuv3_flush_config(sdev);
>> +        return true;
>> +    }
>> +
>> +    ste_0 = (uint64_t)ste.word[0] | (uint64_t)ste.word[1] << 32;
>> +    ste_1 = (uint64_t)ste.word[2] | (uint64_t)ste.word[3] << 32;
>> +    nested_data.ste[0] = cpu_to_le64(ste_0 & STE0_MASK);
>> +    nested_data.ste[1] = cpu_to_le64(ste_1 & STE1_MASK);
>> +
>> +    if (!smmuv3_accel_dev_install_nested_ste(accel_dev,
>> +                                             IOMMU_HWPT_DATA_ARM_SMMUV3,
>> +                                             sizeof(nested_data),
>> +                                             &nested_data, errp)) {
>> +        error_append_hint(errp, "Unable to install sid=0x%x nested STE="
>> +                          "0x%"PRIx64":=0x%"PRIx64"", sid,
> nit: why ":=" between both 64b?
>> +                          (uint64_t)le64_to_cpu(nested_data.ste[1]),
>> +                          (uint64_t)le64_to_cpu(nested_data.ste[0]));
>> +        return false;
> in case of various failure cases, do we need to free the vdev?
>
> I don't think we need to fee the vdev corresponding to this vSID on failures
> here. I think the association between vSID and host SID can remain
> intact even if the nested HWPT alloc/attach fails for whatever reason.

OK then worth a comment

Eric
>
> Thanks,
> Shameer



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-10-31 10:49 ` [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback Shameer Kolothum
@ 2025-11-04 14:11   ` Eric Auger
  2025-11-04 14:20     ` Jason Gunthorpe
  2025-11-04 14:37     ` Shameer Kolothum
  0 siblings, 2 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-04 14:11 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju

Hi Shameer, Nicolin,

On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> On ARM, devices behind an IOMMU have their MSI doorbell addresses
> translated by the IOMMU. In nested mode, this translation happens in
> two stages (gIOVA → gPA → ITS page).
>
> In accelerated SMMUv3 mode, both stages are handled by hardware, so
> get_address_space() returns the system address space so that VFIO
> can setup stage-2 mappings for system address space.

Sorry but I still don't catch the above. Can you explain (most probably
again) why this is a requirement to return the system as so that VFIO
can setup stage-2 mappings for system address space. I am sorry for
insisting (at the risk of being stubborn or dumb) but I fail to
understand the requirement. As far as I remember the way I integrated it
at the old times did not require that change:
https://lore.kernel.org/all/20210411120912.15770-1-eric.auger@redhat.com/
I used a vfio_prereg_listener to force the S2 mapping.

What has changed that forces us now to have this gym


>
> However, QEMU/KVM also calls this callback when resolving
> MSI doorbells:
>
>   kvm_irqchip_add_msi_route()
>     kvm_arch_fixup_msi_route()
>       pci_device_iommu_address_space()
>         get_address_space()
>
> VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
> MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
> directly. This needs to be translated to vITS gPA. In order to do the
> doorbell transalation it needs IOMMU address space.
>
> Add an optional get_msi_address_space() callback and use it in this
> path to return the correct address space for such cases.
>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/pci/pci.c         | 18 ++++++++++++++++++
>  include/hw/pci/pci.h | 16 ++++++++++++++++
>  target/arm/kvm.c     |  2 +-
>  3 files changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index fa9cf5dab2..1edd711247 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2982,6 +2982,24 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>      return &address_space_memory;
>  }
>  
> +AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    int devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> +    if (iommu_bus) {
> +        if (iommu_bus->iommu_ops->get_msi_address_space) {
> +            return iommu_bus->iommu_ops->get_msi_address_space(bus,
> +                                 iommu_bus->iommu_opaque, devfn);
> +        }
> +        return iommu_bus->iommu_ops->get_address_space(bus,
> +                                 iommu_bus->iommu_opaque, devfn);
> +    }
> +    return &address_space_memory;
> +}
> +
>  int pci_iommu_init_iotlb_notifier(PCIDevice *dev, IOMMUNotifier *n,
>                                    IOMMUNotify fn, void *opaque)
>  {
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index dfeba8c9bd..b731443c67 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -664,6 +664,21 @@ typedef struct PCIIOMMUOps {
>                              uint32_t pasid, bool priv_req, bool exec_req,
>                              hwaddr addr, bool lpig, uint16_t prgi, bool is_read,
>                              bool is_write);
> +    /**
> +     * @get_msi_address_space: get the address space for MSI doorbell address
> +     * for devices
> +     *
> +     * Optional callback which returns a pointer to an #AddressSpace. This
> +     * is required if MSI doorbell also gets translated through vIOMMU(eg: ARM)
> +     *
> +     * @bus: the #PCIBus being accessed.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * @devfn: device and function number
> +     */
> +    AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
> +                                            int devfn);
>  } PCIIOMMUOps;
>  
>  bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
> @@ -672,6 +687,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>  bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
>                                   Error **errp);
>  void pci_device_unset_iommu_device(PCIDevice *dev);
> +AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
>  
>  /**
>   * pci_device_get_viommu_flags: get vIOMMU flags.
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index 0d57081e69..0df41128d0 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -1611,7 +1611,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
>  int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
>                               uint64_t address, uint32_t data, PCIDevice *dev)
>  {
> -    AddressSpace *as = pci_device_iommu_address_space(dev);
> +    AddressSpace *as = pci_device_iommu_msi_address_space(dev);
>      hwaddr xlat, len, doorbell_gpa;
>      MemoryRegionSection mrs;
>      MemoryRegion *mr;

Eric



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 14:11   ` Eric Auger
@ 2025-11-04 14:20     ` Jason Gunthorpe
  2025-11-04 14:42       ` Shameer Kolothum
  2025-11-04 14:37     ` Shameer Kolothum
  1 sibling, 1 reply; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-04 14:20 UTC (permalink / raw)
  To: Eric Auger
  Cc: Shameer Kolothum, qemu-arm, qemu-devel, peter.maydell, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, jonathan.cameron, zhangfei.gao, zhenzhong.duan,
	yi.l.liu, kjaju

On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote:
> > However, QEMU/KVM also calls this callback when resolving
> > MSI doorbells:
> >
> >   kvm_irqchip_add_msi_route()
> >     kvm_arch_fixup_msi_route()
> >       pci_device_iommu_address_space()
> >         get_address_space()
> >
> > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
> > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
> > directly. This needs to be translated to vITS gPA. In order to do the
> > doorbell transalation it needs IOMMU address space.

Why does qemu do anything with the msi address? It is opaque and qemu
cannot determine anything meaningful from it. I expect it to ignore it?

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 27/32] hw/arm/smmuv3-accel: Add support for ATS
  2025-10-31 10:50 ` [PATCH v5 27/32] hw/arm/smmuv3-accel: Add support for ATS Shameer Kolothum
@ 2025-11-04 14:22   ` Eric Auger
  0 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-04 14:22 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:50 AM, Shameer Kolothum wrote:
> QEMU SMMUv3 does not enable ATS (Address Translation Services) by default.
> When accelerated mode is enabled and the host SMMUv3 supports ATS, it can
> be useful to report ATS capability to the guest so it can take advantage
> of it if the device also supports ATS.
>
> Note: ATS support cannot be reliably detected from the host SMMUv3 IDR
> registers alone, as firmware ACPI IORT tables may override them. The
> user must therefore ensure the support before enabling it.

Please add a note that this is a partial ATS support, made possible because
emulated devices can be pluugged onto accel SMMU (ie. we do not support
ATS translation requests for instance)
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/arm/smmuv3-accel.c    |  4 ++++
>  hw/arm/smmuv3.c          | 25 ++++++++++++++++++++++++-
>  hw/arm/virt-acpi-build.c | 10 ++++++++--
>  include/hw/arm/smmuv3.h  |  1 +
>  4 files changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 35298350cb..5b0ef3804a 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -645,6 +645,10 @@ void smmuv3_accel_idr_override(SMMUv3State *s)
>      if (!s->ril) {
>          s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 0);
>      }
> +    /* QEMU SMMUv3 has no ATS. Update IDR0 if user has enabled it */
"advertise ats if opt-on by property?"
> +    if (s->ats) {
> +        s->idr[0] = FIELD_DP32(s->idr[0], IDR0, ATS, 1); /* ATS */
use s->ats directly?
> +    }
>  }
>  
>  /* Based on SMUUv3 GBPA configuration, attach a corresponding HWPT */
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index b9d96f5762..d95279a733 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -1517,13 +1517,28 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
>               */
>              smmuv3_range_inval(bs, &cmd, SMMU_STAGE_2);
>              break;
> +        case SMMU_CMD_ATC_INV:
> +        {
> +            SMMUDevice *sdev = smmu_find_sdev(bs, CMD_SID(&cmd));
> +            Error *local_err = NULL;
> +
> +            if (!sdev) {
> +                break;
> +            }
> +
> +            if (!smmuv3_accel_issue_inv_cmd(s, &cmd, sdev, &local_err)) {
> +                error_report_err(local_err);
> +                cmd_error = SMMU_CERROR_ILL;
> +                break;
> +            }
> +            break;
> +        }
>          case SMMU_CMD_TLBI_EL3_ALL:
>          case SMMU_CMD_TLBI_EL3_VA:
>          case SMMU_CMD_TLBI_EL2_ALL:
>          case SMMU_CMD_TLBI_EL2_ASID:
>          case SMMU_CMD_TLBI_EL2_VA:
>          case SMMU_CMD_TLBI_EL2_VAA:
> -        case SMMU_CMD_ATC_INV:
>          case SMMU_CMD_PRI_RESP:
>          case SMMU_CMD_RESUME:
>          case SMMU_CMD_STALL_TERM:
> @@ -1942,6 +1957,10 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
>              error_setg(errp, "ril can only be disabled if accel=on");
>              return false;
>          }
> +        if (s->ats) {
> +            error_setg(errp, "ats can only be enabled if accel=on");
> +            return false;
> +        }
>          return false;
>      }
>      return true;
> @@ -2067,6 +2086,7 @@ static const Property smmuv3_properties[] = {
>      DEFINE_PROP_BOOL("accel", SMMUv3State, accel, false),
>      /* RIL can be turned off for accel cases */
>      DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
> +    DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
>  };
>  
>  static void smmuv3_instance_init(Object *obj)
> @@ -2096,6 +2116,9 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
>                                            "in nested mode for vfio-pci dev assignment");
>      object_class_property_set_description(klass, "ril",
>          "Disable range invalidation support (for accel=on)");
> +    object_class_property_set_description(klass, "ats",
> +        "Enable/disable ATS support (for accel=on). Please ensure host "
> +        "platform has ATS support before enabling this");
>  }
>  
>  static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 6106ad1b6e..1b0d0a2029 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -345,6 +345,7 @@ typedef struct AcpiIortSMMUv3Dev {
>      /* Offset of the SMMUv3 IORT Node relative to the start of the IORT */
>      size_t offset;
>      bool accel;
> +    bool ats;
>  } AcpiIortSMMUv3Dev;
>  
>  /*
> @@ -400,6 +401,7 @@ static int iort_smmuv3_devices(Object *obj, void *opaque)
>  
>      bus = PCI_BUS(object_property_get_link(obj, "primary-bus", &error_abort));
>      sdev.accel = object_property_get_bool(obj, "accel", &error_abort);
> +    sdev.ats = object_property_get_bool(obj, "ats", &error_abort);
>      pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
>      sbdev = SYS_BUS_DEVICE(obj);
>      sdev.base = platform_bus_get_mmio_addr(pbus, sbdev, 0);
> @@ -544,6 +546,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>      int i, nb_nodes, rc_mapping_count;
>      AcpiIortSMMUv3Dev *sdev;
>      size_t node_size;
> +    bool ats_needed = false;
>      int num_smmus = 0;
>      uint32_t id = 0;
>      int rc_smmu_idmaps_len = 0;
> @@ -579,6 +582,9 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>          /* Calculate RMR nodes required. One per SMMUv3 with accelerated mode */
>          for (i = 0; i < num_smmus; i++) {
>              sdev = &g_array_index(smmuv3_devs, AcpiIortSMMUv3Dev, i);
> +            if (sdev->ats) {
> +                ats_needed = true;
> +            }
>              if (sdev->accel) {
>                  nb_nodes++;
>              }
> @@ -678,8 +684,8 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
>      build_append_int_noprefix(table_data, 0, 2); /* Reserved */
>      /* Table 15 Memory Access Flags */
>      build_append_int_noprefix(table_data, 0x3 /* CCA = CPM = DACS = 1 */, 1);
> -
> -    build_append_int_noprefix(table_data, 0, 4); /* ATS Attribute */
> +    /* ATS Attribute */
> +    build_append_int_noprefix(table_data, (ats_needed ? 1 : 0), 4);
can't you use ats_needed directly?
>      /* MCFG pci_segment */
>      build_append_int_noprefix(table_data, 0, 4); /* PCI Segment number */
>  
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index 95202c2757..5fd5ec7b49 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -69,6 +69,7 @@ struct SMMUv3State {
>      struct SMMUv3AccelState *s_accel;
>      Error *migration_blocker;
>      bool ril;
> +    bool ats;
>  };
>  
>  typedef enum {
Eric



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 28/32] hw/arm/smmuv3-accel: Add property to specify OAS bits
  2025-10-31 10:50 ` [PATCH v5 28/32] hw/arm/smmuv3-accel: Add property to specify OAS bits Shameer Kolothum
@ 2025-11-04 14:35   ` Eric Auger
  2025-11-04 14:50     ` Jason Gunthorpe
  0 siblings, 1 reply; 148+ messages in thread
From: Eric Auger @ 2025-11-04 14:35 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:50 AM, Shameer Kolothum wrote:
> QEMU SMMUv3 currently sets the output address size (OAS) to 44 bits. With
> accelerator mode enabled, a guest device may use SVA where CPU page tables
> are shared with SMMUv3, requiring OAS at least equal to the CPU OAS. Add
> a user option to set this.
>
> Note: Linux kernel docs currently state the OAS field in the IDR register
> is not meaningful for users. But looks like we need this information.
I would explain why this is actually needed instead of quoting the linux
kernel docs. I guess in practice the vSMMU can't advertise an OAS
greater than the one support by the host SMMU otherwise the guest might
fail to use the range exposed by the vSMMU.
>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/arm/smmuv3-accel.c    | 22 ++++++++++++++++++++++
>  hw/arm/smmuv3-internal.h |  3 ++-
>  hw/arm/smmuv3.c          | 16 +++++++++++++++-
>  include/hw/arm/smmuv3.h  |  1 +
>  4 files changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 5b0ef3804a..c46510150e 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -28,6 +28,12 @@ MemoryRegion root;
>  MemoryRegion sysmem;
>  static AddressSpace *shared_as_sysmem;
>  
> +static int smmuv3_oas_bits(uint32_t oas)
> +{
> +    static const int map[] = { 32, 36, 40, 42, 44, 48, 52, 56 };
> +    return (oas < ARRAY_SIZE(map)) ? map[oas] : -EINVAL;
> +}
> +
>  static bool
>  smmuv3_accel_check_hw_compatible(SMMUv3State *s,
>                                   struct iommu_hw_info_arm_smmuv3 *info,
> @@ -70,6 +76,18 @@ smmuv3_accel_check_hw_compatible(SMMUv3State *s,
>          return false;
>      }
>  
> +    /*
> +     * TODO: OAS is not something Linux kernel doc says meaningful for user.
> +     * But looks like OAS needs to be compatible for accelerator support. Please
> +     * check.
would remove that comment. Either it is requested or not.
> +     */
> +    if (FIELD_EX32(info->idr[5], IDR5, OAS) <
> +                FIELD_EX32(s->idr[5], IDR5, OAS)) {
> +        error_setg(errp, "Host SMMUv3 OAS(%d) bits not compatible",
> +                   smmuv3_oas_bits(FIELD_EX32(info->idr[5], IDR5, OAS)));
let's be more explicit then and say

Host SMMUv3 OAS (%d bits) is less that OAS bits advertised by SMMU (%d)



> +        return false;
> +    }
> +
>      /* QEMU SMMUv3 supports GRAN4K/GRAN16K/GRAN64K translation granules */
>      if (FIELD_EX32(info->idr[5], IDR5, GRAN4K) !=
>                  FIELD_EX32(s->idr[5], IDR5, GRAN4K)) {
> @@ -649,6 +667,10 @@ void smmuv3_accel_idr_override(SMMUv3State *s)
>      if (s->ats) {
>          s->idr[0] = FIELD_DP32(s->idr[0], IDR0, ATS, 1); /* ATS */
>      }
> +    /* QEMU SMMUv3 has OAS set 44. Update IDR5 if user has it set to 48 bits*/
vSMMUv3 advertises by default a 44 bit wide OAS
> +    if (s->oas == 48) {
> +        s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS_48);
> +    }
>  }
>  
>  /* Based on SMUUv3 GBPA configuration, attach a corresponding HWPT */
> diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
> index 5fd88b4257..cfc5897569 100644
> --- a/hw/arm/smmuv3-internal.h
> +++ b/hw/arm/smmuv3-internal.h
> @@ -111,7 +111,8 @@ REG32(IDR5,                0x14)
>       FIELD(IDR5, VAX,        10, 2);
>       FIELD(IDR5, STALL_MAX,  16, 16);
>  
> -#define SMMU_IDR5_OAS 4
> +#define SMMU_IDR5_OAS_44 4
> +#define SMMU_IDR5_OAS_48 5
>  
>  REG32(IIDR,                0x18)
>  REG32(AIDR,                0x1c)
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index d95279a733..c4d28a3786 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -299,7 +299,8 @@ static void smmuv3_init_id_regs(SMMUv3State *s)
>      s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 1);
>      s->idr[3] = FIELD_DP32(s->idr[3], IDR3, BBML, 2);
>  
> -    s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS); /* 44 bits */
> +    /* OAS: 44 bits */
> +    s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS_44);
>      /* 4K, 16K and 64K granule support */
>      s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1);
>      s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1);
> @@ -1961,6 +1962,15 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
>              error_setg(errp, "ats can only be enabled if accel=on");
>              return false;
>          }
> +        if (s->oas != 44) {
> +            error_setg(errp, "OAS can only be set to 44 bits if accel=off");
> +            return false;
> +        }
> +        return false;
> +    }
> +
> +    if (s->oas != 44 && s->oas != 48) {
> +        error_setg(errp, "OAS can only be set to 44 or 48 bits");
>          return false;
>      }
>      return true;
> @@ -2087,6 +2097,7 @@ static const Property smmuv3_properties[] = {
>      /* RIL can be turned off for accel cases */
>      DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
>      DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
> +    DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
>  };
>  
>  static void smmuv3_instance_init(Object *obj)
> @@ -2119,6 +2130,9 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
>      object_class_property_set_description(klass, "ats",
>          "Enable/disable ATS support (for accel=on). Please ensure host "
>          "platform has ATS support before enabling this");
> +    object_class_property_set_description(klass, "oas",
> +        "Specify Output Address Size (for accel =on). Supported values "
> +        "are 44 or 48 bits. Defaults to 44 bits");
>  }
>  
>  static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index 5fd5ec7b49..e4226b66f3 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -70,6 +70,7 @@ struct SMMUv3State {
>      Error *migration_blocker;
>      bool ril;
>      bool ats;
> +    uint8_t oas;
>  };
>  
>  typedef enum {



^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 14:11   ` Eric Auger
  2025-11-04 14:20     ` Jason Gunthorpe
@ 2025-11-04 14:37     ` Shameer Kolothum
  2025-11-04 14:44       ` Eric Auger
  1 sibling, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-04 14:37 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 04 November 2025 14:12
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Shameer, Nicolin,
> 
> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> > On ARM, devices behind an IOMMU have their MSI doorbell addresses
> > translated by the IOMMU. In nested mode, this translation happens in
> > two stages (gIOVA → gPA → ITS page).
> >
> > In accelerated SMMUv3 mode, both stages are handled by hardware, so
> > get_address_space() returns the system address space so that VFIO
> > can setup stage-2 mappings for system address space.
> 
> Sorry but I still don't catch the above. Can you explain (most probably
> again) why this is a requirement to return the system as so that VFIO
> can setup stage-2 mappings for system address space. I am sorry for
> insisting (at the risk of being stubborn or dumb) but I fail to
> understand the requirement. As far as I remember the way I integrated it
> at the old times did not require that change:
> https://lore.kernel.org/all/20210411120912.15770-1-
> eric.auger@redhat.com/
> I used a vfio_prereg_listener to force the S2 mapping.

Yes I remember that.

> 
> What has changed that forces us now to have this gym

This approach achieves the same outcome, but through a 
different mechanism. Returning the system address space
here ensures that VFIO sets up the Stage-2 mappings for 
devices behind the accelerated SMMUv3.

I think, this makes sense because, in the accelerated case, the
device is no longer managed by QEMU’s SMMUv3 model. The
guest owns the Stage-1 context, and the host (VFIO) is responsible
for establishing the Stage-2 mappings accordingly. 

Do you see any issues with this approach?

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 14:20     ` Jason Gunthorpe
@ 2025-11-04 14:42       ` Shameer Kolothum
  2025-11-04 14:51         ` Jason Gunthorpe
  0 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-04 14:42 UTC (permalink / raw)
  To: Jason Gunthorpe, Eric Auger
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Nicolin Chen, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: 04 November 2025 14:21
> To: Eric Auger <eric.auger@redhat.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote:
> > > However, QEMU/KVM also calls this callback when resolving
> > > MSI doorbells:
> > >
> > >   kvm_irqchip_add_msi_route()
> > >     kvm_arch_fixup_msi_route()
> > >       pci_device_iommu_address_space()
> > >         get_address_space()
> > >
> > > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
> > > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
> > > directly. This needs to be translated to vITS gPA. In order to do the
> > > doorbell transalation it needs IOMMU address space.
> 
> Why does qemu do anything with the msi address? It is opaque and qemu
> cannot determine anything meaningful from it. I expect it to ignore it?

I am afraid not. Guest MSI table write gets trapped and it then configures the 
doorbell( this is where this patch comes handy) and sets up the KVM 
routing etc.

Thanks,
Shameer



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 14:37     ` Shameer Kolothum
@ 2025-11-04 14:44       ` Eric Auger
  2025-11-04 15:14         ` Shameer Kolothum
  0 siblings, 1 reply; 148+ messages in thread
From: Eric Auger @ 2025-11-04 14:44 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju



On 11/4/25 3:37 PM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 04 November 2025 14:12
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>> get_msi_address_space() callback
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Shameer, Nicolin,
>>
>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
>>> translated by the IOMMU. In nested mode, this translation happens in
>>> two stages (gIOVA → gPA → ITS page).
>>>
>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
>>> get_address_space() returns the system address space so that VFIO
>>> can setup stage-2 mappings for system address space.
>> Sorry but I still don't catch the above. Can you explain (most probably
>> again) why this is a requirement to return the system as so that VFIO
>> can setup stage-2 mappings for system address space. I am sorry for
>> insisting (at the risk of being stubborn or dumb) but I fail to
>> understand the requirement. As far as I remember the way I integrated it
>> at the old times did not require that change:
>> https://lore.kernel.org/all/20210411120912.15770-1-
>> eric.auger@redhat.com/
>> I used a vfio_prereg_listener to force the S2 mapping.
> Yes I remember that.
>
>> What has changed that forces us now to have this gym
> This approach achieves the same outcome, but through a 
> different mechanism. Returning the system address space
> here ensures that VFIO sets up the Stage-2 mappings for 
> devices behind the accelerated SMMUv3.
>
> I think, this makes sense because, in the accelerated case, the
> device is no longer managed by QEMU’s SMMUv3 model. The
On the other hand, as we discussed on v4 by returning system as you
pretend there is no translation in place which is not true. Now we use
an alias for it but it has not really removed its usage. Also it forces
use to hack around the MSI mapping and introduce new PCIIOMMUOps. Have
you assessed the feasability of using vfio_prereg_listener to force the
S2 mapping. Is it simply not relevant anymore or could it be used also
with the iommufd be integration? Eric
> guest owns the Stage-1 context, and the host (VFIO) is responsible
> for establishing the Stage-2 mappings accordingly. 
>
> Do you see any issues with this approach?
>
> Thanks,
> Shameer



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 28/32] hw/arm/smmuv3-accel: Add property to specify OAS bits
  2025-11-04 14:35   ` Eric Auger
@ 2025-11-04 14:50     ` Jason Gunthorpe
  2025-11-06  7:54       ` Eric Auger
  0 siblings, 1 reply; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-04 14:50 UTC (permalink / raw)
  To: Eric Auger
  Cc: Shameer Kolothum, qemu-arm, qemu-devel, peter.maydell, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, jonathan.cameron, zhangfei.gao, zhenzhong.duan,
	yi.l.liu, kjaju

On Tue, Nov 04, 2025 at 03:35:42PM +0100, Eric Auger wrote:
> > +    /*
> > +     * TODO: OAS is not something Linux kernel doc says meaningful for user.
> > +     * But looks like OAS needs to be compatible for accelerator support. Please
> > +     * check.
> would remove that comment. Either it is requested or not.
> > +     */
> > +    if (FIELD_EX32(info->idr[5], IDR5, OAS) <
> > +                FIELD_EX32(s->idr[5], IDR5, OAS)) {
> > +        error_setg(errp, "Host SMMUv3 OAS(%d) bits not compatible",
> > +                   smmuv3_oas_bits(FIELD_EX32(info->idr[5], IDR5, OAS)));
> let's be more explicit then and say
> 
> Host SMMUv3 OAS (%d bits) is less that OAS bits advertised by SMMU (%d)

It isn't OAS that is being checked here, this is now IPA. OAS is for
use by the hypervisor.

When the guest looks at the vSMMU the "OAS" it sees is the IPS
supported by the HW.

Aside from the raw HW limit, it also shouldn't exceed the configured
size of the S2 HWPT.

So the above should refer to this detail because it is a bit subtle
that OAS and IPS are often the same. See "3.4 Address sizes"

* IAS reflects the maximum usable IPA of an implementation that is
  generated by stage 1 and input to stage 2:

- This term is defined to illustrate the handling of intermediate
  addresses in this section and is not a configurable parameter.

- The maximum usable IPA size of an SMMU is defined in terms of other SMMU implementation choices,
  as:
    IAS = MAX(SMMU_IDR0.TTF[0]==1 ? 40 : 0), SMMU_IDR0.TTF[1]==1 ? OAS : 0));

- An IPA of 40 bits is required to support of AArch32 LPAE translations, and AArch64 limits the
maximum IPA size to the maximum PA size. Otherwise, when AArch32 LPAE is not implemented, the
IPA size equals OAS, the PA size, and might be smaller than 40 bits.

- The purpose of definition of the IAS term is to abstract away from these implementation variables.

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 14:42       ` Shameer Kolothum
@ 2025-11-04 14:51         ` Jason Gunthorpe
  2025-11-04 14:58           ` Shameer Kolothum
  0 siblings, 1 reply; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-04 14:51 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: Eric Auger, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Nicolin Chen, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 02:42:57PM +0000, Shameer Kolothum wrote:
> > On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote:
> > > > However, QEMU/KVM also calls this callback when resolving
> > > > MSI doorbells:
> > > >
> > > >   kvm_irqchip_add_msi_route()
> > > >     kvm_arch_fixup_msi_route()
> > > >       pci_device_iommu_address_space()
> > > >         get_address_space()
> > > >
> > > > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
> > > > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
> > > > directly. This needs to be translated to vITS gPA. In order to do the
> > > > doorbell transalation it needs IOMMU address space.
> > 
> > Why does qemu do anything with the msi address? It is opaque and qemu
> > cannot determine anything meaningful from it. I expect it to ignore it?
> 
> I am afraid not. Guest MSI table write gets trapped and it then configures the 
> doorbell( this is where this patch comes handy) and sets up the KVM 
> routing etc.

Sure it is trapped, but nothing should be looking at the MSI address
from the guest, it is meaningless and wrong information. Just ignore
it.

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 14:51         ` Jason Gunthorpe
@ 2025-11-04 14:58           ` Shameer Kolothum
  2025-11-04 15:12             ` Jason Gunthorpe
  0 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-04 14:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric Auger, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Nicolin Chen, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: 04 November 2025 14:52
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: Eric Auger <eric.auger@redhat.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org; peter.maydell@linaro.org; Nicolin Chen
> <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> On Tue, Nov 04, 2025 at 02:42:57PM +0000, Shameer Kolothum wrote:
> > > On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote:
> > > > > However, QEMU/KVM also calls this callback when resolving
> > > > > MSI doorbells:
> > > > >
> > > > >   kvm_irqchip_add_msi_route()
> > > > >     kvm_arch_fixup_msi_route()
> > > > >       pci_device_iommu_address_space()
> > > > >         get_address_space()
> > > > >
> > > > > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA
> for
> > > > > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
> > > > > directly. This needs to be translated to vITS gPA. In order to do the
> > > > > doorbell transalation it needs IOMMU address space.
> > >
> > > Why does qemu do anything with the msi address? It is opaque and qemu
> > > cannot determine anything meaningful from it. I expect it to ignore it?
> >
> > I am afraid not. Guest MSI table write gets trapped and it then configures the
> > doorbell( this is where this patch comes handy) and sets up the KVM
> > routing etc.
> 
> Sure it is trapped, but nothing should be looking at the MSI address
> from the guest, it is meaningless and wrong information. Just ignore
> it.

Hmm.. we need to setup the doorbell address correctly. If we don't do
the translation here, it will use the Guest IOVA address. Remember,
we are using the IORT RMR identity mapping to get MSI working.

See this discussion here,

https://lore.kernel.org/qemu-devel/CH3PR12MB754810AE8D308630041F9AFEABF2A@CH3PR12MB7548.namprd12.prod.outlook.com/

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 14:58           ` Shameer Kolothum
@ 2025-11-04 15:12             ` Jason Gunthorpe
  2025-11-04 15:20               ` Shameer Kolothum
  0 siblings, 1 reply; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-04 15:12 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: Eric Auger, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Nicolin Chen, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > Sure it is trapped, but nothing should be looking at the MSI address
> > from the guest, it is meaningless and wrong information. Just ignore
> > it.
> 
> Hmm.. we need to setup the doorbell address correctly. 

> If we don't do the translation here, it will use the Guest IOVA
> address. Remember, we are using the IORT RMR identity mapping to get
> MSI working.

Either you use the RMR value, which is forced by the kernel into the
physical MSI through iommufd and kernel ignores anything qemu
does. So fully ignore the guest's vMSI address.

Eventually qemu should transfer the unchanged guest vMSI address
directly to the kernel, but we haven't figured that out yet.

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 14:44       ` Eric Auger
@ 2025-11-04 15:14         ` Shameer Kolothum
  2025-11-04 16:01           ` Eric Auger
  2025-11-05  8:56           ` Eric Auger
  0 siblings, 2 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-04 15:14 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 04 November 2025 14:44
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> External email: Use caution opening links or attachments
> 
> 
> On 11/4/25 3:37 PM, Shameer Kolothum wrote:
> > Hi Eric,
> >
> >> -----Original Message-----
> >> From: Eric Auger <eric.auger@redhat.com>
> >> Sent: 04 November 2025 14:12
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> >> arm@nongnu.org; qemu-devel@nongnu.org
> >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> >> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> >> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> >> smostafa@google.com; wangzhou1@hisilicon.com;
> >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> >> Krishnakant Jaju <kjaju@nvidia.com>
> >> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> >> get_msi_address_space() callback
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> Hi Shameer, Nicolin,
> >>
> >> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> >>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
> >>> translated by the IOMMU. In nested mode, this translation happens in
> >>> two stages (gIOVA → gPA → ITS page).
> >>>
> >>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
> >>> get_address_space() returns the system address space so that VFIO
> >>> can setup stage-2 mappings for system address space.
> >> Sorry but I still don't catch the above. Can you explain (most probably
> >> again) why this is a requirement to return the system as so that VFIO
> >> can setup stage-2 mappings for system address space. I am sorry for
> >> insisting (at the risk of being stubborn or dumb) but I fail to
> >> understand the requirement. As far as I remember the way I integrated it
> >> at the old times did not require that change:
> >> https://lore.kernel.org/all/20210411120912.15770-1-
> >> eric.auger@redhat.com/
> >> I used a vfio_prereg_listener to force the S2 mapping.
> > Yes I remember that.
> >
> >> What has changed that forces us now to have this gym
> > This approach achieves the same outcome, but through a
> > different mechanism. Returning the system address space
> > here ensures that VFIO sets up the Stage-2 mappings for
> > devices behind the accelerated SMMUv3.
> >
> > I think, this makes sense because, in the accelerated case, the
> > device is no longer managed by QEMU’s SMMUv3 model. The
> On the other hand, as we discussed on v4 by returning system as you
> pretend there is no translation in place which is not true. Now we use
> an alias for it but it has not really removed its usage. Also it forces
> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
> Have
> you assessed the feasability of using vfio_prereg_listener to force the
> S2 mapping. Is it simply not relevant anymore or could it be used also
> with the iommufd be integration? Eric

IIUC, the prereg_listener mechanism just enables us to setup the s2
mappings. For MSI, In your version, I see that smmu_find_add_as()
always returns IOMMU as. How is that supposed to work if the Guest
has s1 bypass mode STE for the device?

Thanks,
Shameer



^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 15:12             ` Jason Gunthorpe
@ 2025-11-04 15:20               ` Shameer Kolothum
  2025-11-04 15:35                 ` Jason Gunthorpe
  0 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-04 15:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric Auger, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Nicolin Chen, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: 04 November 2025 15:13
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: Eric Auger <eric.auger@redhat.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org; peter.maydell@linaro.org; Nicolin Chen
> <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > Sure it is trapped, but nothing should be looking at the MSI address
> > > from the guest, it is meaningless and wrong information. Just ignore
> > > it.
> >
> > Hmm.. we need to setup the doorbell address correctly.
> 
> > If we don't do the translation here, it will use the Guest IOVA
> > address. Remember, we are using the IORT RMR identity mapping to get
> > MSI working.
> 
> Either you use the RMR value, which is forced by the kernel into the
> physical MSI through iommufd and kernel ignores anything qemu
> does. So fully ignore the guest's vMSI address.

Well, we are sort of trying to do the same through this patch here. 
But to avoid a "translation" completely it will involve some changes to
Qemu pci subsystem. I think this is the least intrusive path I can think
of now. And this is a one time setup mostly.

Thanks,
Shameer



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 15:20               ` Shameer Kolothum
@ 2025-11-04 15:35                 ` Jason Gunthorpe
  2025-11-04 17:11                   ` Nicolin Chen
  0 siblings, 1 reply; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-04 15:35 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: Eric Auger, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Nicolin Chen, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > it.
> > >
> > > Hmm.. we need to setup the doorbell address correctly.
> > 
> > > If we don't do the translation here, it will use the Guest IOVA
> > > address. Remember, we are using the IORT RMR identity mapping to get
> > > MSI working.
> > 
> > Either you use the RMR value, which is forced by the kernel into the
> > physical MSI through iommufd and kernel ignores anything qemu
> > does. So fully ignore the guest's vMSI address.
> 
> Well, we are sort of trying to do the same through this patch here. 
> But to avoid a "translation" completely it will involve some changes to
> Qemu pci subsystem. I think this is the least intrusive path I can think
> of now. And this is a one time setup mostly.

Should be explained in the commit message that the translation is
pointless. I'm not sure about this, any translation seems risky
because it could fail. The guest can use any IOVA for MSI and none may
fail.

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 15:14         ` Shameer Kolothum
@ 2025-11-04 16:01           ` Eric Auger
  2025-11-04 17:47             ` Nicolin Chen
  2025-11-04 19:08             ` Shameer Kolothum
  2025-11-05  8:56           ` Eric Auger
  1 sibling, 2 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-04 16:01 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju



On 11/4/25 4:14 PM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 04 November 2025 14:44
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>> get_msi_address_space() callback
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 11/4/25 3:37 PM, Shameer Kolothum wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Sent: 04 November 2025 14:12
>>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>>>> arm@nongnu.org; qemu-devel@nongnu.org
>>>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>>>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>>>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>>>> smostafa@google.com; wangzhou1@hisilicon.com;
>>>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>>>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>>>> Krishnakant Jaju <kjaju@nvidia.com>
>>>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>>>> get_msi_address_space() callback
>>>>
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> Hi Shameer, Nicolin,
>>>>
>>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
>>>>> translated by the IOMMU. In nested mode, this translation happens in
>>>>> two stages (gIOVA → gPA → ITS page).
>>>>>
>>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
>>>>> get_address_space() returns the system address space so that VFIO
>>>>> can setup stage-2 mappings for system address space.
>>>> Sorry but I still don't catch the above. Can you explain (most probably
>>>> again) why this is a requirement to return the system as so that VFIO
>>>> can setup stage-2 mappings for system address space. I am sorry for
>>>> insisting (at the risk of being stubborn or dumb) but I fail to
>>>> understand the requirement. As far as I remember the way I integrated it
>>>> at the old times did not require that change:
>>>> https://lore.kernel.org/all/20210411120912.15770-1-
>>>> eric.auger@redhat.com/
>>>> I used a vfio_prereg_listener to force the S2 mapping.
>>> Yes I remember that.
>>>
>>>> What has changed that forces us now to have this gym
>>> This approach achieves the same outcome, but through a
>>> different mechanism. Returning the system address space
>>> here ensures that VFIO sets up the Stage-2 mappings for
>>> devices behind the accelerated SMMUv3.
>>>
>>> I think, this makes sense because, in the accelerated case, the
>>> device is no longer managed by QEMU’s SMMUv3 model. The
>> On the other hand, as we discussed on v4 by returning system as you
>> pretend there is no translation in place which is not true. Now we use
>> an alias for it but it has not really removed its usage. Also it forces
>> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
>> Have
>> you assessed the feasability of using vfio_prereg_listener to force the
>> S2 mapping. Is it simply not relevant anymore or could it be used also
>> with the iommufd be integration? Eric
> IIUC, the prereg_listener mechanism just enables us to setup the s2
> mappings. For MSI, In your version, I see that smmu_find_add_as()
> always returns IOMMU as. How is that supposed to work if the Guest
> has s1 bypass mode STE for the device?

I need to delve into it again as I forgot the details. Will come back to
you ...

Eric
>
> Thanks,
> Shameer
>
>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
  2025-11-04  8:55             ` Eric Auger
@ 2025-11-04 16:41               ` Nicolin Chen
  0 siblings, 0 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-11-04 16:41 UTC (permalink / raw)
  To: Eric Auger
  Cc: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 09:55:46AM +0100, Eric Auger wrote:
> On 11/3/25 7:51 PM, Nicolin Chen wrote:
> > On Mon, Nov 03, 2025 at 10:17:20AM -0800, Shameer Kolothum wrote:
> >>>> The general
> >>>> idea is, we will pass the errp to accel functions and report or
> >>>> propagate from here.
> >>> But there is no "errp" in smmuv3_cmdq_consume() to propagate the these
> >>> local_errs further? It ends at the error_report_err().
> >>>
> >>> If we only get local_err and print them, why not just print them inside the
> >>> _accel functions?
> >> Right, we don’t propagate error now. But in future it might come
> >> handy. I would personally keep the error propagation facility if possible.
> > smmuv3_cmdq_consume() is called in smmu_writel() only. Where do we
> > plan to propagate that in the future?
> >
> >> Also, this was added as per Eric's comment on RFC v3.
> >>
> >> https://lore.kernel.org/qemu-devel/41ceadf1-07de-4c8a-8935-d709ac7cf6bc@redhat.com/
> > If only we have a top function that does error_report_err() in one
> > place.. Duplicating error_report_err(local_err) doesn't look clean
> > to me.
> >
> > Maybe smmu_writel() could do:
> > {
> > +   Error *errp = NULL;
> >
> >     switch (offset) {
> >     case A_XXX:
> >         smmuv3_cmdq_consume(..., errp);
> > +       return MEMTX_OK;
> > -       break;
> >     ...
> >     case A_YYY:
> >         smmuv3_cmdq_consume(..., errp);
> > +       return MEMTX_OK;
> > -       break;
> >     }
> > +   error_report_err(errp);
> > +   return MEMTX_OK;
> > }
> >
> > Any better idea, Eric?
> 
> Can't we move local_err outside of case block and after the switch,
> 
>  if (cmd_error) {
>    if (local_err) {
>       error_report_err(local_err);
>    }
> ../..  

I see Shameer's vEVENTQ patch (WIP) has errp also that will end
up an error_report_err() in the smmu_writel() for A_EVENTQ_BASE.

So, it seems to be cleaner to do in the top function? I am fine
with adding in cmdq function for now and moving later though.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support
  2025-11-04 12:26     ` Shameer Kolothum
  2025-11-04 13:30       ` Eric Auger
@ 2025-11-04 16:48       ` Nicolin Chen
  1 sibling, 0 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-11-04 16:48 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 04:26:56AM -0800, Shameer Kolothum wrote:
> > > +static bool
> > > +smmuv3_accel_dev_install_nested_ste(SMMUv3AccelDevice *accel_dev,
> > > +                                    uint32_t data_type, uint32_t data_len,
> > > +                                    void *data, Error **errp)
> > the name is very close to the caller function, ie.
> > smmuv3_accel_install_nested_ste which also takes a sdev.
> > I would rename to smmuv3_accel_install_hwpt() or something alike
> 
> This one is going to change a bit based on Nicolin's feedback on taking 
> care of SMMUEN/GBPA values.
> 
> Probably smmuv3_accel_attach_nested_hwpt() suits better considering
> that’s what it finally ends up doing.

Eric is right, because the current version passes in hwpt data for
allocation.

Yet, my new version passes in STE, so naming could keep "ste" IMHO.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 15:35                 ` Jason Gunthorpe
@ 2025-11-04 17:11                   ` Nicolin Chen
  2025-11-04 17:41                     ` Jason Gunthorpe
  0 siblings, 1 reply; 148+ messages in thread
From: Nicolin Chen @ 2025-11-04 17:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameer Kolothum, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > > it.
> > > >
> > > > Hmm.. we need to setup the doorbell address correctly.
> > > 
> > > > If we don't do the translation here, it will use the Guest IOVA
> > > > address. Remember, we are using the IORT RMR identity mapping to get
> > > > MSI working.
> > > 
> > > Either you use the RMR value, which is forced by the kernel into the
> > > physical MSI through iommufd and kernel ignores anything qemu
> > > does. So fully ignore the guest's vMSI address.
> > 
> > Well, we are sort of trying to do the same through this patch here. 
> > But to avoid a "translation" completely it will involve some changes to
> > Qemu pci subsystem. I think this is the least intrusive path I can think
> > of now. And this is a one time setup mostly.
> 
> Should be explained in the commit message that the translation is
> pointless. I'm not sure about this, any translation seems risky
> because it could fail. The guest can use any IOVA for MSI and none may
> fail.

In the current design of KVM in QEMU, it does a generic translation
from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
has an accelerated IOMMU or an emulated IOMMU.

In the accelerated case, this translation is pointless for the SMMU
HW underlying. But the IRQ injection routine still stands.

We could have invented something like get_msi_physical_address, but
the vPCI device is programmed with gIOVA for MSI. So it makes sense
for VMM to follow that gIOVA? Even if the gIOVA is a wrong address,
I think VMM shouldn't correct that, since a real HW wouldn't.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 17:11                   ` Nicolin Chen
@ 2025-11-04 17:41                     ` Jason Gunthorpe
  2025-11-04 17:57                       ` Nicolin Chen
  2025-11-05 17:32                       ` Eric Auger
  0 siblings, 2 replies; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-04 17:41 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Shameer Kolothum, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote:
> On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > > > it.
> > > > >
> > > > > Hmm.. we need to setup the doorbell address correctly.
> > > > 
> > > > > If we don't do the translation here, it will use the Guest IOVA
> > > > > address. Remember, we are using the IORT RMR identity mapping to get
> > > > > MSI working.
> > > > 
> > > > Either you use the RMR value, which is forced by the kernel into the
> > > > physical MSI through iommufd and kernel ignores anything qemu
> > > > does. So fully ignore the guest's vMSI address.
> > > 
> > > Well, we are sort of trying to do the same through this patch here. 
> > > But to avoid a "translation" completely it will involve some changes to
> > > Qemu pci subsystem. I think this is the least intrusive path I can think
> > > of now. And this is a one time setup mostly.
> > 
> > Should be explained in the commit message that the translation is
> > pointless. I'm not sure about this, any translation seems risky
> > because it could fail. The guest can use any IOVA for MSI and none may
> > fail.
> 
> In the current design of KVM in QEMU, it does a generic translation
> from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
> has an accelerated IOMMU or an emulated IOMMU.

And what happens if the translation fails because there is no mapping?
It should be ignored for this case and not ignored for others.

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 16:01           ` Eric Auger
@ 2025-11-04 17:47             ` Nicolin Chen
  2025-11-05  7:47               ` Eric Auger
  2025-11-04 19:08             ` Shameer Kolothum
  1 sibling, 1 reply; 148+ messages in thread
From: Nicolin Chen @ 2025-11-04 17:47 UTC (permalink / raw)
  To: Eric Auger
  Cc: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 05:01:57PM +0100, Eric Auger wrote:
> >>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> >>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
> >>>>> translated by the IOMMU. In nested mode, this translation happens in
> >>>>> two stages (gIOVA → gPA → ITS page).
> >>>>>
> >>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
> >>>>> get_address_space() returns the system address space so that VFIO
> >>>>> can setup stage-2 mappings for system address space.
> >>>> Sorry but I still don't catch the above. Can you explain (most probably
> >>>> again) why this is a requirement to return the system as so that VFIO
> >>>> can setup stage-2 mappings for system address space. I am sorry for
> >>>> insisting (at the risk of being stubborn or dumb) but I fail to
> >>>> understand the requirement. As far as I remember the way I integrated it
> >>>> at the old times did not require that change:
> >>>> https://lore.kernel.org/all/20210411120912.15770-1-
> >>>> eric.auger@redhat.com/
> >>>> I used a vfio_prereg_listener to force the S2 mapping.
> >>> Yes I remember that.
> >>>
> >>>> What has changed that forces us now to have this gym
> >>> This approach achieves the same outcome, but through a
> >>> different mechanism. Returning the system address space
> >>> here ensures that VFIO sets up the Stage-2 mappings for
> >>> devices behind the accelerated SMMUv3.
> >>>
> >>> I think, this makes sense because, in the accelerated case, the
> >>> device is no longer managed by QEMU’s SMMUv3 model. The
> >> On the other hand, as we discussed on v4 by returning system as you
> >> pretend there is no translation in place which is not true. Now we use
> >> an alias for it but it has not really removed its usage. Also it forces
> >> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
> >> Have
> >> you assessed the feasability of using vfio_prereg_listener to force the
> >> S2 mapping. Is it simply not relevant anymore or could it be used also
> >> with the iommufd be integration? Eric
> > IIUC, the prereg_listener mechanism just enables us to setup the s2
> > mappings. For MSI, In your version, I see that smmu_find_add_as()
> > always returns IOMMU as. How is that supposed to work if the Guest
> > has s1 bypass mode STE for the device?
> 
> I need to delve into it again as I forgot the details. Will come back to
> you ...

We aligned with Intel previously about this system address space.
You might know these very well, yet here are the breakdowns:

1. VFIO core has a container that manages an HWPT. By default, it
   allocates a stage-1 normal HWPT, unless vIOMMU requests for a
   nesting parent HWPT for accelerated cases.
2. VFIO core adds a listener for that HWPT and sets up a handler
   vfio_container_region_add() where it checks the memory region
   whether it is iommu or not.
   a. In case of !IOMMU as (i.e. system address space), it treats
      the address space as a RAM region, and handles all stage-2
      mappings for the core allocated nesting parent HWPT.
   b. In case of IOMMU as (i.e. a translation type) it sets up
      the IOTLB notifier and translation replay while bypassing
      the listener for RAM region.

In an accelerated case, we need stage-2 mappings to match with the
nesting parent HWPT. So, returning system address space or an alias
of that notifies the vfio core to take the 2.a path.

If we take 2.b path by returning IOMMU as in smmu_find_add_as, the
VFIO core would no longer listen to the RAM region for us, i.e. no
stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting
parent and manage the stage-2 mappings by adding a listener in its
own code, which is largely duplicated with the core code.

-------------- so far this works for Intel and ARM--------------

3. On ARM, vPCI device is programmed with gIOVA, so KVM has to
   follow what the vPCI is told to inject vIRQs. This requires
   a translation at the nested stage-1 address space. Note that
   vSMMU in this case doesn't manage translation as it doesn't
   need to. But there is no other sane way for KVM to know the
   vITS page corresponding to the given gIOVA. So, we invented
   the get_msi_address_space op.

(3) makes sense because there is a complication in the MSI that
does a 2-stage translation on ARM and KVM must follow the stage-1
input address, leaving us no choice to have two address spaces.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 17:41                     ` Jason Gunthorpe
@ 2025-11-04 17:57                       ` Nicolin Chen
  2025-11-04 18:09                         ` Jason Gunthorpe
  2025-11-05 17:32                       ` Eric Auger
  1 sibling, 1 reply; 148+ messages in thread
From: Nicolin Chen @ 2025-11-04 17:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameer Kolothum, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 01:41:52PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote:
> > On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > > > > it.
> > > > > >
> > > > > > Hmm.. we need to setup the doorbell address correctly.
> > > > > 
> > > > > > If we don't do the translation here, it will use the Guest IOVA
> > > > > > address. Remember, we are using the IORT RMR identity mapping to get
> > > > > > MSI working.
> > > > > 
> > > > > Either you use the RMR value, which is forced by the kernel into the
> > > > > physical MSI through iommufd and kernel ignores anything qemu
> > > > > does. So fully ignore the guest's vMSI address.
> > > > 
> > > > Well, we are sort of trying to do the same through this patch here. 
> > > > But to avoid a "translation" completely it will involve some changes to
> > > > Qemu pci subsystem. I think this is the least intrusive path I can think
> > > > of now. And this is a one time setup mostly.
> > > 
> > > Should be explained in the commit message that the translation is
> > > pointless. I'm not sure about this, any translation seems risky
> > > because it could fail. The guest can use any IOVA for MSI and none may
> > > fail.
> > 
> > In the current design of KVM in QEMU, it does a generic translation
> > from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
> > has an accelerated IOMMU or an emulated IOMMU.
> 
> And what happens if the translation fails because there is no mapping?
> It should be ignored for this case and not ignored for others.

It errors out and does no injection. IOW, yea, "ignored".

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 17:57                       ` Nicolin Chen
@ 2025-11-04 18:09                         ` Jason Gunthorpe
  2025-11-04 18:44                           ` Nicolin Chen
  0 siblings, 1 reply; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-04 18:09 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Shameer Kolothum, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 09:57:53AM -0800, Nicolin Chen wrote:
> On Tue, Nov 04, 2025 at 01:41:52PM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote:
> > > On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
> > > > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > > > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > > > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > > > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > > > > > it.
> > > > > > >
> > > > > > > Hmm.. we need to setup the doorbell address correctly.
> > > > > > 
> > > > > > > If we don't do the translation here, it will use the Guest IOVA
> > > > > > > address. Remember, we are using the IORT RMR identity mapping to get
> > > > > > > MSI working.
> > > > > > 
> > > > > > Either you use the RMR value, which is forced by the kernel into the
> > > > > > physical MSI through iommufd and kernel ignores anything qemu
> > > > > > does. So fully ignore the guest's vMSI address.
> > > > > 
> > > > > Well, we are sort of trying to do the same through this patch here. 
> > > > > But to avoid a "translation" completely it will involve some changes to
> > > > > Qemu pci subsystem. I think this is the least intrusive path I can think
> > > > > of now. And this is a one time setup mostly.
> > > > 
> > > > Should be explained in the commit message that the translation is
> > > > pointless. I'm not sure about this, any translation seems risky
> > > > because it could fail. The guest can use any IOVA for MSI and none may
> > > > fail.
> > > 
> > > In the current design of KVM in QEMU, it does a generic translation
> > > from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
> > > has an accelerated IOMMU or an emulated IOMMU.
> > 
> > And what happens if the translation fails because there is no mapping?
> > It should be ignored for this case and not ignored for others.
> 
> It errors out and does no injection. IOW, yea, "ignored".

"does no injection" does not sound like ignored to me..

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 18:09                         ` Jason Gunthorpe
@ 2025-11-04 18:44                           ` Nicolin Chen
  2025-11-04 18:56                             ` Jason Gunthorpe
  0 siblings, 1 reply; 148+ messages in thread
From: Nicolin Chen @ 2025-11-04 18:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameer Kolothum, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 02:09:28PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 09:57:53AM -0800, Nicolin Chen wrote:
> > On Tue, Nov 04, 2025 at 01:41:52PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote:
> > > > On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
> > > > > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > > > > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > > > > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > > > > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > > > > > > it.
> > > > > > > >
> > > > > > > > Hmm.. we need to setup the doorbell address correctly.
> > > > > > > 
> > > > > > > > If we don't do the translation here, it will use the Guest IOVA
> > > > > > > > address. Remember, we are using the IORT RMR identity mapping to get
> > > > > > > > MSI working.
> > > > > > > 
> > > > > > > Either you use the RMR value, which is forced by the kernel into the
> > > > > > > physical MSI through iommufd and kernel ignores anything qemu
> > > > > > > does. So fully ignore the guest's vMSI address.
> > > > > > 
> > > > > > Well, we are sort of trying to do the same through this patch here. 
> > > > > > But to avoid a "translation" completely it will involve some changes to
> > > > > > Qemu pci subsystem. I think this is the least intrusive path I can think
> > > > > > of now. And this is a one time setup mostly.
> > > > > 
> > > > > Should be explained in the commit message that the translation is
> > > > > pointless. I'm not sure about this, any translation seems risky
> > > > > because it could fail. The guest can use any IOVA for MSI and none may
> > > > > fail.
> > > > 
> > > > In the current design of KVM in QEMU, it does a generic translation
> > > > from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
> > > > has an accelerated IOMMU or an emulated IOMMU.
> > > 
> > > And what happens if the translation fails because there is no mapping?
> > > It should be ignored for this case and not ignored for others.
> > 
> > It errors out and does no injection. IOW, yea, "ignored".
> 
> "does no injection" does not sound like ignored to me..

Sorry. I think I've missed your point.

The hardware path is programmed with a RMR-ed sw_msi in the host
via VFIO's PCI IRQ, ignoring the gIOVA and vITS in the guest VM,
even if the vPCI is programmed with a wrong gIOVA that could not
be translated.

KVM would always get the IRQ from HW, since the HW is programmed
correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
incorrectly, it can't inject the IRQ.

(Perhaps vSMMU in this case should F_TRANSLATION to the device.)

What was the meaning of "ignore" in your remarks?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 18:44                           ` Nicolin Chen
@ 2025-11-04 18:56                             ` Jason Gunthorpe
  2025-11-04 19:31                               ` Nicolin Chen
  0 siblings, 1 reply; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-04 18:56 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Shameer Kolothum, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote:
> The hardware path is programmed with a RMR-ed sw_msi in the host
> via VFIO's PCI IRQ, ignoring the gIOVA and vITS in the guest VM,
> even if the vPCI is programmed with a wrong gIOVA that could not
> be translated.

Yes
 
> KVM would always get the IRQ from HW, since the HW is programmed
> correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
> incorrectly, it can't inject the IRQ.

But this is a software interrupt, and I think it should still just
ignore vMSI's address and assume it is mapped to a legal ITS
page. There is just no way to validate it.

Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it
that isn't mapped in the S2. That's wrong and is something the guest
is permitted to do.

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 16:01           ` Eric Auger
  2025-11-04 17:47             ` Nicolin Chen
@ 2025-11-04 19:08             ` Shameer Kolothum
  1 sibling, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-04 19:08 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 04 November 2025 16:02
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> External email: Use caution opening links or attachments
> 
> 
> On 11/4/25 4:14 PM, Shameer Kolothum wrote:
> >
> >> -----Original Message-----
> >> From: Eric Auger <eric.auger@redhat.com>
> >> Sent: 04 November 2025 14:44
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org;
> >> qemu-devel@nongnu.org
> >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>;
> >> Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> >> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> >> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com;
> >> yi.l.liu@intel.com; Krishnakant Jaju <kjaju@nvidia.com>
> >> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> >> get_msi_address_space() callback
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On 11/4/25 3:37 PM, Shameer Kolothum wrote:
> >>> Hi Eric,
> >>>
> >>>> -----Original Message-----
> >>>> From: Eric Auger <eric.auger@redhat.com>
> >>>> Sent: 04 November 2025 14:12
> >>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> >>>> arm@nongnu.org; qemu-devel@nongnu.org
> >>>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>;
> >>>> Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> >>>> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt
> Ochs
> >>>> <mochs@nvidia.com>; smostafa@google.com;
> wangzhou1@hisilicon.com;
> >>>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> >>>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com;
> >>>> yi.l.liu@intel.com; Krishnakant Jaju <kjaju@nvidia.com>
> >>>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> >>>> get_msi_address_space() callback
> >>>>
> >>>> External email: Use caution opening links or attachments
> >>>>
> >>>>
> >>>> Hi Shameer, Nicolin,
> >>>>
> >>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> >>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
> >>>>> translated by the IOMMU. In nested mode, this translation happens
> >>>>> in two stages (gIOVA → gPA → ITS page).
> >>>>>
> >>>>> In accelerated SMMUv3 mode, both stages are handled by hardware,
> >>>>> so
> >>>>> get_address_space() returns the system address space so that VFIO
> >>>>> can setup stage-2 mappings for system address space.
> >>>> Sorry but I still don't catch the above. Can you explain (most
> >>>> probably
> >>>> again) why this is a requirement to return the system as so that
> >>>> VFIO can setup stage-2 mappings for system address space. I am
> >>>> sorry for insisting (at the risk of being stubborn or dumb) but I
> >>>> fail to understand the requirement. As far as I remember the way I
> >>>> integrated it at the old times did not require that change:
> >>>> https://lore.kernel.org/all/20210411120912.15770-1-
> >>>> eric.auger@redhat.com/
> >>>> I used a vfio_prereg_listener to force the S2 mapping.
> >>> Yes I remember that.
> >>>
> >>>> What has changed that forces us now to have this gym
> >>> This approach achieves the same outcome, but through a different
> >>> mechanism. Returning the system address space here ensures that VFIO
> >>> sets up the Stage-2 mappings for devices behind the accelerated
> >>> SMMUv3.
> >>>
> >>> I think, this makes sense because, in the accelerated case, the
> >>> device is no longer managed by QEMU’s SMMUv3 model. The
> >> On the other hand, as we discussed on v4 by returning system as you
> >> pretend there is no translation in place which is not true. Now we
> >> use an alias for it but it has not really removed its usage. Also it
> >> forces use to hack around the MSI mapping and introduce new
> PCIIOMMUOps.
> >> Have
> >> you assessed the feasability of using vfio_prereg_listener to force
> >> the
> >> S2 mapping. Is it simply not relevant anymore or could it be used
> >> also with the iommufd be integration? Eric
> > IIUC, the prereg_listener mechanism just enables us to setup the s2
> > mappings. For MSI, In your version, I see that smmu_find_add_as()
> > always returns IOMMU as. How is that supposed to work if the Guest has
> > s1 bypass mode STE for the device?
> 
> I need to delve into it again as I forgot the details. Will come back to you ...

I think the BYPASS case will work anyway as in smmuv3_translate() fn 
we are checking the ste config (SMMU_TRANS_BYPASS) and it will just
return the same address back.

So we can do the same here in get_msi_address_space() and return
IOMMU as always. And that completely avoids &address_space_memory
from SMMUv3-accel if that’s the concern.

Thanks,
Shameer






^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 18:56                             ` Jason Gunthorpe
@ 2025-11-04 19:31                               ` Nicolin Chen
  2025-11-04 19:35                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 148+ messages in thread
From: Nicolin Chen @ 2025-11-04 19:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameer Kolothum, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote:
> > KVM would always get the IRQ from HW, since the HW is programmed
> > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
> > incorrectly, it can't inject the IRQ.
> 
> But this is a software interrupt, and I think it should still just
> ignore vMSI's address and assume it is mapped to a legal ITS
> page. There is just no way to validate it.
>
> Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it
> that isn't mapped in the S2. That's wrong and is something the guest
> is permitted to do.

Hmm, that feels like a self-correction? But in a baremetal case,
if HW is programmed with a weird IOVA, interrupt would not work,
right?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 19:31                               ` Nicolin Chen
@ 2025-11-04 19:35                                 ` Jason Gunthorpe
  2025-11-04 19:43                                   ` Nicolin Chen
  2025-11-04 19:46                                   ` Shameer Kolothum
  0 siblings, 2 replies; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-04 19:35 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Shameer Kolothum, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 11:31:50AM -0800, Nicolin Chen wrote:
> On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote:
> > > KVM would always get the IRQ from HW, since the HW is programmed
> > > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
> > > incorrectly, it can't inject the IRQ.
> > 
> > But this is a software interrupt, and I think it should still just
> > ignore vMSI's address and assume it is mapped to a legal ITS
> > page. There is just no way to validate it.
> >
> > Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it
> > that isn't mapped in the S2. That's wrong and is something the guest
> > is permitted to do.
> 
> Hmm, that feels like a self-correction? But in a baremetal case,
> if HW is programmed with a weird IOVA, interrupt would not work,
> right?

Right, but qemu has no way to duplicate that behavior unless it walks
the full s1 and s2 page tables, which we have said it isn't going to
do.

So it should probably just ignore this check and assume the IOVA is
set properly, exactly the same as if it was HW injected using the RMR.

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 19:35                                 ` Jason Gunthorpe
@ 2025-11-04 19:43                                   ` Nicolin Chen
  2025-11-04 19:45                                     ` Jason Gunthorpe
  2025-11-04 19:46                                   ` Shameer Kolothum
  1 sibling, 1 reply; 148+ messages in thread
From: Nicolin Chen @ 2025-11-04 19:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameer Kolothum, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 03:35:21PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 11:31:50AM -0800, Nicolin Chen wrote:
> > On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote:
> > > > KVM would always get the IRQ from HW, since the HW is programmed
> > > > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
> > > > incorrectly, it can't inject the IRQ.
> > > 
> > > But this is a software interrupt, and I think it should still just
> > > ignore vMSI's address and assume it is mapped to a legal ITS
> > > page. There is just no way to validate it.
> > >
> > > Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it
> > > that isn't mapped in the S2. That's wrong and is something the guest
> > > is permitted to do.
> > 
> > Hmm, that feels like a self-correction? But in a baremetal case,
> > if HW is programmed with a weird IOVA, interrupt would not work,
> > right?
> 
> Right, but qemu has no way to duplicate that behavior unless it walks
> the full s1 and s2 page tables, which we have said it isn't going to
> do.

I think it could.

The stage-1 page table is in the guest RAM. And vSMMU has already
implemented the logic to walk through a guest page table. What KVM
has already been doing today is to ask vSMMU to translate that.

What we haven't implemented today is, if gIOVA is a weird one that
isn't translatable, vSMMU should trigger an F_TRANSLATION event as
the real HW does.

> So it should probably just ignore this check and assume the IOVA is
> set properly, exactly the same as if it was HW injected using the RMR.

Hmm, I am not sure about that, especially considering our plan to
support the true 2-stage mapping: gIOVA->vITS->pITS :-/

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 19:43                                   ` Nicolin Chen
@ 2025-11-04 19:45                                     ` Jason Gunthorpe
  2025-11-04 19:59                                       ` Nicolin Chen
  0 siblings, 1 reply; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-04 19:45 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Shameer Kolothum, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 11:43:07AM -0800, Nicolin Chen wrote:
> > Right, but qemu has no way to duplicate that behavior unless it walks
> > the full s1 and s2 page tables, which we have said it isn't going to
> > do.
> 
> I think it could.
> 
> The stage-1 page table is in the guest RAM. And vSMMU has already
> implemented the logic to walk through a guest page table. What KVM
> has already been doing today is to ask vSMMU to translate that.

No, we can't. The existing vsmmu code could do it because it mediated
the invalidation path. As soon as you have something like vcmdq the
hypervisor cannot walk the page tables.

> > So it should probably just ignore this check and assume the IOVA is
> > set properly, exactly the same as if it was HW injected using the RMR.
> 
> Hmm, I am not sure about that, especially considering our plan to
> support the true 2-stage mapping: gIOVA->vITS->pITS :-/

In true mode the HW path will work perfectly and the SW path will
remain deficient in not checking for invalid configuration

I don't see another sensible choice.

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 19:35                                 ` Jason Gunthorpe
  2025-11-04 19:43                                   ` Nicolin Chen
@ 2025-11-04 19:46                                   ` Shameer Kolothum
  2025-11-05 12:52                                     ` Jason Gunthorpe
  1 sibling, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-04 19:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: Eric Auger, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, ddutile@redhat.com, berrange@redhat.com,
	Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: 04 November 2025 19:35
> To: Nicolin Chen <nicolinc@nvidia.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; Eric Auger
> <eric.auger@redhat.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org; peter.maydell@linaro.org; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> On Tue, Nov 04, 2025 at 11:31:50AM -0800, Nicolin Chen wrote:
> > On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote:
> > > > KVM would always get the IRQ from HW, since the HW is programmed
> > > > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
> > > > incorrectly, it can't inject the IRQ.
> > >
> > > But this is a software interrupt, and I think it should still just
> > > ignore vMSI's address and assume it is mapped to a legal ITS
> > > page. There is just no way to validate it.
> > >
> > > Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it
> > > that isn't mapped in the S2. That's wrong and is something the guest
> > > is permitted to do.
> >
> > Hmm, that feels like a self-correction? But in a baremetal case,
> > if HW is programmed with a weird IOVA, interrupt would not work,
> > right?
> 
> Right, but qemu has no way to duplicate that behavior unless it walks
> the full s1 and s2 page tables, which we have said it isn't going to
> do.
> So it should probably just ignore this check and assume the IOVA is
> set properly, exactly the same as if it was HW injected using the RMR.

TBH, I am a bit lost here. Anyway, this is my understanding.

If we ignore and don't return the correct doorbell (gPA) here, 
Qemu will end up invoking KVM_SET_GSI_ROUTING with wrong doorbell
which sets up the in-kernel vgic irq routing information. And when HW
raises the IRQ, KVM can't inject it properly.

Thanks,
Shameer

 




^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 19:45                                     ` Jason Gunthorpe
@ 2025-11-04 19:59                                       ` Nicolin Chen
  0 siblings, 0 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-11-04 19:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameer Kolothum, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 03:45:52PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 11:43:07AM -0800, Nicolin Chen wrote:
> > > Right, but qemu has no way to duplicate that behavior unless it walks
> > > the full s1 and s2 page tables, which we have said it isn't going to
> > > do.
> > 
> > I think it could.
> > 
> > The stage-1 page table is in the guest RAM. And vSMMU has already
> > implemented the logic to walk through a guest page table. What KVM
> > has already been doing today is to ask vSMMU to translate that.
> 
> No, we can't. The existing vsmmu code could do it because it mediated
> the invalidation path. As soon as you have something like vcmdq the
> hypervisor cannot walk the page tables.

Hmm? It does walk through the page table (not invalidation path):
https://github.com/qemu/qemu/blob/master/hw/arm/smmu-common.c#L444

And VCMDQ can work with that. We've tested it..

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 17:47             ` Nicolin Chen
@ 2025-11-05  7:47               ` Eric Auger
  2025-11-05 19:30                 ` Nicolin Chen
  0 siblings, 1 reply; 148+ messages in thread
From: Eric Auger @ 2025-11-05  7:47 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

Hi Nicolin,

On 11/4/25 6:47 PM, Nicolin Chen wrote:
> On Tue, Nov 04, 2025 at 05:01:57PM +0100, Eric Auger wrote:
>>>>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>>>>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
>>>>>>> translated by the IOMMU. In nested mode, this translation happens in
>>>>>>> two stages (gIOVA → gPA → ITS page).
>>>>>>>
>>>>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
>>>>>>> get_address_space() returns the system address space so that VFIO
>>>>>>> can setup stage-2 mappings for system address space.
>>>>>> Sorry but I still don't catch the above. Can you explain (most probably
>>>>>> again) why this is a requirement to return the system as so that VFIO
>>>>>> can setup stage-2 mappings for system address space. I am sorry for
>>>>>> insisting (at the risk of being stubborn or dumb) but I fail to
>>>>>> understand the requirement. As far as I remember the way I integrated it
>>>>>> at the old times did not require that change:
>>>>>> https://lore.kernel.org/all/20210411120912.15770-1-
>>>>>> eric.auger@redhat.com/
>>>>>> I used a vfio_prereg_listener to force the S2 mapping.
>>>>> Yes I remember that.
>>>>>
>>>>>> What has changed that forces us now to have this gym
>>>>> This approach achieves the same outcome, but through a
>>>>> different mechanism. Returning the system address space
>>>>> here ensures that VFIO sets up the Stage-2 mappings for
>>>>> devices behind the accelerated SMMUv3.
>>>>>
>>>>> I think, this makes sense because, in the accelerated case, the
>>>>> device is no longer managed by QEMU’s SMMUv3 model. The
>>>> On the other hand, as we discussed on v4 by returning system as you
>>>> pretend there is no translation in place which is not true. Now we use
>>>> an alias for it but it has not really removed its usage. Also it forces
>>>> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
>>>> Have
>>>> you assessed the feasability of using vfio_prereg_listener to force the
>>>> S2 mapping. Is it simply not relevant anymore or could it be used also
>>>> with the iommufd be integration? Eric
>>> IIUC, the prereg_listener mechanism just enables us to setup the s2
>>> mappings. For MSI, In your version, I see that smmu_find_add_as()
>>> always returns IOMMU as. How is that supposed to work if the Guest
>>> has s1 bypass mode STE for the device?
>> I need to delve into it again as I forgot the details. Will come back to
>> you ...
> We aligned with Intel previously about this system address space.
> You might know these very well, yet here are the breakdowns:
>
> 1. VFIO core has a container that manages an HWPT. By default, it
>    allocates a stage-1 normal HWPT, unless vIOMMU requests for a
You may precise this stage-1 normal HWPT is used to map GPA to HPA (so
eventually implements stage 2).
>    nesting parent HWPT for accelerated cases.
> 2. VFIO core adds a listener for that HWPT and sets up a handler
>    vfio_container_region_add() where it checks the memory region
>    whether it is iommu or not.
>    a. In case of !IOMMU as (i.e. system address space), it treats
>       the address space as a RAM region, and handles all stage-2
>       mappings for the core allocated nesting parent HWPT.
>    b. In case of IOMMU as (i.e. a translation type) it sets up
>       the IOTLB notifier and translation replay while bypassing
>       the listener for RAM region.
yes S1+S2 are combined through vfio_iommu_map_notify()
>
> In an accelerated case, we need stage-2 mappings to match with the
> nesting parent HWPT. So, returning system address space or an alias
> of that notifies the vfio core to take the 2.a path.
>
> If we take 2.b path by returning IOMMU as in smmu_find_add_as, the
> VFIO core would no longer listen to the RAM region for us, i.e. no
> stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting
except if you change the VFIO common.c as I did the past to force the S2
mapping in the nested config.
See
https://lore.kernel.org/all/20210411120912.15770-16-eric.auger@redhat.com/
and vfio_prereg_listener()
Again I do not say this is the right way to do but using system address
space is not the "only" implementation choice I think and it needs to be
properly justified, especially has it has at least 2 side effects:
- somehow abusing the semantic of returned address space and pretends
there is no IOMMU translation in place and
- also impacting the way MSIs are handled (introduction of a new
PCIIOMMUOps).
This kind of explanation you wrote is absolutely needed in the commit
msg for reviewers to understand the design choice I think.

Eric
> parent and manage the stage-2 mappings by adding a listener in its
> own code, which is largely duplicated with the core code.
>
> -------------- so far this works for Intel and ARM--------------
>
> 3. On ARM, vPCI device is programmed with gIOVA, so KVM has to
>    follow what the vPCI is told to inject vIRQs. This requires
>    a translation at the nested stage-1 address space. Note that
>    vSMMU in this case doesn't manage translation as it doesn't
>    need to. But there is no other sane way for KVM to know the
>    vITS page corresponding to the given gIOVA. So, we invented
>    the get_msi_address_space op.
>
> (3) makes sense because there is a complication in the MSI that
> does a 2-stage translation on ARM and KVM must follow the stage-1
> input address, leaving us no choice to have two address spaces.
>
> Thanks
> Nicolin
>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 15:14         ` Shameer Kolothum
  2025-11-04 16:01           ` Eric Auger
@ 2025-11-05  8:56           ` Eric Auger
  2025-11-05 11:41             ` Shameer Kolothum
  1 sibling, 1 reply; 148+ messages in thread
From: Eric Auger @ 2025-11-05  8:56 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

Hi Shameer,

On 11/4/25 4:14 PM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 04 November 2025 14:44
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>> get_msi_address_space() callback
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 11/4/25 3:37 PM, Shameer Kolothum wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Sent: 04 November 2025 14:12
>>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>>>> arm@nongnu.org; qemu-devel@nongnu.org
>>>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>>>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>>>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>>>> smostafa@google.com; wangzhou1@hisilicon.com;
>>>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>>>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>>>> Krishnakant Jaju <kjaju@nvidia.com>
>>>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>>>> get_msi_address_space() callback
>>>>
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> Hi Shameer, Nicolin,
>>>>
>>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
>>>>> translated by the IOMMU. In nested mode, this translation happens in
>>>>> two stages (gIOVA → gPA → ITS page).
>>>>>
>>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
>>>>> get_address_space() returns the system address space so that VFIO
>>>>> can setup stage-2 mappings for system address space.
>>>> Sorry but I still don't catch the above. Can you explain (most probably
>>>> again) why this is a requirement to return the system as so that VFIO
>>>> can setup stage-2 mappings for system address space. I am sorry for
>>>> insisting (at the risk of being stubborn or dumb) but I fail to
>>>> understand the requirement. As far as I remember the way I integrated it
>>>> at the old times did not require that change:
>>>> https://lore.kernel.org/all/20210411120912.15770-1-
>>>> eric.auger@redhat.com/
>>>> I used a vfio_prereg_listener to force the S2 mapping.
>>> Yes I remember that.
>>>
>>>> What has changed that forces us now to have this gym
>>> This approach achieves the same outcome, but through a
>>> different mechanism. Returning the system address space
>>> here ensures that VFIO sets up the Stage-2 mappings for
>>> devices behind the accelerated SMMUv3.
>>>
>>> I think, this makes sense because, in the accelerated case, the
>>> device is no longer managed by QEMU’s SMMUv3 model. The
>> On the other hand, as we discussed on v4 by returning system as you
>> pretend there is no translation in place which is not true. Now we use
>> an alias for it but it has not really removed its usage. Also it forces
>> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
>> Have
>> you assessed the feasability of using vfio_prereg_listener to force the
>> S2 mapping. Is it simply not relevant anymore or could it be used also
>> with the iommufd be integration? Eric
> IIUC, the prereg_listener mechanism just enables us to setup the s2
> mappings. For MSI, In your version, I see that smmu_find_add_as()
> always returns IOMMU as. How is that supposed to work if the Guest
> has s1 bypass mode STE for the device?
in kvm_arch_fixup_msi_route(), as we have as != &address_space_memory in
my case, we proceed with the actual translation for the doorbell gIOVA
using address_space_translate(). I  guess if the S1 is in bypass mode
you get the flat translation, no?

Eric
>
> Thanks,
> Shameer
>
>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-05  8:56           ` Eric Auger
@ 2025-11-05 11:41             ` Shameer Kolothum
  2025-11-05 17:25               ` Eric Auger
  0 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-05 11:41 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju


Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 05 November 2025 08:57
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
[...]
> > IIUC, the prereg_listener mechanism just enables us to setup the s2
> > mappings. For MSI, In your version, I see that smmu_find_add_as()
> > always returns IOMMU as. How is that supposed to work if the Guest
> > has s1 bypass mode STE for the device?
> in kvm_arch_fixup_msi_route(), as we have as != &address_space_memory in
> my case, we proceed with the actual translation for the doorbell gIOVA
> using address_space_translate(). I  guess if the S1 is in bypass mode
> you get the flat translation, no?

Yes, I noted that and replied as well.

Again, coming back to kvm_arch_fixup_msi_route(), I see that this was introduced
as part of your " ARM SMMUv3 Emulation Support" here,
https://lore.kernel.org/qemu-devel/1523518688-26674-12-git-send-email-eric.auger@redhat.com/

The VFIO support was not there at that time. I am trying to understand why
we need this MSI translation for vfio-pci in this accelerated case. My understanding
was that this is to setup the KVM MSI routings via KVM_SET_GSI_ROUTING ioctl.

Is that right?

Thanks,
Shameer




^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 19:46                                   ` Shameer Kolothum
@ 2025-11-05 12:52                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-05 12:52 UTC (permalink / raw)
  To: Shameer Kolothum
  Cc: Nicolin Chen, Eric Auger, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Tue, Nov 04, 2025 at 07:46:46PM +0000, Shameer Kolothum wrote:

> If we ignore and don't return the correct doorbell (gPA) here, 
> Qemu will end up invoking KVM_SET_GSI_ROUTING with wrong doorbell
> which sets up the in-kernel vgic irq routing information. And when HW
> raises the IRQ, KVM can't inject it properly.

That cannot be true.

Again, there is no way for qmeu to put something meaningful into the
'struct kvm_irq_routing_msi' address_lo/hi. It cannot walk the page
tables so it just ends up with some random meaningless guest IOVA.

Qemu MUST ignore the vMSI's address information.

So either the kernel ignores address_lo/high

OR qemu should match the vPCI device to its single vGIC and put in the
kernel expected address_lo/high always.

It should never, ever use the value from the guest once nesting is
enabled, and it should never be trying to translate the vMSI through
some S2, or any other, address space.

Translation is OK for non-nesting only.

Jason

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-05 11:41             ` Shameer Kolothum
@ 2025-11-05 17:25               ` Eric Auger
  2025-11-05 18:10                 ` Jason Gunthorpe
  0 siblings, 1 reply; 148+ messages in thread
From: Eric Auger @ 2025-11-05 17:25 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju



On 11/5/25 12:41 PM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 05 November 2025 08:57
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>> get_msi_address_space() callback
> [...]
>>> IIUC, the prereg_listener mechanism just enables us to setup the s2
>>> mappings. For MSI, In your version, I see that smmu_find_add_as()
>>> always returns IOMMU as. How is that supposed to work if the Guest
>>> has s1 bypass mode STE for the device?
>> in kvm_arch_fixup_msi_route(), as we have as != &address_space_memory in
>> my case, we proceed with the actual translation for the doorbell gIOVA
>> using address_space_translate(). I  guess if the S1 is in bypass mode
>> you get the flat translation, no?
> Yes, I noted that and replied as well.
>
> Again, coming back to kvm_arch_fixup_msi_route(), I see that this was introduced
> as part of your " ARM SMMUv3 Emulation Support" here,
> https://lore.kernel.org/qemu-devel/1523518688-26674-12-git-send-email-eric.auger@redhat.com/
>
> The VFIO support was not there at that time. I am trying to understand why
> we need this MSI translation for vfio-pci in this accelerated case. My understanding
> was that this is to setup the KVM MSI routings via KVM_SET_GSI_ROUTING ioctl.

yes that's correct. This was first needed for vhost integration. And
obviously this is also needed for VFIO.

allows vhost irqfd to trigger a gsi that will be routed by KVM to the
actual guest doorbell. On top of that it registers the guest PCI BDF for
GiCv2m or GICv3 MSI translation setup.
if the guest doorbell address is wrong because not properly translated,
vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and vgic_its_inject_msi

Eric
>
> Is that right?
>
> Thanks,
> Shameer
>
>
>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-04 17:41                     ` Jason Gunthorpe
  2025-11-04 17:57                       ` Nicolin Chen
@ 2025-11-05 17:32                       ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-05 17:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, ddutile@redhat.com, berrange@redhat.com,
	Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



On 11/4/25 6:41 PM, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote:
>> On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
>>> On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
>>>>> On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
>>>>>>> Sure it is trapped, but nothing should be looking at the MSI address
>>>>>>> from the guest, it is meaningless and wrong information. Just ignore
>>>>>>> it.
>>>>>> Hmm.. we need to setup the doorbell address correctly.
>>>>>> If we don't do the translation here, it will use the Guest IOVA
>>>>>> address. Remember, we are using the IORT RMR identity mapping to get
>>>>>> MSI working.
>>>>> Either you use the RMR value, which is forced by the kernel into the
>>>>> physical MSI through iommufd and kernel ignores anything qemu
>>>>> does. So fully ignore the guest's vMSI address.
>>>> Well, we are sort of trying to do the same through this patch here. 
>>>> But to avoid a "translation" completely it will involve some changes to
>>>> Qemu pci subsystem. I think this is the least intrusive path I can think
>>>> of now. And this is a one time setup mostly.
>>> Should be explained in the commit message that the translation is
>>> pointless. I'm not sure about this, any translation seems risky
>>> because it could fail. The guest can use any IOVA for MSI and none may
>>> fail.
in general the translation is not pointless (I mean when RMR are not
applied). In case a vhost device (virtio-net) for instance is protected
by SMMU, vhost triggers irqfds upon which a gsi is injected in vgic.
This latter does irq_routing mapping and this gsi is associated to an
MSI address/data. If the MSI address is wrong, ie. not corresponding to
the vITS gpa doorbell, kernel kvm/vgic/vgic-its.c vgic_its_trigger_msi
will fail to inject the MSI on guest since
vgic_msi_to_its/__vgic_doorbell_to_its will fail to find the ITS
instance to inject in.

Thanks

Eric
>> In the current design of KVM in QEMU, it does a generic translation
>> from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
>> has an accelerated IOMMU or an emulated IOMMU.
> And what happens if the translation fails because there is no mapping?
> It should be ignored for this case and not ignored for others.
>
> Jason
>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-05 17:25               ` Eric Auger
@ 2025-11-05 18:10                 ` Jason Gunthorpe
  2025-11-05 18:33                   ` Nicolin Chen
  2025-11-05 18:33                   ` Shameer Kolothum
  0 siblings, 2 replies; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-05 18:10 UTC (permalink / raw)
  To: Eric Auger
  Cc: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Nicolin Chen, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> if the guest doorbell address is wrong because not properly translated,
> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> vgic_its_inject_msi

Which has been exactly my point to Nicolin. There is no way to
"properly translate" the vMSI address in a HW accelerated SMMU
emulation.

The vMSI address must only be used for some future non-RMR HW only
path.

To keep this flow working qemu must ignore the IOVA from the guest and
always replace it with its own idea of what the correct ITS address is
for KVM to work. It means we don't correctly emulate guest
misconfiguration of the MSI address.

Thus it should never be "translated" in this configuration, that's a
broken idea when working with the HW accelerated vSMMU.

Jason

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-05 18:10                 ` Jason Gunthorpe
@ 2025-11-05 18:33                   ` Nicolin Chen
  2025-11-05 18:58                     ` Jason Gunthorpe
  2025-11-05 18:33                   ` Shameer Kolothum
  1 sibling, 1 reply; 148+ messages in thread
From: Nicolin Chen @ 2025-11-05 18:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric Auger, Shameer Kolothum, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> > if the guest doorbell address is wrong because not properly translated,
> > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> > vgic_its_inject_msi
> 
> Which has been exactly my point to Nicolin. There is no way to
> "properly translate" the vMSI address in a HW accelerated SMMU
> emulation.

Hmm, I still can't connect the dots here. QEMU knows where the
guest CD table is to get the stage-1 translation table to walk
through. We could choose to not let it walk through. Yet, why?

Asking this to know what we should justify for the patch in a
different direction.

> The vMSI address must only be used for some future non-RMR HW only
> path.
> 
> To keep this flow working qemu must ignore the IOVA from the guest and
> always replace it with its own idea of what the correct ITS address is
> for KVM to work. It means we don't correctly emulate guest
> misconfiguration of the MSI address.

That is something alternative in my mind, to simplify things,
especially we are having a discussion, on the other side, for
selecting a correct (QEMU) address space depending on whether
vIOMMU needs a stage-1 translation or not. This MSI translate
thing makes the whole narrative more complicated indeed.

We could use a different PCI op to forward the vITS physical
address to KVM layer bypassing the translation pathway.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-05 18:10                 ` Jason Gunthorpe
  2025-11-05 18:33                   ` Nicolin Chen
@ 2025-11-05 18:33                   ` Shameer Kolothum
  1 sibling, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-05 18:33 UTC (permalink / raw)
  To: Jason Gunthorpe, Eric Auger
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Nicolin Chen, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: 05 November 2025 18:11
> To: Eric Auger <eric.auger@redhat.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> > if the guest doorbell address is wrong because not properly translated,
> > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> > vgic_its_inject_msi
> 
> Which has been exactly my point to Nicolin. There is no way to
> "properly translate" the vMSI address in a HW accelerated SMMU
> emulation.
> 
> The vMSI address must only be used for some future non-RMR HW only
> path.
> 
> To keep this flow working qemu must ignore the IOVA from the guest and
> always replace it with its own idea of what the correct ITS address is
> for KVM to work. It means we don't correctly emulate guest
> misconfiguration of the MSI address.
> 
> Thus it should never be "translated" in this configuration, that's a
> broken idea when working with the HW accelerated vSMMU.

Ah.. I get it now. You are not questioning the flow here but the 
"translate" part. Agree it is not safe to use smmuv3_translate()
in an HW accelerated case. We need somehow to hook into this
path and provide a correct ITS address for KVM. 

Hmm.... need to see how to do that in the least invasive way.

Thanks,
Shameer




^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-05 18:33                   ` Nicolin Chen
@ 2025-11-05 18:58                     ` Jason Gunthorpe
  2025-11-05 19:33                       ` Nicolin Chen
  2025-11-06  7:42                       ` Eric Auger
  0 siblings, 2 replies; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-05 18:58 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Eric Auger, Shameer Kolothum, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
> > On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> > > if the guest doorbell address is wrong because not properly translated,
> > > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> > > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> > > vgic_its_inject_msi
> > 
> > Which has been exactly my point to Nicolin. There is no way to
> > "properly translate" the vMSI address in a HW accelerated SMMU
> > emulation.
> 
> Hmm, I still can't connect the dots here. QEMU knows where the
> guest CD table is to get the stage-1 translation table to walk
> through. We could choose to not let it walk through. Yet, why?

You cannot walk any tables in guest memory without fully trapping all
invalidation on all command queues. Like real HW qemu needs to fence
its walks with any concurrent invalidate & sync to ensure it doesn't
walk into a UAF situation.

Since we can't trap or mediate vCMDQ the walking simply cannot be
done.

Thus, the general principle of the HW accelerated vSMMU is that it
NEVER walks any of these guest tables for any reason.

Thus, we cannot do anything with vMSI address beyond program it
directly into a real PCI device so it undergoes real HW translation.

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-05  7:47               ` Eric Auger
@ 2025-11-05 19:30                 ` Nicolin Chen
  0 siblings, 0 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-11-05 19:30 UTC (permalink / raw)
  To: Eric Auger
  Cc: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
	berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

Hi Eric,

On Wed, Nov 05, 2025 at 08:47:56AM +0100, Eric Auger wrote:
> > We aligned with Intel previously about this system address space.
> > You might know these very well, yet here are the breakdowns:
> >
> > 1. VFIO core has a container that manages an HWPT. By default, it
> >    allocates a stage-1 normal HWPT, unless vIOMMU requests for a

> You may precise this stage-1 normal HWPT is used to map GPA to HPA (so
> eventually implements stage 2).

Functional-wise, that would work. But not as clean as we create
an S2 parent hwpt from the beginning, right?

> >    nesting parent HWPT for accelerated cases.
> > 2. VFIO core adds a listener for that HWPT and sets up a handler
> >    vfio_container_region_add() where it checks the memory region
> >    whether it is iommu or not.
> >    a. In case of !IOMMU as (i.e. system address space), it treats
> >       the address space as a RAM region, and handles all stage-2
> >       mappings for the core allocated nesting parent HWPT.
> >    b. In case of IOMMU as (i.e. a translation type) it sets up
> >       the IOTLB notifier and translation replay while bypassing
> >       the listener for RAM region.

> yes S1+S2 are combined through vfio_iommu_map_notify()

But that map/unmap notifier is useless in the accelerated mode:
we don't need those translation code in the emulated mode (MSI
is likely to bypass translation as well); and we don't need the
emulated IOTLB either since no page table walk through.

Also, S1 and S2 are separated following iommufd design. In this
regard, letting the core manage the S2 hwpt and mappings while
vIOMMU handling the S1 hwpt allocation/attach/invalidation can
look much cleaner.

> > In an accelerated case, we need stage-2 mappings to match with the
> > nesting parent HWPT. So, returning system address space or an alias
> > of that notifies the vfio core to take the 2.a path.
> >
> > If we take 2.b path by returning IOMMU as in smmu_find_add_as, the
> > VFIO core would no longer listen to the RAM region for us, i.e. no
> > stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting

> except if you change the VFIO common.c as I did the past to force the S2
> mapping in the nested config.
> See
> https://lore.kernel.org/all/20210411120912.15770-16-eric.auger@redhat.com/
> and vfio_prereg_listener()

Yea, I remember that. But that's somewhat duplicated IMHO. The
VFIO core already registers a listener on guest RAM for system
address space. Having another set of vfio_prereg_listener does
not feel optimal.

> Again I do not say this is the right way to do but using system address
> space is not the "only" implementation choice I think

Oh, neither do I mean that's the "only" way. Sorry I did not
make this clear.

I had studied your vfio_prereg_listener approach and studied
Intel's approach using the system address space, and concluded
this "cleaner" way that works for both architectures.

> and it needs to be
> properly justified, especially has it has at least 2 side effects:
> - somehow abusing the semantic of returned address space and pretends
> there is no IOMMU translation in place and

Perhaps we shall say "there is no emulated translation" :)

> - also impacting the way MSIs are handled (introduction of a new
> PCIIOMMUOps).

That is a solid point. Yet I think it's less confusing now per
Jason's remarks -- we will bypass the translation pathway for
MSI in accelerated mode.

> This kind of explanation you wrote is absolutely needed in the commit
> msg for reviewers to understand the design choice I think.

Sure. My bad that I didn't explain it well in the first place.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-05 18:58                     ` Jason Gunthorpe
@ 2025-11-05 19:33                       ` Nicolin Chen
  2025-11-06  7:42                       ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Nicolin Chen @ 2025-11-05 19:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Eric Auger, Shameer Kolothum, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Wed, Nov 05, 2025 at 02:58:16PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
> > On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
> > > On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> > > > if the guest doorbell address is wrong because not properly translated,
> > > > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> > > > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> > > > vgic_its_inject_msi
> > > 
> > > Which has been exactly my point to Nicolin. There is no way to
> > > "properly translate" the vMSI address in a HW accelerated SMMU
> > > emulation.
> > 
> > Hmm, I still can't connect the dots here. QEMU knows where the
> > guest CD table is to get the stage-1 translation table to walk
> > through. We could choose to not let it walk through. Yet, why?
> 
> You cannot walk any tables in guest memory without fully trapping all
> invalidation on all command queues. Like real HW qemu needs to fence
> its walks with any concurrent invalidate & sync to ensure it doesn't
> walk into a UAF situation.
> 
> Since we can't trap or mediate vCMDQ the walking simply cannot be
> done.
> 
> Thus, the general principle of the HW accelerated vSMMU is that it
> NEVER walks any of these guest tables for any reason.
>
> Thus, we cannot do anything with vMSI address beyond program it
> directly into a real PCI device so it undergoes real HW translation.

It's clear to me now. Thanks for the elaboration!

Nicolin


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-05 18:58                     ` Jason Gunthorpe
  2025-11-05 19:33                       ` Nicolin Chen
@ 2025-11-06  7:42                       ` Eric Auger
  2025-11-06 11:48                         ` Shameer Kolothum
  2025-11-06 14:32                         ` Jason Gunthorpe
  1 sibling, 2 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-06  7:42 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, ddutile@redhat.com, berrange@redhat.com,
	Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju



On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
> On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
>> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
>>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
>>>> if the guest doorbell address is wrong because not properly translated,
>>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
>>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
>>>> vgic_its_inject_msi
>>> Which has been exactly my point to Nicolin. There is no way to
>>> "properly translate" the vMSI address in a HW accelerated SMMU
>>> emulation.
>> Hmm, I still can't connect the dots here. QEMU knows where the
>> guest CD table is to get the stage-1 translation table to walk
>> through. We could choose to not let it walk through. Yet, why?
> You cannot walk any tables in guest memory without fully trapping all
> invalidation on all command queues. Like real HW qemu needs to fence
> its walks with any concurrent invalidate & sync to ensure it doesn't
> walk into a UAF situation.
But at the moment we do trap IOTLB invalidates so logically we can still
do the translate in that config. The problem you describe will show up
with vCMDQ which is not part of this series.
>
> Since we can't trap or mediate vCMDQ the walking simply cannot be
> done.
>
> Thus, the general principle of the HW accelerated vSMMU is that it
> NEVER walks any of these guest tables for any reason.
>
> Thus, we cannot do anything with vMSI address beyond program it
> directly into a real PCI device so it undergoes real HW translation.
But anyway you need to provide KVM a valid info about the guest doorbell
for this latter to setup irqfd gsi routing and also program ITS
translation tables. At the moment we have a single vITS in qemu so maybe
we can cheat.

Eric
>
> Jason
>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 28/32] hw/arm/smmuv3-accel: Add property to specify OAS bits
  2025-11-04 14:50     ` Jason Gunthorpe
@ 2025-11-06  7:54       ` Eric Auger
  0 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-06  7:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameer Kolothum, qemu-arm, qemu-devel, peter.maydell, nicolinc,
	ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
	jiangkunkun, jonathan.cameron, zhangfei.gao, zhenzhong.duan,
	yi.l.liu, kjaju

Hi Jason,

On 11/4/25 3:50 PM, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 03:35:42PM +0100, Eric Auger wrote:
>>> +    /*
>>> +     * TODO: OAS is not something Linux kernel doc says meaningful for user.
>>> +     * But looks like OAS needs to be compatible for accelerator support. Please
>>> +     * check.
>> would remove that comment. Either it is requested or not.
>>> +     */
>>> +    if (FIELD_EX32(info->idr[5], IDR5, OAS) <
>>> +                FIELD_EX32(s->idr[5], IDR5, OAS)) {
>>> +        error_setg(errp, "Host SMMUv3 OAS(%d) bits not compatible",
>>> +                   smmuv3_oas_bits(FIELD_EX32(info->idr[5], IDR5, OAS)));
>> let's be more explicit then and say
>>
>> Host SMMUv3 OAS (%d bits) is less that OAS bits advertised by SMMU (%d)
> It isn't OAS that is being checked here, this is now IPA. OAS is for
> use by the hypervisor.
>
> When the guest looks at the vSMMU the "OAS" it sees is the IPS
> supported by the HW.
>
> Aside from the raw HW limit, it also shouldn't exceed the configured
> size of the S2 HWPT.
>
> So the above should refer to this detail because it is a bit subtle
> that OAS and IPS are often the same. See "3.4 Address sizes"
>
> * IAS reflects the maximum usable IPA of an implementation that is
>   generated by stage 1 and input to stage 2:
>
> - This term is defined to illustrate the handling of intermediate
>   addresses in this section and is not a configurable parameter.
>
> - The maximum usable IPA size of an SMMU is defined in terms of other SMMU implementation choices,
>   as:
>     IAS = MAX(SMMU_IDR0.TTF[0]==1 ? 40 : 0), SMMU_IDR0.TTF[1]==1 ? OAS : 0));
>
> - An IPA of 40 bits is required to support of AArch32 LPAE translations, and AArch64 limits the
> maximum IPA size to the maximum PA size. Otherwise, when AArch32 LPAE is not implemented, the
> IPA size equals OAS, the PA size, and might be smaller than 40 bits.
>
> - The purpose of definition of the IAS term is to abstract away from these implementation variables.
Thank you for the clarification and pointer. I fully agree.
maybe we can rephrase the error msg as:

"Host SMMUv3 OAS (%d bits) is less that physical SMMU maximum usable IPA (%d)"
which is more accurate despite in practice here we assimilate max IPA to OAS (
TTF[1]==1 case)

Eric

>
> Jason
>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 30/32] Extend get_cap() callback to support PASID
  2025-10-31 10:50 ` [PATCH v5 30/32] Extend get_cap() callback to support PASID Shameer Kolothum
  2025-11-03 14:58   ` Jonathan Cameron via
@ 2025-11-06  8:45   ` Eric Auger
  1 sibling, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-06  8:45 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju



On 10/31/25 11:50 AM, Shameer Kolothum wrote:
> Modify get_cap() callback so that it can return cap via an output
> uint64_t param. And add support for generic iommu hw capability
> info and max_pasid_log2(pasid width).
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  backends/iommufd.c                 | 18 +++++++++++++++---
>  hw/i386/intel_iommu.c              |  5 +++--
>  hw/vfio/container-legacy.c         |  8 ++++++--
>  include/system/host_iommu_device.h | 14 ++++++++++----
>  4 files changed, 34 insertions(+), 11 deletions(-)
>
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 6381f9664b..392f9cf2a8 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -523,19 +523,31 @@ bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
>      return idevc->detach_hwpt(idev, errp);
>  }
>  
> -static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
> +static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap,
s/cap/capid? to avoid confusion with actual cap value?
> +                                uint64_t *out_cap, Error **errp)
>  {
>      HostIOMMUDeviceCaps *caps = &hiod->caps;
>  
> +    g_assert(out_cap);
> +
>      switch (cap) {
>      case HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE:
> -        return caps->type;
> +        *out_cap = caps->type;
> +        break;
>      case HOST_IOMMU_DEVICE_CAP_AW_BITS:
> -        return vfio_device_get_aw_bits(hiod->agent);
> +        *out_cap = vfio_device_get_aw_bits(hiod->agent);
> +        break;
> +    case HOST_IOMMU_DEVICE_CAP_GENERIC_HW:
> +        *out_cap = caps->hw_caps;
> +        break;
> +    case HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2:
> +        *out_cap = caps->max_pasid_log2;
> +        break;
>      default:
>          error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
>          return -EINVAL;
>      }
> +    return 0;
>  }
>  
>  static void hiod_iommufd_class_init(ObjectClass *oc, const void *data)
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 6a168d5107..91d0d643ea 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -4444,6 +4444,7 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>                             Error **errp)
>  {
>      HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
> +    uint64_t out_cap;
>      int ret;
>  
>      if (!hiodc->get_cap) {
> @@ -4452,11 +4453,11 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>      }
>  
>      /* Common checks */
> -    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_AW_BITS, errp);
> +    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_AW_BITS, &out_cap, errp);
>      if (ret < 0) {
>          return false;
>      }
> -    if (s->aw_bits > ret) {
> +    if (s->aw_bits > out_cap) {
>          error_setg(errp, "aw-bits %d > host aw-bits %d", s->aw_bits, ret);
you also need to replace ret and out_cap here and also use 0x%PRIx64
>          return false;
>      }
> diff --git a/hw/vfio/container-legacy.c b/hw/vfio/container-legacy.c
> index a3615d7b5d..ac8370bd4b 100644
> --- a/hw/vfio/container-legacy.c
> +++ b/hw/vfio/container-legacy.c
> @@ -1197,15 +1197,19 @@ static bool hiod_legacy_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>  }
>  
>  static int hiod_legacy_vfio_get_cap(HostIOMMUDevice *hiod, int cap,
> -                                    Error **errp)
> +                                    uint64_t *out_cap, Error **errp)
>  {
> +    g_assert(out_cap);
> +
>      switch (cap) {
>      case HOST_IOMMU_DEVICE_CAP_AW_BITS:
> -        return vfio_device_get_aw_bits(hiod->agent);
> +        *out_cap = vfio_device_get_aw_bits(hiod->agent);
> +        break;
>      default:
>          error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
>          return -EINVAL;
>      }
> +    return 0;
>  }
>  
>  static GList *
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index bfb2b60478..f89dbafd9e 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -94,13 +94,15 @@ struct HostIOMMUDeviceClass {
>       *
>       * @cap: capability to check.
>       *
> +     * @out_cap: 0 if a @cap is unsupported or else 1 or some positive
> +     * value for some special @cap, i.e., HOST_IOMMU_DEVICE_CAP_AW_BITS.
this does not match what is done in the impl. In case @cap is not
supported you do not zero it.
> +     *
>       * @errp: pass an Error out when fails to query capability.
>       *
> -     * Returns: <0 on failure, 0 if a @cap is unsupported, or else
> -     * 1 or some positive value for some special @cap,
I would rather say the capability value
> -     * i.e., HOST_IOMMU_DEVICE_CAP_AW_BITS.
> +     * Returns: <0 on failure, 0 on success.
<0 if cap is not supported
>       */
> -    int (*get_cap)(HostIOMMUDevice *hiod, int cap, Error **errp);
> +    int (*get_cap)(HostIOMMUDevice *hiod, int cap, uint64_t *out_cap,
> +                   Error **errp);
>      /**
>       * @get_iova_ranges: Return the list of usable iova_ranges along with
>       * @hiod Host IOMMU device
> @@ -123,6 +125,10 @@ struct HostIOMMUDeviceClass {
>   */
>  #define HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE        0
>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS           1
> +/* Generic IOMMU HW capability info */
> +#define HOST_IOMMU_DEVICE_CAP_GENERIC_HW        2
> +/* PASID width */
> +#define HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2    3
>  
>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
>  #endif
Eric



^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-06  7:42                       ` Eric Auger
@ 2025-11-06 11:48                         ` Shameer Kolothum
  2025-11-06 17:04                           ` Eric Auger
  2025-11-06 14:32                         ` Jason Gunthorpe
  1 sibling, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-06 11:48 UTC (permalink / raw)
  To: eric.auger@redhat.com, Jason Gunthorpe, Nicolin Chen
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, ddutile@redhat.com, berrange@redhat.com,
	Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 06 November 2025 07:43
> To: Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> External email: Use caution opening links or attachments
> 
> 
> On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
> > On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
> >> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
> >>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> >>>> if the guest doorbell address is wrong because not properly translated,
> >>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> >>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> >>>> vgic_its_inject_msi
> >>> Which has been exactly my point to Nicolin. There is no way to
> >>> "properly translate" the vMSI address in a HW accelerated SMMU
> >>> emulation.
> >> Hmm, I still can't connect the dots here. QEMU knows where the
> >> guest CD table is to get the stage-1 translation table to walk
> >> through. We could choose to not let it walk through. Yet, why?
> > You cannot walk any tables in guest memory without fully trapping all
> > invalidation on all command queues. Like real HW qemu needs to fence
> > its walks with any concurrent invalidate & sync to ensure it doesn't
> > walk into a UAF situation.
> But at the moment we do trap IOTLB invalidates so logically we can still
> do the translate in that config. The problem you describe will show up
> with vCMDQ which is not part of this series.
> >
> > Since we can't trap or mediate vCMDQ the walking simply cannot be
> > done.
> >
> > Thus, the general principle of the HW accelerated vSMMU is that it
> > NEVER walks any of these guest tables for any reason.
> >
> > Thus, we cannot do anything with vMSI address beyond program it
> > directly into a real PCI device so it undergoes real HW translation.
> But anyway you need to provide KVM a valid info about the guest doorbell
> for this latter to setup irqfd gsi routing and also program ITS
> translation tables. At the moment we have a single vITS in qemu so maybe
> we can cheat.

I have tried to address the “translate” issue below. This introduces a new
get_msi_address() callback to retrieve the MSI doorbell address directly
from the vIOMMU, so we can drop the existing get_msi_address_space() logic.
Please take a look and let me know your thoughts.

Thanks,
Shameer

---
 hw/arm/smmuv3-accel.c   | 10 ++++++++++
 hw/arm/smmuv3.c         |  1 +
 hw/arm/virt.c           |  4 ++++
 hw/pci/pci.c            | 17 +++++++++++++++++
 include/hw/arm/smmuv3.h |  1 +
 include/hw/pci/pci.h    | 15 +++++++++++++++
 target/arm/kvm.c        | 14 ++++++++++++--
 7 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index e6c81c4786..8b2a45a915 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -667,6 +667,15 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
     }
 }

+static uint64_t smmuv3_accel_get_msi_address(PCIBus *bus, void *opaque,
+                                             int devfn)
+{
+    SMMUState *bs = opaque;
+    SMMUv3State *s = ARM_SMMUV3(bs);
+
+    g_assert(s->msi_doorbell);
+    return s->msi_doorbell;
+}
 static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void *opaque,
                                              int devfn)
 {
@@ -788,6 +797,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
     .set_iommu_device = smmuv3_accel_set_iommu_device,
     .unset_iommu_device = smmuv3_accel_unset_iommu_device,
     .get_msi_address_space = smmuv3_accel_get_msi_as,
+    .get_msi_address = smmuv3_accel_get_msi_address,
 };

 void smmuv3_accel_idr_override(SMMUv3State *s)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 43d297698b..3f2ee8bcce 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -2120,6 +2120,7 @@ static const Property smmuv3_properties[] = {
     DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
     DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
     DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
+    DEFINE_PROP_UINT64("msi-doorbell", SMMUv3State, msi_doorbell, 0),
 };

 static void smmuv3_instance_init(Object *obj)
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 2498e3beff..d2dcb89235 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -3097,6 +3097,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,

             create_smmuv3_dev_dtb(vms, dev, bus, errp);
             if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
+                hwaddr db_start = base_memmap[VIRT_GIC_ITS].base +
+                                  ITS_TRANS_SIZE + GITS_TRANSLATER;
                 char *stage;
                 stage = object_property_get_str(OBJECT(dev), "stage",
                                                 &error_fatal);
@@ -3107,6 +3109,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
                     return;
                 }
                 vms->pci_preserve_config = true;
+                object_property_set_uint(OBJECT(dev), "msi-doorbell", db_start,
+                                         &error_abort);
             }
         }
     }
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 1edd711247..45e79a3c23 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2982,6 +2982,23 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     return &address_space_memory;
 }

+bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell)
+{
+    PCIBus *bus;
+    PCIBus *iommu_bus;
+    int devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
+    if (iommu_bus) {
+        if (iommu_bus->iommu_ops->get_msi_address) {
+            *out_doorbell = iommu_bus->iommu_ops->get_msi_address(bus,
+                                 iommu_bus->iommu_opaque, devfn);
+            return true;
+        }
+    }
+    return false;
+}
+
 AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
 {
     PCIBus *bus;
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index ee0b5ed74f..f50d8c72bd 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -72,6 +72,7 @@ struct SMMUv3State {
     bool ats;
     uint8_t oas;
     bool pasid;
+    uint64_t msi_doorbell;
 };

 typedef enum {
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index b731443c67..e1709b0bfe 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -679,6 +679,20 @@ typedef struct PCIIOMMUOps {
      */
     AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
                                             int devfn);
+    /**
+     * @get_msi_address: get the address of MSI doorbell for the device
+     * on a PCI bus.
+     *
+     * Optional callback, if implemented must return a valid MSI doorbell
+     * address.
+     *
+     * @bus: the #PCIBus being accessed.
+     *
+     * @opaque: the data passed to pci_setup_iommu().
+     *
+     * @devfn: device and function number
+     */
+    uint64_t (*get_msi_address)(PCIBus *bus, void *opaque, int devfn);
 } PCIIOMMUOps;

 bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
@@ -688,6 +702,7 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
 void pci_device_unset_iommu_device(PCIDevice *dev);
 AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
+bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell);

 /**
  * pci_device_get_viommu_flags: get vIOMMU flags.
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 0df41128d0..8d4d2be0bc 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1611,35 +1611,45 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
 int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
                              uint64_t address, uint32_t data, PCIDevice *dev)
 {
-    AddressSpace *as = pci_device_iommu_msi_address_space(dev);
+    AddressSpace *as;
     hwaddr xlat, len, doorbell_gpa;
     MemoryRegionSection mrs;
     MemoryRegion *mr;

+    /* Check if there is a direct msi address available */
+    if (pci_device_iommu_msi_direct_address(dev, &doorbell_gpa)) {
+        goto set_doorbell;
+    }
+
+    as = pci_device_iommu_msi_address_space(dev);
     if (as == &address_space_memory) {
         return 0;
     }

     /* MSI doorbell address is translated by an IOMMU */

-    RCU_READ_LOCK_GUARD();
+    rcu_read_lock();

     mr = address_space_translate(as, address, &xlat, &len, true,
                                  MEMTXATTRS_UNSPECIFIED);

     if (!mr) {
+        rcu_read_unlock();
         return 1;
     }

     mrs = memory_region_find(mr, xlat, 1);

     if (!mrs.mr) {
+        rcu_read_unlock();
         return 1;
     }

     doorbell_gpa = mrs.offset_within_address_space;
     memory_region_unref(mrs.mr);
+    rcu_read_unlock();

+set_doorbell:
     route->u.msi.address_lo = doorbell_gpa;
     route->u.msi.address_hi = doorbell_gpa >> 32;

--







^ permalink raw reply related	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM
  2025-10-31 10:50 ` [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM Shameer Kolothum
  2025-11-03 15:00   ` Jonathan Cameron via
@ 2025-11-06 13:55   ` Eric Auger
  2025-11-06 14:27     ` Shameer Kolothum
  1 sibling, 1 reply; 148+ messages in thread
From: Eric Auger @ 2025-11-06 13:55 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju


Hi Shameer,
On 10/31/25 11:50 AM, Shameer Kolothum wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> If user wants to expose PASID capability in vIOMMU, then VFIO would also
need to report?
> report the PASID cap for this device if the underlying hardware supports
> it as well.
>
> As a start, this chooses to put the vPASID cap in the last 8 bytes of the
> vconfig space. This is a choice in the good hope of no conflict with any
> existing cap or hidden registers. For the devices that has hidden registers,
> user should figure out a proper offset for the vPASID cap. This may require
> an option for user to config it. Here we leave it as a future extension.
> There are more discussions on the mechanism of finding the proper offset.
>
> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/
>
> Since we add a check to ensure the vIOMMU supports PASID, only devices
> under those vIOMMUs can synthesize the vPASID capability. This gives
> users control over which devices expose vPASID.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/vfio/pci.c      | 37 +++++++++++++++++++++++++++++++++++++
>  include/hw/iommu.h |  1 +
>  2 files changed, 38 insertions(+)
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 06b06afc2b..2054eac897 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -24,6 +24,7 @@
>  #include <sys/ioctl.h>
>  
>  #include "hw/hw.h"
> +#include "hw/iommu.h"
>  #include "hw/pci/msi.h"
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pci_bridge.h"
> @@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
>  
>  static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>  {
> +    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
> +    HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>      PCIDevice *pdev = PCI_DEVICE(vdev);
> +    uint64_t max_pasid_log2 = 0;
> +    bool pasid_cap_added = false;
> +    uint64_t hw_caps;
>      uint32_t header;
>      uint16_t cap_id, next, size;
>      uint8_t cap_ver;
> @@ -2578,12 +2584,43 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>                  pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>              }
>              break;
> +        case PCI_EXT_CAP_ID_PASID:
> +             pasid_cap_added = true;
> +             /* fallthrough */
>          default:
>              pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>          }
>  
>      }
>  
> +#ifdef CONFIG_IOMMUFD
> +    /*
> +     * Although we check for PCI_EXT_CAP_ID_PASID above, the Linux VFIO
> +     * framework currently hides this capability. Try to retrieve it
> +     * through alternative kernel interfaces (e.g. IOMMUFD APIs).
I don't catch this sentence . When are you supposed to read above
PCI_EXT_CAP_ID_PASID cap id then?
> +     */
> +    if (!pasid_cap_added && hiodc->get_cap) {
> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_GENERIC_HW, &hw_caps, NULL);
> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2,
> +                       &max_pasid_log2, NULL);
> +    }
> +
> +    /*
> +     * If supported, adds the PASID capability in the end of the PCIe config
> +     * space. TODO: Add option for enabling pasid at a safe offset.
> +     */
> +    if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
> +                           VIOMMU_FLAG_PASID_SUPPORTED)) {
> +        bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC) ? true : false;
can't you direct set exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC);
> +        bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV) ? true : false;
> +
> +        pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF,
> +                        max_pasid_log2, exec_perm, priv_mod);
> +        /* PASID capability is fully emulated by QEMU */
> +        memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
> +    }
> +#endif
> +
>      /* Cleanup chain head ID if necessary */
>      if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
>          pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> index 9b8bb94fc2..9635770bee 100644
> --- a/include/hw/iommu.h
> +++ b/include/hw/iommu.h
> @@ -20,6 +20,7 @@
>  enum viommu_flags {
>      /* vIOMMU needs nesting parent HWPT to create nested HWPT */
>      VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
> +    VIOMMU_FLAG_PASID_SUPPORTED = BIT_ULL(1),
>  };
>  
>  #endif /* HW_IOMMU_H */
Thanks

Eric



^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM
  2025-11-06 13:55   ` Eric Auger
@ 2025-11-06 14:27     ` Shameer Kolothum
  2025-11-06 15:44       ` Eric Auger
  0 siblings, 1 reply; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-06 14:27 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju



> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 06 November 2025 13:56
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Shameer,
> On 10/31/25 11:50 AM, Shameer Kolothum wrote:
> > From: Yi Liu <yi.l.liu@intel.com>
> >
> > If user wants to expose PASID capability in vIOMMU, then VFIO would also
> need to report?
> > report the PASID cap for this device if the underlying hardware supports
> > it as well.
> >
> > As a start, this chooses to put the vPASID cap in the last 8 bytes of the
> > vconfig space. This is a choice in the good hope of no conflict with any
> > existing cap or hidden registers. For the devices that has hidden registers,
> > user should figure out a proper offset for the vPASID cap. This may require
> > an option for user to config it. Here we leave it as a future extension.
> > There are more discussions on the mechanism of finding the proper offset.
> >
> >
> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8C
> BE2@BN9PR11MB5276.namprd11.prod.outlook.com/
> >
> > Since we add a check to ensure the vIOMMU supports PASID, only devices
> > under those vIOMMUs can synthesize the vPASID capability. This gives
> > users control over which devices expose vPASID.
> >
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > ---
> >  hw/vfio/pci.c      | 37 +++++++++++++++++++++++++++++++++++++
> >  include/hw/iommu.h |  1 +
> >  2 files changed, 38 insertions(+)
> >
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 06b06afc2b..2054eac897 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -24,6 +24,7 @@
> >  #include <sys/ioctl.h>
> >
> >  #include "hw/hw.h"
> > +#include "hw/iommu.h"
> >  #include "hw/pci/msi.h"
> >  #include "hw/pci/msix.h"
> >  #include "hw/pci/pci_bridge.h"
> > @@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice
> *vdev, uint16_t pos)
> >
> >  static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >  {
> > +    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
> > +    HostIOMMUDeviceClass *hiodc =
> HOST_IOMMU_DEVICE_GET_CLASS(hiod);
> >      PCIDevice *pdev = PCI_DEVICE(vdev);
> > +    uint64_t max_pasid_log2 = 0;
> > +    bool pasid_cap_added = false;
> > +    uint64_t hw_caps;
> >      uint32_t header;
> >      uint16_t cap_id, next, size;
> >      uint8_t cap_ver;
> > @@ -2578,12 +2584,43 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> >                  pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> >              }
> >              break;
> > +        case PCI_EXT_CAP_ID_PASID:
> > +             pasid_cap_added = true;
> > +             /* fallthrough */
> >          default:
> >              pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> >          }
> >
> >      }
> >
> > +#ifdef CONFIG_IOMMUFD
> > +    /*
> > +     * Although we check for PCI_EXT_CAP_ID_PASID above, the Linux VFIO
> > +     * framework currently hides this capability. Try to retrieve it
> > +     * through alternative kernel interfaces (e.g. IOMMUFD APIs).
> I don't catch this sentence . When are you supposed to read above
> PCI_EXT_CAP_ID_PASID cap id then?

That’s to make it future proof in case VFIO relaxes that.  If that happens
the code above by default, will add the CAP and we may end with a
duplicate at below offset.

> > +     */
> > +    if (!pasid_cap_added && hiodc->get_cap) {
> > +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_GENERIC_HW,
> &hw_caps, NULL);
> > +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2,
> > +                       &max_pasid_log2, NULL);
> > +    }
> > +
> > +    /*
> > +     * If supported, adds the PASID capability in the end of the PCIe config
> > +     * space. TODO: Add option for enabling pasid at a safe offset.
> > +     */
> > +    if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
> > +                           VIOMMU_FLAG_PASID_SUPPORTED)) {
> > +        bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC) ?
> true : false;
> can't you direct set exec_perm = (hw_caps &
> IOMMU_HW_CAP_PCI_PASID_EXEC);

True 😊

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-06  7:42                       ` Eric Auger
  2025-11-06 11:48                         ` Shameer Kolothum
@ 2025-11-06 14:32                         ` Jason Gunthorpe
  2025-11-06 15:47                           ` Eric Auger
  1 sibling, 1 reply; 148+ messages in thread
From: Jason Gunthorpe @ 2025-11-06 14:32 UTC (permalink / raw)
  To: Eric Auger
  Cc: Nicolin Chen, Shameer Kolothum, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju

On Thu, Nov 06, 2025 at 08:42:31AM +0100, Eric Auger wrote:
> 
> 
> On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
> > On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
> >> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
> >>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> >>>> if the guest doorbell address is wrong because not properly translated,
> >>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> >>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> >>>> vgic_its_inject_msi
> >>> Which has been exactly my point to Nicolin. There is no way to
> >>> "properly translate" the vMSI address in a HW accelerated SMMU
> >>> emulation.
> >> Hmm, I still can't connect the dots here. QEMU knows where the
> >> guest CD table is to get the stage-1 translation table to walk
> >> through. We could choose to not let it walk through. Yet, why?
> > You cannot walk any tables in guest memory without fully trapping all
> > invalidation on all command queues. Like real HW qemu needs to fence
> > its walks with any concurrent invalidate & sync to ensure it doesn't
> > walk into a UAF situation.
> But at the moment we do trap IOTLB invalidates so logically we can still
> do the translate in that config. The problem you describe will show up
> with vCMDQ which is not part of this series.

This is why I said:

> > Thus, the general principle of the HW accelerated vSMMU is that it
> > NEVER walks any of these guest tables for any reason.

It would make no sense to add table walking then have to figure out
how to rip it out.

> But anyway you need to provide KVM a valid info about the guest doorbell
> for this latter to setup irqfd gsi routing and also program ITS
> translation tables. At the moment we have a single vITS in qemu so maybe
> we can cheat.

qemu should always know what VITS is linked to a pci device to tell
kvm whatever it needs, even if there are more than one.

Jason


^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM
  2025-11-06 14:27     ` Shameer Kolothum
@ 2025-11-06 15:44       ` Eric Auger
  0 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-06 15:44 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
  Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju



On 11/6/25 3:27 PM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 06 November 2025 13:56
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Shameer,
>> On 10/31/25 11:50 AM, Shameer Kolothum wrote:
>>> From: Yi Liu <yi.l.liu@intel.com>
>>>
>>> If user wants to expose PASID capability in vIOMMU, then VFIO would also
>> need to report?
>>> report the PASID cap for this device if the underlying hardware supports
>>> it as well.
>>>
>>> As a start, this chooses to put the vPASID cap in the last 8 bytes of the
>>> vconfig space. This is a choice in the good hope of no conflict with any
>>> existing cap or hidden registers. For the devices that has hidden registers,
>>> user should figure out a proper offset for the vPASID cap. This may require
>>> an option for user to config it. Here we leave it as a future extension.
>>> There are more discussions on the mechanism of finding the proper offset.
>>>
>>>
>> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8C
>> BE2@BN9PR11MB5276.namprd11.prod.outlook.com/
>>> Since we add a check to ensure the vIOMMU supports PASID, only devices
>>> under those vIOMMUs can synthesize the vPASID capability. This gives
>>> users control over which devices expose vPASID.
>>>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>>> ---
>>>  hw/vfio/pci.c      | 37 +++++++++++++++++++++++++++++++++++++
>>>  include/hw/iommu.h |  1 +
>>>  2 files changed, 38 insertions(+)
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index 06b06afc2b..2054eac897 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -24,6 +24,7 @@
>>>  #include <sys/ioctl.h>
>>>
>>>  #include "hw/hw.h"
>>> +#include "hw/iommu.h"
>>>  #include "hw/pci/msi.h"
>>>  #include "hw/pci/msix.h"
>>>  #include "hw/pci/pci_bridge.h"
>>> @@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice
>> *vdev, uint16_t pos)
>>>  static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>>>  {
>>> +    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
>>> +    HostIOMMUDeviceClass *hiodc =
>> HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>>>      PCIDevice *pdev = PCI_DEVICE(vdev);
>>> +    uint64_t max_pasid_log2 = 0;
>>> +    bool pasid_cap_added = false;
>>> +    uint64_t hw_caps;
>>>      uint32_t header;
>>>      uint16_t cap_id, next, size;
>>>      uint8_t cap_ver;
>>> @@ -2578,12 +2584,43 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>>>                  pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>>>              }
>>>              break;
>>> +        case PCI_EXT_CAP_ID_PASID:
>>> +             pasid_cap_added = true;
>>> +             /* fallthrough */
>>>          default:
>>>              pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>>>          }
>>>
>>>      }
>>>
>>> +#ifdef CONFIG_IOMMUFD
>>> +    /*
>>> +     * Although we check for PCI_EXT_CAP_ID_PASID above, the Linux VFIO
>>> +     * framework currently hides this capability. Try to retrieve it
>>> +     * through alternative kernel interfaces (e.g. IOMMUFD APIs).
>> I don't catch this sentence . When are you supposed to read above
>> PCI_EXT_CAP_ID_PASID cap id then?
> That’s to make it future proof in case VFIO relaxes that.  If that happens
> the code above by default, will add the CAP and we may end with a
> duplicate at below offset.
OK thanks for the clarification. Then I would move the comment about
VFIO kernel code currently hiding the extended cap along with

+             pasid_cap_added = true;

and explain it is added to make it future proof in case VFIO relaxes that

Eric

>
>>> +     */
>>> +    if (!pasid_cap_added && hiodc->get_cap) {
>>> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_GENERIC_HW,
>> &hw_caps, NULL);
>>> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2,
>>> +                       &max_pasid_log2, NULL);
>>> +    }
>>> +
>>> +    /*
>>> +     * If supported, adds the PASID capability in the end of the PCIe config
>>> +     * space. TODO: Add option for enabling pasid at a safe offset.
>>> +     */
>>> +    if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
>>> +                           VIOMMU_FLAG_PASID_SUPPORTED)) {
>>> +        bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC) ?
>> true : false;
>> can't you direct set exec_perm = (hw_caps &
>> IOMMU_HW_CAP_PCI_PASID_EXEC);
> True 😊
>
> Thanks,
> Shameer



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-06 14:32                         ` Jason Gunthorpe
@ 2025-11-06 15:47                           ` Eric Auger
  0 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-06 15:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, Shameer Kolothum, qemu-arm@nongnu.org,
	qemu-devel@nongnu.org, peter.maydell@linaro.org,
	ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
	smostafa@google.com, wangzhou1@hisilicon.com,
	jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
	zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
	yi.l.liu@intel.com, Krishnakant Jaju



On 11/6/25 3:32 PM, Jason Gunthorpe wrote:
> On Thu, Nov 06, 2025 at 08:42:31AM +0100, Eric Auger wrote:
>>
>> On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
>>> On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
>>>> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
>>>>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
>>>>>> if the guest doorbell address is wrong because not properly translated,
>>>>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
>>>>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
>>>>>> vgic_its_inject_msi
>>>>> Which has been exactly my point to Nicolin. There is no way to
>>>>> "properly translate" the vMSI address in a HW accelerated SMMU
>>>>> emulation.
>>>> Hmm, I still can't connect the dots here. QEMU knows where the
>>>> guest CD table is to get the stage-1 translation table to walk
>>>> through. We could choose to not let it walk through. Yet, why?
>>> You cannot walk any tables in guest memory without fully trapping all
>>> invalidation on all command queues. Like real HW qemu needs to fence
>>> its walks with any concurrent invalidate & sync to ensure it doesn't
>>> walk into a UAF situation.
>> But at the moment we do trap IOTLB invalidates so logically we can still
>> do the translate in that config. The problem you describe will show up
>> with vCMDQ which is not part of this series.
> This is why I said:
>
>>> Thus, the general principle of the HW accelerated vSMMU is that it
>>> NEVER walks any of these guest tables for any reason.
> It would make no sense to add table walking then have to figure out
> how to rip it out.

understood. Though strictly speaking you are not adding it as it is
already there ;-)
>
>> But anyway you need to provide KVM a valid info about the guest doorbell
>> for this latter to setup irqfd gsi routing and also program ITS
>> translation tables. At the moment we have a single vITS in qemu so maybe
>> we can cheat.
> qemu should always know what VITS is linked to a pci device to tell
> kvm whatever it needs, even if there are more than one.
Yeah we can work in that direction instead. But this could be worked on
later on along with vcmdq series as well ;-)

Eric
>
> Jason
>



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 32/32] hw/arm/smmuv3-accel: Add support for PASID enable
  2025-10-31 10:50 ` [PATCH v5 32/32] hw/arm/smmuv3-accel: Add support for PASID enable Shameer Kolothum
@ 2025-11-06 16:46   ` Eric Auger
  0 siblings, 0 replies; 148+ messages in thread
From: Eric Auger @ 2025-11-06 16:46 UTC (permalink / raw)
  To: Shameer Kolothum, qemu-arm, qemu-devel
  Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
	smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
	zhenzhong.duan, yi.l.liu, kjaju

Hi Shameer,

On 10/31/25 11:50 AM, Shameer Kolothum wrote:
> QEMU SMMUv3 currently forces SSID (Substream ID) to zero. One key use case
> for accelerated mode is Shared Virtual Addressing (SVA), which requires
> SSID support so the guest can maintain multiple context descriptors per
> substream ID.
>
> Provide an option for user to enable PASID support. A SSIDSIZE of 16
> is currently used as default.
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/arm/smmuv3-accel.c    | 23 ++++++++++++++++++++++-
>  hw/arm/smmuv3-internal.h |  1 +
>  hw/arm/smmuv3.c          | 10 +++++++++-
>  include/hw/arm/smmuv3.h  |  1 +
>  4 files changed, 33 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index caa4a1d82d..1f206be8e4 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -68,6 +68,12 @@ smmuv3_accel_check_hw_compatible(SMMUv3State *s,
>          error_setg(errp, "Host SMMUv3 SIDSIZE not compatible");
>          return false;
>      }
> +    /* If user enables PASID support(pasid=on), QEMU sets SSIDSIZE to 16 */
> +    if (FIELD_EX32(info->idr[1], IDR1, SSIDSIZE) <
> +                FIELD_EX32(s->idr[1], IDR1, SSIDSIZE)) {
> +        error_setg(errp, "Host SMMUv3 SSIDSIZE not compatible");
> +        return false;
> +    }
>  
>      /* User can disable QEMU SMMUv3 Range Invalidation support */
>      if (FIELD_EX32(info->idr[3], IDR3, RIL) !=
> @@ -642,7 +648,14 @@ static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
>       * The real HW nested support should be reported from host SMMUv3 and if
>       * it doesn't, the nesting parent allocation will fail anyway in VFIO core.
>       */
> -    return VIOMMU_FLAG_WANT_NESTING_PARENT;
> +    uint64_t flags = VIOMMU_FLAG_WANT_NESTING_PARENT;
> +    SMMUState *bs = opaque;
> +    SMMUv3State *s = ARM_SMMUV3(bs);
> +
> +    if (s->pasid) {
> +        flags |= VIOMMU_FLAG_PASID_SUPPORTED;
> +    }
> +    return flags;
>  }
>  
>  static const PCIIOMMUOps smmuv3_accel_ops = {
> @@ -672,6 +685,14 @@ void smmuv3_accel_idr_override(SMMUv3State *s)
>      if (s->oas == 48) {
>          s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS_48);
>      }
> +
> +    /*
> +     * By default QEMU SMMUv3 has no PASID(SSID) support. Update IDR1 if user
> +     * has enabled it.
> +     */
> +    if (s->pasid) {
> +        s->idr[1] = FIELD_DP32(s->idr[1], IDR1, SSIDSIZE, SMMU_IDR1_SSIDSIZE);
> +    }
>  }
>  
>  /* Based on SMUUv3 GBPA configuration, attach a corresponding HWPT */
> diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
> index cfc5897569..2e0d8d538b 100644
> --- a/hw/arm/smmuv3-internal.h
> +++ b/hw/arm/smmuv3-internal.h
> @@ -81,6 +81,7 @@ REG32(IDR1,                0x4)
>      FIELD(IDR1, ECMDQ,        31, 1)
>  
>  #define SMMU_IDR1_SIDSIZE 16
> +#define SMMU_IDR1_SSIDSIZE 16
>  #define SMMU_CMDQS   19
>  #define SMMU_EVENTQS 19
>  
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index c4d28a3786..e1140fe087 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -611,7 +611,8 @@ static int decode_ste(SMMUv3State *s, SMMUTransCfg *cfg,
>          }
>      }
>  
> -    if (STE_S1CDMAX(ste) != 0) {
> +    /* If pasid enabled, we report SSIDSIZE = 16 */
> +    if (!FIELD_EX32(s->idr[1], IDR1, SSIDSIZE) && STE_S1CDMAX(ste) != 0) {
can't you directly check s->pasid instead of decoding the IDR1?
>          qemu_log_mask(LOG_UNIMP,
>                        "SMMUv3 does not support multiple context descriptors yet\n");
you may suggest to add pasid= option then.
>          goto bad_ste;
> @@ -1966,6 +1967,10 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
>              error_setg(errp, "OAS can only be set to 44 bits if accel=off");
>              return false;
>          }
> +        if (s->pasid) {
> +            error_setg(errp, "pasid can only be enabled if accel=on");
> +            return false;
> +        }
>          return false;
>      }
>  
> @@ -2098,6 +2103,7 @@ static const Property smmuv3_properties[] = {
>      DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
>      DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
>      DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
> +    DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
>  };
>  
>  static void smmuv3_instance_init(Object *obj)
> @@ -2133,6 +2139,8 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
>      object_class_property_set_description(klass, "oas",
>          "Specify Output Address Size (for accel =on). Supported values "
>          "are 44 or 48 bits. Defaults to 44 bits");
> +    object_class_property_set_description(klass, "pasid",
> +        "Enable/disable PASID support (for accel=on)");
>  }
>  
>  static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index e4226b66f3..ee0b5ed74f 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -71,6 +71,7 @@ struct SMMUv3State {
>      bool ril;
>      bool ats;
>      uint8_t oas;
> +    bool pasid;
>  };
>  
>  typedef enum {
Otherwise looks good to me
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric



^ permalink raw reply	[flat|nested] 148+ messages in thread

* Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-06 11:48                         ` Shameer Kolothum
@ 2025-11-06 17:04                           ` Eric Auger
  2025-11-07 10:27                             ` Shameer Kolothum
  0 siblings, 1 reply; 148+ messages in thread
From: Eric Auger @ 2025-11-06 17:04 UTC (permalink / raw)
  To: Shameer Kolothum, Jason Gunthorpe, Nicolin Chen
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, ddutile@redhat.com, berrange@redhat.com,
	Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju

Hi Shameer,

On 11/6/25 12:48 PM, Shameer Kolothum wrote:
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 06 November 2025 07:43
>> To: Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>
>> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
>> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
>> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>> get_msi_address_space() callback
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
>>> On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
>>>> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
>>>>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
>>>>>> if the guest doorbell address is wrong because not properly translated,
>>>>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
>>>>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
>>>>>> vgic_its_inject_msi
>>>>> Which has been exactly my point to Nicolin. There is no way to
>>>>> "properly translate" the vMSI address in a HW accelerated SMMU
>>>>> emulation.
>>>> Hmm, I still can't connect the dots here. QEMU knows where the
>>>> guest CD table is to get the stage-1 translation table to walk
>>>> through. We could choose to not let it walk through. Yet, why?
>>> You cannot walk any tables in guest memory without fully trapping all
>>> invalidation on all command queues. Like real HW qemu needs to fence
>>> its walks with any concurrent invalidate & sync to ensure it doesn't
>>> walk into a UAF situation.
>> But at the moment we do trap IOTLB invalidates so logically we can still
>> do the translate in that config. The problem you describe will show up
>> with vCMDQ which is not part of this series.
>>> Since we can't trap or mediate vCMDQ the walking simply cannot be
>>> done.
>>>
>>> Thus, the general principle of the HW accelerated vSMMU is that it
>>> NEVER walks any of these guest tables for any reason.
>>>
>>> Thus, we cannot do anything with vMSI address beyond program it
>>> directly into a real PCI device so it undergoes real HW translation.
>> But anyway you need to provide KVM a valid info about the guest doorbell
>> for this latter to setup irqfd gsi routing and also program ITS
>> translation tables. At the moment we have a single vITS in qemu so maybe
>> we can cheat.
> I have tried to address the “translate” issue below. This introduces a new
> get_msi_address() callback to retrieve the MSI doorbell address directly
> from the vIOMMU, so we can drop the existing get_msi_address_space() logic.
> Please take a look and let me know your thoughts.
>
> Thanks,
> Shameer
>
> ---
>  hw/arm/smmuv3-accel.c   | 10 ++++++++++
>  hw/arm/smmuv3.c         |  1 +
>  hw/arm/virt.c           |  4 ++++
>  hw/pci/pci.c            | 17 +++++++++++++++++
>  include/hw/arm/smmuv3.h |  1 +
>  include/hw/pci/pci.h    | 15 +++++++++++++++
>  target/arm/kvm.c        | 14 ++++++++++++--
>  7 files changed, 60 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index e6c81c4786..8b2a45a915 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -667,6 +667,15 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
>      }
>  }
>
> +static uint64_t smmuv3_accel_get_msi_address(PCIBus *bus, void *opaque,
> +                                             int devfn)
> +{
> +    SMMUState *bs = opaque;
> +    SMMUv3State *s = ARM_SMMUV3(bs);
> +
> +    g_assert(s->msi_doorbell);
> +    return s->msi_doorbell;
> +}
>  static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void *opaque,
>                                               int devfn)
>  {
> @@ -788,6 +797,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
>      .set_iommu_device = smmuv3_accel_set_iommu_device,
>      .unset_iommu_device = smmuv3_accel_unset_iommu_device,
>      .get_msi_address_space = smmuv3_accel_get_msi_as,
to be removed then
> +    .get_msi_address = smmuv3_accel_get_msi_address,
>  };
>
>  void smmuv3_accel_idr_override(SMMUv3State *s)
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index 43d297698b..3f2ee8bcce 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -2120,6 +2120,7 @@ static const Property smmuv3_properties[] = {
>      DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
>      DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
>      DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
> +    DEFINE_PROP_UINT64("msi-doorbell", SMMUv3State, msi_doorbell, 0),
>  };
>
>  static void smmuv3_instance_init(Object *obj)
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 2498e3beff..d2dcb89235 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -3097,6 +3097,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
>
>              create_smmuv3_dev_dtb(vms, dev, bus, errp);
>              if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
> +                hwaddr db_start = base_memmap[VIRT_GIC_ITS].base +
> +                                  ITS_TRANS_SIZE + GITS_TRANSLATER;
there are still use cases where you count target GICv2M doorbell so at
least you would need to add some logic to switch between both
>                  char *stage;
>                  stage = object_property_get_str(OBJECT(dev), "stage",
>                                                  &error_fatal);
> @@ -3107,6 +3109,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
>                      return;
>                  }
>                  vms->pci_preserve_config = true;
> +                object_property_set_uint(OBJECT(dev), "msi-doorbell", db_start,
> +                                         &error_abort);
>              }
>          }
>      }
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 1edd711247..45e79a3c23 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2982,6 +2982,23 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>      return &address_space_memory;
>  }
>
> +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    int devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> +    if (iommu_bus) {
> +        if (iommu_bus->iommu_ops->get_msi_address) {
> +            *out_doorbell = iommu_bus->iommu_ops->get_msi_address(bus,
> +                                 iommu_bus->iommu_opaque, devfn);
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
>  AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
>  {
>      PCIBus *bus;
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index ee0b5ed74f..f50d8c72bd 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -72,6 +72,7 @@ struct SMMUv3State {
>      bool ats;
>      uint8_t oas;
>      bool pasid;
> +    uint64_t msi_doorbell;
>  };
>
>  typedef enum {
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index b731443c67..e1709b0bfe 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -679,6 +679,20 @@ typedef struct PCIIOMMUOps {
>       */
>      AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
>                                              int devfn);
> +    /**
> +     * @get_msi_address: get the address of MSI doorbell for the device
(gpa) address
> +     * on a PCI bus.
> +     *
> +     * Optional callback, if implemented must return a valid MSI doorbell
> +     * address.
> +     *
> +     * @bus: the #PCIBus being accessed.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * @devfn: device and function number
> +     */
> +    uint64_t (*get_msi_address)(PCIBus *bus, void *opaque, int devfn);
>  } PCIIOMMUOps;
>
>  bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
> @@ -688,6 +702,7 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
>                                   Error **errp);
>  void pci_device_unset_iommu_device(PCIDevice *dev);
>  AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
> +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell);
>
>  /**
>   * pci_device_get_viommu_flags: get vIOMMU flags.
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index 0df41128d0..8d4d2be0bc 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -1611,35 +1611,45 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
>  int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
>                               uint64_t address, uint32_t data, PCIDevice *dev)
>  {
> -    AddressSpace *as = pci_device_iommu_msi_address_space(dev);
> +    AddressSpace *as;
>      hwaddr xlat, len, doorbell_gpa;
>      MemoryRegionSection mrs;
>      MemoryRegion *mr;
>
> +    /* Check if there is a direct msi address available */
> +    if (pci_device_iommu_msi_direct_address(dev, &doorbell_gpa)) {
> +        goto set_doorbell;
> +    }
> +
> +    as = pci_device_iommu_msi_address_space(dev);
logically this should be after the test below (ie. meaning we have an
IOMMU). But this means that you shall use an as which is not the
address_space_memory.

This works but it is not neat either because it totally ignored the
@address. So you have to build a solid commit msg to explain readers why
this is needed ;-)
>      if (as == &address_space_memory) {
>          return 0;
>      }
>
>      /* MSI doorbell address is translated by an IOMMU */
>
> -    RCU_READ_LOCK_GUARD();
> +    rcu_read_lock();
>
>      mr = address_space_translate(as, address, &xlat, &len, true,
>                                   MEMTXATTRS_UNSPECIFIED);
>
>      if (!mr) {
> +        rcu_read_unlock();
>          return 1;
>      }
>
>      mrs = memory_region_find(mr, xlat, 1);
>
>      if (!mrs.mr) {
> +        rcu_read_unlock();
>          return 1;
>      }
>
>      doorbell_gpa = mrs.offset_within_address_space;
>      memory_region_unref(mrs.mr);
> +    rcu_read_unlock();
>
> +set_doorbell:
>      route->u.msi.address_lo = doorbell_gpa;
>      route->u.msi.address_hi = doorbell_gpa >> 32;
>
> --
>
>
>
>
>
>
Thanks

Eric



^ permalink raw reply	[flat|nested] 148+ messages in thread

* RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback
  2025-11-06 17:04                           ` Eric Auger
@ 2025-11-07 10:27                             ` Shameer Kolothum
  0 siblings, 0 replies; 148+ messages in thread
From: Shameer Kolothum @ 2025-11-07 10:27 UTC (permalink / raw)
  To: eric.auger@redhat.com, Jason Gunthorpe, Nicolin Chen
  Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
	peter.maydell@linaro.org, ddutile@redhat.com, berrange@redhat.com,
	Nathan Chen, Matt Ochs, smostafa@google.com,
	wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
	jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
	zhenzhong.duan@intel.com, yi.l.liu@intel.com, Krishnakant Jaju


Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 06 November 2025 17:05
> To: Shameer Kolothum <skolothumtho@nvidia.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> peter.maydell@linaro.org; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
[...] 
> 
> > I have tried to address the “translate” issue below. This introduces a new
> > get_msi_address() callback to retrieve the MSI doorbell address directly
> > from the vIOMMU, so we can drop the existing get_msi_address_space()
> logic.
> > Please take a look and let me know your thoughts.
> >
> > Thanks,
> > Shameer
> >
> > ---
> >  hw/arm/smmuv3-accel.c   | 10 ++++++++++
> >  hw/arm/smmuv3.c         |  1 +
> >  hw/arm/virt.c           |  4 ++++
> >  hw/pci/pci.c            | 17 +++++++++++++++++
> >  include/hw/arm/smmuv3.h |  1 +
> >  include/hw/pci/pci.h    | 15 +++++++++++++++
> >  target/arm/kvm.c        | 14 ++++++++++++--
> >  7 files changed, 60 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> > index e6c81c4786..8b2a45a915 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> > @@ -667,6 +667,15 @@ static void
> smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> >      }
> >  }
> >
> > +static uint64_t smmuv3_accel_get_msi_address(PCIBus *bus, void
> *opaque,
> > +                                             int devfn)
> > +{
> > +    SMMUState *bs = opaque;
> > +    SMMUv3State *s = ARM_SMMUV3(bs);
> > +
> > +    g_assert(s->msi_doorbell);
> > +    return s->msi_doorbell;
> > +}
> >  static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void
> *opaque,
> >                                               int devfn)
> >  {
> > @@ -788,6 +797,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
> >      .set_iommu_device = smmuv3_accel_set_iommu_device,
> >      .unset_iommu_device = smmuv3_accel_unset_iommu_device,
> >      .get_msi_address_space = smmuv3_accel_get_msi_as,
> to be removed then

Yes, Of course. Will drop that.

> > +    .get_msi_address = smmuv3_accel_get_msi_address,
> >  };
> >
> >  void smmuv3_accel_idr_override(SMMUv3State *s)
> > diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> > index 43d297698b..3f2ee8bcce 100644
> > --- a/hw/arm/smmuv3.c
> > +++ b/hw/arm/smmuv3.c
> > @@ -2120,6 +2120,7 @@ static const Property smmuv3_properties[] = {
> >      DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
> >      DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
> >      DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
> > +    DEFINE_PROP_UINT64("msi-doorbell", SMMUv3State, msi_doorbell, 0),
> >  };
> >
> >  static void smmuv3_instance_init(Object *obj)
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 2498e3beff..d2dcb89235 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -3097,6 +3097,8 @@ static void
> virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
> >
> >              create_smmuv3_dev_dtb(vms, dev, bus, errp);
> >              if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
> > +                hwaddr db_start = base_memmap[VIRT_GIC_ITS].base +
> > +                                  ITS_TRANS_SIZE + GITS_TRANSLATER;
> there are still use cases where you count target GICv2M doorbell so at
> least you would need to add some logic to switch between both

But with KVM, virt doesn't support GICv2 , right?
That reminds me we should probably add a check to see KVM enabled
for SMMUV3 accel=on case.

> >                  char *stage;
> >                  stage = object_property_get_str(OBJECT(dev), "stage",
> >                                                  &error_fatal);
> > @@ -3107,6 +3109,8 @@ static void
> virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
> >                      return;
> >                  }
> >                  vms->pci_preserve_config = true;
> > +                object_property_set_uint(OBJECT(dev), "msi-doorbell", db_start,
> > +                                         &error_abort);
> >              }
> >          }
> >      }
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 1edd711247..45e79a3c23 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2982,6 +2982,23 @@ AddressSpace
> *pci_device_iommu_address_space(PCIDevice *dev)
> >      return &address_space_memory;
> >  }
> >
> > +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr
> *out_doorbell)
> > +{
> > +    PCIBus *bus;
> > +    PCIBus *iommu_bus;
> > +    int devfn;
> > +
> > +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> > +    if (iommu_bus) {
> > +        if (iommu_bus->iommu_ops->get_msi_address) {
> > +            *out_doorbell = iommu_bus->iommu_ops->get_msi_address(bus,
> > +                                 iommu_bus->iommu_opaque, devfn);
> > +            return true;
> > +        }
> > +    }
> > +    return false;
> > +}
> > +
> >  AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
> >  {
> >      PCIBus *bus;
> > diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> > index ee0b5ed74f..f50d8c72bd 100644
> > --- a/include/hw/arm/smmuv3.h
> > +++ b/include/hw/arm/smmuv3.h
> > @@ -72,6 +72,7 @@ struct SMMUv3State {
> >      bool ats;
> >      uint8_t oas;
> >      bool pasid;
> > +    uint64_t msi_doorbell;
> >  };
> >
> >  typedef enum {
> > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > index b731443c67..e1709b0bfe 100644
> > --- a/include/hw/pci/pci.h
> > +++ b/include/hw/pci/pci.h
> > @@ -679,6 +679,20 @@ typedef struct PCIIOMMUOps {
> >       */
> >      AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
> >                                              int devfn);
> > +    /**
> > +     * @get_msi_address: get the address of MSI doorbell for the device
> (gpa) address
> > +     * on a PCI bus.
> > +     *
> > +     * Optional callback, if implemented must return a valid MSI doorbell
> > +     * address.
> > +     *
> > +     * @bus: the #PCIBus being accessed.
> > +     *
> > +     * @opaque: the data passed to pci_setup_iommu().
> > +     *
> > +     * @devfn: device and function number
> > +     */
> > +    uint64_t (*get_msi_address)(PCIBus *bus, void *opaque, int devfn);
> >  } PCIIOMMUOps;
> >
> >  bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus
> **piommu_bus,
> > @@ -688,6 +702,7 @@ bool pci_device_set_iommu_device(PCIDevice *dev,
> HostIOMMUDevice *hiod,
> >                                   Error **errp);
> >  void pci_device_unset_iommu_device(PCIDevice *dev);
> >  AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
> > +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr
> *out_doorbell);
> >
> >  /**
> >   * pci_device_get_viommu_flags: get vIOMMU flags.
> > diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> > index 0df41128d0..8d4d2be0bc 100644
> > --- a/target/arm/kvm.c
> > +++ b/target/arm/kvm.c
> > @@ -1611,35 +1611,45 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq,
> int level)
> >  int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
> >                               uint64_t address, uint32_t data, PCIDevice *dev)
> >  {
> > -    AddressSpace *as = pci_device_iommu_msi_address_space(dev);
> > +    AddressSpace *as;
> >      hwaddr xlat, len, doorbell_gpa;
> >      MemoryRegionSection mrs;
> >      MemoryRegion *mr;
> >
> > +    /* Check if there is a direct msi address available */
> > +    if (pci_device_iommu_msi_direct_address(dev, &doorbell_gpa)) {
> > +        goto set_doorbell;
> > +    }
> > +
> > +    as = pci_device_iommu_msi_address_space(dev);
> logically this should be after the test below (ie. meaning we have an
> IOMMU). But this means that you shall use an as which is not the
> address_space_memory.

Ok. I will move it then.

> 
> This works but it is not neat either because it totally ignored the
> @address. So you have to build a solid commit msg to explain readers why
> this is needed ;-)

Sure. I will try to do a solid one explaining why we don’t need @address for 
this path😊.

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 148+ messages in thread

end of thread, other threads:[~2025-11-07 10:27 UTC | newest]

Thread overview: 148+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-31 10:49 [PATCH v5 00/32] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
2025-10-31 10:49 ` [PATCH v5 01/32] backends/iommufd: Introduce iommufd_backend_alloc_viommu Shameer Kolothum
2025-10-31 10:49 ` [PATCH v5 02/32] backends/iommufd: Introduce iommufd_backend_alloc_vdev Shameer Kolothum
2025-10-31 10:49 ` [PATCH v5 03/32] hw/arm/smmu-common: Factor out common helper functions and export Shameer Kolothum
2025-10-31 10:49 ` [PATCH v5 04/32] hw/arm/smmu-common: Make iommu ops part of SMMUState Shameer Kolothum
2025-10-31 10:49 ` [PATCH v5 05/32] hw/arm/smmuv3-accel: Introduce smmuv3 accel device Shameer Kolothum
2025-10-31 10:49 ` [PATCH v5 06/32] hw/arm/smmuv3-accel: Initialize shared system address space Shameer Kolothum
2025-10-31 21:10   ` Nicolin Chen
2025-11-03 14:17     ` Shameer Kolothum
2025-11-03 13:12   ` Jonathan Cameron via
2025-11-03 15:53     ` Shameer Kolothum
2025-11-03 13:39   ` Philippe Mathieu-Daudé
2025-11-03 16:30     ` Eric Auger
2025-10-31 10:49 ` [PATCH v5 07/32] hw/pci/pci: Move pci_init_bus_master() after adding device to bus Shameer Kolothum
2025-11-03 13:24   ` Jonathan Cameron via
2025-11-03 16:40   ` Eric Auger
2025-10-31 10:49 ` [PATCH v5 08/32] hw/pci/pci: Add optional supports_address_space() callback Shameer Kolothum
2025-11-03 13:30   ` Jonathan Cameron via
2025-11-03 16:47   ` Eric Auger
2025-10-31 10:49 ` [PATCH v5 09/32] hw/pci-bridge/pci_expander_bridge: Move TYPE_PXB_PCIE_DEV to header Shameer Kolothum
2025-11-03 13:30   ` Jonathan Cameron via
2025-11-03 14:25   ` Eric Auger
2025-10-31 10:49 ` [PATCH v5 10/32] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd Shameer Kolothum
2025-11-03 16:51   ` Eric Auger
2025-10-31 10:49 ` [PATCH v5 11/32] hw/arm/smmuv3: Implement get_viommu_cap() callback Shameer Kolothum
2025-11-03 16:55   ` Eric Auger
2025-10-31 10:49 ` [PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback Shameer Kolothum
2025-10-31 22:02   ` Nicolin Chen
2025-10-31 22:08     ` Nicolin Chen
2025-11-03 14:19     ` Shameer Kolothum
2025-10-31 10:49 ` [PATCH v5 13/32] hw/arm/smmuv3-accel: Add nested vSTE install/uninstall support Shameer Kolothum
2025-10-31 23:52   ` Nicolin Chen
2025-11-01  0:20     ` Nicolin Chen
2025-11-03 15:11     ` Shameer Kolothum
2025-11-03 17:32       ` Nicolin Chen
2025-11-04 11:05   ` Eric Auger
2025-11-04 12:26     ` Shameer Kolothum
2025-11-04 13:30       ` Eric Auger
2025-11-04 16:48       ` Nicolin Chen
2025-10-31 10:49 ` [PATCH v5 14/32] hw/arm/smmuv3-accel: Install SMMUv3 GBPA based hwpt Shameer Kolothum
2025-11-04 13:28   ` Eric Auger
2025-10-31 10:49 ` [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback Shameer Kolothum
2025-11-04 14:11   ` Eric Auger
2025-11-04 14:20     ` Jason Gunthorpe
2025-11-04 14:42       ` Shameer Kolothum
2025-11-04 14:51         ` Jason Gunthorpe
2025-11-04 14:58           ` Shameer Kolothum
2025-11-04 15:12             ` Jason Gunthorpe
2025-11-04 15:20               ` Shameer Kolothum
2025-11-04 15:35                 ` Jason Gunthorpe
2025-11-04 17:11                   ` Nicolin Chen
2025-11-04 17:41                     ` Jason Gunthorpe
2025-11-04 17:57                       ` Nicolin Chen
2025-11-04 18:09                         ` Jason Gunthorpe
2025-11-04 18:44                           ` Nicolin Chen
2025-11-04 18:56                             ` Jason Gunthorpe
2025-11-04 19:31                               ` Nicolin Chen
2025-11-04 19:35                                 ` Jason Gunthorpe
2025-11-04 19:43                                   ` Nicolin Chen
2025-11-04 19:45                                     ` Jason Gunthorpe
2025-11-04 19:59                                       ` Nicolin Chen
2025-11-04 19:46                                   ` Shameer Kolothum
2025-11-05 12:52                                     ` Jason Gunthorpe
2025-11-05 17:32                       ` Eric Auger
2025-11-04 14:37     ` Shameer Kolothum
2025-11-04 14:44       ` Eric Auger
2025-11-04 15:14         ` Shameer Kolothum
2025-11-04 16:01           ` Eric Auger
2025-11-04 17:47             ` Nicolin Chen
2025-11-05  7:47               ` Eric Auger
2025-11-05 19:30                 ` Nicolin Chen
2025-11-04 19:08             ` Shameer Kolothum
2025-11-05  8:56           ` Eric Auger
2025-11-05 11:41             ` Shameer Kolothum
2025-11-05 17:25               ` Eric Auger
2025-11-05 18:10                 ` Jason Gunthorpe
2025-11-05 18:33                   ` Nicolin Chen
2025-11-05 18:58                     ` Jason Gunthorpe
2025-11-05 19:33                       ` Nicolin Chen
2025-11-06  7:42                       ` Eric Auger
2025-11-06 11:48                         ` Shameer Kolothum
2025-11-06 17:04                           ` Eric Auger
2025-11-07 10:27                             ` Shameer Kolothum
2025-11-06 14:32                         ` Jason Gunthorpe
2025-11-06 15:47                           ` Eric Auger
2025-11-05 18:33                   ` Shameer Kolothum
2025-10-31 10:49 ` [PATCH v5 16/32] hw/arm/smmuv3-accel: Make use of " Shameer Kolothum
2025-10-31 23:57   ` Nicolin Chen
2025-11-03 15:19     ` Shameer Kolothum
2025-11-03 17:34       ` Nicolin Chen
2025-10-31 10:49 ` [PATCH v5 17/32] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host Shameer Kolothum
2025-11-01  0:35   ` Nicolin Chen via
2025-11-03 15:28     ` Shameer Kolothum
2025-11-03 17:43       ` Nicolin Chen
2025-11-03 18:17         ` Shameer Kolothum
2025-11-03 18:51           ` Nicolin Chen
2025-11-04  8:55             ` Eric Auger
2025-11-04 16:41               ` Nicolin Chen
2025-11-03 17:11   ` Eric Auger
2025-10-31 10:49 ` [PATCH v5 18/32] hw/arm/smmuv3: Initialize ID registers early during realize() Shameer Kolothum
2025-11-01  0:24   ` Nicolin Chen
2025-11-03 13:57   ` Jonathan Cameron via
2025-11-03 15:11   ` Eric Auger
2025-10-31 10:49 ` [PATCH v5 19/32] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate Shameer Kolothum
2025-11-01  0:49   ` Nicolin Chen
2025-11-01 14:20   ` Zhangfei Gao
2025-11-03 15:42     ` Shameer Kolothum
2025-11-03 17:16       ` Eric Auger
2025-11-03 14:47   ` Jonathan Cameron via
2025-10-31 10:49 ` [PATCH v5 20/32] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5 Shameer Kolothum
2025-11-03 13:58   ` Jonathan Cameron via
2025-10-31 10:49 ` [PATCH v5 21/32] hw/arm/virt: Set PCI preserve_config for accel SMMUv3 Shameer Kolothum
2025-11-03 14:58   ` Eric Auger
2025-11-03 15:03     ` Eric Auger via
2025-11-03 16:01     ` Shameer Kolothum
2025-10-31 10:49 ` [PATCH v5 22/32] tests/qtest/bios-tables-test: Prepare for IORT revison upgrade Shameer Kolothum
2025-11-03 14:48   ` Jonathan Cameron via
2025-11-03 14:59   ` Eric Auger
2025-10-31 10:49 ` [PATCH v5 23/32] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Shameer Kolothum
2025-11-03 14:53   ` Jonathan Cameron via
2025-11-03 15:43     ` Shameer Kolothum
2025-10-31 10:49 ` [PATCH v5 24/32] tests/qtest/bios-tables-test: Update IORT blobs after revision upgrade Shameer Kolothum
2025-11-03 14:54   ` Jonathan Cameron via
2025-11-03 15:01   ` Eric Auger
2025-10-31 10:49 ` [PATCH v5 25/32] hw/arm/smmuv3: Add accel property for SMMUv3 device Shameer Kolothum
2025-11-03 14:56   ` Jonathan Cameron via
2025-10-31 10:49 ` [PATCH v5 26/32] hw/arm/smmuv3-accel: Add a property to specify RIL support Shameer Kolothum
2025-11-03 15:07   ` Eric Auger
2025-11-03 16:08     ` Shameer Kolothum
2025-11-03 16:25       ` Eric Auger
2025-11-04  9:38   ` Eric Auger
2025-10-31 10:50 ` [PATCH v5 27/32] hw/arm/smmuv3-accel: Add support for ATS Shameer Kolothum
2025-11-04 14:22   ` Eric Auger
2025-10-31 10:50 ` [PATCH v5 28/32] hw/arm/smmuv3-accel: Add property to specify OAS bits Shameer Kolothum
2025-11-04 14:35   ` Eric Auger
2025-11-04 14:50     ` Jason Gunthorpe
2025-11-06  7:54       ` Eric Auger
2025-10-31 10:50 ` [PATCH v5 29/32] backends/iommufd: Retrieve PASID width from iommufd_backend_get_device_info() Shameer Kolothum
2025-10-31 10:50 ` [PATCH v5 30/32] Extend get_cap() callback to support PASID Shameer Kolothum
2025-11-03 14:58   ` Jonathan Cameron via
2025-11-06  8:45   ` Eric Auger
2025-10-31 10:50 ` [PATCH v5 31/32] vfio: Synthesize vPASID capability to VM Shameer Kolothum
2025-11-03 15:00   ` Jonathan Cameron via
2025-11-06 13:55   ` Eric Auger
2025-11-06 14:27     ` Shameer Kolothum
2025-11-06 15:44       ` Eric Auger
2025-10-31 10:50 ` [PATCH v5 32/32] hw/arm/smmuv3-accel: Add support for PASID enable Shameer Kolothum
2025-11-06 16:46   ` Eric Auger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).