* [PATCH v4 01/27] backends/iommufd: Introduce iommufd_backend_alloc_viommu
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 15:35 ` Jonathan Cameron via
2025-10-17 12:21 ` Eric Auger
2025-09-29 13:36 ` [PATCH v4 02/27] backends/iommufd: Introduce iommufd_vdev_alloc Shameer Kolothum
` (26 subsequent siblings)
27 siblings, 2 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
From: Nicolin Chen <nicolinc@nvidia.com>
Add a helper to allocate a viommu object.
Also introduce a struct IOMMUFDViommu that can be used later by vendor
IOMMU implementations.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
backends/iommufd.c | 26 ++++++++++++++++++++++++++
backends/trace-events | 1 +
include/system/iommufd.h | 14 ++++++++++++++
3 files changed, 41 insertions(+)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 2a33c7ab0b..7b2e5ace2d 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -446,6 +446,32 @@ bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
return !ret;
}
+bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
+ uint32_t viommu_type, uint32_t hwpt_id,
+ uint32_t *out_viommu_id, Error **errp)
+{
+ int ret;
+ struct iommu_viommu_alloc alloc_viommu = {
+ .size = sizeof(alloc_viommu),
+ .type = viommu_type,
+ .dev_id = dev_id,
+ .hwpt_id = hwpt_id,
+ };
+
+ ret = ioctl(be->fd, IOMMU_VIOMMU_ALLOC, &alloc_viommu);
+
+ trace_iommufd_backend_alloc_viommu(be->fd, dev_id, viommu_type, hwpt_id,
+ alloc_viommu.out_viommu_id, ret);
+ if (ret) {
+ error_setg_errno(errp, errno, "IOMMU_VIOMMU_ALLOC failed");
+ return false;
+ }
+
+ g_assert(out_viommu_id);
+ *out_viommu_id = alloc_viommu.out_viommu_id;
+ return true;
+}
+
bool host_iommu_device_iommufd_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
uint32_t hwpt_id, Error **errp)
{
diff --git a/backends/trace-events b/backends/trace-events
index 56132d3fd2..01c2d9bde9 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -21,3 +21,4 @@ iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%
iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%u enable=%d (%d)"
iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
iommufd_backend_invalidate_cache(int iommufd, uint32_t id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
+iommufd_backend_alloc_viommu(int iommufd, uint32_t dev_id, uint32_t type, uint32_t hwpt_id, uint32_t viommu_id, int ret) " iommufd=%d type=%u dev_id=%u hwpt_id=%u viommu_id=%u (%d)"
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index c9c72ffc45..dfe1dc2850 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -38,6 +38,16 @@ struct IOMMUFDBackend {
/*< public >*/
};
+/*
+ * Virtual IOMMU object that respresents physical IOMMU's virtualization
+ * support
+ */
+typedef struct IOMMUFDViommu {
+ IOMMUFDBackend *iommufd;
+ uint32_t s2_hwpt_id; /* Id of stage 2 HWPT */
+ uint32_t viommu_id; /* virtual IOMMU ID of allocated object */
+} IOMMUFDViommu;
+
bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp);
void iommufd_backend_disconnect(IOMMUFDBackend *be);
@@ -59,6 +69,10 @@ bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
uint32_t data_type, uint32_t data_len,
void *data_ptr, uint32_t *out_hwpt,
Error **errp);
+bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
+ uint32_t viommu_type, uint32_t hwpt_id,
+ uint32_t *out_hwpt, Error **errp);
+
bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t hwpt_id,
bool start, Error **errp);
bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 01/27] backends/iommufd: Introduce iommufd_backend_alloc_viommu
2025-09-29 13:36 ` [PATCH v4 01/27] backends/iommufd: Introduce iommufd_backend_alloc_viommu Shameer Kolothum
@ 2025-09-29 15:35 ` Jonathan Cameron via
2025-10-17 12:21 ` Eric Auger
1 sibling, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 15:35 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:17 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Add a helper to allocate a viommu object.
>
> Also introduce a struct IOMMUFDViommu that can be used later by vendor
> IOMMU implementations.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Some finest quality triviality inline.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> diff --git a/backends/trace-events b/backends/trace-events
> index 56132d3fd2..01c2d9bde9 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -21,3 +21,4 @@ iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%
> iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%u enable=%d (%d)"
> iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
> iommufd_backend_invalidate_cache(int iommufd, uint32_t id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
> +iommufd_backend_alloc_viommu(int iommufd, uint32_t dev_id, uint32_t type, uint32_t hwpt_id, uint32_t viommu_id, int ret) " iommufd=%d type=%u dev_id=%u hwpt_id=%u viommu_id=%u (%d)"
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index c9c72ffc45..dfe1dc2850 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -38,6 +38,16 @@ struct IOMMUFDBackend {
> /*< public >*/
> };
>
> +/*
> + * Virtual IOMMU object that respresents physical IOMMU's virtualization
> + * support
> + */
> +typedef struct IOMMUFDViommu {
> + IOMMUFDBackend *iommufd;
> + uint32_t s2_hwpt_id; /* Id of stage 2 HWPT */
Id or ID? I'd go with ID.
> + uint32_t viommu_id; /* virtual IOMMU ID of allocated object */
> +} IOMMUFDViommu;
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 01/27] backends/iommufd: Introduce iommufd_backend_alloc_viommu
2025-09-29 13:36 ` [PATCH v4 01/27] backends/iommufd: Introduce iommufd_backend_alloc_viommu Shameer Kolothum
2025-09-29 15:35 ` Jonathan Cameron via
@ 2025-10-17 12:21 ` Eric Auger
1 sibling, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-17 12:21 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Add a helper to allocate a viommu object.
>
> Also introduce a struct IOMMUFDViommu that can be used later by vendor
> IOMMU implementations.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> backends/iommufd.c | 26 ++++++++++++++++++++++++++
> backends/trace-events | 1 +
> include/system/iommufd.h | 14 ++++++++++++++
> 3 files changed, 41 insertions(+)
>
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 2a33c7ab0b..7b2e5ace2d 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -446,6 +446,32 @@ bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
> return !ret;
> }
>
> +bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
> + uint32_t viommu_type, uint32_t hwpt_id,
> + uint32_t *out_viommu_id, Error **errp)
> +{
> + int ret;
> + struct iommu_viommu_alloc alloc_viommu = {
> + .size = sizeof(alloc_viommu),
> + .type = viommu_type,
> + .dev_id = dev_id,
> + .hwpt_id = hwpt_id,
> + };
> +
> + ret = ioctl(be->fd, IOMMU_VIOMMU_ALLOC, &alloc_viommu);
> +
> + trace_iommufd_backend_alloc_viommu(be->fd, dev_id, viommu_type, hwpt_id,
> + alloc_viommu.out_viommu_id, ret);
> + if (ret) {
> + error_setg_errno(errp, errno, "IOMMU_VIOMMU_ALLOC failed");
> + return false;
> + }
> +
> + g_assert(out_viommu_id);
> + *out_viommu_id = alloc_viommu.out_viommu_id;
> + return true;
> +}
> +
> bool host_iommu_device_iommufd_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> uint32_t hwpt_id, Error **errp)
> {
> diff --git a/backends/trace-events b/backends/trace-events
> index 56132d3fd2..01c2d9bde9 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -21,3 +21,4 @@ iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%
> iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%u enable=%d (%d)"
> iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
> iommufd_backend_invalidate_cache(int iommufd, uint32_t id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
> +iommufd_backend_alloc_viommu(int iommufd, uint32_t dev_id, uint32_t type, uint32_t hwpt_id, uint32_t viommu_id, int ret) " iommufd=%d type=%u dev_id=%u hwpt_id=%u viommu_id=%u (%d)"
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index c9c72ffc45..dfe1dc2850 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -38,6 +38,16 @@ struct IOMMUFDBackend {
> /*< public >*/
> };
>
> +/*
> + * Virtual IOMMU object that respresents physical IOMMU's virtualization
represents
Eric
> + * support
> + */
> +typedef struct IOMMUFDViommu {
> + IOMMUFDBackend *iommufd;
> + uint32_t s2_hwpt_id; /* Id of stage 2 HWPT */
> + uint32_t viommu_id; /* virtual IOMMU ID of allocated object */
> +} IOMMUFDViommu;
> +
> bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp);
> void iommufd_backend_disconnect(IOMMUFDBackend *be);
>
> @@ -59,6 +69,10 @@ bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> uint32_t data_type, uint32_t data_len,
> void *data_ptr, uint32_t *out_hwpt,
> Error **errp);
> +bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
> + uint32_t viommu_type, uint32_t hwpt_id,
> + uint32_t *out_hwpt, Error **errp);
> +
> bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t hwpt_id,
> bool start, Error **errp);
> bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 02/27] backends/iommufd: Introduce iommufd_vdev_alloc
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
2025-09-29 13:36 ` [PATCH v4 01/27] backends/iommufd: Introduce iommufd_backend_alloc_viommu Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 15:40 ` Jonathan Cameron via
2025-09-29 17:52 ` Nicolin Chen
2025-09-29 13:36 ` [PATCH v4 03/27] hw/arm/smmu-common: Factor out common helper functions and export Shameer Kolothum
` (25 subsequent siblings)
27 siblings, 2 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
From: Nicolin Chen <nicolinc@nvidia.com>
Add a helper to allocate an iommufd device's virtual device (in the user
space) per a viommu instance.
While at it, introduce a struct IOMMUFDVdev for later use by vendor
IOMMU implementations.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
backends/iommufd.c | 27 +++++++++++++++++++++++++++
backends/trace-events | 1 +
include/system/iommufd.h | 12 ++++++++++++
3 files changed, 40 insertions(+)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 7b2e5ace2d..d3029d4658 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -472,6 +472,33 @@ bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
return true;
}
+bool iommufd_backend_alloc_vdev(IOMMUFDBackend *be, uint32_t dev_id,
+ uint32_t viommu_id, uint64_t virt_id,
+ uint32_t *out_vdev_id, Error **errp)
+{
+ int ret;
+ struct iommu_vdevice_alloc alloc_vdev = {
+ .size = sizeof(alloc_vdev),
+ .viommu_id = viommu_id,
+ .dev_id = dev_id,
+ .virt_id = virt_id,
+ };
+
+ ret = ioctl(be->fd, IOMMU_VDEVICE_ALLOC, &alloc_vdev);
+
+ trace_iommufd_backend_alloc_vdev(be->fd, dev_id, viommu_id, virt_id,
+ alloc_vdev.out_vdevice_id, ret);
+
+ if (ret) {
+ error_setg_errno(errp, errno, "IOMMU_VDEVICE_ALLOC failed");
+ return false;
+ }
+
+ g_assert(out_vdev_id);
+ *out_vdev_id = alloc_vdev.out_vdevice_id;
+ return true;
+}
+
bool host_iommu_device_iommufd_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
uint32_t hwpt_id, Error **errp)
{
diff --git a/backends/trace-events b/backends/trace-events
index 01c2d9bde9..8408dc8701 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -22,3 +22,4 @@ iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) "
iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
iommufd_backend_invalidate_cache(int iommufd, uint32_t id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
iommufd_backend_alloc_viommu(int iommufd, uint32_t dev_id, uint32_t type, uint32_t hwpt_id, uint32_t viommu_id, int ret) " iommufd=%d type=%u dev_id=%u hwpt_id=%u viommu_id=%u (%d)"
+iommufd_backend_alloc_vdev(int iommufd, uint32_t dev_id, uint32_t viommu_id, uint64_t virt_id, uint32_t vdev_id, int ret) " iommufd=%d dev_id=%u viommu_id=%u virt_id=0x%"PRIx64" vdev_id=%u (%d)"
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index dfe1dc2850..e852193f35 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -48,6 +48,14 @@ typedef struct IOMMUFDViommu {
uint32_t viommu_id; /* virtual IOMMU ID of allocated object */
} IOMMUFDViommu;
+/*
+ * Virtual device object for a physical device bind to a vIOMMU.
+ */
+typedef struct IOMMUFDVdev {
+ uint32_t vdev_id; /* Virtual device ID */
+ uint32_t dev_id; /* Physical device ID */
+} IOMMUFDVdev;
+
bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp);
void iommufd_backend_disconnect(IOMMUFDBackend *be);
@@ -73,6 +81,10 @@ bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
uint32_t viommu_type, uint32_t hwpt_id,
uint32_t *out_hwpt, Error **errp);
+bool iommufd_backend_alloc_vdev(IOMMUFDBackend *be, uint32_t dev_id,
+ uint32_t viommu_id, uint64_t virt_id,
+ uint32_t *out_vdev_id, Error **errp);
+
bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t hwpt_id,
bool start, Error **errp);
bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 02/27] backends/iommufd: Introduce iommufd_vdev_alloc
2025-09-29 13:36 ` [PATCH v4 02/27] backends/iommufd: Introduce iommufd_vdev_alloc Shameer Kolothum
@ 2025-09-29 15:40 ` Jonathan Cameron via
2025-09-29 17:52 ` Nicolin Chen
1 sibling, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 15:40 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:18 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Add a helper to allocate an iommufd device's virtual device (in the user
> space) per a viommu instance.
>
> While at it, introduce a struct IOMMUFDVdev for later use by vendor
> IOMMU implementations.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
A theme emerging. See below.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index dfe1dc2850..e852193f35 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -48,6 +48,14 @@ typedef struct IOMMUFDViommu {
> uint32_t viommu_id; /* virtual IOMMU ID of allocated object */
> } IOMMUFDViommu;
>
> +/*
> + * Virtual device object for a physical device bind to a vIOMMU.
> + */
> +typedef struct IOMMUFDVdev {
> + uint32_t vdev_id; /* Virtual device ID */
> + uint32_t dev_id; /* Physical device ID */
Spacing and capitalization is a bit is consistent... Previous patch had lowercase v in
similar comments.
> +} IOMMUFDVdev;
> +
> bool iommufd_backend_connect(IOMMUFDBackend *be, Error **errp);
> void iommufd_backend_disconnect(IOMMUFDBackend *be);
>
> @@ -73,6 +81,10 @@ bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
> uint32_t viommu_type, uint32_t hwpt_id,
> uint32_t *out_hwpt, Error **errp);
>
> +bool iommufd_backend_alloc_vdev(IOMMUFDBackend *be, uint32_t dev_id,
> + uint32_t viommu_id, uint64_t virt_id,
> + uint32_t *out_vdev_id, Error **errp);
> +
> bool iommufd_backend_set_dirty_tracking(IOMMUFDBackend *be, uint32_t hwpt_id,
> bool start, Error **errp);
> bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 02/27] backends/iommufd: Introduce iommufd_vdev_alloc
2025-09-29 13:36 ` [PATCH v4 02/27] backends/iommufd: Introduce iommufd_vdev_alloc Shameer Kolothum
2025-09-29 15:40 ` Jonathan Cameron via
@ 2025-09-29 17:52 ` Nicolin Chen
2025-09-30 8:14 ` Shameer Kolothum
1 sibling, 1 reply; 118+ messages in thread
From: Nicolin Chen @ 2025-09-29 17:52 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, Sep 29, 2025 at 02:36:18PM +0100, Shameer Kolothum wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Add a helper to allocate an iommufd device's virtual device (in the user
> space) per a viommu instance.
>
> While at it, introduce a struct IOMMUFDVdev for later use by vendor
> IOMMU implementations.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> backends/iommufd.c | 27 +++++++++++++++++++++++++++
> backends/trace-events | 1 +
> include/system/iommufd.h | 12 ++++++++++++
> 3 files changed, 40 insertions(+)
>
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 7b2e5ace2d..d3029d4658 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -472,6 +472,33 @@ bool iommufd_backend_alloc_viommu(IOMMUFDBackend *be, uint32_t dev_id,
> return true;
> }
>
> +bool iommufd_backend_alloc_vdev(IOMMUFDBackend *be, uint32_t dev_id,
> + uint32_t viommu_id, uint64_t virt_id,
> + uint32_t *out_vdev_id, Error **errp)
The function name in the subject is now mismatched and should be
updated.
Nicolin
^ permalink raw reply [flat|nested] 118+ messages in thread
* RE: [PATCH v4 02/27] backends/iommufd: Introduce iommufd_vdev_alloc
2025-09-29 17:52 ` Nicolin Chen
@ 2025-09-30 8:14 ` Shameer Kolothum
0 siblings, 0 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-30 8:14 UTC (permalink / raw)
To: Nicolin Chen
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 29 September 2025 18:52
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 02/27] backends/iommufd: Introduce
> iommufd_vdev_alloc
>
> On Mon, Sep 29, 2025 at 02:36:18PM +0100, Shameer Kolothum wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > +bool iommufd_backend_alloc_vdev(IOMMUFDBackend *be, uint32_t
> dev_id,
> > + uint32_t viommu_id, uint64_t virt_id,
> > + uint32_t *out_vdev_id, Error **errp)
>
> The function name in the subject is now mismatched and should be updated.
Oops..missed that.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 03/27] hw/arm/smmu-common: Factor out common helper functions and export
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
2025-09-29 13:36 ` [PATCH v4 01/27] backends/iommufd: Introduce iommufd_backend_alloc_viommu Shameer Kolothum
2025-09-29 13:36 ` [PATCH v4 02/27] backends/iommufd: Introduce iommufd_vdev_alloc Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 15:43 ` Jonathan Cameron via
2025-09-29 13:36 ` [PATCH v4 04/27] hw/arm/smmu-common:Make iommu ops part of SMMUState Shameer Kolothum
` (24 subsequent siblings)
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
Subsequent patches for smmuv3 accel support will make use of this.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmu-common.c | 44 +++++++++++++++++++++---------------
include/hw/arm/smmu-common.h | 6 +++++
2 files changed, 32 insertions(+), 18 deletions(-)
diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 62a7612184..59d6147ec9 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -847,12 +847,24 @@ SMMUPciBus *smmu_find_smmu_pcibus(SMMUState *s, uint8_t bus_num)
return NULL;
}
-static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
+void smmu_init_sdev(SMMUState *s, SMMUDevice *sdev, PCIBus *bus, int devfn)
{
- SMMUState *s = opaque;
- SMMUPciBus *sbus = g_hash_table_lookup(s->smmu_pcibus_by_busptr, bus);
- SMMUDevice *sdev;
static unsigned int index;
+ g_autofree char *name = g_strdup_printf("%s-%d-%d", s->mrtypename, devfn,
+ index++);
+ sdev->smmu = s;
+ sdev->bus = bus;
+ sdev->devfn = devfn;
+
+ memory_region_init_iommu(&sdev->iommu, sizeof(sdev->iommu),
+ s->mrtypename, OBJECT(s), name, UINT64_MAX);
+ address_space_init(&sdev->as, MEMORY_REGION(&sdev->iommu), name);
+ trace_smmu_add_mr(name);
+}
+
+SMMUPciBus *smmu_get_sbus(SMMUState *s, PCIBus *bus)
+{
+ SMMUPciBus *sbus = g_hash_table_lookup(s->smmu_pcibus_by_busptr, bus);
if (!sbus) {
sbus = g_malloc0(sizeof(SMMUPciBus) +
@@ -861,23 +873,19 @@ static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
g_hash_table_insert(s->smmu_pcibus_by_busptr, bus, sbus);
}
+ return sbus;
+}
+
+static AddressSpace *smmu_find_add_as(PCIBus *bus, void *opaque, int devfn)
+{
+ SMMUState *s = opaque;
+ SMMUPciBus *sbus = smmu_get_sbus(s, bus);
+ SMMUDevice *sdev;
+
sdev = sbus->pbdev[devfn];
if (!sdev) {
- char *name = g_strdup_printf("%s-%d-%d", s->mrtypename, devfn, index++);
-
sdev = sbus->pbdev[devfn] = g_new0(SMMUDevice, 1);
-
- sdev->smmu = s;
- sdev->bus = bus;
- sdev->devfn = devfn;
-
- memory_region_init_iommu(&sdev->iommu, sizeof(sdev->iommu),
- s->mrtypename,
- OBJECT(s), name, UINT64_MAX);
- address_space_init(&sdev->as,
- MEMORY_REGION(&sdev->iommu), name);
- trace_smmu_add_mr(name);
- g_free(name);
+ smmu_init_sdev(s, sdev, bus, devfn);
}
return &sdev->as;
diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
index 80d0fecfde..c6f899e403 100644
--- a/include/hw/arm/smmu-common.h
+++ b/include/hw/arm/smmu-common.h
@@ -180,6 +180,12 @@ OBJECT_DECLARE_TYPE(SMMUState, SMMUBaseClass, ARM_SMMU)
/* Return the SMMUPciBus handle associated to a PCI bus number */
SMMUPciBus *smmu_find_smmu_pcibus(SMMUState *s, uint8_t bus_num);
+/* Return the SMMUPciBus handle associated to a PCI bus */
+SMMUPciBus *smmu_get_sbus(SMMUState *s, PCIBus *bus);
+
+/* Initialize SMMUDevice handle associated to a SMMUPCIBus */
+void smmu_init_sdev(SMMUState *s, SMMUDevice *sdev, PCIBus *bus, int devfn);
+
/* Return the stream ID of an SMMU device */
static inline uint16_t smmu_get_sid(SMMUDevice *sdev)
{
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 03/27] hw/arm/smmu-common: Factor out common helper functions and export
2025-09-29 13:36 ` [PATCH v4 03/27] hw/arm/smmu-common: Factor out common helper functions and export Shameer Kolothum
@ 2025-09-29 15:43 ` Jonathan Cameron via
0 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 15:43 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:19 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> Subsequent patches for smmuv3 accel support will make use of this.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
One trivial
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
> index 80d0fecfde..c6f899e403 100644
> --- a/include/hw/arm/smmu-common.h
> +++ b/include/hw/arm/smmu-common.h
> @@ -180,6 +180,12 @@ OBJECT_DECLARE_TYPE(SMMUState, SMMUBaseClass, ARM_SMMU)
> /* Return the SMMUPciBus handle associated to a PCI bus number */
> SMMUPciBus *smmu_find_smmu_pcibus(SMMUState *s, uint8_t bus_num);
>
> +/* Return the SMMUPciBus handle associated to a PCI bus */
> +SMMUPciBus *smmu_get_sbus(SMMUState *s, PCIBus *bus);
> +
> +/* Initialize SMMUDevice handle associated to a SMMUPCIBus */
Pci assuming intent is to match the type name.
> +void smmu_init_sdev(SMMUState *s, SMMUDevice *sdev, PCIBus *bus, int devfn);
> +
> /* Return the stream ID of an SMMU device */
> static inline uint16_t smmu_get_sid(SMMUDevice *sdev)
> {
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 04/27] hw/arm/smmu-common:Make iommu ops part of SMMUState
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (2 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 03/27] hw/arm/smmu-common: Factor out common helper functions and export Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 15:45 ` Jonathan Cameron via
` (2 more replies)
2025-09-29 13:36 ` [PATCH v4 05/27] hw/arm/smmuv3-accel: Introduce smmuv3 accel device Shameer Kolothum
` (23 subsequent siblings)
27 siblings, 3 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
And set to the current default smmu_ops. No functional change intended.
This will allow SMMUv3 accel implementation to set a different iommu ops
later.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmu-common.c | 7 +++++--
include/hw/arm/smmu-common.h | 1 +
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
index 59d6147ec9..4d6516443e 100644
--- a/hw/arm/smmu-common.c
+++ b/hw/arm/smmu-common.c
@@ -952,6 +952,9 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
return;
}
+ if (!s->iommu_ops) {
+ s->iommu_ops = &smmu_ops;
+ }
/*
* We only allow default PCIe Root Complex(pcie.0) or pxb-pcie based extra
* root complexes to be associated with SMMU.
@@ -971,9 +974,9 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
}
if (s->smmu_per_bus) {
- pci_setup_iommu_per_bus(pci_bus, &smmu_ops, s);
+ pci_setup_iommu_per_bus(pci_bus, s->iommu_ops, s);
} else {
- pci_setup_iommu(pci_bus, &smmu_ops, s);
+ pci_setup_iommu(pci_bus, s->iommu_ops, s);
}
return;
}
diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
index c6f899e403..75b83b2b4a 100644
--- a/include/hw/arm/smmu-common.h
+++ b/include/hw/arm/smmu-common.h
@@ -162,6 +162,7 @@ struct SMMUState {
uint8_t bus_num;
PCIBus *primary_bus;
bool smmu_per_bus; /* SMMU is specific to the primary_bus */
+ const PCIIOMMUOps *iommu_ops;
};
struct SMMUBaseClass {
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 04/27] hw/arm/smmu-common:Make iommu ops part of SMMUState
2025-09-29 13:36 ` [PATCH v4 04/27] hw/arm/smmu-common:Make iommu ops part of SMMUState Shameer Kolothum
@ 2025-09-29 15:45 ` Jonathan Cameron via
2025-09-29 21:53 ` Nicolin Chen via
2025-10-01 16:11 ` Eric Auger
2 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 15:45 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:20 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
Space between : and Make
I'd repeat the patch title bit of the sentence in here just to make
it more readable.
> And set to the current default smmu_ops. No functional change intended.
> This will allow SMMUv3 accel implementation to set a different iommu ops
> later.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 04/27] hw/arm/smmu-common:Make iommu ops part of SMMUState
2025-09-29 13:36 ` [PATCH v4 04/27] hw/arm/smmu-common:Make iommu ops part of SMMUState Shameer Kolothum
2025-09-29 15:45 ` Jonathan Cameron via
@ 2025-09-29 21:53 ` Nicolin Chen via
2025-10-01 16:11 ` Eric Auger
2 siblings, 0 replies; 118+ messages in thread
From: Nicolin Chen via @ 2025-09-29 21:53 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, Sep 29, 2025 at 02:36:20PM +0100, Shameer Kolothum wrote:
> And set to the current default smmu_ops. No functional change intended.
> This will allow SMMUv3 accel implementation to set a different iommu ops
> later.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 04/27] hw/arm/smmu-common:Make iommu ops part of SMMUState
2025-09-29 13:36 ` [PATCH v4 04/27] hw/arm/smmu-common:Make iommu ops part of SMMUState Shameer Kolothum
2025-09-29 15:45 ` Jonathan Cameron via
2025-09-29 21:53 ` Nicolin Chen via
@ 2025-10-01 16:11 ` Eric Auger
2 siblings, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-01 16:11 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> And set to the current default smmu_ops. No functional change intended.
> This will allow SMMUv3 accel implementation to set a different iommu ops
> later.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Eric
> ---
> hw/arm/smmu-common.c | 7 +++++--
> include/hw/arm/smmu-common.h | 1 +
> 2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmu-common.c b/hw/arm/smmu-common.c
> index 59d6147ec9..4d6516443e 100644
> --- a/hw/arm/smmu-common.c
> +++ b/hw/arm/smmu-common.c
> @@ -952,6 +952,9 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
> return;
> }
>
> + if (!s->iommu_ops) {
> + s->iommu_ops = &smmu_ops;
> + }
> /*
> * We only allow default PCIe Root Complex(pcie.0) or pxb-pcie based extra
> * root complexes to be associated with SMMU.
> @@ -971,9 +974,9 @@ static void smmu_base_realize(DeviceState *dev, Error **errp)
> }
>
> if (s->smmu_per_bus) {
> - pci_setup_iommu_per_bus(pci_bus, &smmu_ops, s);
> + pci_setup_iommu_per_bus(pci_bus, s->iommu_ops, s);
> } else {
> - pci_setup_iommu(pci_bus, &smmu_ops, s);
> + pci_setup_iommu(pci_bus, s->iommu_ops, s);
> }
> return;
> }
> diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
> index c6f899e403..75b83b2b4a 100644
> --- a/include/hw/arm/smmu-common.h
> +++ b/include/hw/arm/smmu-common.h
> @@ -162,6 +162,7 @@ struct SMMUState {
> uint8_t bus_num;
> PCIBus *primary_bus;
> bool smmu_per_bus; /* SMMU is specific to the primary_bus */
> + const PCIIOMMUOps *iommu_ops;
> };
>
> struct SMMUBaseClass {
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 05/27] hw/arm/smmuv3-accel: Introduce smmuv3 accel device
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (3 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 04/27] hw/arm/smmu-common:Make iommu ops part of SMMUState Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 15:53 ` Jonathan Cameron via
` (2 more replies)
2025-09-29 13:36 ` [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd Shameer Kolothum
` (22 subsequent siblings)
27 siblings, 3 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
Set up dedicated PCIIOMMUOps for the accel SMMUv3, since it will need
different callback handling in upcoming patches. This also adds a
CONFIG_ARM_SMMUV3_ACCEL build option so the feature can be disabled
at compile time. Because we now include CONFIG_DEVICES in the header to
check for ARM_SMMUV3_ACCEL, the meson file entry for smmuv3.c needs to
be changed as well.
The “accel” property isn’t user visible yet, it will be introduced in
a later patch once all the supporting pieces are ready.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/Kconfig | 5 ++++
hw/arm/meson.build | 3 ++-
hw/arm/smmuv3-accel.c | 52 +++++++++++++++++++++++++++++++++++++++++
hw/arm/smmuv3-accel.h | 27 +++++++++++++++++++++
hw/arm/smmuv3.c | 5 ++++
include/hw/arm/smmuv3.h | 3 +++
6 files changed, 94 insertions(+), 1 deletion(-)
create mode 100644 hw/arm/smmuv3-accel.c
create mode 100644 hw/arm/smmuv3-accel.h
diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
index 3baa6c6c74..157c0f3517 100644
--- a/hw/arm/Kconfig
+++ b/hw/arm/Kconfig
@@ -12,6 +12,7 @@ config ARM_VIRT
select ARM_GIC
select ACPI
select ARM_SMMUV3
+ select ARM_SMMUV3_ACCEL
select GPIO_KEY
select DEVICE_TREE
select FW_CFG_DMA
@@ -625,6 +626,10 @@ config FSL_IMX8MP_EVK
config ARM_SMMUV3
bool
+config ARM_SMMUV3_ACCEL
+ bool
+ depends on ARM_SMMUV3 && IOMMUFD
+
config FSL_IMX6UL
bool
default y
diff --git a/hw/arm/meson.build b/hw/arm/meson.build
index dc68391305..bcb27c0bf6 100644
--- a/hw/arm/meson.build
+++ b/hw/arm/meson.build
@@ -61,7 +61,8 @@ arm_common_ss.add(when: 'CONFIG_ARMSSE', if_true: files('armsse.c'))
arm_common_ss.add(when: 'CONFIG_FSL_IMX7', if_true: files('fsl-imx7.c', 'mcimx7d-sabre.c'))
arm_common_ss.add(when: 'CONFIG_FSL_IMX8MP', if_true: files('fsl-imx8mp.c'))
arm_common_ss.add(when: 'CONFIG_FSL_IMX8MP_EVK', if_true: files('imx8mp-evk.c'))
-arm_common_ss.add(when: 'CONFIG_ARM_SMMUV3', if_true: files('smmuv3.c'))
+arm_ss.add(when: 'CONFIG_ARM_SMMUV3', if_true: files('smmuv3.c'))
+arm_ss.add(when: 'CONFIG_ARM_SMMUV3_ACCEL', if_true: files('smmuv3-accel.c'))
arm_common_ss.add(when: 'CONFIG_FSL_IMX6UL', if_true: files('fsl-imx6ul.c', 'mcimx6ul-evk.c'))
arm_common_ss.add(when: 'CONFIG_NRF51_SOC', if_true: files('nrf51_soc.c'))
arm_ss.add(when: 'CONFIG_XEN', if_true: files(
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
new file mode 100644
index 0000000000..79f1713be6
--- /dev/null
+++ b/hw/arm/smmuv3-accel.c
@@ -0,0 +1,52 @@
+/*
+ * Copyright (c) 2025 Huawei Technologies R & D (UK) Ltd
+ * Copyright (C) 2025 NVIDIA
+ * Written by Nicolin Chen, Shameer Kolothum
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+
+#include "hw/arm/smmuv3.h"
+#include "smmuv3-accel.h"
+
+static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
+ PCIBus *bus, int devfn)
+{
+ SMMUDevice *sdev = sbus->pbdev[devfn];
+ SMMUv3AccelDevice *accel_dev;
+
+ if (sdev) {
+ return container_of(sdev, SMMUv3AccelDevice, sdev);
+ }
+
+ accel_dev = g_new0(SMMUv3AccelDevice, 1);
+ sdev = &accel_dev->sdev;
+
+ sbus->pbdev[devfn] = sdev;
+ smmu_init_sdev(bs, sdev, bus, devfn);
+ return accel_dev;
+}
+
+static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
+ int devfn)
+{
+ SMMUState *bs = opaque;
+ SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
+ SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
+ SMMUDevice *sdev = &accel_dev->sdev;
+
+ return &sdev->as;
+}
+
+static const PCIIOMMUOps smmuv3_accel_ops = {
+ .get_address_space = smmuv3_accel_find_add_as,
+};
+
+void smmuv3_accel_init(SMMUv3State *s)
+{
+ SMMUState *bs = ARM_SMMU(s);
+
+ bs->iommu_ops = &smmuv3_accel_ops;
+}
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
new file mode 100644
index 0000000000..70da16960f
--- /dev/null
+++ b/hw/arm/smmuv3-accel.h
@@ -0,0 +1,27 @@
+/*
+ * Copyright (c) 2025 Huawei Technologies R & D (UK) Ltd
+ * Copyright (C) 2025 NVIDIA
+ * Written by Nicolin Chen, Shameer Kolothum
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_ARM_SMMUV3_ACCEL_H
+#define HW_ARM_SMMUV3_ACCEL_H
+
+#include "hw/arm/smmu-common.h"
+#include CONFIG_DEVICES
+
+typedef struct SMMUv3AccelDevice {
+ SMMUDevice sdev;
+} SMMUv3AccelDevice;
+
+#ifdef CONFIG_ARM_SMMUV3_ACCEL
+void smmuv3_accel_init(SMMUv3State *s);
+#else
+static inline void smmuv3_accel_init(SMMUv3State *s)
+{
+}
+#endif
+
+#endif /* HW_ARM_SMMUV3_ACCEL_H */
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index bcf8af8dc7..ef991cb7d8 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -32,6 +32,7 @@
#include "qapi/error.h"
#include "hw/arm/smmuv3.h"
+#include "smmuv3-accel.h"
#include "smmuv3-internal.h"
#include "smmu-internal.h"
@@ -1882,6 +1883,10 @@ static void smmu_realize(DeviceState *d, Error **errp)
SysBusDevice *dev = SYS_BUS_DEVICE(d);
Error *local_err = NULL;
+ if (s->accel) {
+ smmuv3_accel_init(s);
+ }
+
c->parent_realize(d, &local_err);
if (local_err) {
error_propagate(errp, local_err);
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index d183a62766..bb7076286b 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -63,6 +63,9 @@ struct SMMUv3State {
qemu_irq irq[4];
QemuMutex mutex;
char *stage;
+
+ /* SMMU has HW accelerator support for nested S1 + s2 */
+ bool accel;
};
typedef enum {
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 05/27] hw/arm/smmuv3-accel: Introduce smmuv3 accel device
2025-09-29 13:36 ` [PATCH v4 05/27] hw/arm/smmuv3-accel: Introduce smmuv3 accel device Shameer Kolothum
@ 2025-09-29 15:53 ` Jonathan Cameron via
2025-09-29 22:24 ` Nicolin Chen
2025-10-01 16:25 ` Eric Auger
2 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 15:53 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:21 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> Set up dedicated PCIIOMMUOps for the accel SMMUv3, since it will need
> different callback handling in upcoming patches. This also adds a
> CONFIG_ARM_SMMUV3_ACCEL build option so the feature can be disabled
> at compile time. Because we now include CONFIG_DEVICES in the header to
> check for ARM_SMMUV3_ACCEL, the meson file entry for smmuv3.c needs to
> be changed as well.
>
> The “accel” property isn’t user visible yet, it will be introduced in
> a later patch once all the supporting pieces are ready.
>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
All the code looks fine, just comment stuff.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index d183a62766..bb7076286b 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -63,6 +63,9 @@ struct SMMUv3State {
> qemu_irq irq[4];
> QemuMutex mutex;
> char *stage;
> +
> + /* SMMU has HW accelerator support for nested S1 + s2 */
s1 + s2 or S1 + S2 (probably the second)
> + bool accel;
> };
>
> typedef enum {
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 05/27] hw/arm/smmuv3-accel: Introduce smmuv3 accel device
2025-09-29 13:36 ` [PATCH v4 05/27] hw/arm/smmuv3-accel: Introduce smmuv3 accel device Shameer Kolothum
2025-09-29 15:53 ` Jonathan Cameron via
@ 2025-09-29 22:24 ` Nicolin Chen
2025-10-01 16:25 ` Eric Auger
2 siblings, 0 replies; 118+ messages in thread
From: Nicolin Chen @ 2025-09-29 22:24 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, Sep 29, 2025 at 02:36:21PM +0100, Shameer Kolothum wrote:
> Because we now include CONFIG_DEVICES in the header to
> check for ARM_SMMUV3_ACCEL, the meson file entry for smmuv3.c needs to
> be changed as well.
The reasoning isn't very clear. Let's make a note here that only
arm_ss via hw_arch can include CONFIG_DEVICES.
> The “accel” property isn’t user visible yet, it will be introduced in
Let's use standard " and '.
Thanks
Nicolin
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 05/27] hw/arm/smmuv3-accel: Introduce smmuv3 accel device
2025-09-29 13:36 ` [PATCH v4 05/27] hw/arm/smmuv3-accel: Introduce smmuv3 accel device Shameer Kolothum
2025-09-29 15:53 ` Jonathan Cameron via
2025-09-29 22:24 ` Nicolin Chen
@ 2025-10-01 16:25 ` Eric Auger
2 siblings, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-01 16:25 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
Hi Shameer,
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> Set up dedicated PCIIOMMUOps for the accel SMMUv3, since it will need
> different callback handling in upcoming patches. This also adds a
> CONFIG_ARM_SMMUV3_ACCEL build option so the feature can be disabled
> at compile time. Because we now include CONFIG_DEVICES in the header to
> check for ARM_SMMUV3_ACCEL, the meson file entry for smmuv3.c needs to
> be changed as well.
>
> The “accel” property isn’t user visible yet, it will be introduced in
> a later patch once all the supporting pieces are ready.
>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Eric
> ---
> hw/arm/Kconfig | 5 ++++
> hw/arm/meson.build | 3 ++-
> hw/arm/smmuv3-accel.c | 52 +++++++++++++++++++++++++++++++++++++++++
> hw/arm/smmuv3-accel.h | 27 +++++++++++++++++++++
> hw/arm/smmuv3.c | 5 ++++
> include/hw/arm/smmuv3.h | 3 +++
> 6 files changed, 94 insertions(+), 1 deletion(-)
> create mode 100644 hw/arm/smmuv3-accel.c
> create mode 100644 hw/arm/smmuv3-accel.h
>
> diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig
> index 3baa6c6c74..157c0f3517 100644
> --- a/hw/arm/Kconfig
> +++ b/hw/arm/Kconfig
> @@ -12,6 +12,7 @@ config ARM_VIRT
> select ARM_GIC
> select ACPI
> select ARM_SMMUV3
> + select ARM_SMMUV3_ACCEL
> select GPIO_KEY
> select DEVICE_TREE
> select FW_CFG_DMA
> @@ -625,6 +626,10 @@ config FSL_IMX8MP_EVK
> config ARM_SMMUV3
> bool
>
> +config ARM_SMMUV3_ACCEL
> + bool
> + depends on ARM_SMMUV3 && IOMMUFD
> +
> config FSL_IMX6UL
> bool
> default y
> diff --git a/hw/arm/meson.build b/hw/arm/meson.build
> index dc68391305..bcb27c0bf6 100644
> --- a/hw/arm/meson.build
> +++ b/hw/arm/meson.build
> @@ -61,7 +61,8 @@ arm_common_ss.add(when: 'CONFIG_ARMSSE', if_true: files('armsse.c'))
> arm_common_ss.add(when: 'CONFIG_FSL_IMX7', if_true: files('fsl-imx7.c', 'mcimx7d-sabre.c'))
> arm_common_ss.add(when: 'CONFIG_FSL_IMX8MP', if_true: files('fsl-imx8mp.c'))
> arm_common_ss.add(when: 'CONFIG_FSL_IMX8MP_EVK', if_true: files('imx8mp-evk.c'))
> -arm_common_ss.add(when: 'CONFIG_ARM_SMMUV3', if_true: files('smmuv3.c'))
> +arm_ss.add(when: 'CONFIG_ARM_SMMUV3', if_true: files('smmuv3.c'))
> +arm_ss.add(when: 'CONFIG_ARM_SMMUV3_ACCEL', if_true: files('smmuv3-accel.c'))
> arm_common_ss.add(when: 'CONFIG_FSL_IMX6UL', if_true: files('fsl-imx6ul.c', 'mcimx6ul-evk.c'))
> arm_common_ss.add(when: 'CONFIG_NRF51_SOC', if_true: files('nrf51_soc.c'))
> arm_ss.add(when: 'CONFIG_XEN', if_true: files(
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> new file mode 100644
> index 0000000000..79f1713be6
> --- /dev/null
> +++ b/hw/arm/smmuv3-accel.c
> @@ -0,0 +1,52 @@
> +/*
> + * Copyright (c) 2025 Huawei Technologies R & D (UK) Ltd
> + * Copyright (C) 2025 NVIDIA
> + * Written by Nicolin Chen, Shameer Kolothum
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +
> +#include "hw/arm/smmuv3.h"
> +#include "smmuv3-accel.h"
> +
> +static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> + PCIBus *bus, int devfn)
> +{
> + SMMUDevice *sdev = sbus->pbdev[devfn];
> + SMMUv3AccelDevice *accel_dev;
> +
> + if (sdev) {
> + return container_of(sdev, SMMUv3AccelDevice, sdev);
> + }
> +
> + accel_dev = g_new0(SMMUv3AccelDevice, 1);
> + sdev = &accel_dev->sdev;
> +
> + sbus->pbdev[devfn] = sdev;
> + smmu_init_sdev(bs, sdev, bus, devfn);
> + return accel_dev;
> +}
> +
> +static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
> + int devfn)
> +{
> + SMMUState *bs = opaque;
> + SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> + SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
> + SMMUDevice *sdev = &accel_dev->sdev;
> +
> + return &sdev->as;
> +}
> +
> +static const PCIIOMMUOps smmuv3_accel_ops = {
> + .get_address_space = smmuv3_accel_find_add_as,
> +};
> +
> +void smmuv3_accel_init(SMMUv3State *s)
> +{
> + SMMUState *bs = ARM_SMMU(s);
> +
> + bs->iommu_ops = &smmuv3_accel_ops;
> +}
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> new file mode 100644
> index 0000000000..70da16960f
> --- /dev/null
> +++ b/hw/arm/smmuv3-accel.h
> @@ -0,0 +1,27 @@
> +/*
> + * Copyright (c) 2025 Huawei Technologies R & D (UK) Ltd
> + * Copyright (C) 2025 NVIDIA
> + * Written by Nicolin Chen, Shameer Kolothum
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_ARM_SMMUV3_ACCEL_H
> +#define HW_ARM_SMMUV3_ACCEL_H
> +
> +#include "hw/arm/smmu-common.h"
> +#include CONFIG_DEVICES
> +
> +typedef struct SMMUv3AccelDevice {
> + SMMUDevice sdev;
> +} SMMUv3AccelDevice;
> +
> +#ifdef CONFIG_ARM_SMMUV3_ACCEL
> +void smmuv3_accel_init(SMMUv3State *s);
> +#else
> +static inline void smmuv3_accel_init(SMMUv3State *s)
> +{
> +}
> +#endif
> +
> +#endif /* HW_ARM_SMMUV3_ACCEL_H */
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index bcf8af8dc7..ef991cb7d8 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -32,6 +32,7 @@
> #include "qapi/error.h"
>
> #include "hw/arm/smmuv3.h"
> +#include "smmuv3-accel.h"
> #include "smmuv3-internal.h"
> #include "smmu-internal.h"
>
> @@ -1882,6 +1883,10 @@ static void smmu_realize(DeviceState *d, Error **errp)
> SysBusDevice *dev = SYS_BUS_DEVICE(d);
> Error *local_err = NULL;
>
> + if (s->accel) {
> + smmuv3_accel_init(s);
> + }
> +
> c->parent_realize(d, &local_err);
> if (local_err) {
> error_propagate(errp, local_err);
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index d183a62766..bb7076286b 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -63,6 +63,9 @@ struct SMMUv3State {
> qemu_irq irq[4];
> QemuMutex mutex;
> char *stage;
> +
> + /* SMMU has HW accelerator support for nested S1 + s2 */
> + bool accel;
> };
>
> typedef enum {
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (4 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 05/27] hw/arm/smmuv3-accel: Introduce smmuv3 accel device Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 16:08 ` Jonathan Cameron via
` (3 more replies)
2025-09-29 13:36 ` [PATCH v4 07/27] hw/arm/smmuv3: Implement get_viommu_cap() callback Shameer Kolothum
` (21 subsequent siblings)
27 siblings, 4 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
Accelerated SMMUv3 is only useful when the device can take advantage of
the host's SMMUv3 in nested mode. To keep things simple and correct, we
only allow this feature for vfio-pci endpoint devices that use the iommufd
backend. We also allow non-endpoint emulated devices like PCI bridges and
root ports, so that users can plug in these vfio-pci devices. We can only
enforce this if devices are cold plugged. For hotplug cases, give appropriate
warnings.
Another reason for this limit is to avoid problems with IOTLB
invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
SID, making it difficult to trace the originating device. If we allowed
emulated endpoint devices, QEMU would have to invalidate both its own
software IOTLB and the host's hardware IOTLB, which could slow things
down.
Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
translation (S1+S2), their get_address_space() callback must return the
system address space so that VFIO core can setup correct S2 mappings
for guest RAM.
So in short:
- vfio-pci devices(with iommufd as backend) return the system address
space.
- bridges and root ports return the IOMMU address space.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 68 ++++++++++++++++++++++++++++-
hw/pci-bridge/pci_expander_bridge.c | 1 -
include/hw/pci/pci_bridge.h | 1 +
3 files changed, 68 insertions(+), 2 deletions(-)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 79f1713be6..44410cfb2a 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -7,8 +7,13 @@
*/
#include "qemu/osdep.h"
+#include "qemu/error-report.h"
#include "hw/arm/smmuv3.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci-host/gpex.h"
+#include "hw/vfio/pci.h"
+
#include "smmuv3-accel.h"
static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
@@ -29,15 +34,76 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
return accel_dev;
}
+static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
+{
+
+ if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
+ object_dynamic_cast(OBJECT(pdev), TYPE_PXB_PCIE_DEV) ||
+ object_dynamic_cast(OBJECT(pdev), TYPE_GPEX_ROOT_DEVICE)) {
+ return true;
+ } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI))) {
+ *vfio_pci = true;
+ if (object_property_get_link(OBJECT(pdev), "iommufd", NULL)) {
+ return true;
+ }
+ }
+ return false;
+}
+
static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
int devfn)
{
+ PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
SMMUState *bs = opaque;
SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
SMMUDevice *sdev = &accel_dev->sdev;
+ bool vfio_pci = false;
+
+ if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
+ if (DEVICE(pdev)->hotplugged) {
+ if (vfio_pci) {
+ warn_report("Hot plugging a vfio-pci device (%s) without "
+ "iommufd as backend is not supported", pdev->name);
+ } else {
+ warn_report("Hot plugging an emulated device %s with "
+ "accelerated SMMUv3. This will bring down "
+ "performace", pdev->name);
+ }
+ /*
+ * Both cases, we will return IOMMU address space. For hotplugged
+ * vfio-pci dev without iommufd as backend, it will fail later in
+ * smmuv3_notify_flag_changed() with "requires iommu MAP notifier"
+ * error message.
+ */
+ return &sdev->as;
+ } else {
+ error_report("Device(%s) not allowed. Only PCIe root complex "
+ "devices or PCI bridge devices or vfio-pci endpoint "
+ "devices with iommufd as backend is allowed with "
+ "arm-smmuv3,accel=on", pdev->name);
+ exit(1);
+ }
+ }
- return &sdev->as;
+ /*
+ * We return the system address for vfio-pci devices(with iommufd as
+ * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
+ * guest RAM. This is needed because, in the accelerated SMMUv3 case,
+ * the host SMMUv3 runs in nested (S1 + S2) mode where the guest
+ * manages its own S1 page tables while the host manages S2.
+ *
+ * We are using the global &address_space_memory here, as this will ensure
+ * same system address space pointer for all devices behind the accelerated
+ * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
+ * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
+ * within the VM instead of duplicating them for every SMMUv3 instance.
+ */
+ if (vfio_pci) {
+ return &address_space_memory;
+ } else {
+ return &sdev->as;
+ }
}
static const PCIIOMMUOps smmuv3_accel_ops = {
diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
index 1bcceddbc4..a8eb2d2426 100644
--- a/hw/pci-bridge/pci_expander_bridge.c
+++ b/hw/pci-bridge/pci_expander_bridge.c
@@ -48,7 +48,6 @@ struct PXBBus {
char bus_path[8];
};
-#define TYPE_PXB_PCIE_DEV "pxb-pcie"
OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
static GList *pxb_dev_list;
diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
index a055fd8d32..b61360b900 100644
--- a/include/hw/pci/pci_bridge.h
+++ b/include/hw/pci/pci_bridge.h
@@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
#define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
#define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
+#define TYPE_PXB_PCIE_DEV "pxb-pcie"
#define TYPE_PXB_DEV "pxb"
OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-09-29 13:36 ` [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd Shameer Kolothum
@ 2025-09-29 16:08 ` Jonathan Cameron via
2025-09-30 8:03 ` Shameer Kolothum
2025-09-30 0:11 ` Nicolin Chen
` (2 subsequent siblings)
3 siblings, 1 reply; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 16:08 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:22 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> Accelerated SMMUv3 is only useful when the device can take advantage of
> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> only allow this feature for vfio-pci endpoint devices that use the iommufd
> backend. We also allow non-endpoint emulated devices like PCI bridges and
> root ports, so that users can plug in these vfio-pci devices. We can only
> enforce this if devices are cold plugged. For hotplug cases, give appropriate
> warnings.
>
> Another reason for this limit is to avoid problems with IOTLB
> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> SID, making it difficult to trace the originating device. If we allowed
> emulated endpoint devices, QEMU would have to invalidate both its own
> software IOTLB and the host's hardware IOTLB, which could slow things
> down.
>
> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> translation (S1+S2), their get_address_space() callback must return the
> system address space so that VFIO core can setup correct S2 mappings
> for guest RAM.
>
> So in short:
> - vfio-pci devices(with iommufd as backend) return the system address
> space.
> - bridges and root ports return the IOMMU address space.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
One question that really applies to earlier patch and an even more trivial
comment on a comment than the earlier ones ;)
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> ---
> hw/arm/smmuv3-accel.c | 68 ++++++++++++++++++++++++++++-
> hw/pci-bridge/pci_expander_bridge.c | 1 -
> include/hw/pci/pci_bridge.h | 1 +
> 3 files changed, 68 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 79f1713be6..44410cfb2a 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
I should have noticed this in previous patch...
What does add stand for here? This name is not particularly clear to me.
> int devfn)
> {
> + PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> SMMUState *bs = opaque;
> SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
> SMMUDevice *sdev = &accel_dev->sdev;
> + bool vfio_pci = false;
> +
> + if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> + if (DEVICE(pdev)->hotplugged) {
> + if (vfio_pci) {
> + warn_report("Hot plugging a vfio-pci device (%s) without "
> + "iommufd as backend is not supported", pdev->name);
> + } else {
> + warn_report("Hot plugging an emulated device %s with "
> + "accelerated SMMUv3. This will bring down "
> + "performace", pdev->name);
> + }
> + /*
> + * Both cases, we will return IOMMU address space. For hotplugged
> + * vfio-pci dev without iommufd as backend, it will fail later in
> + * smmuv3_notify_flag_changed() with "requires iommu MAP notifier"
> + * error message.
> + */
> + return &sdev->as;
> + } else {
> + error_report("Device(%s) not allowed. Only PCIe root complex "
> + "devices or PCI bridge devices or vfio-pci endpoint "
> + "devices with iommufd as backend is allowed with "
> + "arm-smmuv3,accel=on", pdev->name);
> + exit(1);
> + }
> + }
>
> - return &sdev->as;
> + /*
> + * We return the system address for vfio-pci devices(with iommufd as
> + * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> + * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> + * the host SMMUv3 runs in nested (S1 + S2) mode where the guest
> + * manages its own S1 page tables while the host manages S2.
> + *
> + * We are using the global &address_space_memory here, as this will ensure
> + * same system address space pointer for all devices behind the accelerated
> + * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
> + * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
> + * within the VM instead of duplicating them for every SMMUv3 instance.
These two paragraphs definitely not wrapping to same line length.
Nice to tidy that up.
> + */
> + if (vfio_pci) {
> + return &address_space_memory;
> + } else {
> + return &sdev->as;
> + }
> }
>
> static const PCIIOMMUOps smmuv3_accel_ops = {
>
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-09-29 16:08 ` Jonathan Cameron via
@ 2025-09-30 8:03 ` Shameer Kolothum
2025-10-01 16:38 ` Eric Auger
0 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-30 8:03 UTC (permalink / raw)
To: Jonathan Cameron
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 29 September 2025 17:09
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
>
> External email: Use caution opening links or attachments
>
>
> On Mon, 29 Sep 2025 14:36:22 +0100
> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>
> > Accelerated SMMUv3 is only useful when the device can take advantage of
> > the host's SMMUv3 in nested mode. To keep things simple and correct, we
> > only allow this feature for vfio-pci endpoint devices that use the iommufd
> > backend. We also allow non-endpoint emulated devices like PCI bridges and
> > root ports, so that users can plug in these vfio-pci devices. We can only
> > enforce this if devices are cold plugged. For hotplug cases, give appropriate
> > warnings.
> >
> > Another reason for this limit is to avoid problems with IOTLB
> > invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
> associated
> > SID, making it difficult to trace the originating device. If we allowed
> > emulated endpoint devices, QEMU would have to invalidate both its own
> > software IOTLB and the host's hardware IOTLB, which could slow things
> > down.
> >
> > Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> > translation (S1+S2), their get_address_space() callback must return the
> > system address space so that VFIO core can setup correct S2 mappings
> > for guest RAM.
> >
> > So in short:
> > - vfio-pci devices(with iommufd as backend) return the system address
> > space.
> > - bridges and root ports return the IOMMU address space.
> >
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> One question that really applies to earlier patch and an even more trivial
> comment on a comment than the earlier ones ;)
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>
> > ---
> > hw/arm/smmuv3-accel.c | 68 ++++++++++++++++++++++++++++-
> > hw/pci-bridge/pci_expander_bridge.c | 1 -
> > include/hw/pci/pci_bridge.h | 1 +
> > 3 files changed, 68 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> > index 79f1713be6..44410cfb2a 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
>
> > static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
> *opaque,
>
> I should have noticed this in previous patch...
> What does add stand for here? This name is not particularly clear to me.
Good question 😊.
I believe the name comes from the smmu-common.c implementation of
get_address_space:
static const PCIIOMMUOps smmu_ops = {
.get_address_space = smmu_find_add_as,
};
Looking at it again, that version allocates a new MR and creates a
new address space per sdev, so perhaps "add" referred to the address
space creation.
This callback here originally did something similar but no longer does.
So, I think it’s better to just rename it to smmuv3_accel_get_as()
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-09-30 8:03 ` Shameer Kolothum
@ 2025-10-01 16:38 ` Eric Auger
2025-10-02 8:16 ` Shameer Kolothum
0 siblings, 1 reply; 118+ messages in thread
From: Eric Auger @ 2025-10-01 16:38 UTC (permalink / raw)
To: Shameer Kolothum, Jonathan Cameron
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
On 9/30/25 10:03 AM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Sent: 29 September 2025 17:09
>> To: Shameer Kolothum <skolothumtho@nvidia.com>
>> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
>> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
>> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
>> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
>> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
>> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
>> SMMUv3 to vfio-pci endpoints with iommufd
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On Mon, 29 Sep 2025 14:36:22 +0100
>> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>>
>>> Accelerated SMMUv3 is only useful when the device can take advantage of
>>> the host's SMMUv3 in nested mode. To keep things simple and correct, we
>>> only allow this feature for vfio-pci endpoint devices that use the iommufd
>>> backend. We also allow non-endpoint emulated devices like PCI bridges and
>>> root ports, so that users can plug in these vfio-pci devices. We can only
>>> enforce this if devices are cold plugged. For hotplug cases, give appropriate
>>> warnings.
>>>
>>> Another reason for this limit is to avoid problems with IOTLB
>>> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
>> associated
>>> SID, making it difficult to trace the originating device. If we allowed
>>> emulated endpoint devices, QEMU would have to invalidate both its own
>>> software IOTLB and the host's hardware IOTLB, which could slow things
>>> down.
>>>
>>> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
>>> translation (S1+S2), their get_address_space() callback must return the
>>> system address space so that VFIO core can setup correct S2 mappings
>>> for guest RAM.
>>>
>>> So in short:
>>> - vfio-pci devices(with iommufd as backend) return the system address
>>> space.
>>> - bridges and root ports return the IOMMU address space.
>>>
>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>> One question that really applies to earlier patch and an even more trivial
>> comment on a comment than the earlier ones ;)
>>
>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>>
>>> ---
>>> hw/arm/smmuv3-accel.c | 68 ++++++++++++++++++++++++++++-
>>> hw/pci-bridge/pci_expander_bridge.c | 1 -
>>> include/hw/pci/pci_bridge.h | 1 +
>>> 3 files changed, 68 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
>>> index 79f1713be6..44410cfb2a 100644
>>> --- a/hw/arm/smmuv3-accel.c
>>> +++ b/hw/arm/smmuv3-accel.c
>>> static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
>> *opaque,
>>
>> I should have noticed this in previous patch...
>> What does add stand for here? This name is not particularly clear to me.
> Good question 😊.
>
> I believe the name comes from the smmu-common.c implementation of
> get_address_space:
>
> static const PCIIOMMUOps smmu_ops = {
> .get_address_space = smmu_find_add_as,
> };
> Looking at it again, that version allocates a new MR and creates a
> new address space per sdev, so perhaps "add" referred to the address
> space creation.
this stems from the original terminology used in intel-iommu.c
(vtd_find_add_as)
the smmu-common code looks for a registered device corresponding to @bus
and @devfn (this is the 'find'). If it exists it returns it, otherwise
it allocates a bus and SMMUDevice object according to what exists and
initializes the AddressSpace (this is the 'add').
>
> This callback here originally did something similar but no longer does.
I don't get why it does not do something similar anymore?
> So, I think it’s better to just rename it to smmuv3_accel_get_as()
Well I would prefer we keep the original terminology to match other
viommu code. Except of course if I misunderstood the existing code.
Thanks
Eric
>
> Thanks,
> Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-10-01 16:38 ` Eric Auger
@ 2025-10-02 8:16 ` Shameer Kolothum
0 siblings, 0 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-02 8:16 UTC (permalink / raw)
To: eric.auger@redhat.com, Jonathan Cameron
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 01 October 2025 17:39
> To: Shameer Kolothum <skolothumtho@nvidia.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen
> <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
>
> External email: Use caution opening links or attachments
>
>
> On 9/30/25 10:03 AM, Shameer Kolothum wrote:
> >
> >> -----Original Message-----
> >> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> >> Sent: 29 September 2025 17:09
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>
> >> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> >> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> >> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>;
> ddutile@redhat.com;
> >> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> >> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> >> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> >> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> >> shameerkolothum@gmail.com
> >> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> >> SMMUv3 to vfio-pci endpoints with iommufd
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On Mon, 29 Sep 2025 14:36:22 +0100
> >> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> >>
> >>> Accelerated SMMUv3 is only useful when the device can take advantage
> of
> >>> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> >>> only allow this feature for vfio-pci endpoint devices that use the iommufd
> >>> backend. We also allow non-endpoint emulated devices like PCI bridges
> and
> >>> root ports, so that users can plug in these vfio-pci devices. We can only
> >>> enforce this if devices are cold plugged. For hotplug cases, give appropriate
> >>> warnings.
> >>>
> >>> Another reason for this limit is to avoid problems with IOTLB
> >>> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
> >> associated
> >>> SID, making it difficult to trace the originating device. If we allowed
> >>> emulated endpoint devices, QEMU would have to invalidate both its own
> >>> software IOTLB and the host's hardware IOTLB, which could slow things
> >>> down.
> >>>
> >>> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> >>> translation (S1+S2), their get_address_space() callback must return the
> >>> system address space so that VFIO core can setup correct S2 mappings
> >>> for guest RAM.
> >>>
> >>> So in short:
> >>> - vfio-pci devices(with iommufd as backend) return the system address
> >>> space.
> >>> - bridges and root ports return the IOMMU address space.
> >>>
> >>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> >> One question that really applies to earlier patch and an even more trivial
> >> comment on a comment than the earlier ones ;)
> >>
> >> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> >>
> >>> ---
> >>> hw/arm/smmuv3-accel.c | 68
> ++++++++++++++++++++++++++++-
> >>> hw/pci-bridge/pci_expander_bridge.c | 1 -
> >>> include/hw/pci/pci_bridge.h | 1 +
> >>> 3 files changed, 68 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> >>> index 79f1713be6..44410cfb2a 100644
> >>> --- a/hw/arm/smmuv3-accel.c
> >>> +++ b/hw/arm/smmuv3-accel.c
> >>> static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
> >> *opaque,
> >>
> >> I should have noticed this in previous patch...
> >> What does add stand for here? This name is not particularly clear to me.
> > Good question 😊.
> >
> > I believe the name comes from the smmu-common.c implementation of
> > get_address_space:
> >
> > static const PCIIOMMUOps smmu_ops = {
> > .get_address_space = smmu_find_add_as,
> > };
> > Looking at it again, that version allocates a new MR and creates a
> > new address space per sdev, so perhaps "add" referred to the address
> > space creation.
> this stems from the original terminology used in intel-iommu.c
> (vtd_find_add_as)
>
> the smmu-common code looks for a registered device corresponding to @bus
> and @devfn (this is the 'find'). If it exists it returns it, otherwise
> it allocates a bus and SMMUDevice object according to what exists and
> initializes the AddressSpace (this is the 'add').
Agree.
>
> >
> > This callback here originally did something similar but no longer does.
> I don't get why it does not do something similar anymore?
Ok. It does all the above related to the "find" and "add" described above.
But for vfio-pci dev with IOMMUFD backend, it now returns the global
&address_space_memory. Previously we were creating a separate address
space pointing to system memory for each such devices.
We could argue that in general what the function does is "get" the appropriate
address space for the device and can just call it simply,
get_dev_address_space() .
> > So, I think it’s better to just rename it to smmuv3_accel_get_as()
> Well I would prefer we keep the original terminology to match other
> viommu code. Except of course if I misunderstood the existing code.
Ok. I will keep the same then with some comment to explain that "find"
and "add" part.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-09-29 13:36 ` [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd Shameer Kolothum
2025-09-29 16:08 ` Jonathan Cameron via
@ 2025-09-30 0:11 ` Nicolin Chen
2025-10-02 7:29 ` Shameer Kolothum
2025-10-01 17:32 ` Eric Auger
2025-10-20 16:31 ` Eric Auger
3 siblings, 1 reply; 118+ messages in thread
From: Nicolin Chen @ 2025-09-30 0:11 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, Sep 29, 2025 at 02:36:22PM +0100, Shameer Kolothum wrote:
> Accelerated SMMUv3 is only useful when the device can take advantage of
> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> only allow this feature for vfio-pci endpoint devices that use the iommufd
> backend. We also allow non-endpoint emulated devices like PCI bridges and
> root ports, so that users can plug in these vfio-pci devices. We can only
> enforce this if devices are cold plugged. For hotplug cases, give appropriate
> warnings.
>
> Another reason for this limit is to avoid problems with IOTLB
> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> SID, making it difficult to trace the originating device. If we allowed
> emulated endpoint devices, QEMU would have to invalidate both its own
> software IOTLB and the host's hardware IOTLB, which could slow things
> down.
>
> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> translation (S1+S2), their get_address_space() callback must return the
> system address space so that VFIO core can setup correct S2 mappings
> for guest RAM.
>
> So in short:
> - vfio-pci devices(with iommufd as backend) return the system address
> space.
> - bridges and root ports return the IOMMU address space.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
With some nits:
> + /*
> + * We return the system address for vfio-pci devices(with iommufd as
> + * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> + * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> + * the host SMMUv3 runs in nested (S1 + S2) mode where the guest
> + * manages its own S1 page tables while the host manages S2.
> + *
> + * We are using the global &address_space_memory here, as this will ensure
> + * same system address space pointer for all devices behind the accelerated
> + * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
> + * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
> + * within the VM instead of duplicating them for every SMMUv3 instance.
> + */
> + if (vfio_pci) {
> + return &address_space_memory;
How about:
/*
* In the accelerated case, a vfio-pci device passed through via the iommufd
* backend must stay in the system address space, as it is always translated
* by its physical SMMU (using a stage-2-only STE or a nested STE), in which
* case the stage-2 nesting parent page table is allocated by the vfio core,
* backing up the system address space.
*
* So, return the system address space via the global address_space_memory.
* The shared address_space_memory also allows devices under different vSMMU
* instances in a VM to reuse a single nesting parent HWPT in the vfio core.
*/
?
And I think this would be clearer by having get_viommu_flags() in
this patch.
> diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
> index 1bcceddbc4..a8eb2d2426 100644
> --- a/hw/pci-bridge/pci_expander_bridge.c
> +++ b/hw/pci-bridge/pci_expander_bridge.c
> @@ -48,7 +48,6 @@ struct PXBBus {
> char bus_path[8];
> };
>
> -#define TYPE_PXB_PCIE_DEV "pxb-pcie"
> OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
>
> static GList *pxb_dev_list;
> diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
> index a055fd8d32..b61360b900 100644
> --- a/include/hw/pci/pci_bridge.h
> +++ b/include/hw/pci/pci_bridge.h
> @@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
>
> #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
> #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
> +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
> #define TYPE_PXB_DEV "pxb"
> OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
Maybe this can be a patch itself.
Nicolin
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-09-30 0:11 ` Nicolin Chen
@ 2025-10-02 7:29 ` Shameer Kolothum
0 siblings, 0 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-02 7:29 UTC (permalink / raw)
To: Nicolin Chen
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
Hi Nocolin,
> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 30 September 2025 01:11
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
>
> On Mon, Sep 29, 2025 at 02:36:22PM +0100, Shameer Kolothum wrote:
> > Accelerated SMMUv3 is only useful when the device can take advantage of
> > the host's SMMUv3 in nested mode. To keep things simple and correct, we
> > only allow this feature for vfio-pci endpoint devices that use the iommufd
> > backend. We also allow non-endpoint emulated devices like PCI bridges and
> > root ports, so that users can plug in these vfio-pci devices. We can only
> > enforce this if devices are cold plugged. For hotplug cases, give appropriate
> > warnings.
> >
> > Another reason for this limit is to avoid problems with IOTLB
> > invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
> associated
> > SID, making it difficult to trace the originating device. If we allowed
> > emulated endpoint devices, QEMU would have to invalidate both its own
> > software IOTLB and the host's hardware IOTLB, which could slow things
> > down.
> >
> > Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> > translation (S1+S2), their get_address_space() callback must return the
> > system address space so that VFIO core can setup correct S2 mappings
> > for guest RAM.
> >
> > So in short:
> > - vfio-pci devices(with iommufd as backend) return the system address
> > space.
> > - bridges and root ports return the IOMMU address space.
> >
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>
> With some nits:
>
> > + /*
> > + * We return the system address for vfio-pci devices(with iommufd as
> > + * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> > + * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> > + * the host SMMUv3 runs in nested (S1 + S2) mode where the guest
> > + * manages its own S1 page tables while the host manages S2.
> > + *
> > + * We are using the global &address_space_memory here, as this will
> ensure
> > + * same system address space pointer for all devices behind the
> accelerated
> > + * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID
> in
> > + * iommufd_cdev_attach(), allowing the Stage-2 page tables to be
> shared
> > + * within the VM instead of duplicating them for every SMMUv3
> instance.
> > + */
> > + if (vfio_pci) {
> > + return &address_space_memory;
>
> How about:
>
> /*
> * In the accelerated case, a vfio-pci device passed through via the iommufd
> * backend must stay in the system address space, as it is always translated
> * by its physical SMMU (using a stage-2-only STE or a nested STE), in which
> * case the stage-2 nesting parent page table is allocated by the vfio core,
> * backing up the system address space.
> *
> * So, return the system address space via the global
> address_space_memory.
> * The shared address_space_memory also allows devices under different
> vSMMU
> * instances in a VM to reuse a single nesting parent HWPT in the vfio core.
> */
> ?
Ok. I will go through the descriptions and comments in this series again and
will try to improve it.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-09-29 13:36 ` [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd Shameer Kolothum
2025-09-29 16:08 ` Jonathan Cameron via
2025-09-30 0:11 ` Nicolin Chen
@ 2025-10-01 17:32 ` Eric Auger
2025-10-02 9:30 ` Shameer Kolothum
2025-10-20 16:31 ` Eric Auger
3 siblings, 1 reply; 118+ messages in thread
From: Eric Auger @ 2025-10-01 17:32 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
Hi Shameer,
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> Accelerated SMMUv3 is only useful when the device can take advantage of
> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> only allow this feature for vfio-pci endpoint devices that use the iommufd
> backend. We also allow non-endpoint emulated devices like PCI bridges and
> root ports, so that users can plug in these vfio-pci devices. We can only
> enforce this if devices are cold plugged. For hotplug cases, give appropriate
"We can only enforce this if devices are cold plugged": I don't really understand that statement. you do checks when the device is hotplugged too. For emulated device you eventually allow them but you could decide to reject them?
> warnings.
>
> Another reason for this limit is to avoid problems with IOTLB
> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> SID, making it difficult to trace the originating device. If we allowed
> emulated endpoint devices, QEMU would have to invalidate both its own
> software IOTLB and the host's hardware IOTLB, which could slow things
> down.
>
> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> translation (S1+S2), their get_address_space() callback must return the
> system address space so that VFIO core can setup correct S2 mappings
> for guest RAM.
>
> So in short:
> - vfio-pci devices(with iommufd as backend) return the system address
> space.
> - bridges and root ports return the IOMMU address space.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/arm/smmuv3-accel.c | 68 ++++++++++++++++++++++++++++-
> hw/pci-bridge/pci_expander_bridge.c | 1 -
> include/hw/pci/pci_bridge.h | 1 +
> 3 files changed, 68 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 79f1713be6..44410cfb2a 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -7,8 +7,13 @@
> */
>
> #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>
> #include "hw/arm/smmuv3.h"
> +#include "hw/pci/pci_bridge.h"
> +#include "hw/pci-host/gpex.h"
> +#include "hw/vfio/pci.h"
> +
> #include "smmuv3-accel.h"
>
> static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> @@ -29,15 +34,76 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> return accel_dev;
> }
>
> +static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> +{
> +
> + if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
> + object_dynamic_cast(OBJECT(pdev), TYPE_PXB_PCIE_DEV) ||
> + object_dynamic_cast(OBJECT(pdev), TYPE_GPEX_ROOT_DEVICE)) {
> + return true;
> + } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI))) {
> + *vfio_pci = true;
> + if (object_property_get_link(OBJECT(pdev), "iommufd", NULL)) {
> + return true;
> + }
> + }
> + return false;
> +}
> +
> static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
> int devfn)
> {
> + PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> SMMUState *bs = opaque;
> SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
> SMMUDevice *sdev = &accel_dev->sdev;
> + bool vfio_pci = false;
> +
> + if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> + if (DEVICE(pdev)->hotplugged) {
> + if (vfio_pci) {
> + warn_report("Hot plugging a vfio-pci device (%s) without "
> + "iommufd as backend is not supported", pdev->name);
with accelerated SMMUv3.
why don't we return NULL and properly handle this in the caller. May be
worth adding an errp to get_address_space(). I know this is cumbersome
though.
> + } else {
> + warn_report("Hot plugging an emulated device %s with "
> + "accelerated SMMUv3. This will bring down "
> + "performace", pdev->name);
performance
> + }
> + /*
> + * Both cases, we will return IOMMU address space. For hotplugged
In both cases?
> + * vfio-pci dev without iommufd as backend, it will fail later in
> + * smmuv3_notify_flag_changed() with "requires iommu MAP notifier"
> + * error message.
> + */
> + return &sdev->as;
> + } else {
> + error_report("Device(%s) not allowed. Only PCIe root complex "
device (%s)
> + "devices or PCI bridge devices or vfio-pci endpoint "
> + "devices with iommufd as backend is allowed with "
> + "arm-smmuv3,accel=on", pdev->name);
> + exit(1);
> + }
> + }
>
> - return &sdev->as;
> + /*
> + * We return the system address for vfio-pci devices(with iommufd as
> + * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> + * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> + * the host SMMUv3 runs in nested (S1 + S2) mode where the guest
> + * manages its own S1 page tables while the host manages S2.
> + *
> + * We are using the global &address_space_memory here, as this will ensure
> + * same system address space pointer for all devices behind the accelerated
> + * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
> + * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
> + * within the VM instead of duplicating them for every SMMUv3 instance.
> + */
> + if (vfio_pci) {
> + return &address_space_memory;
> + } else {
> + return &sdev->as;
> + }
> }
>
> static const PCIIOMMUOps smmuv3_accel_ops = {
> diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
> index 1bcceddbc4..a8eb2d2426 100644
> --- a/hw/pci-bridge/pci_expander_bridge.c
> +++ b/hw/pci-bridge/pci_expander_bridge.c
> @@ -48,7 +48,6 @@ struct PXBBus {
> char bus_path[8];
> };
>
> -#define TYPE_PXB_PCIE_DEV "pxb-pcie"
> OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
>
> static GList *pxb_dev_list;
> diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
> index a055fd8d32..b61360b900 100644
> --- a/include/hw/pci/pci_bridge.h
> +++ b/include/hw/pci/pci_bridge.h
> @@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
>
> #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
> #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
> +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
I agree with Nicolin, you shall rather move that change in a seperate patch.
> #define TYPE_PXB_DEV "pxb"
> OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
>
Thanks
Eric
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-10-01 17:32 ` Eric Auger
@ 2025-10-02 9:30 ` Shameer Kolothum
2025-10-17 12:47 ` Eric Auger
0 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-02 9:30 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
Hi Eric,
> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 01 October 2025 18:32
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
>
> External email: Use caution opening links or attachments
>
> Hi Shameer,
>
> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> > Accelerated SMMUv3 is only useful when the device can take advantage
> > of the host's SMMUv3 in nested mode. To keep things simple and
> > correct, we only allow this feature for vfio-pci endpoint devices that
> > use the iommufd backend. We also allow non-endpoint emulated devices
> > like PCI bridges and root ports, so that users can plug in these
> > vfio-pci devices. We can only enforce this if devices are cold
> > plugged. For hotplug cases, give appropriate
>
> "We can only enforce this if devices are cold plugged": I don't really
> understand that statement.
By "enforce" here I meant, we can prevent user from starting a Guest
with a non "vfio-pci/iommufd dev" with accel=one case.
you do checks when the device is hotplugged too.
> For emulated device you eventually allow them but you could decide to reject
> them?
Currently get_address_space() is a " Mandatory callback which returns a pointer
to an #AddressSpace". Changing that and propagating an error all the way, as
you said below, is not that straightforward. At present we warn the user
appropriately for both vfio-pci without iommufd and emulated device hot plug
cases. Perhaps, if required, the error handling can be taken up as a clean-up series
later?
Also, I think I need to explain the emulated device hotplug case a bit more. This
is something I realised later during the tests.
Unfortunately, the hotplug scenario for emulated devices behaves differently.
What I’ve noticed is that the hotplug handler’s call path to get_address_space()
differs from cold-plug cases.
In the emulated device hotplug case, the pdev is NULL for below:
PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
Here’s what seems to be happening:
do_pci_register_device() {
....
if (phase_check(PHASE_MACHINE_READY)) {
pci_init_bus_master(pci_dev);
pci_device_iommu_address_space() --> get_address_space()
}
....
bus->devices[devfn] = pci_dev; //happens only after the above call.
}
For vfio-pci hotplug, we’re fine, since the vfio layer calls get_address_space()
again, with a valid pdev.
For cold-plug cases, the if (phase_check(PHASE_MACHINE_READY)) check is
false, and the call path looks like this:
pcibus_machine_done()
pci_init_bus_master(pci_dev);
pci_device_iommu_address_space() --> get_address_space()
By then we have a valid pdev.
I’m not sure there’s an easy fix here. One option could be to modify
get_address_space() to take pci_dev as input. Or we could change the
call path order above.
(See my below reply to emulated dev warn_report() case as well)
Please let me know your thoughts.
> > warnings.
> >
[...]
> > +
> > + if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> > + if (DEVICE(pdev)->hotplugged) {
> > + if (vfio_pci) {
> > + warn_report("Hot plugging a vfio-pci device (%s) without "
> > + "iommufd as backend is not supported",
> > + pdev->name);
> with accelerated SMMUv3.
>
> why don't we return NULL and properly handle this in the caller. May be worth
> adding an errp to get_address_space(). I know this is cumbersome though.
See above reply on propagating err from this callback.
> > + } else {
> > + warn_report("Hot plugging an emulated device %s with "
> > + "accelerated SMMUv3. This will bring down "
> > + "performace", pdev->name);
> performance
> > + }
As I mentioned above, since the pdev for emulated dev hotplug case is NULL,
we will not hit the above warning.
> > + /*
> > + * Both cases, we will return IOMMU address space. For
> > + hotplugged
> In both cases?
Yes, since we can't return NULL here. However, as done here, we will inform
the user appropriately.
> > + * vfio-pci dev without iommufd as backend, it will fail later in
> > + * smmuv3_notify_flag_changed() with "requires iommu MAP
> notifier"
[...]
> > +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
> I agree with Nicolin, you shall rather move that change in a seperate patch.
I thought of mentioning this change in the commit log(which I missed) and
avoiding a separate patch just for this. But if you guys feel strongly, I will
have a separate one.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-10-02 9:30 ` Shameer Kolothum
@ 2025-10-17 12:47 ` Eric Auger
2025-10-17 13:15 ` Shameer Kolothum
0 siblings, 1 reply; 118+ messages in thread
From: Eric Auger @ 2025-10-17 12:47 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
Hi Shameer,
On 10/2/25 11:30 AM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 01 October 2025 18:32
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
>> SMMUv3 to vfio-pci endpoints with iommufd
>>
>> External email: Use caution opening links or attachments
>>
>> Hi Shameer,
>>
>> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
>>> Accelerated SMMUv3 is only useful when the device can take advantage
>>> of the host's SMMUv3 in nested mode. To keep things simple and
>>> correct, we only allow this feature for vfio-pci endpoint devices that
>>> use the iommufd backend. We also allow non-endpoint emulated devices
>>> like PCI bridges and root ports, so that users can plug in these
>>> vfio-pci devices. We can only enforce this if devices are cold
>>> plugged. For hotplug cases, give appropriate
>> "We can only enforce this if devices are cold plugged": I don't really
>> understand that statement.
> By "enforce" here I meant, we can prevent user from starting a Guest
> with a non "vfio-pci/iommufd dev" with accel=one case.
Ah OK I misread the code. I thought you were also exiting in case of
hotplug but you only issue a warn_report.
From a user point of view, the assigned device will succeed attachment
but won't work. Will we get subsequent messages? I understand the pain
of propagating the error but if the user experience is bad I think it
should weight over ?
>
> you do checks when the device is hotplugged too.
>> For emulated device you eventually allow them but you could decide to reject
>> them?
> Currently get_address_space() is a " Mandatory callback which returns a pointer
> to an #AddressSpace". Changing that and propagating an error all the way, as
> you said below, is not that straightforward. At present we warn the user
> appropriately for both vfio-pci without iommufd and emulated device hot plug
> cases. Perhaps, if required, the error handling can be taken up as a clean-up series
> later?
>
> Also, I think I need to explain the emulated device hotplug case a bit more. This
> is something I realised later during the tests.
>
> Unfortunately, the hotplug scenario for emulated devices behaves differently.
> What I’ve noticed is that the hotplug handler’s call path to get_address_space()
> differs from cold-plug cases.
>
> In the emulated device hotplug case, the pdev is NULL for below:
> PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>
> Here’s what seems to be happening:
>
> do_pci_register_device() {
> ....
> if (phase_check(PHASE_MACHINE_READY)) {
> pci_init_bus_master(pci_dev);
> pci_device_iommu_address_space() --> get_address_space()
> }
> ....
> bus->devices[devfn] = pci_dev; //happens only after the above call.
> }
>
> For vfio-pci hotplug, we’re fine, since the vfio layer calls get_address_space()
> again, with a valid pdev.
>
> For cold-plug cases, the if (phase_check(PHASE_MACHINE_READY)) check is
> false, and the call path looks like this:
>
> pcibus_machine_done()
> pci_init_bus_master(pci_dev);
> pci_device_iommu_address_space() --> get_address_space()
>
> By then we have a valid pdev.
>
> I’m not sure there’s an easy fix here. One option could be to modify
> get_address_space() to take pci_dev as input. Or we could change the
> call path order above.
>
> (See my below reply to emulated dev warn_report() case as well)
>
> Please let me know your thoughts.
Can't you move the assignment of bus->devices[devfn] before the call and
unset it in case of failure?
Or if you propagate errors from
get_address_space() you could retry the call later?
Eric
>
>>> warnings.
>>>
> [...]
>
>>> +
>>> + if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
>>> + if (DEVICE(pdev)->hotplugged) {
>>> + if (vfio_pci) {
>>> + warn_report("Hot plugging a vfio-pci device (%s) without "
>>> + "iommufd as backend is not supported",
>>> + pdev->name);
>> with accelerated SMMUv3.
>>
>> why don't we return NULL and properly handle this in the caller. May be worth
>> adding an errp to get_address_space(). I know this is cumbersome though.
> See above reply on propagating err from this callback.
>
>>> + } else {
>>> + warn_report("Hot plugging an emulated device %s with "
>>> + "accelerated SMMUv3. This will bring down "
>>> + "performace", pdev->name);
>> performance
>>> + }
> As I mentioned above, since the pdev for emulated dev hotplug case is NULL,
> we will not hit the above warning.
>
>>> + /*
>>> + * Both cases, we will return IOMMU address space. For
>>> + hotplugged
>> In both cases?
> Yes, since we can't return NULL here. However, as done here, we will inform
> the user appropriately.
>
>>> + * vfio-pci dev without iommufd as backend, it will fail later in
>>> + * smmuv3_notify_flag_changed() with "requires iommu MAP
>> notifier"
> [...]
>
>>> +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
>> I agree with Nicolin, you shall rather move that change in a seperate patch.
> I thought of mentioning this change in the commit log(which I missed) and
> avoiding a separate patch just for this. But if you guys feel strongly, I will
> have a separate one.
>
> Thanks,
> Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-10-17 12:47 ` Eric Auger
@ 2025-10-17 13:15 ` Shameer Kolothum
2025-10-17 17:19 ` Eric Auger
0 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-17 13:15 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
Hi Eric,
> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 17 October 2025 13:47
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
>
> External email: Use caution opening links or attachments
>
>
> Hi Shameer,
>
> On 10/2/25 11:30 AM, Shameer Kolothum wrote:
> > Hi Eric,
> >
> >> -----Original Message-----
> >> From: Eric Auger <eric.auger@redhat.com>
> >> Sent: 01 October 2025 18:32
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> >> arm@nongnu.org; qemu-devel@nongnu.org
> >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> >> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> >> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> >> smostafa@google.com; wangzhou1@hisilicon.com;
> >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> >> shameerkolothum@gmail.com
> >> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> >> SMMUv3 to vfio-pci endpoints with iommufd
> >>
> >> External email: Use caution opening links or attachments
> >>
> >> Hi Shameer,
> >>
> >> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> >>> Accelerated SMMUv3 is only useful when the device can take advantage
> >>> of the host's SMMUv3 in nested mode. To keep things simple and
> >>> correct, we only allow this feature for vfio-pci endpoint devices that
> >>> use the iommufd backend. We also allow non-endpoint emulated devices
> >>> like PCI bridges and root ports, so that users can plug in these
> >>> vfio-pci devices. We can only enforce this if devices are cold
> >>> plugged. For hotplug cases, give appropriate
> >> "We can only enforce this if devices are cold plugged": I don't really
> >> understand that statement.
> > By "enforce" here I meant, we can prevent user from starting a Guest
> > with a non "vfio-pci/iommufd dev" with accel=one case.
> Ah OK I misread the code. I thought you were also exiting in case of
> hotplug but you only issue a warn_report.
> From a user point of view, the assigned device will succeed attachment
> but won't work. Will we get subsequent messages?
It will work. But as the warning says, it may degrade the performance especially
If the SMMUv3 has other vfio-pci devices. Because the TLB invalidations
from emulated one will be issued to host SMMUv3 as well.
I understand the pain
> of propagating the error but if the user experience is bad I think it
> should weight over ?
I am not against it. But can be taken up as a separate one if required.
> >
> > Please let me know your thoughts.
> Can't you move the assignment of bus->devices[devfn] before the call and
> unset it in case of failure?
>
> Or if you propagate errors from
>
> get_address_space() you could retry the call later?
For now, I have a fix like below that seems to do the
Job.
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index c9932c87e3..9693d7f10c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1370,9 +1370,6 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
pci_dev->bus_master_as.max_bounce_buffer_size =
pci_dev->max_bounce_buffer_size;
- if (phase_check(PHASE_MACHINE_READY)) {
- pci_init_bus_master(pci_dev);
- }
pci_dev->irq_state = 0;
pci_config_alloc(pci_dev);
@@ -1416,6 +1413,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
pci_dev->config_write = config_write;
bus->devices[devfn] = pci_dev;
pci_dev->version_id = 2; /* Current pci device vmstate version */
+ if (phase_check(PHASE_MACHINE_READY)) {
+ pci_init_bus_master(pci_dev);
+ }
return pci_dev;
}
Thanks,
Shameer
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-10-17 13:15 ` Shameer Kolothum
@ 2025-10-17 17:19 ` Eric Auger
0 siblings, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-17 17:19 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
On 10/17/25 3:15 PM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 17 October 2025 13:47
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
>> SMMUv3 to vfio-pci endpoints with iommufd
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Shameer,
>>
>> On 10/2/25 11:30 AM, Shameer Kolothum wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Sent: 01 October 2025 18:32
>>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>>>> arm@nongnu.org; qemu-devel@nongnu.org
>>>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>>>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>>>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>>>> smostafa@google.com; wangzhou1@hisilicon.com;
>>>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>>>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>>>> shameerkolothum@gmail.com
>>>> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
>>>> SMMUv3 to vfio-pci endpoints with iommufd
>>>>
>>>> External email: Use caution opening links or attachments
>>>>
>>>> Hi Shameer,
>>>>
>>>> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
>>>>> Accelerated SMMUv3 is only useful when the device can take advantage
>>>>> of the host's SMMUv3 in nested mode. To keep things simple and
>>>>> correct, we only allow this feature for vfio-pci endpoint devices that
>>>>> use the iommufd backend. We also allow non-endpoint emulated devices
>>>>> like PCI bridges and root ports, so that users can plug in these
>>>>> vfio-pci devices. We can only enforce this if devices are cold
>>>>> plugged. For hotplug cases, give appropriate
>>>> "We can only enforce this if devices are cold plugged": I don't really
>>>> understand that statement.
>>> By "enforce" here I meant, we can prevent user from starting a Guest
>>> with a non "vfio-pci/iommufd dev" with accel=one case.
>> Ah OK I misread the code. I thought you were also exiting in case of
>> hotplug but you only issue a warn_report.
>> From a user point of view, the assigned device will succeed attachment
>> but won't work. Will we get subsequent messages?
> It will work. But as the warning says, it may degrade the performance especially
> If the SMMUv3 has other vfio-pci devices. Because the TLB invalidations
> from emulated one will be issued to host SMMUv3 as well.
>
> I understand the pain
>> of propagating the error but if the user experience is bad I think it
>> should weight over ?
> I am not against it. But can be taken up as a separate one if required.
>
>>> Please let me know your thoughts.
>> Can't you move the assignment of bus->devices[devfn] before the call and
>> unset it in case of failure?
>>
>> Or if you propagate errors from
>>
>> get_address_space() you could retry the call later?
> For now, I have a fix like below that seems to do the
> Job.
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index c9932c87e3..9693d7f10c 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -1370,9 +1370,6 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
> pci_dev->bus_master_as.max_bounce_buffer_size =
> pci_dev->max_bounce_buffer_size;
>
> - if (phase_check(PHASE_MACHINE_READY)) {
> - pci_init_bus_master(pci_dev);
> - }
> pci_dev->irq_state = 0;
> pci_config_alloc(pci_dev);
>
> @@ -1416,6 +1413,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
> pci_dev->config_write = config_write;
> bus->devices[devfn] = pci_dev;
> pci_dev->version_id = 2; /* Current pci device vmstate version */
> + if (phase_check(PHASE_MACHINE_READY)) {
> + pci_init_bus_master(pci_dev);
> + }
> return pci_dev;
OK worth putting it in a separate patch to allow finer review of PCI
maintainers.
Thanks
Eric
> }
>
> Thanks,
> Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-09-29 13:36 ` [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd Shameer Kolothum
` (2 preceding siblings ...)
2025-10-01 17:32 ` Eric Auger
@ 2025-10-20 16:31 ` Eric Auger
2025-10-20 18:25 ` Nicolin Chen
3 siblings, 1 reply; 118+ messages in thread
From: Eric Auger @ 2025-10-20 16:31 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> Accelerated SMMUv3 is only useful when the device can take advantage of
> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> only allow this feature for vfio-pci endpoint devices that use the iommufd
> backend. We also allow non-endpoint emulated devices like PCI bridges and
> root ports, so that users can plug in these vfio-pci devices. We can only
> enforce this if devices are cold plugged. For hotplug cases, give appropriate
> warnings.
>
> Another reason for this limit is to avoid problems with IOTLB
> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> SID, making it difficult to trace the originating device. If we allowed
> emulated endpoint devices, QEMU would have to invalidate both its own
> software IOTLB and the host's hardware IOTLB, which could slow things
> down.
>
> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> translation (S1+S2), their get_address_space() callback must return the
> system address space so that VFIO core can setup correct S2 mappings
> for guest RAM.
>
> So in short:
> - vfio-pci devices(with iommufd as backend) return the system address
> space.
> - bridges and root ports return the IOMMU address space.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/arm/smmuv3-accel.c | 68 ++++++++++++++++++++++++++++-
> hw/pci-bridge/pci_expander_bridge.c | 1 -
> include/hw/pci/pci_bridge.h | 1 +
> 3 files changed, 68 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 79f1713be6..44410cfb2a 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -7,8 +7,13 @@
> */
>
> #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>
> #include "hw/arm/smmuv3.h"
> +#include "hw/pci/pci_bridge.h"
> +#include "hw/pci-host/gpex.h"
> +#include "hw/vfio/pci.h"
> +
> #include "smmuv3-accel.h"
>
> static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> @@ -29,15 +34,76 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> return accel_dev;
> }
>
> +static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> +{
> +
> + if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
> + object_dynamic_cast(OBJECT(pdev), TYPE_PXB_PCIE_DEV) ||
> + object_dynamic_cast(OBJECT(pdev), TYPE_GPEX_ROOT_DEVICE)) {
> + return true;
> + } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI))) {
> + *vfio_pci = true;
> + if (object_property_get_link(OBJECT(pdev), "iommufd", NULL)) {
> + return true;
> + }
> + }
> + return false;
> +}
> +
> static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
> int devfn)
> {
> + PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> SMMUState *bs = opaque;
> SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
> SMMUDevice *sdev = &accel_dev->sdev;
> + bool vfio_pci = false;
> +
> + if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> + if (DEVICE(pdev)->hotplugged) {
> + if (vfio_pci) {
> + warn_report("Hot plugging a vfio-pci device (%s) without "
> + "iommufd as backend is not supported", pdev->name);
> + } else {
> + warn_report("Hot plugging an emulated device %s with "
> + "accelerated SMMUv3. This will bring down "
> + "performace", pdev->name);
> + }
> + /*
> + * Both cases, we will return IOMMU address space. For hotplugged
> + * vfio-pci dev without iommufd as backend, it will fail later in
> + * smmuv3_notify_flag_changed() with "requires iommu MAP notifier"
> + * error message.
> + */
> + return &sdev->as;
> + } else {
> + error_report("Device(%s) not allowed. Only PCIe root complex "
> + "devices or PCI bridge devices or vfio-pci endpoint "
> + "devices with iommufd as backend is allowed with "
> + "arm-smmuv3,accel=on", pdev->name);
> + exit(1);
> + }
> + }
>
> - return &sdev->as;
> + /*
> + * We return the system address for vfio-pci devices(with iommufd as
> + * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> + * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> + * the host SMMUv3 runs in nested (S1 + S2) mode where the guest
> + * manages its own S1 page tables while the host manages S2.
> + *
> + * We are using the global &address_space_memory here, as this will ensure
> + * same system address space pointer for all devices behind the accelerated
> + * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
> + * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
> + * within the VM instead of duplicating them for every SMMUv3 instance.
> + */
> + if (vfio_pci) {
> + return &address_space_memory;
From that comment one understands the need of a single and common AS.
However it is not obvious why it shall be
&address_space_memory and not an AS created on purpose.
Eric
> + } else {
> + return &sdev->as;
> + }
> }
>
> static const PCIIOMMUOps smmuv3_accel_ops = {
> diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
> index 1bcceddbc4..a8eb2d2426 100644
> --- a/hw/pci-bridge/pci_expander_bridge.c
> +++ b/hw/pci-bridge/pci_expander_bridge.c
> @@ -48,7 +48,6 @@ struct PXBBus {
> char bus_path[8];
> };
>
> -#define TYPE_PXB_PCIE_DEV "pxb-pcie"
> OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
>
> static GList *pxb_dev_list;
> diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
> index a055fd8d32..b61360b900 100644
> --- a/include/hw/pci/pci_bridge.h
> +++ b/include/hw/pci/pci_bridge.h
> @@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
>
> #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
> #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
> +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
> #define TYPE_PXB_DEV "pxb"
> OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
>
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-10-20 16:31 ` Eric Auger
@ 2025-10-20 18:25 ` Nicolin Chen
2025-10-20 18:59 ` Shameer Kolothum
0 siblings, 1 reply; 118+ messages in thread
From: Nicolin Chen @ 2025-10-20 18:25 UTC (permalink / raw)
To: Eric Auger
Cc: Shameer Kolothum, qemu-arm, qemu-devel, peter.maydell, jgg,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, jonathan.cameron, zhangfei.gao, zhenzhong.duan,
yi.l.liu, shameerkolothum
On Mon, Oct 20, 2025 at 06:31:38PM +0200, Eric Auger wrote:
> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> > + /*
> > + * We return the system address for vfio-pci devices(with iommufd as
> > + * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> > + * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> > + * the host SMMUv3 runs in nested (S1 + S2) mode where the guest
> > + * manages its own S1 page tables while the host manages S2.
> > + *
> > + * We are using the global &address_space_memory here, as this will ensure
> > + * same system address space pointer for all devices behind the accelerated
> > + * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
> > + * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
> > + * within the VM instead of duplicating them for every SMMUv3 instance.
> > + */
> > + if (vfio_pci) {
> > + return &address_space_memory;
> From that comment one understands the need of a single and common AS.
> However it is not obvious why it shall be
>
> &address_space_memory and not an AS created on purpose.
We tried creating an AS, but it was not straightforward to share
across vSMMU instances, as most of the structures are per vSMMU.
Only SMMUv3Class seems to be shared across vSMMU instances, but
it doesn't seem to be the good place to hold an AS pointer either.
The global @address_space_memory is provisioned as the system AS,
so it's easy to use.
Perhaps we could add a couple of lines to the comments.
Thanks
Nicolin
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-10-20 18:25 ` Nicolin Chen
@ 2025-10-20 18:59 ` Shameer Kolothum
2025-10-21 15:28 ` Eric Auger
0 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-20 18:59 UTC (permalink / raw)
To: Nicolin Chen, Eric Auger
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 20 October 2025 19:25
> To: Eric Auger <eric.auger@redhat.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> Jason Gunthorpe <jgg@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
>
> On Mon, Oct 20, 2025 at 06:31:38PM +0200, Eric Auger wrote:
> > On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> > > + /*
> > > + * We return the system address for vfio-pci devices(with iommufd as
> > > + * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> > > + * guest RAM. This is needed because, in the accelerated SMMUv3
> case,
> > > + * the host SMMUv3 runs in nested (S1 + S2) mode where the guest
> > > + * manages its own S1 page tables while the host manages S2.
> > > + *
> > > + * We are using the global &address_space_memory here, as this will
> ensure
> > > + * same system address space pointer for all devices behind the
> accelerated
> > > + * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS
> ID in
> > > + * iommufd_cdev_attach(), allowing the Stage-2 page tables to be
> shared
> > > + * within the VM instead of duplicating them for every SMMUv3
> instance.
> > > + */
> > > + if (vfio_pci) {
> > > + return &address_space_memory;
> > From that comment one understands the need of a single and common AS.
> > However it is not obvious why it shall be
> >
> > &address_space_memory and not an AS created on purpose.
>
> We tried creating an AS, but it was not straightforward to share across vSMMU
> instances, as most of the structures are per vSMMU.
>
> Only SMMUv3Class seems to be shared across vSMMU instances, but it
> doesn't seem to be the good place to hold an AS pointer either.
>
> The global @address_space_memory is provisioned as the system AS, so it's
> easy to use.
We had discussed this previously here,
https://lore.kernel.org/qemu-devel/aJKn650gOGQh2whD@Asurada-Nvidia/
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd
2025-10-20 18:59 ` Shameer Kolothum
@ 2025-10-21 15:28 ` Eric Auger
0 siblings, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-21 15:28 UTC (permalink / raw)
To: Shameer Kolothum, Nicolin Chen
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
On 10/20/25 8:59 PM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Sent: 20 October 2025 19:25
>> To: Eric Auger <eric.auger@redhat.com>
>> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
>> Jason Gunthorpe <jgg@nvidia.com>; ddutile@redhat.com;
>> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
>> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
>> SMMUv3 to vfio-pci endpoints with iommufd
>>
>> On Mon, Oct 20, 2025 at 06:31:38PM +0200, Eric Auger wrote:
>>> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
>>>> + /*
>>>> + * We return the system address for vfio-pci devices(with iommufd as
>>>> + * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
>>>> + * guest RAM. This is needed because, in the accelerated SMMUv3
>> case,
>>>> + * the host SMMUv3 runs in nested (S1 + S2) mode where the guest
>>>> + * manages its own S1 page tables while the host manages S2.
>>>> + *
>>>> + * We are using the global &address_space_memory here, as this will
>> ensure
>>>> + * same system address space pointer for all devices behind the
>> accelerated
>>>> + * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS
>> ID in
>>>> + * iommufd_cdev_attach(), allowing the Stage-2 page tables to be
>> shared
>>>> + * within the VM instead of duplicating them for every SMMUv3
>> instance.
>>>> + */
>>>> + if (vfio_pci) {
>>>> + return &address_space_memory;
>>> From that comment one understands the need of a single and common AS.
>>> However it is not obvious why it shall be
>>>
>>> &address_space_memory and not an AS created on purpose.
>> We tried creating an AS, but it was not straightforward to share across vSMMU
>> instances, as most of the structures are per vSMMU.
>>
>> Only SMMUv3Class seems to be shared across vSMMU instances, but it
>> doesn't seem to be the good place to hold an AS pointer either.
>>
>> The global @address_space_memory is provisioned as the system AS, so it's
>> easy to use.
> We had discussed this previously here,
> https://lore.kernel.org/qemu-devel/aJKn650gOGQh2whD@Asurada-Nvidia/
Thank you for the pointer. I definitively missed that thread. Seems
Peter was not very keen either to use the address_space_memory. Why
can't we use a global variable in smmu-accel.c ? in hw/vfio/container.c
there is a list of VFIOAddressSpace for instance.
Is there anything wrong doing that?
Thanks
Eric
Eric
>
> Thanks,
> Shameer
>
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 07/27] hw/arm/smmuv3: Implement get_viommu_cap() callback
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (5 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 16:13 ` Jonathan Cameron via
2025-10-01 17:36 ` Eric Auger
2025-09-29 13:36 ` [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback Shameer Kolothum
` (20 subsequent siblings)
27 siblings, 2 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
For accelerated SMMUv3, we need nested parent domain creation. Add the
callback support so that VFIO can create a nested parent.
In the accelerated SMMUv3 case, the host SMMUv3 is configured in nested
mode (S1 + S2), and the guest owns the Stage-1 page table. Therefore, we
expose only Stage-1 to the guest to ensure it uses the correct page-table
format.
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 13 +++++++++++++
hw/arm/virt.c | 13 +++++++++++++
2 files changed, 26 insertions(+)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 44410cfb2a..6b0e512d86 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -10,6 +10,7 @@
#include "qemu/error-report.h"
#include "hw/arm/smmuv3.h"
+#include "hw/iommu.h"
#include "hw/pci/pci_bridge.h"
#include "hw/pci-host/gpex.h"
#include "hw/vfio/pci.h"
@@ -106,8 +107,20 @@ static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
}
}
+static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
+{
+ /*
+ * We return VIOMMU_FLAG_WANT_NESTING_PARENT to inform VFIO core to create a
+ * nesting parent which is required for accelerated SMMUv3 support.
+ * The real HW nested support should be reported from host SMMUv3 and if
+ * it doesn't, the nesting parent allocation will fail anyway in VFIO core.
+ */
+ return VIOMMU_FLAG_WANT_NESTING_PARENT;
+}
+
static const PCIIOMMUOps smmuv3_accel_ops = {
.get_address_space = smmuv3_accel_find_add_as,
+ .get_viommu_flags = smmuv3_accel_get_viommu_flags,
};
void smmuv3_accel_init(SMMUv3State *s)
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 02209fadcf..b533b0556e 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -3073,6 +3073,19 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
return;
}
+ if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
+ char *stage;
+
+ stage = object_property_get_str(OBJECT(dev), "stage",
+ &error_fatal);
+ /* If no stage specified, SMMUv3 will default to stage 1 */
+ if (*stage && strcmp("1", stage)) {
+ error_setg(errp, "Only stage1 is supported for SMMUV3 with "
+ "accel=on");
+ return;
+ }
+ }
+
create_smmuv3_dev_dtb(vms, dev, bus);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 07/27] hw/arm/smmuv3: Implement get_viommu_cap() callback
2025-09-29 13:36 ` [PATCH v4 07/27] hw/arm/smmuv3: Implement get_viommu_cap() callback Shameer Kolothum
@ 2025-09-29 16:13 ` Jonathan Cameron via
2025-10-01 17:36 ` Eric Auger
1 sibling, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 16:13 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:23 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> For accelerated SMMUv3, we need nested parent domain creation. Add the
> callback support so that VFIO can create a nested parent.
>
> In the accelerated SMMUv3 case, the host SMMUv3 is configured in nested
> mode (S1 + S2), and the guest owns the Stage-1 page table. Therefore, we
> expose only Stage-1 to the guest to ensure it uses the correct page-table
> format.
>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 07/27] hw/arm/smmuv3: Implement get_viommu_cap() callback
2025-09-29 13:36 ` [PATCH v4 07/27] hw/arm/smmuv3: Implement get_viommu_cap() callback Shameer Kolothum
2025-09-29 16:13 ` Jonathan Cameron via
@ 2025-10-01 17:36 ` Eric Auger
2025-10-02 9:38 ` Shameer Kolothum
2025-10-02 9:39 ` Jonathan Cameron via
1 sibling, 2 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-01 17:36 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
Hi Shameer,
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> For accelerated SMMUv3, we need nested parent domain creation. Add the
> callback support so that VFIO can create a nested parent.
>
> In the accelerated SMMUv3 case, the host SMMUv3 is configured in nested
> mode (S1 + S2), and the guest owns the Stage-1 page table. Therefore, we
> expose only Stage-1 to the guest to ensure it uses the correct page-table
> format.
>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Wonder if you shall keep both. I don't know the usage though but worth
to check.
> ---
> hw/arm/smmuv3-accel.c | 13 +++++++++++++
> hw/arm/virt.c | 13 +++++++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 44410cfb2a..6b0e512d86 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -10,6 +10,7 @@
> #include "qemu/error-report.h"
>
> #include "hw/arm/smmuv3.h"
> +#include "hw/iommu.h"
> #include "hw/pci/pci_bridge.h"
> #include "hw/pci-host/gpex.h"
> #include "hw/vfio/pci.h"
> @@ -106,8 +107,20 @@ static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
> }
> }
>
> +static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
> +{
> + /*
> + * We return VIOMMU_FLAG_WANT_NESTING_PARENT to inform VFIO core to create a
> + * nesting parent which is required for accelerated SMMUv3 support.
> + * The real HW nested support should be reported from host SMMUv3 and if
> + * it doesn't, the nesting parent allocation will fail anyway in VFIO core.
> + */
> + return VIOMMU_FLAG_WANT_NESTING_PARENT;
> +}
> +
> static const PCIIOMMUOps smmuv3_accel_ops = {
> .get_address_space = smmuv3_accel_find_add_as,
> + .get_viommu_flags = smmuv3_accel_get_viommu_flags,
> };
>
> void smmuv3_accel_init(SMMUv3State *s)
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 02209fadcf..b533b0556e 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -3073,6 +3073,19 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
> return;
> }
>
> + if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
This looks unrelated to the get_viommu_flags() addition and to me this
shall be put in a separate patch of squashed in the patch that exposes
the accel prop Thanks Eric
> + char *stage;
> +
> + stage = object_property_get_str(OBJECT(dev), "stage",
> + &error_fatal);
> + /* If no stage specified, SMMUv3 will default to stage 1 */
> + if (*stage && strcmp("1", stage)) {
> + error_setg(errp, "Only stage1 is supported for SMMUV3 with "
> + "accel=on");
> + return;
> + }
> + }
> +
> create_smmuv3_dev_dtb(vms, dev, bus);
> }
> }
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 07/27] hw/arm/smmuv3: Implement get_viommu_cap() callback
2025-10-01 17:36 ` Eric Auger
@ 2025-10-02 9:38 ` Shameer Kolothum
2025-10-02 12:31 ` Eric Auger
2025-10-02 9:39 ` Jonathan Cameron via
1 sibling, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-02 9:38 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 01 October 2025 18:37
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 07/27] hw/arm/smmuv3: Implement
> get_viommu_cap() callback
>
> External email: Use caution opening links or attachments
>
>
> Hi Shameer,
>
> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> > For accelerated SMMUv3, we need nested parent domain creation. Add the
> > callback support so that VFIO can create a nested parent.
> >
> > In the accelerated SMMUv3 case, the host SMMUv3 is configured in nested
> > mode (S1 + S2), and the guest owns the Stage-1 page table. Therefore, we
> > expose only Stage-1 to the guest to ensure it uses the correct page-table
> > format.
> >
> > Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> Wonder if you shall keep both. I don't know the usage though but worth
> to check.
Hmm.. I don't know either for sure. What I followed here(I will double check)
is all the patches I had previously(v3) I kept all the S-by tags. That seems to be
a right thing to do and IIRC I have seen that previously as well.
> > ---
> > hw/arm/smmuv3-accel.c | 13 +++++++++++++
> > hw/arm/virt.c | 13 +++++++++++++
> > 2 files changed, 26 insertions(+)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> > index 44410cfb2a..6b0e512d86 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> > @@ -10,6 +10,7 @@
> > #include "qemu/error-report.h"
> >
> > #include "hw/arm/smmuv3.h"
> > +#include "hw/iommu.h"
> > #include "hw/pci/pci_bridge.h"
> > #include "hw/pci-host/gpex.h"
> > #include "hw/vfio/pci.h"
> > @@ -106,8 +107,20 @@ static AddressSpace
> *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
> > }
> > }
> >
> > +static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
> > +{
> > + /*
> > + * We return VIOMMU_FLAG_WANT_NESTING_PARENT to inform VFIO
> core to create a
> > + * nesting parent which is required for accelerated SMMUv3 support.
> > + * The real HW nested support should be reported from host SMMUv3
> and if
> > + * it doesn't, the nesting parent allocation will fail anyway in VFIO core.
> > + */
> > + return VIOMMU_FLAG_WANT_NESTING_PARENT;
> > +}
> > +
> > static const PCIIOMMUOps smmuv3_accel_ops = {
> > .get_address_space = smmuv3_accel_find_add_as,
> > + .get_viommu_flags = smmuv3_accel_get_viommu_flags,
> > };
> >
> > void smmuv3_accel_init(SMMUv3State *s)
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 02209fadcf..b533b0556e 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -3073,6 +3073,19 @@ static void
> virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
> > return;
> > }
> >
> > + if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
> This looks unrelated to the get_viommu_flags() addition and to me this
> shall be put in a separate patch of squashed in the patch that exposes
> the accel prop Thanks Eric
But my thought process was, without this we can't say the vIOMMU will support
the nesting parent. But then the flag seems to be indicating that vIOMMU "want"
nesting parent. So I guess we can move it for later.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 07/27] hw/arm/smmuv3: Implement get_viommu_cap() callback
2025-10-02 9:38 ` Shameer Kolothum
@ 2025-10-02 12:31 ` Eric Auger
0 siblings, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-02 12:31 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
On 10/2/25 11:38 AM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 01 October 2025 18:37
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 07/27] hw/arm/smmuv3: Implement
>> get_viommu_cap() callback
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Shameer,
>>
>> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
>>> For accelerated SMMUv3, we need nested parent domain creation. Add the
>>> callback support so that VFIO can create a nested parent.
>>>
>>> In the accelerated SMMUv3 case, the host SMMUv3 is configured in nested
>>> mode (S1 + S2), and the guest owns the Stage-1 page table. Therefore, we
>>> expose only Stage-1 to the guest to ensure it uses the correct page-table
>>> format.
>>>
>>> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>>> Signed-off-by: Shameer Kolothum
>> <shameerali.kolothum.thodi@huawei.com>
>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>> Wonder if you shall keep both. I don't know the usage though but worth
>> to check.
> Hmm.. I don't know either for sure. What I followed here(I will double check)
> is all the patches I had previously(v3) I kept all the S-by tags. That seems to be
> a right thing to do and IIRC I have seen that previously as well.
>
>>> ---
>>> hw/arm/smmuv3-accel.c | 13 +++++++++++++
>>> hw/arm/virt.c | 13 +++++++++++++
>>> 2 files changed, 26 insertions(+)
>>>
>>> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
>>> index 44410cfb2a..6b0e512d86 100644
>>> --- a/hw/arm/smmuv3-accel.c
>>> +++ b/hw/arm/smmuv3-accel.c
>>> @@ -10,6 +10,7 @@
>>> #include "qemu/error-report.h"
>>>
>>> #include "hw/arm/smmuv3.h"
>>> +#include "hw/iommu.h"
>>> #include "hw/pci/pci_bridge.h"
>>> #include "hw/pci-host/gpex.h"
>>> #include "hw/vfio/pci.h"
>>> @@ -106,8 +107,20 @@ static AddressSpace
>> *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
>>> }
>>> }
>>>
>>> +static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
>>> +{
>>> + /*
>>> + * We return VIOMMU_FLAG_WANT_NESTING_PARENT to inform VFIO
>> core to create a
>>> + * nesting parent which is required for accelerated SMMUv3 support.
>>> + * The real HW nested support should be reported from host SMMUv3
>> and if
>>> + * it doesn't, the nesting parent allocation will fail anyway in VFIO core.
>>> + */
>>> + return VIOMMU_FLAG_WANT_NESTING_PARENT;
>>> +}
>>> +
>>> static const PCIIOMMUOps smmuv3_accel_ops = {
>>> .get_address_space = smmuv3_accel_find_add_as,
>>> + .get_viommu_flags = smmuv3_accel_get_viommu_flags,
>>> };
>>>
>>> void smmuv3_accel_init(SMMUv3State *s)
>>> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
>>> index 02209fadcf..b533b0556e 100644
>>> --- a/hw/arm/virt.c
>>> +++ b/hw/arm/virt.c
>>> @@ -3073,6 +3073,19 @@ static void
>> virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
>>> return;
>>> }
>>>
>>> + if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
>> This looks unrelated to the get_viommu_flags() addition and to me this
>> shall be put in a separate patch of squashed in the patch that exposes
>> the accel prop Thanks Eric
> But my thought process was, without this we can't say the vIOMMU will support
> the nesting parent. But then the flag seems to be indicating that vIOMMU "want"
> nesting parent. So I guess we can move it for later.
Yes that's my understanding too
Eric
>
> Thanks,
> Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 07/27] hw/arm/smmuv3: Implement get_viommu_cap() callback
2025-10-01 17:36 ` Eric Auger
2025-10-02 9:38 ` Shameer Kolothum
@ 2025-10-02 9:39 ` Jonathan Cameron via
1 sibling, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-02 9:39 UTC (permalink / raw)
To: Eric Auger
Cc: Shameer Kolothum, qemu-arm, qemu-devel, peter.maydell, jgg,
nicolinc, ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Wed, 1 Oct 2025 19:36:47 +0200
Eric Auger <eric.auger@redhat.com> wrote:
> Hi Shameer,
>
> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> > For accelerated SMMUv3, we need nested parent domain creation. Add the
> > callback support so that VFIO can create a nested parent.
> >
> > In the accelerated SMMUv3 case, the host SMMUv3 is configured in nested
> > mode (S1 + S2), and the guest owns the Stage-1 page table. Therefore, we
> > expose only Stage-1 to the guest to ensure it uses the correct page-table
> > format.
> >
> > Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> Wonder if you shall keep both. I don't know the usage though but worth
> to check.
I've never found any clear guidance on what to do in this case.
Given cost of a bonus SoB that is valid given the "Nvidia" Shameer
has taken the code from the "Huawei" Shameer and moved it forwards
I'd be tempted to keep it unless anyone feels strongly about it.
Jonathan
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (6 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 07/27] hw/arm/smmuv3: Implement get_viommu_cap() callback Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 16:25 ` Jonathan Cameron via
` (2 more replies)
2025-09-29 13:36 ` [PATCH v4 09/27] hw/arm/smmuv3-accel: Support nested STE install/uninstall support Shameer Kolothum
` (19 subsequent siblings)
27 siblings, 3 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
From: Nicolin Chen <nicolinc@nvidia.com>
Implement a set_iommu_device callback:
-If found an existing viommu reuse that.
-Else,
Allocate a vIOMMU with the nested parent S2 hwpt allocated by VFIO.
Though, iommufd’s vIOMMU model supports nested translation by
encapsulating a S2 nesting parent HWPT, devices cannot attach to this
parent HWPT directly. So two proxy nested HWPTs (bypass and abort) are
allocated to handle device attachments.
-And add the dev to viommu device list
Also add an unset_iommu_device to unwind/cleanup above.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 150 ++++++++++++++++++++++++++++++++++++++++
hw/arm/smmuv3-accel.h | 17 +++++
hw/arm/trace-events | 4 ++
include/hw/arm/smmuv3.h | 1 +
4 files changed, 172 insertions(+)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 6b0e512d86..81fa738f6f 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -8,6 +8,7 @@
#include "qemu/osdep.h"
#include "qemu/error-report.h"
+#include "trace.h"
#include "hw/arm/smmuv3.h"
#include "hw/iommu.h"
@@ -17,6 +18,9 @@
#include "smmuv3-accel.h"
+#define SMMU_STE_VALID (1ULL << 0)
+#define SMMU_STE_CFG_BYPASS (1ULL << 3)
+
static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
PCIBus *bus, int devfn)
{
@@ -35,6 +39,149 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
return accel_dev;
}
+static bool
+smmuv3_accel_dev_alloc_viommu(SMMUv3AccelDevice *accel_dev,
+ HostIOMMUDeviceIOMMUFD *idev, Error **errp)
+{
+ struct iommu_hwpt_arm_smmuv3 bypass_data = {
+ .ste = { SMMU_STE_CFG_BYPASS | SMMU_STE_VALID, 0x0ULL },
+ };
+ struct iommu_hwpt_arm_smmuv3 abort_data = {
+ .ste = { SMMU_STE_VALID, 0x0ULL },
+ };
+ SMMUDevice *sdev = &accel_dev->sdev;
+ SMMUState *bs = sdev->smmu;
+ SMMUv3State *s = ARM_SMMUV3(bs);
+ SMMUv3AccelState *s_accel = s->s_accel;
+ uint32_t s2_hwpt_id = idev->hwpt_id;
+ SMMUViommu *viommu;
+ uint32_t viommu_id;
+
+ if (s_accel->viommu) {
+ accel_dev->viommu = s_accel->viommu;
+ return true;
+ }
+
+ if (!iommufd_backend_alloc_viommu(idev->iommufd, idev->devid,
+ IOMMU_VIOMMU_TYPE_ARM_SMMUV3,
+ s2_hwpt_id, &viommu_id, errp)) {
+ return false;
+ }
+
+ viommu = g_new0(SMMUViommu, 1);
+ viommu->core.viommu_id = viommu_id;
+ viommu->core.s2_hwpt_id = s2_hwpt_id;
+ viommu->core.iommufd = idev->iommufd;
+
+ if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+ viommu->core.viommu_id, 0,
+ IOMMU_HWPT_DATA_ARM_SMMUV3,
+ sizeof(abort_data), &abort_data,
+ &viommu->abort_hwpt_id, errp)) {
+ goto free_viommu;
+ }
+
+ if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+ viommu->core.viommu_id, 0,
+ IOMMU_HWPT_DATA_ARM_SMMUV3,
+ sizeof(bypass_data), &bypass_data,
+ &viommu->bypass_hwpt_id, errp)) {
+ goto free_abort_hwpt;
+ }
+
+ viommu->iommufd = idev->iommufd;
+
+ s_accel->viommu = viommu;
+ accel_dev->viommu = viommu;
+ return true;
+
+free_abort_hwpt:
+ iommufd_backend_free_id(idev->iommufd, viommu->abort_hwpt_id);
+free_viommu:
+ iommufd_backend_free_id(idev->iommufd, viommu->core.viommu_id);
+ g_free(viommu);
+ return false;
+}
+
+static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
+ HostIOMMUDevice *hiod, Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
+ SMMUState *bs = opaque;
+ SMMUv3State *s = ARM_SMMUV3(bs);
+ SMMUv3AccelState *s_accel = s->s_accel;
+ SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
+ SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
+ SMMUDevice *sdev = &accel_dev->sdev;
+ uint16_t sid = smmu_get_sid(sdev);
+
+ if (!idev) {
+ return true;
+ }
+
+ if (accel_dev->idev) {
+ if (accel_dev->idev != idev) {
+ error_setg(errp, "Device 0x%x already has an associated IOMMU dev",
+ sid);
+ return false;
+ }
+ return true;
+ }
+
+ if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
+ error_setg(errp, "Device 0x%x: Unable to alloc viommu", sid);
+ return false;
+ }
+
+ accel_dev->idev = idev;
+ QLIST_INSERT_HEAD(&s_accel->viommu->device_list, accel_dev, next);
+ trace_smmuv3_accel_set_iommu_device(devfn, sid);
+ return true;
+}
+
+static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
+ int devfn)
+{
+ SMMUState *bs = opaque;
+ SMMUv3State *s = ARM_SMMUV3(bs);
+ SMMUPciBus *sbus = g_hash_table_lookup(bs->smmu_pcibus_by_busptr, bus);
+ SMMUv3AccelDevice *accel_dev;
+ SMMUViommu *viommu;
+ SMMUDevice *sdev;
+ uint16_t sid;
+
+ if (!sbus) {
+ return;
+ }
+
+ sdev = sbus->pbdev[devfn];
+ if (!sdev) {
+ return;
+ }
+
+ sid = smmu_get_sid(sdev);
+ accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
+ if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
+ accel_dev->idev->hwpt_id,
+ NULL)) {
+ error_report("Unable to attach dev 0x%x to the default HW pagetable",
+ sid);
+ }
+
+ accel_dev->idev = NULL;
+ QLIST_REMOVE(accel_dev, next);
+ trace_smmuv3_accel_unset_iommu_device(devfn, sid);
+
+ viommu = s->s_accel->viommu;
+ if (QLIST_EMPTY(&viommu->device_list)) {
+ iommufd_backend_free_id(viommu->iommufd, viommu->bypass_hwpt_id);
+ iommufd_backend_free_id(viommu->iommufd, viommu->abort_hwpt_id);
+ iommufd_backend_free_id(viommu->iommufd, viommu->core.viommu_id);
+ g_free(viommu);
+ s->s_accel->viommu = NULL;
+ }
+}
+
static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
{
@@ -121,6 +268,8 @@ static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
static const PCIIOMMUOps smmuv3_accel_ops = {
.get_address_space = smmuv3_accel_find_add_as,
.get_viommu_flags = smmuv3_accel_get_viommu_flags,
+ .set_iommu_device = smmuv3_accel_set_iommu_device,
+ .unset_iommu_device = smmuv3_accel_unset_iommu_device,
};
void smmuv3_accel_init(SMMUv3State *s)
@@ -128,4 +277,5 @@ void smmuv3_accel_init(SMMUv3State *s)
SMMUState *bs = ARM_SMMU(s);
bs->iommu_ops = &smmuv3_accel_ops;
+ s->s_accel = g_new0(SMMUv3AccelState, 1);
}
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index 70da16960f..3c8506d1e6 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -10,12 +10,29 @@
#define HW_ARM_SMMUV3_ACCEL_H
#include "hw/arm/smmu-common.h"
+#include "system/iommufd.h"
+#include <linux/iommufd.h>
#include CONFIG_DEVICES
+typedef struct SMMUViommu {
+ IOMMUFDBackend *iommufd;
+ IOMMUFDViommu core;
+ uint32_t bypass_hwpt_id;
+ uint32_t abort_hwpt_id;
+ QLIST_HEAD(, SMMUv3AccelDevice) device_list;
+} SMMUViommu;
+
typedef struct SMMUv3AccelDevice {
SMMUDevice sdev;
+ HostIOMMUDeviceIOMMUFD *idev;
+ SMMUViommu *viommu;
+ QLIST_ENTRY(SMMUv3AccelDevice) next;
} SMMUv3AccelDevice;
+typedef struct SMMUv3AccelState {
+ SMMUViommu *viommu;
+} SMMUv3AccelState;
+
#ifdef CONFIG_ARM_SMMUV3_ACCEL
void smmuv3_accel_init(SMMUv3State *s);
#else
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index f3386bd7ae..86370d448a 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -66,6 +66,10 @@ smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu mr=%s
smmuv3_inv_notifiers_iova(const char *name, int asid, int vmid, uint64_t iova, uint8_t tg, uint64_t num_pages, int stage) "iommu mr=%s asid=%d vmid=%d iova=0x%"PRIx64" tg=%d num_pages=0x%"PRIx64" stage=%d"
smmu_reset_exit(void) ""
+#smmuv3-accel.c
+smmuv3_accel_set_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (sid=0x%x)"
+smmuv3_accel_unset_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (sid=0x%x)"
+
# strongarm.c
strongarm_uart_update_parameters(const char *label, int speed, char parity, int data_bits, int stop_bits) "%s speed=%d parity=%c data=%d stop=%d"
strongarm_ssp_read_underrun(void) "SSP rx underrun"
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index bb7076286b..5f3e9089a7 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -66,6 +66,7 @@ struct SMMUv3State {
/* SMMU has HW accelerator support for nested S1 + s2 */
bool accel;
+ struct SMMUv3AccelState *s_accel;
};
typedef enum {
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
2025-09-29 13:36 ` [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback Shameer Kolothum
@ 2025-09-29 16:25 ` Jonathan Cameron via
2025-09-30 8:13 ` Shameer Kolothum
2025-10-02 6:52 ` Eric Auger
2025-10-17 12:23 ` Eric Auger
2 siblings, 1 reply; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 16:25 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:24 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Implement a set_iommu_device callback:
> -If found an existing viommu reuse that.
> -Else,
> Allocate a vIOMMU with the nested parent S2 hwpt allocated by VFIO.
> Though, iommufd’s vIOMMU model supports nested translation by
> encapsulating a S2 nesting parent HWPT, devices cannot attach to this
> parent HWPT directly. So two proxy nested HWPTs (bypass and abort) are
> allocated to handle device attachments.
> -And add the dev to viommu device list
>
> Also add an unset_iommu_device to unwind/cleanup above.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Triviality follows.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> index 70da16960f..3c8506d1e6 100644
> --- a/hw/arm/smmuv3-accel.h
> +++ b/hw/arm/smmuv3-accel.h
> @@ -10,12 +10,29 @@
> #define HW_ARM_SMMUV3_ACCEL_H
>
> #include "hw/arm/smmu-common.h"
> +#include "system/iommufd.h"
> +#include <linux/iommufd.h>
> #include CONFIG_DEVICES
>
> +typedef struct SMMUViommu {
> + IOMMUFDBackend *iommufd;
> + IOMMUFDViommu core;
> + uint32_t bypass_hwpt_id;
> + uint32_t abort_hwpt_id;
> + QLIST_HEAD(, SMMUv3AccelDevice) device_list;
Does that need a forwards def of SMMUv3AccelDevice ?
> +} SMMUViommu;
> +
> typedef struct SMMUv3AccelDevice {
> SMMUDevice sdev;
I didn't comment on this earlier as thought there would be something added tht
aligned with the above, but seems not. So why the double space?
> + HostIOMMUDeviceIOMMUFD *idev;
> + SMMUViommu *viommu;
> + QLIST_ENTRY(SMMUv3AccelDevice) next;
> } SMMUv3AccelDevice;
>
> +typedef struct SMMUv3AccelState {
> + SMMUViommu *viommu;
> +} SMMUv3AccelState;
> strongarm_uart_update_parameters(const char *label, int speed, char parity, int data_bits, int stop_bits) "%s speed=%d parity=%c data=%d stop=%d"
> strongarm_ssp_read_underrun(void) "SSP rx underrun"
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index bb7076286b..5f3e9089a7 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -66,6 +66,7 @@ struct SMMUv3State {
>
> /* SMMU has HW accelerator support for nested S1 + s2 */
> bool accel;
> + struct SMMUv3AccelState *s_accel;
Seems like a bonus space before *
> };
>
> typedef enum {
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
2025-09-29 16:25 ` Jonathan Cameron via
@ 2025-09-30 8:13 ` Shameer Kolothum
0 siblings, 0 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-30 8:13 UTC (permalink / raw)
To: Jonathan Cameron
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 29 September 2025 17:26
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 08/27] hw/arm/smmuv3-accel: Add
> set/unset_iommu_device callback
>
> #include CONFIG_DEVICES
> >
> > +typedef struct SMMUViommu {
> > + IOMMUFDBackend *iommufd;
> > + IOMMUFDViommu core;
> > + uint32_t bypass_hwpt_id;
> > + uint32_t abort_hwpt_id;
> > + QLIST_HEAD(, SMMUv3AccelDevice) device_list;
>
> Does that need a forwards def of SMMUv3AccelDevice ?
I haven't seen any compiler warnings or errors and it looks like,
QLIST_HEAD expands to something like below,
struct {
struct SMMUv3AccelDevice *lh_first;
}
So should be fine, I guess.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
2025-09-29 13:36 ` [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback Shameer Kolothum
2025-09-29 16:25 ` Jonathan Cameron via
@ 2025-10-02 6:52 ` Eric Auger
2025-10-02 11:34 ` Shameer Kolothum
2025-10-17 12:23 ` Eric Auger
2 siblings, 1 reply; 118+ messages in thread
From: Eric Auger @ 2025-10-02 6:52 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
Hi Shameer,
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
the original intent of this callback is
* @set_iommu_device: attach a HostIOMMUDevice to a vIOMMU
*
* Optional callback, if not implemented in vIOMMU, then vIOMMU can't
* retrieve host information from the associated HostIOMMUDevice.
*
The implementation below goes way behond the simple "attachment" of the
HostIOMMUDevice to the vIOMMU.
allocation of a vIOMMU; allocation of 2 HWPTs, creation of a new
SMMUv3AccelState
>
> Implement a set_iommu_device callback:
> -If found an existing viommu reuse that.
I think you need to document why you need a vIOMMU object.
> -Else,
> Allocate a vIOMMU with the nested parent S2 hwpt allocated by VFIO.
> Though, iommufd’s vIOMMU model supports nested translation by
> encapsulating a S2 nesting parent HWPT, devices cannot attach to this
> parent HWPT directly. So two proxy nested HWPTs (bypass and abort) are
> allocated to handle device attachments.
"devices cannot attach to this parent HWPT directly". Why? It is not clear to me what those hwpt are used for compared to the original one. Why are they mandated? To me this deserves some additional explanations. If they are s2 ones, I would use an s2 prefix too.
> -And add the dev to viommu device list
this is the initial objective of the callback
>
> Also add an unset_iommu_device to unwind/cleanup above.
I think you shall document the introduction of SMMUv3AccelState.It
currently contains a single member, do you plan to add others.
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/arm/smmuv3-accel.c | 150 ++++++++++++++++++++++++++++++++++++++++
> hw/arm/smmuv3-accel.h | 17 +++++
> hw/arm/trace-events | 4 ++
> include/hw/arm/smmuv3.h | 1 +
> 4 files changed, 172 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 6b0e512d86..81fa738f6f 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -8,6 +8,7 @@
>
> #include "qemu/osdep.h"
> #include "qemu/error-report.h"
> +#include "trace.h"
>
> #include "hw/arm/smmuv3.h"
> #include "hw/iommu.h"
> @@ -17,6 +18,9 @@
>
> #include "smmuv3-accel.h"
>
> +#define SMMU_STE_VALID (1ULL << 0)
> +#define SMMU_STE_CFG_BYPASS (1ULL << 3)
I would rather put that in smmuv3-internal.h where you have other STE
macros. Look for "/* STE fields */
> +
> static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> PCIBus *bus, int devfn)
> {
> @@ -35,6 +39,149 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> return accel_dev;
> }
>
> +static bool
> +smmuv3_accel_dev_alloc_viommu(SMMUv3AccelDevice *accel_dev,
> + HostIOMMUDeviceIOMMUFD *idev, Error **errp)
> +{
> + struct iommu_hwpt_arm_smmuv3 bypass_data = {
> + .ste = { SMMU_STE_CFG_BYPASS | SMMU_STE_VALID, 0x0ULL },
> + };
> + struct iommu_hwpt_arm_smmuv3 abort_data = {
> + .ste = { SMMU_STE_VALID, 0x0ULL },
> + };
> + SMMUDevice *sdev = &accel_dev->sdev;
> + SMMUState *bs = sdev->smmu;
> + SMMUv3State *s = ARM_SMMUV3(bs);
> + SMMUv3AccelState *s_accel = s->s_accel;
> + uint32_t s2_hwpt_id = idev->hwpt_id;
> + SMMUViommu *viommu;
> + uint32_t viommu_id;
> +
> + if (s_accel->viommu) {
> + accel_dev->viommu = s_accel->viommu;
> + return true;
> + }
> +
> + if (!iommufd_backend_alloc_viommu(idev->iommufd, idev->devid,
> + IOMMU_VIOMMU_TYPE_ARM_SMMUV3,
> + s2_hwpt_id, &viommu_id, errp)) {
> + return false;
> + }
> +
> + viommu = g_new0(SMMUViommu, 1);
> + viommu->core.viommu_id = viommu_id;
> + viommu->core.s2_hwpt_id = s2_hwpt_id;
> + viommu->core.iommufd = idev->iommufd;
> +
> + if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> + viommu->core.viommu_id, 0,
> + IOMMU_HWPT_DATA_ARM_SMMUV3,
> + sizeof(abort_data), &abort_data,
> + &viommu->abort_hwpt_id, errp)) {
> + goto free_viommu;
> + }
> +
> + if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> + viommu->core.viommu_id, 0,
> + IOMMU_HWPT_DATA_ARM_SMMUV3,
> + sizeof(bypass_data), &bypass_data,
> + &viommu->bypass_hwpt_id, errp)) {
> + goto free_abort_hwpt;
> + }
> +
> + viommu->iommufd = idev->iommufd;
> +
> + s_accel->viommu = viommu;
> + accel_dev->viommu = viommu;
> + return true;
> +
> +free_abort_hwpt:
> + iommufd_backend_free_id(idev->iommufd, viommu->abort_hwpt_id);
> +free_viommu:
> + iommufd_backend_free_id(idev->iommufd, viommu->core.viommu_id);
> + g_free(viommu);
> + return false;
> +}
> +
> +static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> + HostIOMMUDevice *hiod, Error **errp)
> +{
> + HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
> + SMMUState *bs = opaque;
> + SMMUv3State *s = ARM_SMMUV3(bs);
> + SMMUv3AccelState *s_accel = s->s_accel;
> + SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> + SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
you are using smmuv3_accel_get_dev but logically the function was
already called once before in smmuv3_accel_find_add_as()? Meaning the
add/allocate part of the function is not something that should happen
here. Shouldn't we add a new helper that does SMMUPciBus *sbus =
g_hash_table_lookup(bs->smmu_pcibus_by_busptr, bus); if (!sbus) {
return; } sdev = sbus->pbdev[devfn]; if (!sdev) { return; } that would
be used both set and unset()?
> + SMMUDevice *sdev = &accel_dev->sdev;
> + uint16_t sid = smmu_get_sid(sdev);
> +
> + if (!idev) {
> + return true;
> + }
> +
> + if (accel_dev->idev) {
> + if (accel_dev->idev != idev) {
> + error_setg(errp, "Device 0x%x already has an associated IOMMU dev",
> + sid);
> + return false;
> + }
> + return true;
> + }
> +
> + if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
> + error_setg(errp, "Device 0x%x: Unable to alloc viommu", sid);
> + return false;
> + }
> +
> + accel_dev->idev = idev;
> + QLIST_INSERT_HEAD(&s_accel->viommu->device_list, accel_dev, next);
> + trace_smmuv3_accel_set_iommu_device(devfn, sid);
> + return true;
> +}
> +
> +static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> + int devfn)
> +{
> + SMMUState *bs = opaque;
> + SMMUv3State *s = ARM_SMMUV3(bs);
> + SMMUPciBus *sbus = g_hash_table_lookup(bs->smmu_pcibus_by_busptr, bus);
> + SMMUv3AccelDevice *accel_dev;
> + SMMUViommu *viommu;
> + SMMUDevice *sdev;
> + uint16_t sid;
> +
> + if (!sbus) {
> + return;
> + }
> +
> + sdev = sbus->pbdev[devfn];
> + if (!sdev) {
> + return;
> + }
> +
> + sid = smmu_get_sid(sdev);
> + accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
> + if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
> + accel_dev->idev->hwpt_id,
> + NULL)) {
> + error_report("Unable to attach dev 0x%x to the default HW pagetable",
> + sid);
> + }
> +
> + accel_dev->idev = NULL;
> + QLIST_REMOVE(accel_dev, next);
> + trace_smmuv3_accel_unset_iommu_device(devfn, sid);
> +
> + viommu = s->s_accel->viommu;
> + if (QLIST_EMPTY(&viommu->device_list)) {
> + iommufd_backend_free_id(viommu->iommufd, viommu->bypass_hwpt_id);
> + iommufd_backend_free_id(viommu->iommufd, viommu->abort_hwpt_id);
> + iommufd_backend_free_id(viommu->iommufd, viommu->core.viommu_id);
> + g_free(viommu);
> + s->s_accel->viommu = NULL;
> + }
> +}
> +
> static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> {
>
> @@ -121,6 +268,8 @@ static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
> static const PCIIOMMUOps smmuv3_accel_ops = {
> .get_address_space = smmuv3_accel_find_add_as,
> .get_viommu_flags = smmuv3_accel_get_viommu_flags,
> + .set_iommu_device = smmuv3_accel_set_iommu_device,
> + .unset_iommu_device = smmuv3_accel_unset_iommu_device,
> };
>
> void smmuv3_accel_init(SMMUv3State *s)
> @@ -128,4 +277,5 @@ void smmuv3_accel_init(SMMUv3State *s)
> SMMUState *bs = ARM_SMMU(s);
>
> bs->iommu_ops = &smmuv3_accel_ops;
> + s->s_accel = g_new0(SMMUv3AccelState, 1);
> }
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> index 70da16960f..3c8506d1e6 100644
> --- a/hw/arm/smmuv3-accel.h
> +++ b/hw/arm/smmuv3-accel.h
> @@ -10,12 +10,29 @@
> #define HW_ARM_SMMUV3_ACCEL_H
>
> #include "hw/arm/smmu-common.h"
> +#include "system/iommufd.h"
> +#include <linux/iommufd.h>
> #include CONFIG_DEVICES
>
> +typedef struct SMMUViommu {
would deserve to be documented with explanation of what it does abstract
> + IOMMUFDBackend *iommufd;
> + IOMMUFDViommu core;
> + uint32_t bypass_hwpt_id;
> + uint32_t abort_hwpt_id;
> + QLIST_HEAD(, SMMUv3AccelDevice) device_list;
> +} SMMUViommu;
> +
> typedef struct SMMUv3AccelDevice {
> SMMUDevice sdev;
> + HostIOMMUDeviceIOMMUFD *idev;
> + SMMUViommu *viommu;
> + QLIST_ENTRY(SMMUv3AccelDevice) next;
> } SMMUv3AccelDevice;
>
> +typedef struct SMMUv3AccelState {
> + SMMUViommu *viommu;
> +} SMMUv3AccelState;
> +
> #ifdef CONFIG_ARM_SMMUV3_ACCEL
> void smmuv3_accel_init(SMMUv3State *s);
> #else
> diff --git a/hw/arm/trace-events b/hw/arm/trace-events
> index f3386bd7ae..86370d448a 100644
> --- a/hw/arm/trace-events
> +++ b/hw/arm/trace-events
> @@ -66,6 +66,10 @@ smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu mr=%s
> smmuv3_inv_notifiers_iova(const char *name, int asid, int vmid, uint64_t iova, uint8_t tg, uint64_t num_pages, int stage) "iommu mr=%s asid=%d vmid=%d iova=0x%"PRIx64" tg=%d num_pages=0x%"PRIx64" stage=%d"
> smmu_reset_exit(void) ""
>
> +#smmuv3-accel.c
> +smmuv3_accel_set_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (sid=0x%x)"
> +smmuv3_accel_unset_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (sid=0x%x)"
> +
> # strongarm.c
> strongarm_uart_update_parameters(const char *label, int speed, char parity, int data_bits, int stop_bits) "%s speed=%d parity=%c data=%d stop=%d"
> strongarm_ssp_read_underrun(void) "SSP rx underrun"
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index bb7076286b..5f3e9089a7 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -66,6 +66,7 @@ struct SMMUv3State {
>
> /* SMMU has HW accelerator support for nested S1 + s2 */
> bool accel;
> + struct SMMUv3AccelState *s_accel;
> };
>
> typedef enum {
Thanks
Eric
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
2025-10-02 6:52 ` Eric Auger
@ 2025-10-02 11:34 ` Shameer Kolothum
2025-10-02 16:44 ` Nicolin Chen
0 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-02 11:34 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 02 October 2025 07:53
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 08/27] hw/arm/smmuv3-accel: Add
> set/unset_iommu_device callback
>
> External email: Use caution opening links or attachments
>
>
> Hi Shameer,
>
> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> the original intent of this callback is
>
> * @set_iommu_device: attach a HostIOMMUDevice to a vIOMMU
> *
> * Optional callback, if not implemented in vIOMMU, then vIOMMU can't
> * retrieve host information from the associated HostIOMMUDevice.
> *
>
> The implementation below goes way behond the simple "attachment" of the
> HostIOMMUDevice to the vIOMMU.
> allocation of a vIOMMU; allocation of 2 HWPTs, creation of a new
> SMMUv3AccelState
Sure. It does do all of this. Will update.
> >
> > Implement a set_iommu_device callback:
> > -If found an existing viommu reuse that.
> I think you need to document why you need a vIOMMU object.
> > -Else,
> > Allocate a vIOMMU with the nested parent S2 hwpt allocated by VFIO.
> > Though, iommufd’s vIOMMU model supports nested translation by
> > encapsulating a S2 nesting parent HWPT, devices cannot attach to this
> > parent HWPT directly. So two proxy nested HWPTs (bypass and abort) are
> > allocated to handle device attachments.
>
> "devices cannot attach to this parent HWPT directly". Why? It is not clear to
> me what those hwpt are used for compared to the original one. Why are they
> mandated? To me this deserves some additional explanations. If they are s2
> ones, I would use an s2 prefix too.
Ok. This needs some rephrasing.
The idea is, we cannot yet attach a domain to the SMMUv3 for this device yet.
We need a vDEVICE object (which will link vSID to pSID) for attach. Please see
Patch #10.
Here we just allocate two domains(bypass or abort) for later attach based on
Guest request.
These are not S2 only HWPT per se. They are of type IOMMU_DOMAIN_NESTED.
From kernel doc:
#define __IOMMU_DOMAIN_NESTED (1U << 6) /* User-managed address space nested
on a stage-2 translation */
I will update the comment which will clarify this better.
@Nicolin, please correct me if my above understanding is not right.
>
> > -And add the dev to viommu device list
> this is the initial objective of the callback
> >
> > Also add an unset_iommu_device to unwind/cleanup above.
>
> I think you shall document the introduction of SMMUv3AccelState.It
> currently contains a single member, do you plan to add others.
Ok. I think for this series we only have one member for now. But when
we add support for vEVeNTQ and vCMDQ, this will have more.
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > ---
> > hw/arm/smmuv3-accel.c | 150
> ++++++++++++++++++++++++++++++++++++++++
> > hw/arm/smmuv3-accel.h | 17 +++++
> > hw/arm/trace-events | 4 ++
> > include/hw/arm/smmuv3.h | 1 +
> > 4 files changed, 172 insertions(+)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> > index 6b0e512d86..81fa738f6f 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> > @@ -8,6 +8,7 @@
> >
> > #include "qemu/osdep.h"
> > #include "qemu/error-report.h"
> > +#include "trace.h"
> >
> > #include "hw/arm/smmuv3.h"
> > #include "hw/iommu.h"
> > @@ -17,6 +18,9 @@
> >
> > #include "smmuv3-accel.h"
> >
> > +#define SMMU_STE_VALID (1ULL << 0)
> > +#define SMMU_STE_CFG_BYPASS (1ULL << 3)
> I would rather put that in smmuv3-internal.h where you have other STE
> macros. Look for "/* STE fields */
Ok.
> > +
> > static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs,
> SMMUPciBus *sbus,
> > PCIBus *bus, int devfn)
> > {
> > @@ -35,6 +39,149 @@ static SMMUv3AccelDevice
> *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> > return accel_dev;
> > }
> >
> > +static bool
> > +smmuv3_accel_dev_alloc_viommu(SMMUv3AccelDevice *accel_dev,
> > + HostIOMMUDeviceIOMMUFD *idev, Error **errp)
> > +{
> > + struct iommu_hwpt_arm_smmuv3 bypass_data = {
> > + .ste = { SMMU_STE_CFG_BYPASS | SMMU_STE_VALID, 0x0ULL },
> > + };
> > + struct iommu_hwpt_arm_smmuv3 abort_data = {
> > + .ste = { SMMU_STE_VALID, 0x0ULL },
> > + };
> > + SMMUDevice *sdev = &accel_dev->sdev;
> > + SMMUState *bs = sdev->smmu;
> > + SMMUv3State *s = ARM_SMMUV3(bs);
> > + SMMUv3AccelState *s_accel = s->s_accel;
> > + uint32_t s2_hwpt_id = idev->hwpt_id;
> > + SMMUViommu *viommu;
> > + uint32_t viommu_id;
> > +
> > + if (s_accel->viommu) {
> > + accel_dev->viommu = s_accel->viommu;
> > + return true;
> > + }
> > +
> > + if (!iommufd_backend_alloc_viommu(idev->iommufd, idev->devid,
> > + IOMMU_VIOMMU_TYPE_ARM_SMMUV3,
> > + s2_hwpt_id, &viommu_id, errp)) {
> > + return false;
> > + }
> > +
> > + viommu = g_new0(SMMUViommu, 1);
> > + viommu->core.viommu_id = viommu_id;
> > + viommu->core.s2_hwpt_id = s2_hwpt_id;
> > + viommu->core.iommufd = idev->iommufd;
> > +
> > + if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> > + viommu->core.viommu_id, 0,
> > + IOMMU_HWPT_DATA_ARM_SMMUV3,
> > + sizeof(abort_data), &abort_data,
> > + &viommu->abort_hwpt_id, errp)) {
> > + goto free_viommu;
> > + }
> > +
> > + if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> > + viommu->core.viommu_id, 0,
> > + IOMMU_HWPT_DATA_ARM_SMMUV3,
> > + sizeof(bypass_data), &bypass_data,
> > + &viommu->bypass_hwpt_id, errp)) {
> > + goto free_abort_hwpt;
> > + }
> > +
> > + viommu->iommufd = idev->iommufd;
> > +
> > + s_accel->viommu = viommu;
> > + accel_dev->viommu = viommu;
> > + return true;
> > +
> > +free_abort_hwpt:
> > + iommufd_backend_free_id(idev->iommufd, viommu->abort_hwpt_id);
> > +free_viommu:
> > + iommufd_backend_free_id(idev->iommufd, viommu->core.viommu_id);
> > + g_free(viommu);
> > + return false;
> > +}
> > +
> > +static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque,
> int devfn,
> > + HostIOMMUDevice *hiod, Error **errp)
> > +{
> > + HostIOMMUDeviceIOMMUFD *idev =
> HOST_IOMMU_DEVICE_IOMMUFD(hiod);
> > + SMMUState *bs = opaque;
> > + SMMUv3State *s = ARM_SMMUV3(bs);
> > + SMMUv3AccelState *s_accel = s->s_accel;
> > + SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> > + SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus,
> bus, devfn);
> you are using smmuv3_accel_get_dev but logically the function was
> already called once before in smmuv3_accel_find_add_as()? Meaning the
> add/allocate part of the function is not something that should happen
> here. Shouldn't we add a new helper that does SMMUPciBus *sbus =
> g_hash_table_lookup(bs->smmu_pcibus_by_busptr, bus); if (!sbus) {
> return; } sdev = sbus->pbdev[devfn]; if (!sdev) { return; } that would
> be used both set and unset()?
Hmm.. I am not sure I quite follow the need for that.
smmuv3_accel_get_dev() second time will just retrieve the
SMMUDevice *sdev = sbus->pbdev[devfn];
And then return immediately,
if (sdev) {
return container_of(sdev, SMMUv3AccelDevice, sdev);
}
Do we need really need a separate helper for this?
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
2025-10-02 11:34 ` Shameer Kolothum
@ 2025-10-02 16:44 ` Nicolin Chen
2025-10-02 18:35 ` Jason Gunthorpe
2025-10-17 12:06 ` Eric Auger
0 siblings, 2 replies; 118+ messages in thread
From: Nicolin Chen @ 2025-10-02 16:44 UTC (permalink / raw)
To: Shameer Kolothum
Cc: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org,
peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
On Thu, Oct 02, 2025 at 04:34:17AM -0700, Shameer Kolothum wrote:
> > > Implement a set_iommu_device callback:
> > > -If found an existing viommu reuse that.
> > I think you need to document why you need a vIOMMU object.
> > > -Else,
> > > Allocate a vIOMMU with the nested parent S2 hwpt allocated by VFIO.
> > > Though, iommufd’s vIOMMU model supports nested translation by
> > > encapsulating a S2 nesting parent HWPT, devices cannot attach to this
> > > parent HWPT directly. So two proxy nested HWPTs (bypass and abort) are
> > > allocated to handle device attachments.
> >
> > "devices cannot attach to this parent HWPT directly". Why? It is not clear to
> > me what those hwpt are used for compared to the original one. Why are they
> > mandated? To me this deserves some additional explanations. If they are s2
> > ones, I would use an s2 prefix too.
>
> Ok. This needs some rephrasing.
>
> The idea is, we cannot yet attach a domain to the SMMUv3 for this device yet.
> We need a vDEVICE object (which will link vSID to pSID) for attach. Please see
> Patch #10.
>
> Here we just allocate two domains(bypass or abort) for later attach based on
> Guest request.
>
> These are not S2 only HWPT per se. They are of type IOMMU_DOMAIN_NESTED.
>
> From kernel doc:
>
> #define __IOMMU_DOMAIN_NESTED (1U << 6) /* User-managed address space nested
> on a stage-2 translation */
There are a couple of things going on here:
1) We should not attach directly to the S2 HWPT that eventually
will be shared across vSMMU instances. In other word, an S2
HWPT will not be attachable for lacking of its tie to an SMMU
instance and not having a VMID at all. Instead, each vIOMMU
object allocated using this S2 HWPT will hold the VMID.
2) A device cannot attach to a vIOMMU directly but has to attach
through a proxy nested HWPT (IOMMU_DOMAIN_NESTED). To attach
to an IOMMU_DOMAIN_NESTED, a vDEVICE must be allocated with a
given vSID.
This might sound a bit complicated but I think it makes sense from
a VM perspective, as a device that's behind a vSMMU should have a
guest-level SID and its corresponding STE: if the device is working
in the S2-only mode (physically), there must be a guest-level STE
configuring to the S1-BYPASS mode, where the "bypass" proxy HWPT
will be picked for attachment.
So, for rephrasing, I think it would nicer to say something like:
"
A device that is put behind a vSMMU instance must have a vSID and its
corresponding vSTEs (bypass/abort/translate). Pre-allocate the bypass
and abort vSTEs as two proxy nested HWPTs for the device to attach to
a vIOMMU.
Note that the core-managed nesting parent HWPT should not be attached
directly when using the iommufd's vIOMMU model. This is also because
we want that nesting parent HWPT to be reused eventually across vSMMU
instances in the same VM.
"
Nicolin
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
2025-10-02 16:44 ` Nicolin Chen
@ 2025-10-02 18:35 ` Jason Gunthorpe
2025-10-17 12:06 ` Eric Auger
1 sibling, 0 replies; 118+ messages in thread
From: Jason Gunthorpe @ 2025-10-02 18:35 UTC (permalink / raw)
To: Nicolin Chen
Cc: Shameer Kolothum, eric.auger@redhat.com, qemu-arm@nongnu.org,
qemu-devel@nongnu.org, peter.maydell@linaro.org,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
On Thu, Oct 02, 2025 at 09:44:18AM -0700, Nicolin Chen wrote:
> On Thu, Oct 02, 2025 at 04:34:17AM -0700, Shameer Kolothum wrote:
> > > > Implement a set_iommu_device callback:
> > > > -If found an existing viommu reuse that.
> > > I think you need to document why you need a vIOMMU object.
> > > > -Else,
> > > > Allocate a vIOMMU with the nested parent S2 hwpt allocated by VFIO.
> > > > Though, iommufd’s vIOMMU model supports nested translation by
> > > > encapsulating a S2 nesting parent HWPT, devices cannot attach to this
> > > > parent HWPT directly. So two proxy nested HWPTs (bypass and abort) are
> > > > allocated to handle device attachments.
> > >
> > > "devices cannot attach to this parent HWPT directly". Why? It is not clear to
> > > me what those hwpt are used for compared to the original one. Why are they
> > > mandated? To me this deserves some additional explanations. If they are s2
> > > ones, I would use an s2 prefix too.
> >
> > Ok. This needs some rephrasing.
> >
> > The idea is, we cannot yet attach a domain to the SMMUv3 for this device yet.
> > We need a vDEVICE object (which will link vSID to pSID) for attach. Please see
> > Patch #10.
> >
> > Here we just allocate two domains(bypass or abort) for later attach based on
> > Guest request.
> >
> > These are not S2 only HWPT per se. They are of type IOMMU_DOMAIN_NESTED.
> >
> > From kernel doc:
> >
> > #define __IOMMU_DOMAIN_NESTED (1U << 6) /* User-managed address space nested
> > on a stage-2 translation */
>
> There are a couple of things going on here:
> 1) We should not attach directly to the S2 HWPT that eventually
> will be shared across vSMMU instances. In other word, an S2
> HWPT will not be attachable for lacking of its tie to an SMMU
> instance and not having a VMID at all. Instead, each vIOMMU
> object allocated using this S2 HWPT will hold the VMID.
>
> 2) A device cannot attach to a vIOMMU directly but has to attach
> through a proxy nested HWPT (IOMMU_DOMAIN_NESTED). To attach
> to an IOMMU_DOMAIN_NESTED, a vDEVICE must be allocated with a
> given vSID.
>
> This might sound a bit complicated but I think it makes sense from
> a VM perspective, as a device that's behind a vSMMU should have a
> guest-level SID and its corresponding STE: if the device is working
> in the S2-only mode (physically), there must be a guest-level STE
> configuring to the S1-BYPASS mode, where the "bypass" proxy HWPT
> will be picked for attachment.
>
> So, for rephrasing, I think it would nicer to say something like:
>
> "
> A device that is put behind a vSMMU instance must have a vSID and its
> corresponding vSTEs (bypass/abort/translate). Pre-allocate the bypass
> and abort vSTEs as two proxy nested HWPTs for the device to attach to
> a vIOMMU.
>
> Note that the core-managed nesting parent HWPT should not be attached
> directly when using the iommufd's vIOMMU model. This is also because
> we want that nesting parent HWPT to be reused eventually across vSMMU
> instances in the same VM.
> "
This all seems correct to me
Thanks,
Jason
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
2025-10-02 16:44 ` Nicolin Chen
2025-10-02 18:35 ` Jason Gunthorpe
@ 2025-10-17 12:06 ` Eric Auger
1 sibling, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-17 12:06 UTC (permalink / raw)
To: Nicolin Chen, Shameer Kolothum
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
On 10/2/25 6:44 PM, Nicolin Chen wrote:
> On Thu, Oct 02, 2025 at 04:34:17AM -0700, Shameer Kolothum wrote:
>>>> Implement a set_iommu_device callback:
>>>> -If found an existing viommu reuse that.
>>> I think you need to document why you need a vIOMMU object.
>>>> -Else,
>>>> Allocate a vIOMMU with the nested parent S2 hwpt allocated by VFIO.
>>>> Though, iommufd’s vIOMMU model supports nested translation by
>>>> encapsulating a S2 nesting parent HWPT, devices cannot attach to this
>>>> parent HWPT directly. So two proxy nested HWPTs (bypass and abort) are
>>>> allocated to handle device attachments.
>>> "devices cannot attach to this parent HWPT directly". Why? It is not clear to
>>> me what those hwpt are used for compared to the original one. Why are they
>>> mandated? To me this deserves some additional explanations. If they are s2
>>> ones, I would use an s2 prefix too.
>> Ok. This needs some rephrasing.
>>
>> The idea is, we cannot yet attach a domain to the SMMUv3 for this device yet.
>> We need a vDEVICE object (which will link vSID to pSID) for attach. Please see
>> Patch #10.
>>
>> Here we just allocate two domains(bypass or abort) for later attach based on
>> Guest request.
>>
>> These are not S2 only HWPT per se. They are of type IOMMU_DOMAIN_NESTED.
>>
>> From kernel doc:
>>
>> #define __IOMMU_DOMAIN_NESTED (1U << 6) /* User-managed address space nested
>> on a stage-2 translation */
> There are a couple of things going on here:
> 1) We should not attach directly to the S2 HWPT that eventually
> will be shared across vSMMU instances. In other word, an S2
> HWPT will not be attachable for lacking of its tie to an SMMU
> instance and not having a VMID at all. Instead, each vIOMMU
> object allocated using this S2 HWPT will hold the VMID.
>
> 2) A device cannot attach to a vIOMMU directly but has to attach
> through a proxy nested HWPT (IOMMU_DOMAIN_NESTED). To attach
> to an IOMMU_DOMAIN_NESTED, a vDEVICE must be allocated with a
> given vSID.
>
> This might sound a bit complicated but I think it makes sense from
> a VM perspective, as a device that's behind a vSMMU should have a
> guest-level SID and its corresponding STE: if the device is working
> in the S2-only mode (physically), there must be a guest-level STE
> configuring to the S1-BYPASS mode, where the "bypass" proxy HWPT
> will be picked for attachment.
>
> So, for rephrasing, I think it would nicer to say something like:
>
> "
> A device that is put behind a vSMMU instance must have a vSID and its
> corresponding vSTEs (bypass/abort/translate). Pre-allocate the bypass
> and abort vSTEs as two proxy nested HWPTs for the device to attach to
> a vIOMMU.
>
> Note that the core-managed nesting parent HWPT should not be attached
> directly when using the iommufd's vIOMMU model. This is also because
> we want that nesting parent HWPT to be reused eventually across vSMMU
> instances in the same VM.
> "
I would add 1) and 2) also in the commit msg. This definitively helps
understanding the whole setup
Eric
>
> Nicolin
>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
2025-09-29 13:36 ` [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback Shameer Kolothum
2025-09-29 16:25 ` Jonathan Cameron via
2025-10-02 6:52 ` Eric Auger
@ 2025-10-17 12:23 ` Eric Auger
2 siblings, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-17 12:23 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Implement a set_iommu_device callback:
> -If found an existing viommu reuse that.
> -Else,
> Allocate a vIOMMU with the nested parent S2 hwpt allocated by VFIO.
> Though, iommufd’s vIOMMU model supports nested translation by
> encapsulating a S2 nesting parent HWPT, devices cannot attach to this
> parent HWPT directly. So two proxy nested HWPTs (bypass and abort) are
> allocated to handle device attachments.
> -And add the dev to viommu device list
>
> Also add an unset_iommu_device to unwind/cleanup above.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/arm/smmuv3-accel.c | 150 ++++++++++++++++++++++++++++++++++++++++
> hw/arm/smmuv3-accel.h | 17 +++++
> hw/arm/trace-events | 4 ++
> include/hw/arm/smmuv3.h | 1 +
> 4 files changed, 172 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 6b0e512d86..81fa738f6f 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -8,6 +8,7 @@
>
> #include "qemu/osdep.h"
> #include "qemu/error-report.h"
> +#include "trace.h"
>
> #include "hw/arm/smmuv3.h"
> #include "hw/iommu.h"
> @@ -17,6 +18,9 @@
>
> #include "smmuv3-accel.h"
>
> +#define SMMU_STE_VALID (1ULL << 0)
> +#define SMMU_STE_CFG_BYPASS (1ULL << 3)
> +
> static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> PCIBus *bus, int devfn)
> {
> @@ -35,6 +39,149 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> return accel_dev;
> }
>
> +static bool
> +smmuv3_accel_dev_alloc_viommu(SMMUv3AccelDevice *accel_dev,
> + HostIOMMUDeviceIOMMUFD *idev, Error **errp)
> +{
> + struct iommu_hwpt_arm_smmuv3 bypass_data = {
> + .ste = { SMMU_STE_CFG_BYPASS | SMMU_STE_VALID, 0x0ULL },
> + };
> + struct iommu_hwpt_arm_smmuv3 abort_data = {
> + .ste = { SMMU_STE_VALID, 0x0ULL },
> + };
> + SMMUDevice *sdev = &accel_dev->sdev;
> + SMMUState *bs = sdev->smmu;
> + SMMUv3State *s = ARM_SMMUV3(bs);
> + SMMUv3AccelState *s_accel = s->s_accel;
> + uint32_t s2_hwpt_id = idev->hwpt_id;
> + SMMUViommu *viommu;
> + uint32_t viommu_id;
> +
> + if (s_accel->viommu) {
> + accel_dev->viommu = s_accel->viommu;
> + return true;
> + }
> +
> + if (!iommufd_backend_alloc_viommu(idev->iommufd, idev->devid,
> + IOMMU_VIOMMU_TYPE_ARM_SMMUV3,
> + s2_hwpt_id, &viommu_id, errp)) {
> + return false;
> + }
> +
> + viommu = g_new0(SMMUViommu, 1);
> + viommu->core.viommu_id = viommu_id;
> + viommu->core.s2_hwpt_id = s2_hwpt_id;
> + viommu->core.iommufd = idev->iommufd;
> +
> + if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> + viommu->core.viommu_id, 0,
> + IOMMU_HWPT_DATA_ARM_SMMUV3,
> + sizeof(abort_data), &abort_data,
> + &viommu->abort_hwpt_id, errp)) {
> + goto free_viommu;
> + }
> +
> + if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> + viommu->core.viommu_id, 0,
> + IOMMU_HWPT_DATA_ARM_SMMUV3,
> + sizeof(bypass_data), &bypass_data,
> + &viommu->bypass_hwpt_id, errp)) {
> + goto free_abort_hwpt;
> + }
> +
> + viommu->iommufd = idev->iommufd;
> +
> + s_accel->viommu = viommu;
> + accel_dev->viommu = viommu;
> + return true;
> +
> +free_abort_hwpt:
> + iommufd_backend_free_id(idev->iommufd, viommu->abort_hwpt_id);
> +free_viommu:
> + iommufd_backend_free_id(idev->iommufd, viommu->core.viommu_id);
> + g_free(viommu);
> + return false;
> +}
> +
> +static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> + HostIOMMUDevice *hiod, Error **errp)
> +{
> + HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
> + SMMUState *bs = opaque;
> + SMMUv3State *s = ARM_SMMUV3(bs);
> + SMMUv3AccelState *s_accel = s->s_accel;
> + SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> + SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
> + SMMUDevice *sdev = &accel_dev->sdev;
> + uint16_t sid = smmu_get_sid(sdev);
> +
> + if (!idev) {
> + return true;
> + }
> +
> + if (accel_dev->idev) {
> + if (accel_dev->idev != idev) {
> + error_setg(errp, "Device 0x%x already has an associated IOMMU dev",
> + sid);
> + return false;
> + }
> + return true;
> + }
> +
> + if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
> + error_setg(errp, "Device 0x%x: Unable to alloc viommu", sid);
> + return false;
> + }
> +
> + accel_dev->idev = idev;
> + QLIST_INSERT_HEAD(&s_accel->viommu->device_list, accel_dev, next);
> + trace_smmuv3_accel_set_iommu_device(devfn, sid);
> + return true;
> +}
> +
> +static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> + int devfn)
> +{
> + SMMUState *bs = opaque;
> + SMMUv3State *s = ARM_SMMUV3(bs);
> + SMMUPciBus *sbus = g_hash_table_lookup(bs->smmu_pcibus_by_busptr, bus);
> + SMMUv3AccelDevice *accel_dev;
> + SMMUViommu *viommu;
> + SMMUDevice *sdev;
> + uint16_t sid;
> +
> + if (!sbus) {
> + return;
> + }
> +
> + sdev = sbus->pbdev[devfn];
> + if (!sdev) {
> + return;
> + }
> +
> + sid = smmu_get_sid(sdev);
> + accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
> + if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
> + accel_dev->idev->hwpt_id,
> + NULL)) {
> + error_report("Unable to attach dev 0x%x to the default HW pagetable",
> + sid);
> + }
> +
> + accel_dev->idev = NULL;
> + QLIST_REMOVE(accel_dev, next);
> + trace_smmuv3_accel_unset_iommu_device(devfn, sid);
> +
> + viommu = s->s_accel->viommu;
> + if (QLIST_EMPTY(&viommu->device_list)) {
> + iommufd_backend_free_id(viommu->iommufd, viommu->bypass_hwpt_id);
> + iommufd_backend_free_id(viommu->iommufd, viommu->abort_hwpt_id);
> + iommufd_backend_free_id(viommu->iommufd, viommu->core.viommu_id);
> + g_free(viommu);
> + s->s_accel->viommu = NULL;
> + }
> +}
> +
> static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> {
>
> @@ -121,6 +268,8 @@ static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
> static const PCIIOMMUOps smmuv3_accel_ops = {
> .get_address_space = smmuv3_accel_find_add_as,
> .get_viommu_flags = smmuv3_accel_get_viommu_flags,
> + .set_iommu_device = smmuv3_accel_set_iommu_device,
> + .unset_iommu_device = smmuv3_accel_unset_iommu_device,
> };
>
> void smmuv3_accel_init(SMMUv3State *s)
> @@ -128,4 +277,5 @@ void smmuv3_accel_init(SMMUv3State *s)
> SMMUState *bs = ARM_SMMU(s);
>
> bs->iommu_ops = &smmuv3_accel_ops;
> + s->s_accel = g_new0(SMMUv3AccelState, 1);
> }
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> index 70da16960f..3c8506d1e6 100644
> --- a/hw/arm/smmuv3-accel.h
> +++ b/hw/arm/smmuv3-accel.h
> @@ -10,12 +10,29 @@
> #define HW_ARM_SMMUV3_ACCEL_H
>
> #include "hw/arm/smmu-common.h"
> +#include "system/iommufd.h"
> +#include <linux/iommufd.h>
> #include CONFIG_DEVICES
>
> +typedef struct SMMUViommu {
> + IOMMUFDBackend *iommufd;
> + IOMMUFDViommu core;
could we avoid using too generic field names like "core". In the rest of
the code it is then difficult to understand what the field corresponds to.
viommu?
> + uint32_t bypass_hwpt_id;
> + uint32_t abort_hwpt_id;
> + QLIST_HEAD(, SMMUv3AccelDevice) device_list;
> +} SMMUViommu;
> +
> typedef struct SMMUv3AccelDevice {
> SMMUDevice sdev;
> + HostIOMMUDeviceIOMMUFD *idev;
same here. hdev at least would refer to host dev at least. Or does it
correspond to some kernel terminology?
Eric
> + SMMUViommu *viommu;
> + QLIST_ENTRY(SMMUv3AccelDevice) next;
> } SMMUv3AccelDevice;
>
> +typedef struct SMMUv3AccelState {
> + SMMUViommu *viommu;
> +} SMMUv3AccelState;
> +
> #ifdef CONFIG_ARM_SMMUV3_ACCEL
> void smmuv3_accel_init(SMMUv3State *s);
> #else
> diff --git a/hw/arm/trace-events b/hw/arm/trace-events
> index f3386bd7ae..86370d448a 100644
> --- a/hw/arm/trace-events
> +++ b/hw/arm/trace-events
> @@ -66,6 +66,10 @@ smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu mr=%s
> smmuv3_inv_notifiers_iova(const char *name, int asid, int vmid, uint64_t iova, uint8_t tg, uint64_t num_pages, int stage) "iommu mr=%s asid=%d vmid=%d iova=0x%"PRIx64" tg=%d num_pages=0x%"PRIx64" stage=%d"
> smmu_reset_exit(void) ""
>
> +#smmuv3-accel.c
> +smmuv3_accel_set_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (sid=0x%x)"
> +smmuv3_accel_unset_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (sid=0x%x)"
> +
> # strongarm.c
> strongarm_uart_update_parameters(const char *label, int speed, char parity, int data_bits, int stop_bits) "%s speed=%d parity=%c data=%d stop=%d"
> strongarm_ssp_read_underrun(void) "SSP rx underrun"
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index bb7076286b..5f3e9089a7 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -66,6 +66,7 @@ struct SMMUv3State {
>
> /* SMMU has HW accelerator support for nested S1 + s2 */
> bool accel;
> + struct SMMUv3AccelState *s_accel;
> };
>
> typedef enum {
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 09/27] hw/arm/smmuv3-accel: Support nested STE install/uninstall support
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (7 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 08/27] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 16:41 ` Jonathan Cameron via
2025-10-02 10:04 ` Eric Auger
2025-09-29 13:36 ` [PATCH v4 10/27] hw/arm/smmuv3-accel: Allocate a vDEVICE object for device Shameer Kolothum
` (18 subsequent siblings)
27 siblings, 2 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
From: Nicolin Chen <nicolinc@nvidia.com>
Allocates a s1 HWPT for the Guest s1 stage and attaches that to the
pass-through vfio device. This will be invoked when Guest issues
SMMU_CMD_CFGI_STE/STE_RANGE.
While at it, we are also exporting both smmu_find_ste() and
smmuv3_flush_config() from smmuv3.c for use here.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 164 +++++++++++++++++++++++++++++++++++++++
hw/arm/smmuv3-accel.h | 22 ++++++
hw/arm/smmuv3-internal.h | 3 +
hw/arm/smmuv3.c | 18 ++++-
hw/arm/trace-events | 1 +
5 files changed, 205 insertions(+), 3 deletions(-)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 81fa738f6f..5c3825cecd 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -17,10 +17,174 @@
#include "hw/vfio/pci.h"
#include "smmuv3-accel.h"
+#include "smmuv3-internal.h"
#define SMMU_STE_VALID (1ULL << 0)
#define SMMU_STE_CFG_BYPASS (1ULL << 3)
+#define STE0_V MAKE_64BIT_MASK(0, 1)
+#define STE0_CONFIG MAKE_64BIT_MASK(1, 3)
+#define STE0_S1FMT MAKE_64BIT_MASK(4, 2)
+#define STE0_CTXPTR MAKE_64BIT_MASK(6, 50)
+#define STE0_S1CDMAX MAKE_64BIT_MASK(59, 5)
+#define STE0_MASK (STE0_S1CDMAX | STE0_CTXPTR | STE0_S1FMT | STE0_CONFIG | \
+ STE0_V)
+
+#define STE1_S1DSS MAKE_64BIT_MASK(0, 2)
+#define STE1_S1CIR MAKE_64BIT_MASK(2, 2)
+#define STE1_S1COR MAKE_64BIT_MASK(4, 2)
+#define STE1_S1CSH MAKE_64BIT_MASK(6, 2)
+#define STE1_S1STALLD MAKE_64BIT_MASK(27, 1)
+#define STE1_ETS MAKE_64BIT_MASK(28, 2)
+#define STE1_MASK (STE1_ETS | STE1_S1STALLD | STE1_S1CSH | STE1_S1COR | \
+ STE1_S1CIR | STE1_S1DSS)
+
+static bool
+smmuv3_accel_dev_uninstall_nested_ste(SMMUv3AccelDevice *accel_dev, bool abort,
+ Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
+ SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
+ uint32_t hwpt_id;
+
+ if (!s1_hwpt || !accel_dev->viommu) {
+ return true;
+ }
+
+ if (abort) {
+ hwpt_id = accel_dev->viommu->abort_hwpt_id;
+ } else {
+ hwpt_id = accel_dev->viommu->bypass_hwpt_id;
+ }
+
+ if (!host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp)) {
+ return false;
+ }
+
+ iommufd_backend_free_id(s1_hwpt->iommufd, s1_hwpt->hwpt_id);
+ accel_dev->s1_hwpt = NULL;
+ g_free(s1_hwpt);
+ return true;
+}
+
+static bool
+smmuv3_accel_dev_install_nested_ste(SMMUv3AccelDevice *accel_dev,
+ uint32_t data_type, uint32_t data_len,
+ void *data, Error **errp)
+{
+ SMMUViommu *viommu = accel_dev->viommu;
+ SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
+ HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
+ uint32_t flags = 0;
+
+ if (!idev || !viommu) {
+ error_setg(errp, "Device 0x%x has no associated IOMMU dev or vIOMMU",
+ smmu_get_sid(&accel_dev->sdev));
+ return false;
+ }
+
+ if (s1_hwpt) {
+ if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev, true, errp)) {
+ return false;
+ }
+ }
+
+ s1_hwpt = g_new0(SMMUS1Hwpt, 1);
+ s1_hwpt->iommufd = idev->iommufd;
+ if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+ viommu->core.viommu_id, flags, data_type,
+ data_len, data, &s1_hwpt->hwpt_id, errp)) {
+ return false;
+ }
+
+ if (!host_iommu_device_iommufd_attach_hwpt(idev, s1_hwpt->hwpt_id, errp)) {
+ iommufd_backend_free_id(idev->iommufd, s1_hwpt->hwpt_id);
+ return false;
+ }
+ accel_dev->s1_hwpt = s1_hwpt;
+ return true;
+}
+
+bool
+smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
+ Error **errp)
+{
+ SMMUv3AccelDevice *accel_dev;
+ SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid,
+ .inval_ste_allowed = true};
+ struct iommu_hwpt_arm_smmuv3 nested_data = {};
+ uint64_t ste_0, ste_1;
+ uint32_t config;
+ STE ste;
+ int ret;
+
+ if (!s->accel) {
+ return true;
+ }
+
+ accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
+ if (!accel_dev->viommu) {
+ return true;
+ }
+
+ ret = smmu_find_ste(sdev->smmu, sid, &ste, &event);
+ if (ret) {
+ error_setg(errp, "Failed to find STE for Device 0x%x", sid);
+ return true;
+ }
+
+ config = STE_CONFIG(&ste);
+ if (!STE_VALID(&ste) || !STE_CFG_S1_ENABLED(config)) {
+ if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev,
+ STE_CFG_ABORT(config),
+ errp)) {
+ return false;
+ }
+ smmuv3_flush_config(sdev);
+ return true;
+ }
+
+ ste_0 = (uint64_t)ste.word[0] | (uint64_t)ste.word[1] << 32;
+ ste_1 = (uint64_t)ste.word[2] | (uint64_t)ste.word[3] << 32;
+ nested_data.ste[0] = cpu_to_le64(ste_0 & STE0_MASK);
+ nested_data.ste[1] = cpu_to_le64(ste_1 & STE1_MASK);
+
+ if (!smmuv3_accel_dev_install_nested_ste(accel_dev,
+ IOMMU_HWPT_DATA_ARM_SMMUV3,
+ sizeof(nested_data),
+ &nested_data, errp)) {
+ error_setg(errp, "Unable to install nested STE=%16LX:%16LX, sid=0x%x,"
+ "ret=%d", nested_data.ste[1], nested_data.ste[0], sid, ret);
+ return false;
+ }
+ trace_smmuv3_accel_install_nested_ste(sid, nested_data.ste[1],
+ nested_data.ste[0]);
+ return true;
+}
+
+bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
+ Error **errp)
+{
+ SMMUv3AccelState *s_accel = s->s_accel;
+ SMMUv3AccelDevice *accel_dev;
+
+ if (!s_accel || !s_accel->viommu) {
+ return true;
+ }
+
+ QLIST_FOREACH(accel_dev, &s_accel->viommu->device_list, next) {
+ uint32_t sid = smmu_get_sid(&accel_dev->sdev);
+
+ if (sid >= range->start && sid <= range->end) {
+ if (!smmuv3_accel_install_nested_ste(s, &accel_dev->sdev,
+ sid, errp)) {
+ return false;
+ }
+ }
+ }
+ return true;
+}
+
static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
PCIBus *bus, int devfn)
{
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index 3c8506d1e6..f631443b09 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -22,9 +22,15 @@ typedef struct SMMUViommu {
QLIST_HEAD(, SMMUv3AccelDevice) device_list;
} SMMUViommu;
+typedef struct SMMUS1Hwpt {
+ IOMMUFDBackend *iommufd;
+ uint32_t hwpt_id;
+} SMMUS1Hwpt;
+
typedef struct SMMUv3AccelDevice {
SMMUDevice sdev;
HostIOMMUDeviceIOMMUFD *idev;
+ SMMUS1Hwpt *s1_hwpt;
SMMUViommu *viommu;
QLIST_ENTRY(SMMUv3AccelDevice) next;
} SMMUv3AccelDevice;
@@ -35,10 +41,26 @@ typedef struct SMMUv3AccelState {
#ifdef CONFIG_ARM_SMMUV3_ACCEL
void smmuv3_accel_init(SMMUv3State *s);
+bool smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
+ Error **errp);
+bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
+ Error **errp);
#else
static inline void smmuv3_accel_init(SMMUv3State *s)
{
}
+static inline bool
+smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
+ Error **errp)
+{
+ return true;
+}
+static inline bool
+smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
+ Error **errp)
+{
+ return true;
+}
#endif
#endif /* HW_ARM_SMMUV3_ACCEL_H */
diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
index b6b7399347..b0dfa9465c 100644
--- a/hw/arm/smmuv3-internal.h
+++ b/hw/arm/smmuv3-internal.h
@@ -547,6 +547,9 @@ typedef struct CD {
uint32_t word[16];
} CD;
+int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste, SMMUEventInfo *event);
+void smmuv3_flush_config(SMMUDevice *sdev);
+
/* STE fields */
#define STE_VALID(x) extract32((x)->word[0], 0, 1)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index ef991cb7d8..1fd8aaa0c7 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -630,8 +630,7 @@ bad_ste:
* Supports linear and 2-level stream table
* Return 0 on success, -EINVAL otherwise
*/
-static int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste,
- SMMUEventInfo *event)
+int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste, SMMUEventInfo *event)
{
dma_addr_t addr, strtab_base;
uint32_t log2size;
@@ -900,7 +899,7 @@ static SMMUTransCfg *smmuv3_get_config(SMMUDevice *sdev, SMMUEventInfo *event)
return cfg;
}
-static void smmuv3_flush_config(SMMUDevice *sdev)
+void smmuv3_flush_config(SMMUDevice *sdev)
{
SMMUv3State *s = sdev->smmu;
SMMUState *bc = &s->smmu_state;
@@ -1330,6 +1329,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
{
uint32_t sid = CMD_SID(&cmd);
SMMUDevice *sdev = smmu_find_sdev(bs, sid);
+ Error *local_err = NULL;
if (CMD_SSEC(&cmd)) {
cmd_error = SMMU_CERROR_ILL;
@@ -1341,6 +1341,11 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
}
trace_smmuv3_cmdq_cfgi_ste(sid);
+ if (!smmuv3_accel_install_nested_ste(s, sdev, sid, &local_err)) {
+ error_report_err(local_err);
+ cmd_error = SMMU_CERROR_ILL;
+ break;
+ }
smmuv3_flush_config(sdev);
break;
@@ -1350,6 +1355,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
uint32_t sid = CMD_SID(&cmd), mask;
uint8_t range = CMD_STE_RANGE(&cmd);
SMMUSIDRange sid_range;
+ Error *local_err = NULL;
if (CMD_SSEC(&cmd)) {
cmd_error = SMMU_CERROR_ILL;
@@ -1361,6 +1367,12 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
sid_range.end = sid_range.start + mask;
trace_smmuv3_cmdq_cfgi_ste_range(sid_range.start, sid_range.end);
+ if (!smmuv3_accel_install_nested_ste_range(s, &sid_range,
+ &local_err)) {
+ error_report_err(local_err);
+ cmd_error = SMMU_CERROR_ILL;
+ break;
+ }
smmu_configs_inv_sid_range(bs, sid_range);
break;
}
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index 86370d448a..3b1e9bf083 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -69,6 +69,7 @@ smmu_reset_exit(void) ""
#smmuv3-accel.c
smmuv3_accel_set_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (sid=0x%x)"
smmuv3_accel_unset_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (sid=0x%x)"
+smmuv3_accel_install_nested_ste(uint32_t sid, uint64_t ste_1, uint64_t ste_0) "sid=%d ste=%"PRIx64":%"PRIx64
# strongarm.c
strongarm_uart_update_parameters(const char *label, int speed, char parity, int data_bits, int stop_bits) "%s speed=%d parity=%c data=%d stop=%d"
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 09/27] hw/arm/smmuv3-accel: Support nested STE install/uninstall support
2025-09-29 13:36 ` [PATCH v4 09/27] hw/arm/smmuv3-accel: Support nested STE install/uninstall support Shameer Kolothum
@ 2025-09-29 16:41 ` Jonathan Cameron via
2025-10-02 10:04 ` Eric Auger
1 sibling, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 16:41 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:25 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Allocates a s1 HWPT for the Guest s1 stage and attaches that to the
S1
> pass-through vfio device. This will be invoked when Guest issues
> SMMU_CMD_CFGI_STE/STE_RANGE.
>
> While at it, we are also exporting both smmu_find_ste() and
> smmuv3_flush_config() from smmuv3.c for use here.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Whilst I'm getting a bit out of my comfort zone for review
and don't have time to dig into the details / specs. Code is in a good state
so
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 09/27] hw/arm/smmuv3-accel: Support nested STE install/uninstall support
2025-09-29 13:36 ` [PATCH v4 09/27] hw/arm/smmuv3-accel: Support nested STE install/uninstall support Shameer Kolothum
2025-09-29 16:41 ` Jonathan Cameron via
@ 2025-10-02 10:04 ` Eric Auger
2025-10-02 12:08 ` Shameer Kolothum
1 sibling, 1 reply; 118+ messages in thread
From: Eric Auger @ 2025-10-02 10:04 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
Hi Shameer,
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Allocates a s1 HWPT for the Guest s1 stage and attaches that to the
> pass-through vfio device. This will be invoked when Guest issues
> SMMU_CMD_CFGI_STE/STE_RANGE.
ON set both alloc + attachment are done. On unset you shall explain the
gym related to config/abort hwpt. Those are S1 hwpt, right? I think this
shall be reflected in the name to make it clearer? In the previous patch
I didn't really understand that.
> While at it, we are also exporting both smmu_find_ste() and
> smmuv3_flush_config() from smmuv3.c for use here.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/arm/smmuv3-accel.c | 164 +++++++++++++++++++++++++++++++++++++++
> hw/arm/smmuv3-accel.h | 22 ++++++
> hw/arm/smmuv3-internal.h | 3 +
> hw/arm/smmuv3.c | 18 ++++-
> hw/arm/trace-events | 1 +
> 5 files changed, 205 insertions(+), 3 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 81fa738f6f..5c3825cecd 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -17,10 +17,174 @@
> #include "hw/vfio/pci.h"
>
> #include "smmuv3-accel.h"
> +#include "smmuv3-internal.h"
>
> #define SMMU_STE_VALID (1ULL << 0)
> #define SMMU_STE_CFG_BYPASS (1ULL << 3)
>
> +#define STE0_V MAKE_64BIT_MASK(0, 1)
> +#define STE0_CONFIG MAKE_64BIT_MASK(1, 3)
> +#define STE0_S1FMT MAKE_64BIT_MASK(4, 2)
> +#define STE0_CTXPTR MAKE_64BIT_MASK(6, 50)
> +#define STE0_S1CDMAX MAKE_64BIT_MASK(59, 5)
> +#define STE0_MASK (STE0_S1CDMAX | STE0_CTXPTR | STE0_S1FMT | STE0_CONFIG | \
> + STE0_V)
> +
> +#define STE1_S1DSS MAKE_64BIT_MASK(0, 2)
> +#define STE1_S1CIR MAKE_64BIT_MASK(2, 2)
> +#define STE1_S1COR MAKE_64BIT_MASK(4, 2)
> +#define STE1_S1CSH MAKE_64BIT_MASK(6, 2)
> +#define STE1_S1STALLD MAKE_64BIT_MASK(27, 1)
> +#define STE1_ETS MAKE_64BIT_MASK(28, 2)
this is EATS
> +#define STE1_MASK (STE1_ETS | STE1_S1STALLD | STE1_S1CSH | STE1_S1COR | \
> + STE1_S1CIR | STE1_S1DSS)
I would move all that stuff in smmuv3-internal.h too
> +
> +static bool
> +smmuv3_accel_dev_uninstall_nested_ste(SMMUv3AccelDevice *accel_dev, bool abort,
> + Error **errp)
> +{
> + HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
> + SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
> + uint32_t hwpt_id;
> +
> + if (!s1_hwpt || !accel_dev->viommu) {
> + return true;
> + }
> +
> + if (abort) {
> + hwpt_id = accel_dev->viommu->abort_hwpt_id;
> + } else {
> + hwpt_id = accel_dev->viommu->bypass_hwpt_id;
> + }
> +
> + if (!host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp)) {
> + return false;
> + }
I think you shall add a trace point for uninstall and precise which hwpt
we use (abort or bypass). This might be useful for debug.
> +
> + iommufd_backend_free_id(s1_hwpt->iommufd, s1_hwpt->hwpt_id);
> + accel_dev->s1_hwpt = NULL;
> + g_free(s1_hwpt);
> + return true;
> +}
> +
> +static bool
> +smmuv3_accel_dev_install_nested_ste(SMMUv3AccelDevice *accel_dev,
> + uint32_t data_type, uint32_t data_len,
> + void *data, Error **errp)
> +{
> + SMMUViommu *viommu = accel_dev->viommu;
> + SMMUS1Hwpt *s1_hwpt = accel_dev->s1_hwpt;
> + HostIOMMUDeviceIOMMUFD *idev = accel_dev->idev;
> + uint32_t flags = 0;
> +
> + if (!idev || !viommu) {
> + error_setg(errp, "Device 0x%x has no associated IOMMU dev or vIOMMU",
> + smmu_get_sid(&accel_dev->sdev));
> + return false;
> + }
> +
> + if (s1_hwpt) {
> + if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev, true, errp)) {
> + return false;
> + }
> + }
> +
> + s1_hwpt = g_new0(SMMUS1Hwpt, 1);
> + s1_hwpt->iommufd = idev->iommufd;
> + if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> + viommu->core.viommu_id, flags, data_type,
> + data_len, data, &s1_hwpt->hwpt_id, errp)) {
> + return false;
> + }
> +
> + if (!host_iommu_device_iommufd_attach_hwpt(idev, s1_hwpt->hwpt_id, errp)) {
> + iommufd_backend_free_id(idev->iommufd, s1_hwpt->hwpt_id);
> + return false;
> + }
> + accel_dev->s1_hwpt = s1_hwpt;
> + return true;
> +}
> +
> +bool
> +smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
> + Error **errp)
> +{
> + SMMUv3AccelDevice *accel_dev;
> + SMMUEventInfo event = {.type = SMMU_EVT_NONE, .sid = sid,
> + .inval_ste_allowed = true};
> + struct iommu_hwpt_arm_smmuv3 nested_data = {};
> + uint64_t ste_0, ste_1;
> + uint32_t config;
> + STE ste;
> + int ret;
> +
> + if (!s->accel) {
> + return true;
> + }
> +
> + accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
> + if (!accel_dev->viommu) {
> + return true;
> + }
> +
> + ret = smmu_find_ste(sdev->smmu, sid, &ste, &event);
> + if (ret) {
> + error_setg(errp, "Failed to find STE for Device 0x%x", sid);
> + return true;
> + }
> +
> + config = STE_CONFIG(&ste);
> + if (!STE_VALID(&ste) || !STE_CFG_S1_ENABLED(config)) {
> + if (!smmuv3_accel_dev_uninstall_nested_ste(accel_dev,
> + STE_CFG_ABORT(config),
> + errp)) {
> + return false;
> + }
> + smmuv3_flush_config(sdev);
> + return true;
> + }
> +
> + ste_0 = (uint64_t)ste.word[0] | (uint64_t)ste.word[1] << 32;
> + ste_1 = (uint64_t)ste.word[2] | (uint64_t)ste.word[3] << 32;
> + nested_data.ste[0] = cpu_to_le64(ste_0 & STE0_MASK);
> + nested_data.ste[1] = cpu_to_le64(ste_1 & STE1_MASK);
> +
> + if (!smmuv3_accel_dev_install_nested_ste(accel_dev,
> + IOMMU_HWPT_DATA_ARM_SMMUV3,
> + sizeof(nested_data),
> + &nested_data, errp)) {
> + error_setg(errp, "Unable to install nested STE=%16LX:%16LX, sid=0x%x,"
don't you need to use PRIx64 instead?
also I suggest to put the SID first.
> + "ret=%d", nested_data.ste[1], nested_data.ste[0], sid, ret);
> + return false;
> + }
> + trace_smmuv3_accel_install_nested_ste(sid, nested_data.ste[1],
> + nested_data.ste[0]);
> + return true;
> +}
> +
> +bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
> + Error **errp)
> +{
> + SMMUv3AccelState *s_accel = s->s_accel;
> + SMMUv3AccelDevice *accel_dev;
> +
> + if (!s_accel || !s_accel->viommu) {
> + return true;
> + }
> +
> + QLIST_FOREACH(accel_dev, &s_accel->viommu->device_list, next) {
> + uint32_t sid = smmu_get_sid(&accel_dev->sdev);
> +
> + if (sid >= range->start && sid <= range->end) {
> + if (!smmuv3_accel_install_nested_ste(s, &accel_dev->sdev,
> + sid, errp)) {
> + return false;
> + }
> + }
> + }
> + return true;
> +}
> +
> static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> PCIBus *bus, int devfn)
> {
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> index 3c8506d1e6..f631443b09 100644
> --- a/hw/arm/smmuv3-accel.h
> +++ b/hw/arm/smmuv3-accel.h
> @@ -22,9 +22,15 @@ typedef struct SMMUViommu {
> QLIST_HEAD(, SMMUv3AccelDevice) device_list;
> } SMMUViommu;
>
> +typedef struct SMMUS1Hwpt {
> + IOMMUFDBackend *iommufd;
> + uint32_t hwpt_id;
> +} SMMUS1Hwpt;
> +
> typedef struct SMMUv3AccelDevice {
> SMMUDevice sdev;
> HostIOMMUDeviceIOMMUFD *idev;
> + SMMUS1Hwpt *s1_hwpt;
> SMMUViommu *viommu;
> QLIST_ENTRY(SMMUv3AccelDevice) next;
> } SMMUv3AccelDevice;
> @@ -35,10 +41,26 @@ typedef struct SMMUv3AccelState {
>
> #ifdef CONFIG_ARM_SMMUV3_ACCEL
> void smmuv3_accel_init(SMMUv3State *s);
> +bool smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
> + Error **errp);
> +bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
> + Error **errp);
> #else
> static inline void smmuv3_accel_init(SMMUv3State *s)
> {
> }
> +static inline bool
> +smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
> + Error **errp)
> +{
> + return true;
> +}
> +static inline bool
> +smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
> + Error **errp)
> +{
> + return true;
> +}
> #endif
>
> #endif /* HW_ARM_SMMUV3_ACCEL_H */
> diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
> index b6b7399347..b0dfa9465c 100644
> --- a/hw/arm/smmuv3-internal.h
> +++ b/hw/arm/smmuv3-internal.h
> @@ -547,6 +547,9 @@ typedef struct CD {
> uint32_t word[16];
> } CD;
>
> +int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste, SMMUEventInfo *event);
> +void smmuv3_flush_config(SMMUDevice *sdev);
> +
> /* STE fields */
>
> #define STE_VALID(x) extract32((x)->word[0], 0, 1)
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index ef991cb7d8..1fd8aaa0c7 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -630,8 +630,7 @@ bad_ste:
> * Supports linear and 2-level stream table
> * Return 0 on success, -EINVAL otherwise
> */
> -static int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste,
> - SMMUEventInfo *event)
> +int smmu_find_ste(SMMUv3State *s, uint32_t sid, STE *ste, SMMUEventInfo *event)
> {
> dma_addr_t addr, strtab_base;
> uint32_t log2size;
> @@ -900,7 +899,7 @@ static SMMUTransCfg *smmuv3_get_config(SMMUDevice *sdev, SMMUEventInfo *event)
> return cfg;
> }
>
> -static void smmuv3_flush_config(SMMUDevice *sdev)
> +void smmuv3_flush_config(SMMUDevice *sdev)
> {
> SMMUv3State *s = sdev->smmu;
> SMMUState *bc = &s->smmu_state;
> @@ -1330,6 +1329,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
> {
> uint32_t sid = CMD_SID(&cmd);
> SMMUDevice *sdev = smmu_find_sdev(bs, sid);
> + Error *local_err = NULL;
>
> if (CMD_SSEC(&cmd)) {
> cmd_error = SMMU_CERROR_ILL;
> @@ -1341,6 +1341,11 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
> }
>
> trace_smmuv3_cmdq_cfgi_ste(sid);
> + if (!smmuv3_accel_install_nested_ste(s, sdev, sid, &local_err)) {
> + error_report_err(local_err);
> + cmd_error = SMMU_CERROR_ILL;
> + break;
> + }
> smmuv3_flush_config(sdev);
>
> break;
> @@ -1350,6 +1355,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
> uint32_t sid = CMD_SID(&cmd), mask;
> uint8_t range = CMD_STE_RANGE(&cmd);
> SMMUSIDRange sid_range;
> + Error *local_err = NULL;
>
> if (CMD_SSEC(&cmd)) {
> cmd_error = SMMU_CERROR_ILL;
> @@ -1361,6 +1367,12 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
> sid_range.end = sid_range.start + mask;
>
> trace_smmuv3_cmdq_cfgi_ste_range(sid_range.start, sid_range.end);
> + if (!smmuv3_accel_install_nested_ste_range(s, &sid_range,
> + &local_err)) {
> + error_report_err(local_err);
> + cmd_error = SMMU_CERROR_ILL;
> + break;
> + }
> smmu_configs_inv_sid_range(bs, sid_range);
> break;
> }
> diff --git a/hw/arm/trace-events b/hw/arm/trace-events
> index 86370d448a..3b1e9bf083 100644
> --- a/hw/arm/trace-events
> +++ b/hw/arm/trace-events
> @@ -69,6 +69,7 @@ smmu_reset_exit(void) ""
> #smmuv3-accel.c
> smmuv3_accel_set_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (sid=0x%x)"
> smmuv3_accel_unset_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (sid=0x%x)"
> +smmuv3_accel_install_nested_ste(uint32_t sid, uint64_t ste_1, uint64_t ste_0) "sid=%d ste=%"PRIx64":%"PRIx64
>
> # strongarm.c
> strongarm_uart_update_parameters(const char *label, int speed, char parity, int data_bits, int stop_bits) "%s speed=%d parity=%c data=%d stop=%d"
Thanks
Eric
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 09/27] hw/arm/smmuv3-accel: Support nested STE install/uninstall support
2025-10-02 10:04 ` Eric Auger
@ 2025-10-02 12:08 ` Shameer Kolothum
2025-10-02 12:27 ` Eric Auger
0 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-02 12:08 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 02 October 2025 11:05
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 09/27] hw/arm/smmuv3-accel: Support nested STE
> install/uninstall support
>
> External email: Use caution opening links or attachments
>
>
> Hi Shameer,
>
> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> >
> > Allocates a s1 HWPT for the Guest s1 stage and attaches that to the
> > pass-through vfio device. This will be invoked when Guest issues
> > SMMU_CMD_CFGI_STE/STE_RANGE.
> ON set both alloc + attachment are done. On unset you shall explain the
> gym related to config/abort hwpt. Those are S1 hwpt, right? I think this
> shall be reflected in the name to make it clearer? In the previous patch
> I didn't really understand that.
Ok. There are three HWPTs in play here.
BYPASS HWPT
ABORT HWPT
S1 HWPT --> This is when Guest has a valid S1 (STE_VALID && STE_CFG_S1_ENABLED)
In previous patch we allocate a common BYPASS and ABORT HWPT for all devices
in a vIOMMU. We reuse that here in this patch and attach if Guest request a S1
bypass or abort case.
The S1 HWPT is allocated as and when the Guest has a valid STE with context
descriptor and use that for attachment.
Whether we can call them S1 HWPT only, I am not sure. Because, I think,
during alloc() call the kernel allocates a Nested HWPT(IOMMU_DOMAIN_NESTED)
which uses a Guest S1 nested on a S2 HWPT.
Anyway, I will rephrase the comments and variable names to make it clear.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 09/27] hw/arm/smmuv3-accel: Support nested STE install/uninstall support
2025-10-02 12:08 ` Shameer Kolothum
@ 2025-10-02 12:27 ` Eric Auger
0 siblings, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-02 12:27 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
Hi Shameer,
On 10/2/25 2:08 PM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 02 October 2025 11:05
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 09/27] hw/arm/smmuv3-accel: Support nested STE
>> install/uninstall support
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Shameer,
>>
>> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>>
>>> Allocates a s1 HWPT for the Guest s1 stage and attaches that to the
>>> pass-through vfio device. This will be invoked when Guest issues
>>> SMMU_CMD_CFGI_STE/STE_RANGE.
>> ON set both alloc + attachment are done. On unset you shall explain the
>> gym related to config/abort hwpt. Those are S1 hwpt, right? I think this
>> shall be reflected in the name to make it clearer? In the previous patch
>> I didn't really understand that.
> Ok. There are three HWPTs in play here.
>
> BYPASS HWPT
> ABORT HWPT
> S1 HWPT --> This is when Guest has a valid S1 (STE_VALID && STE_CFG_S1_ENABLED)
>
> In previous patch we allocate a common BYPASS and ABORT HWPT for all devices
> in a vIOMMU. We reuse that here in this patch and attach if Guest request a S1
> bypass or abort case.
>
> The S1 HWPT is allocated as and when the Guest has a valid STE with context
> descriptor and use that for attachment.
>
> Whether we can call them S1 HWPT only, I am not sure. Because, I think,
> during alloc() call the kernel allocates a Nested HWPT(IOMMU_DOMAIN_NESTED)
> which uses a Guest S1 nested on a S2 HWPT.
the role of BYPASS HWPT and ABORT_HWPT must be better explained I think.
Same in previous patch. I understand they abstract stage 1 in abort or
bypass. I think we shall better explain what HWPT hierarchy we are
putting in place refering to the kernel uapi (and not kernel internal
implementation). Thanks Eric
>
> Anyway, I will rephrase the comments and variable names to make it clear.
>
> Thanks,
> Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 10/27] hw/arm/smmuv3-accel: Allocate a vDEVICE object for device
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (8 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 09/27] hw/arm/smmuv3-accel: Support nested STE install/uninstall support Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 16:42 ` Jonathan Cameron via
2025-10-17 13:08 ` Eric Auger
2025-09-29 13:36 ` [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback Shameer Kolothum
` (17 subsequent siblings)
27 siblings, 2 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
From: Nicolin Chen <nicolinc@nvidia.com>
Allocate and associate a vDEVICE object for the Guest device with the
vIOMMU. This will help the host kernel to make a virtual SID --> physical
SID mapping. Since we pass the raw invalidation commands(eg: CMD_CFGI_CD)
from Guest directly to host kernel, this provides a way to retrieve the
correct physical SID.
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 41 +++++++++++++++++++++++++++++++++++++++++
hw/arm/smmuv3-accel.h | 1 +
2 files changed, 42 insertions(+)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 5c3825cecd..790887ac31 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -39,6 +39,35 @@
#define STE1_MASK (STE1_ETS | STE1_S1STALLD | STE1_S1CSH | STE1_S1COR | \
STE1_S1CIR | STE1_S1DSS)
+static bool
+smmuv3_accel_alloc_vdev(SMMUv3AccelDevice *accel_dev, int sid, Error **errp)
+{
+ SMMUViommu *viommu = accel_dev->viommu;
+ IOMMUFDVdev *vdev;
+ uint32_t vdev_id;
+
+ if (!accel_dev->idev || accel_dev->vdev) {
+ return true;
+ }
+
+ if (!iommufd_backend_alloc_vdev(viommu->iommufd, accel_dev->idev->devid,
+ viommu->core.viommu_id, sid,
+ &vdev_id, errp)) {
+ return false;
+ }
+ if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
+ viommu->bypass_hwpt_id, errp)) {
+ iommufd_backend_free_id(viommu->iommufd, vdev_id);
+ return false;
+ }
+
+ vdev = g_new(IOMMUFDVdev, 1);
+ vdev->vdev_id = vdev_id;
+ vdev->dev_id = sid;
+ accel_dev->vdev = vdev;
+ return true;
+}
+
static bool
smmuv3_accel_dev_uninstall_nested_ste(SMMUv3AccelDevice *accel_dev, bool abort,
Error **errp)
@@ -127,6 +156,10 @@ smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
return true;
}
+ if (!smmuv3_accel_alloc_vdev(accel_dev, sid, errp)) {
+ return false;
+ }
+
ret = smmu_find_ste(sdev->smmu, sid, &ste, &event);
if (ret) {
error_setg(errp, "Failed to find STE for Device 0x%x", sid);
@@ -311,6 +344,7 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
SMMUPciBus *sbus = g_hash_table_lookup(bs->smmu_pcibus_by_busptr, bus);
SMMUv3AccelDevice *accel_dev;
SMMUViommu *viommu;
+ IOMMUFDVdev *vdev;
SMMUDevice *sdev;
uint16_t sid;
@@ -337,6 +371,13 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
trace_smmuv3_accel_unset_iommu_device(devfn, sid);
viommu = s->s_accel->viommu;
+ vdev = accel_dev->vdev;
+ if (vdev) {
+ iommufd_backend_free_id(viommu->iommufd, vdev->vdev_id);
+ g_free(vdev);
+ accel_dev->vdev = NULL;
+ }
+
if (QLIST_EMPTY(&viommu->device_list)) {
iommufd_backend_free_id(viommu->iommufd, viommu->bypass_hwpt_id);
iommufd_backend_free_id(viommu->iommufd, viommu->abort_hwpt_id);
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index f631443b09..6242614c00 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -31,6 +31,7 @@ typedef struct SMMUv3AccelDevice {
SMMUDevice sdev;
HostIOMMUDeviceIOMMUFD *idev;
SMMUS1Hwpt *s1_hwpt;
+ IOMMUFDVdev *vdev;
SMMUViommu *viommu;
QLIST_ENTRY(SMMUv3AccelDevice) next;
} SMMUv3AccelDevice;
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 10/27] hw/arm/smmuv3-accel: Allocate a vDEVICE object for device
2025-09-29 13:36 ` [PATCH v4 10/27] hw/arm/smmuv3-accel: Allocate a vDEVICE object for device Shameer Kolothum
@ 2025-09-29 16:42 ` Jonathan Cameron via
2025-10-17 13:08 ` Eric Auger
1 sibling, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 16:42 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:26 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Allocate and associate a vDEVICE object for the Guest device with the
> vIOMMU. This will help the host kernel to make a virtual SID --> physical
> SID mapping. Since we pass the raw invalidation commands(eg: CMD_CFGI_CD)
> from Guest directly to host kernel, this provides a way to retrieve the
> correct physical SID.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 10/27] hw/arm/smmuv3-accel: Allocate a vDEVICE object for device
2025-09-29 13:36 ` [PATCH v4 10/27] hw/arm/smmuv3-accel: Allocate a vDEVICE object for device Shameer Kolothum
2025-09-29 16:42 ` Jonathan Cameron via
@ 2025-10-17 13:08 ` Eric Auger
1 sibling, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-17 13:08 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
Hi Shameer,
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> Allocate and associate a vDEVICE object for the Guest device with the
> vIOMMU. This will help the host kernel to make a virtual SID --> physical
> SID mapping. Since we pass the raw invalidation commands(eg: CMD_CFGI_CD)
> from Guest directly to host kernel, this provides a way to retrieve the
> correct physical SID.
>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/arm/smmuv3-accel.c | 41 +++++++++++++++++++++++++++++++++++++++++
> hw/arm/smmuv3-accel.h | 1 +
> 2 files changed, 42 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 5c3825cecd..790887ac31 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -39,6 +39,35 @@
> #define STE1_MASK (STE1_ETS | STE1_S1STALLD | STE1_S1CSH | STE1_S1COR | \
> STE1_S1CIR | STE1_S1DSS)
>
> +static bool
> +smmuv3_accel_alloc_vdev(SMMUv3AccelDevice *accel_dev, int sid, Error **errp)
> +{
> + SMMUViommu *viommu = accel_dev->viommu;
> + IOMMUFDVdev *vdev;
> + uint32_t vdev_id;
> +
> + if (!accel_dev->idev || accel_dev->vdev) {
> + return true;
> + }
> +
> + if (!iommufd_backend_alloc_vdev(viommu->iommufd, accel_dev->idev->devid,
> + viommu->core.viommu_id, sid,
> + &vdev_id, errp)) {
> + return false;
> + }
> + if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
> + viommu->bypass_hwpt_id, errp)) {
> + iommufd_backend_free_id(viommu->iommufd, vdev_id);
> + return false;
> + }
> +
> + vdev = g_new(IOMMUFDVdev, 1);
> + vdev->vdev_id = vdev_id;
> + vdev->dev_id = sid;
That's confusing to me it should be virt_id and not dev_id which usually
refers to an iommu object id.
Would be nice in general to stick to the kernel uapi terminology. For
instance vdev_id shall rather vdevice_id although in that case it is
understandable.
+bool iommufd_backend_alloc_vdev(IOMMUFDBackend *be, uint32_t dev_id,
+ uint32_t viommu_id, uint64_t virt_id,
+ uint32_t *out_vdev_id, Error **errp)
* struct iommu_vdevice_alloc - ioctl(IOMMU_VDEVICE_ALLOC)
* @size: sizeof(struct iommu_vdevice_alloc)
* @viommu_id: vIOMMU ID to associate with the virtual device
* @dev_id: The physical device to allocate a virtual instance on the vIOMMU
* @out_vdevice_id: Object handle for the vDevice. Pass to IOMMU_DESTORY
* @virt_id: Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3,
vDeviceID
* of AMD IOMMU, and vRID of Intel VT-d
> + accel_dev->vdev = vdev;
> + return true;
> +}
> +
> static bool
> smmuv3_accel_dev_uninstall_nested_ste(SMMUv3AccelDevice *accel_dev, bool abort,
> Error **errp)
> @@ -127,6 +156,10 @@ smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
> return true;
> }
>
> + if (!smmuv3_accel_alloc_vdev(accel_dev, sid, errp)) {
> + return false;
> + }
> +
> ret = smmu_find_ste(sdev->smmu, sid, &ste, &event);
> if (ret) {
> error_setg(errp, "Failed to find STE for Device 0x%x", sid);
> @@ -311,6 +344,7 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> SMMUPciBus *sbus = g_hash_table_lookup(bs->smmu_pcibus_by_busptr, bus);
> SMMUv3AccelDevice *accel_dev;
> SMMUViommu *viommu;
> + IOMMUFDVdev *vdev;
> SMMUDevice *sdev;
> uint16_t sid;
>
> @@ -337,6 +371,13 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> trace_smmuv3_accel_unset_iommu_device(devfn, sid);
>
> viommu = s->s_accel->viommu;
> + vdev = accel_dev->vdev;
> + if (vdev) {
> + iommufd_backend_free_id(viommu->iommufd, vdev->vdev_id);
> + g_free(vdev);
> + accel_dev->vdev = NULL;
> + }
> +
> if (QLIST_EMPTY(&viommu->device_list)) {
> iommufd_backend_free_id(viommu->iommufd, viommu->bypass_hwpt_id);
> iommufd_backend_free_id(viommu->iommufd, viommu->abort_hwpt_id);
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> index f631443b09..6242614c00 100644
> --- a/hw/arm/smmuv3-accel.h
> +++ b/hw/arm/smmuv3-accel.h
> @@ -31,6 +31,7 @@ typedef struct SMMUv3AccelDevice {
> SMMUDevice sdev;
> HostIOMMUDeviceIOMMUFD *idev;
> SMMUS1Hwpt *s1_hwpt;
> + IOMMUFDVdev *vdev;
> SMMUViommu *viommu;
> QLIST_ENTRY(SMMUv3AccelDevice) next;
> } SMMUv3AccelDevice;
Thanks
Eric
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (9 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 10/27] hw/arm/smmuv3-accel: Allocate a vDEVICE object for device Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 16:48 ` Jonathan Cameron via
` (2 more replies)
2025-09-29 13:36 ` [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of " Shameer Kolothum
` (16 subsequent siblings)
27 siblings, 3 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On ARM, when a device is behind an IOMMU, its MSI doorbell address is
subject to translation by the IOMMU. This behavior affects vfio-pci
passthrough devices assigned to guests using an accelerated SMMUv3.
In this setup, we configure the host SMMUv3 in nested mode, where
VFIO sets up the Stage-2 (S2) mappings for guest RAM, while the guest
controls Stage-1 (S1). To allow VFIO to correctly configure S2 mappings,
we currently return the system address space via the get_address_space()
callback for vfio-pci devices.
However, QEMU/KVM also uses this same callback path when resolving the
address space for MSI doorbells:
kvm_irqchip_add_msi_route()
kvm_arch_fixup_msi_route()
pci_device_iommu_address_space()
get_address_space()
This will cause the device to be configured with wrong MSI doorbell
address if it return the system address space.
Introduce an optional get_msi_address_space() callback and use that in
the above path if available. This will enable IOMMU implementations to
make use of this if required.
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/pci/pci.c | 19 +++++++++++++++++++
include/hw/pci/pci.h | 16 ++++++++++++++++
target/arm/kvm.c | 2 +-
3 files changed, 36 insertions(+), 1 deletion(-)
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 1315ef13ea..6f9e1616dd 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2964,6 +2964,25 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
return &address_space_memory;
}
+AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
+{
+ PCIBus *bus;
+ PCIBus *iommu_bus;
+ int devfn;
+
+ pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
+ if (iommu_bus) {
+ if (iommu_bus->iommu_ops->get_msi_address_space) {
+ return iommu_bus->iommu_ops->get_msi_address_space(bus,
+ iommu_bus->iommu_opaque, devfn);
+ } else {
+ return iommu_bus->iommu_ops->get_address_space(bus,
+ iommu_bus->iommu_opaque, devfn);
+ }
+ }
+ return &address_space_memory;
+}
+
int pci_iommu_init_iotlb_notifier(PCIDevice *dev, IOMMUNotifier *n,
IOMMUNotify fn, void *opaque)
{
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index c54f2b53ae..0d3b351903 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -652,6 +652,21 @@ typedef struct PCIIOMMUOps {
uint32_t pasid, bool priv_req, bool exec_req,
hwaddr addr, bool lpig, uint16_t prgi, bool is_read,
bool is_write);
+ /**
+ * @get_msi_address_space: get the address space for MSI doorbell address
+ * for devices
+ *
+ * Optional callback which returns a pointer to an #AddressSpace. This
+ * is required if MSI doorbell also gets translated through IOMMU(eg: ARM)
+ *
+ * @bus: the #PCIBus being accessed.
+ *
+ * @opaque: the data passed to pci_setup_iommu().
+ *
+ * @devfn: device and function number
+ */
+ AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
+ int devfn);
} PCIIOMMUOps;
bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
@@ -660,6 +675,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
Error **errp);
void pci_device_unset_iommu_device(PCIDevice *dev);
+AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
/**
* pci_device_get_viommu_flags: get vIOMMU flags.
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index b8a1c071f5..10eb8655c6 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1611,7 +1611,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
uint64_t address, uint32_t data, PCIDevice *dev)
{
- AddressSpace *as = pci_device_iommu_address_space(dev);
+ AddressSpace *as = pci_device_iommu_msi_address_space(dev);
hwaddr xlat, len, doorbell_gpa;
MemoryRegionSection mrs;
MemoryRegion *mr;
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback
2025-09-29 13:36 ` [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback Shameer Kolothum
@ 2025-09-29 16:48 ` Jonathan Cameron via
2025-10-16 22:30 ` Nicolin Chen
2025-10-20 16:21 ` Eric Auger
2 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 16:48 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:27 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> On ARM, when a device is behind an IOMMU, its MSI doorbell address is
> subject to translation by the IOMMU. This behavior affects vfio-pci
> passthrough devices assigned to guests using an accelerated SMMUv3.
>
> In this setup, we configure the host SMMUv3 in nested mode, where
> VFIO sets up the Stage-2 (S2) mappings for guest RAM, while the guest
> controls Stage-1 (S1). To allow VFIO to correctly configure S2 mappings,
> we currently return the system address space via the get_address_space()
> callback for vfio-pci devices.
>
> However, QEMU/KVM also uses this same callback path when resolving the
> address space for MSI doorbells:
>
> kvm_irqchip_add_msi_route()
> kvm_arch_fixup_msi_route()
> pci_device_iommu_address_space()
> get_address_space()
>
> This will cause the device to be configured with wrong MSI doorbell
> address if it return the system address space.
>
> Introduce an optional get_msi_address_space() callback and use that in
> the above path if available. This will enable IOMMU implementations to
> make use of this if required.
Extra space before required.
>
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
one comment inline. Either way
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> ---
> hw/pci/pci.c | 19 +++++++++++++++++++
> include/hw/pci/pci.h | 16 ++++++++++++++++
> target/arm/kvm.c | 2 +-
> 3 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 1315ef13ea..6f9e1616dd 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2964,6 +2964,25 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> return &address_space_memory;
> }
>
> +AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
> +{
> + PCIBus *bus;
> + PCIBus *iommu_bus;
> + int devfn;
> +
> + pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> + if (iommu_bus) {
> + if (iommu_bus->iommu_ops->get_msi_address_space) {
> + return iommu_bus->iommu_ops->get_msi_address_space(bus,
> + iommu_bus->iommu_opaque, devfn);
> + } else {
Not important so up to you.
I see the 'else' as unnecessary here both because you returned above and
because it's kind of the natural default - i.e. what we did before the
new callback.
> + return iommu_bus->iommu_ops->get_address_space(bus,
> + iommu_bus->iommu_opaque, devfn);
> + }
> + }
> + return &address_space_memory;
> +}
> +
> int pci_iommu_init_iotlb_notifier(PCIDevice *dev, IOMMUNotifier *n,
> IOMMUNotify fn, void *opaque)
> {
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback
2025-09-29 13:36 ` [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback Shameer Kolothum
2025-09-29 16:48 ` Jonathan Cameron via
@ 2025-10-16 22:30 ` Nicolin Chen
2025-10-20 16:14 ` Eric Auger
2025-10-20 16:21 ` Eric Auger
2 siblings, 1 reply; 118+ messages in thread
From: Nicolin Chen @ 2025-10-16 22:30 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, Sep 29, 2025 at 02:36:27PM +0100, Shameer Kolothum wrote:
> On ARM, when a device is behind an IOMMU, its MSI doorbell address is
> subject to translation by the IOMMU. This behavior affects vfio-pci
> passthrough devices assigned to guests using an accelerated SMMUv3.
>
> In this setup, we configure the host SMMUv3 in nested mode, where
> VFIO sets up the Stage-2 (S2) mappings for guest RAM, while the guest
> controls Stage-1 (S1). To allow VFIO to correctly configure S2 mappings,
> we currently return the system address space via the get_address_space()
> callback for vfio-pci devices.
>
> However, QEMU/KVM also uses this same callback path when resolving the
> address space for MSI doorbells:
>
> kvm_irqchip_add_msi_route()
> kvm_arch_fixup_msi_route()
> pci_device_iommu_address_space()
> get_address_space()
>
> This will cause the device to be configured with wrong MSI doorbell
> address if it return the system address space.
I think it'd be nicer to elaborate why a wrong address will be returned:
--------------------------------------------------------------------------
On ARM, a device behind an IOMMU requires translation for its MSI doorbell
address. When HW nested translation is enabled, the translation will also
happen in two stages: gIOVA => gPA => ITS page.
In the accelerated SMMUv3 mode, both stages are translated by the HW. So,
get_address_space() returns the system address space for stage-2 mappings,
as the smmuv3-accel model doesn't involve in either stage.
On the other hand, this callback is also invoked by QEMU/KVM:
kvm_irqchip_add_msi_route()
kvm_arch_fixup_msi_route()
pci_device_iommu_address_space()
get_address_space()
What KVM wants is to translate an MSI doorbell gIOVA to a vITS page (gPA),
so as to inject IRQs to the guest VM. And it expected get_address_space()
to return the address space for stage-1 mappings instead. Apparently, this
is broken.
Introduce an optional get_msi_address_space() callback and use that in the
above path.
--------------------------------------------------------------------------
> @@ -652,6 +652,21 @@ typedef struct PCIIOMMUOps {
> uint32_t pasid, bool priv_req, bool exec_req,
> hwaddr addr, bool lpig, uint16_t prgi, bool is_read,
> bool is_write);
> + /**
> + * @get_msi_address_space: get the address space for MSI doorbell address
> + * for devices
+ * @get_msi_address_space: get the address space to translate MSI doorbell
+ * address for a device
> + *
> + * Optional callback which returns a pointer to an #AddressSpace. This
> + * is required if MSI doorbell also gets translated through IOMMU(eg: ARM)
through vIOMMU (e.g. ARM).
With these,
Reviewed-by Nicolin Chen <nicolinc@nvidia.com>
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback
2025-10-16 22:30 ` Nicolin Chen
@ 2025-10-20 16:14 ` Eric Auger
2025-10-20 18:00 ` Nicolin Chen
0 siblings, 1 reply; 118+ messages in thread
From: Eric Auger @ 2025-10-20 16:14 UTC (permalink / raw)
To: Nicolin Chen, Shameer Kolothum
Cc: qemu-arm, qemu-devel, peter.maydell, jgg, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
Hi Nicolin, Shameer,
On 10/17/25 12:30 AM, Nicolin Chen wrote:
> On Mon, Sep 29, 2025 at 02:36:27PM +0100, Shameer Kolothum wrote:
>> On ARM, when a device is behind an IOMMU, its MSI doorbell address is
>> subject to translation by the IOMMU. This behavior affects vfio-pci
>> passthrough devices assigned to guests using an accelerated SMMUv3.
>>
>> In this setup, we configure the host SMMUv3 in nested mode, where
>> VFIO sets up the Stage-2 (S2) mappings for guest RAM, while the guest
>> controls Stage-1 (S1). To allow VFIO to correctly configure S2 mappings,
>> we currently return the system address space via the get_address_space()
>> callback for vfio-pci devices.
>>
>> However, QEMU/KVM also uses this same callback path when resolving the
>> address space for MSI doorbells:
>>
>> kvm_irqchip_add_msi_route()
>> kvm_arch_fixup_msi_route()
>> pci_device_iommu_address_space()
>> get_address_space()
>>
>> This will cause the device to be configured with wrong MSI doorbell
>> address if it return the system address space.
> I think it'd be nicer to elaborate why a wrong address will be returned:
>
> --------------------------------------------------------------------------
> On ARM, a device behind an IOMMU requires translation for its MSI doorbell
> address. When HW nested translation is enabled, the translation will also
> happen in two stages: gIOVA => gPA => ITS page.
>
> In the accelerated SMMUv3 mode, both stages are translated by the HW. So,
> get_address_space() returns the system address space for stage-2 mappings,
> as the smmuv3-accel model doesn't involve in either stage.
I don't understand "doesn't involve in either stage". This is still not
obious to me that for an HW accelerated nested IOMMU get_address_space()
shall return the system address space. I think this deserves to be
explained and maybe documented along with the callback.
>
> On the other hand, this callback is also invoked by QEMU/KVM:
>
> kvm_irqchip_add_msi_route()
> kvm_arch_fixup_msi_route()
> pci_device_iommu_address_space()
> get_address_space()
>
> What KVM wants is to translate an MSI doorbell gIOVA to a vITS page (gPA),
> so as to inject IRQs to the guest VM. And it expected get_address_space()
> to return the address space for stage-1 mappings instead. Apparently, this
> is broken.
"Apparently this is broken". Please clarify what is broken. Definitively if
pci_device_iommu_address_space(dev) retruns @adress_system_memory no
translation is attempted.
kvm_arch_fixup_msi_route() was introduced by
https://lore.kernel.org/all/1523518688-26674-12-git-send-email-eric.auger@redhat.com/
This relies on the vIOMMU translate callback which is supposed to be bypassed in general with VFIO devices. Isn't needed only for emulated devices?
May you and shameer discussed that in a previous thread. Might be worth to add the link to this discussion.
Thanks
Eric
>
> Introduce an optional get_msi_address_space() callback and use that in the
> above path.
> --------------------------------------------------------------------------
>
>> @@ -652,6 +652,21 @@ typedef struct PCIIOMMUOps {
>> uint32_t pasid, bool priv_req, bool exec_req,
>> hwaddr addr, bool lpig, uint16_t prgi, bool is_read,
>> bool is_write);
>> + /**
>> + * @get_msi_address_space: get the address space for MSI doorbell address
>> + * for devices
> + * @get_msi_address_space: get the address space to translate MSI doorbell
> + * address for a device
>
>> + *
>> + * Optional callback which returns a pointer to an #AddressSpace. This
>> + * is required if MSI doorbell also gets translated through IOMMU(eg: ARM)
> through vIOMMU (e.g. ARM).
>
> With these,
>
> Reviewed-by Nicolin Chen <nicolinc@nvidia.com>
>
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback
2025-10-20 16:14 ` Eric Auger
@ 2025-10-20 18:00 ` Nicolin Chen
2025-10-21 16:26 ` Eric Auger
0 siblings, 1 reply; 118+ messages in thread
From: Nicolin Chen @ 2025-10-20 18:00 UTC (permalink / raw)
To: Eric Auger
Cc: Shameer Kolothum, qemu-arm, qemu-devel, peter.maydell, jgg,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, jonathan.cameron, zhangfei.gao, zhenzhong.duan,
yi.l.liu, shameerkolothum
On Mon, Oct 20, 2025 at 06:14:33PM +0200, Eric Auger wrote:
> >> This will cause the device to be configured with wrong MSI doorbell
> >> address if it return the system address space.
> >
> > I think it'd be nicer to elaborate why a wrong address will be returned:
> >
> > --------------------------------------------------------------------------
> > On ARM, a device behind an IOMMU requires translation for its MSI doorbell
> > address. When HW nested translation is enabled, the translation will also
> > happen in two stages: gIOVA => gPA => ITS page.
> >
> > In the accelerated SMMUv3 mode, both stages are translated by the HW. So,
> > get_address_space() returns the system address space for stage-2 mappings,
> > as the smmuv3-accel model doesn't involve in either stage.
> I don't understand "doesn't involve in either stage". This is still not
> obious to me that for an HW accelerated nested IOMMU get_address_space()
> shall return the system address space. I think this deserves to be
> explained and maybe documented along with the callback.
get_address_space() is used by pci_device_iommu_address_space(),
which is for attach or translation.
In QEMU, we have an "iommu" type of memory region, to represent
the address space providing the stage-1 translation.
In accel case excluding MSI, there is no need of "emulated iommu
translation" since HW/host SMMU takes care of both stages. Thus,
the system address is returned for get_address_space(), to avoid
stage-1 translation and to also allow VFIO devices to attach to
the system address space that the VFIO core will monitor to take
care of stage-2 mappings.
> > On the other hand, this callback is also invoked by QEMU/KVM:
> >
> > kvm_irqchip_add_msi_route()
> > kvm_arch_fixup_msi_route()
> > pci_device_iommu_address_space()
> > get_address_space()
> >
> > What KVM wants is to translate an MSI doorbell gIOVA to a vITS page (gPA),
> > so as to inject IRQs to the guest VM. And it expected get_address_space()
> > to return the address space for stage-1 mappings instead. Apparently, this
> > is broken.
> "Apparently this is broken". Please clarify what is broken. Definitively if
>
> pci_device_iommu_address_space(dev) retruns @adress_system_memory no
> translation is attempted.
Hmm, I thought my writing was clear:
- pci_device_iommu_address_space() returns the system address
space that can't do a stage-1 translation.
- KVM/MSI pathway requires an adress space that can do a stage-1
translation.
> kvm_arch_fixup_msi_route() was introduced by
> https://lore.kernel.org/all/1523518688-26674-12-git-send-email-eric.auger@redhat.com/
>
> This relies on the vIOMMU translate callback which is supposed to be bypassed in general with VFIO devices. Isn't needed only for emulated devices?
Not only for emulated devices.
This KVM function needs the translation for the IRQ injection for
VFIO devices as well.
Although we use RMR for underlying HW to bypass the stage-1, the
translation for gIOVA=>vITS page (VIRT_GIC_ITS) still exists in
the guest level. FWIW, it's just doesn't have the stage-2 mapping
because HW never uses the "gIOVA" but a hard-coded SW_MSI address.
In the meantime, a VFIO device in the guest is programmed with a
gIOVA for MSI doorbell. This gIOVA can't be used for KVM code to
inject IRQs. It needs the gPA (i.e. VIRT_GIC_ITS). So, it needs a
translation address space to do that.
Hope this is clear now.
Thanks
Nicolin
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback
2025-10-20 18:00 ` Nicolin Chen
@ 2025-10-21 16:26 ` Eric Auger
2025-10-21 18:56 ` Nicolin Chen
0 siblings, 1 reply; 118+ messages in thread
From: Eric Auger @ 2025-10-21 16:26 UTC (permalink / raw)
To: Nicolin Chen
Cc: Shameer Kolothum, qemu-arm, qemu-devel, peter.maydell, jgg,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, jonathan.cameron, zhangfei.gao, zhenzhong.duan,
yi.l.liu, shameerkolothum
Hi Nicolin,
On 10/20/25 8:00 PM, Nicolin Chen wrote:
> On Mon, Oct 20, 2025 at 06:14:33PM +0200, Eric Auger wrote:
>>>> This will cause the device to be configured with wrong MSI doorbell
>>>> address if it return the system address space.
>>> I think it'd be nicer to elaborate why a wrong address will be returned:
>>>
>>> --------------------------------------------------------------------------
>>> On ARM, a device behind an IOMMU requires translation for its MSI doorbell
>>> address. When HW nested translation is enabled, the translation will also
>>> happen in two stages: gIOVA => gPA => ITS page.
>>>
>>> In the accelerated SMMUv3 mode, both stages are translated by the HW. So,
>>> get_address_space() returns the system address space for stage-2 mappings,
>>> as the smmuv3-accel model doesn't involve in either stage.
>> I don't understand "doesn't involve in either stage". This is still not
>> obious to me that for an HW accelerated nested IOMMU get_address_space()
>> shall return the system address space. I think this deserves to be
>> explained and maybe documented along with the callback.
> get_address_space() is used by pci_device_iommu_address_space(),
> which is for attach or translation.
>
> In QEMU, we have an "iommu" type of memory region, to represent
> the address space providing the stage-1 translation.
>
> In accel case excluding MSI, there is no need of "emulated iommu
> translation" since HW/host SMMU takes care of both stages. Thus,
> the system address is returned for get_address_space(), to avoid
> stage-1 translation and to also allow VFIO devices to attach to
> the system address space that the VFIO core will monitor to take
> care of stage-2 mappings.
but in general if you set as output 'as' the system_address_memory it
rather means you have no translation in place. This is what I am not
convinced about.
you say it aims at
- avoiding stage-1 translation - allow VFIO devices to attach to the
system address space that the VFIO core will monitor to take care of
stage-2 mappings. Can you achieve the same goals with a proper address
space?
>
>>> On the other hand, this callback is also invoked by QEMU/KVM:
>>>
>>> kvm_irqchip_add_msi_route()
>>> kvm_arch_fixup_msi_route()
>>> pci_device_iommu_address_space()
>>> get_address_space()
>>>
>>> What KVM wants is to translate an MSI doorbell gIOVA to a vITS page (gPA),
>>> so as to inject IRQs to the guest VM. And it expected get_address_space()
>>> to return the address space for stage-1 mappings instead. Apparently, this
>>> is broken.
>> "Apparently this is broken". Please clarify what is broken. Definitively if
>>
>> pci_device_iommu_address_space(dev) retruns @adress_system_memory no
>> translation is attempted.
> Hmm, I thought my writing was clear:
> - pci_device_iommu_address_space() returns the system address
> space that can't do a stage-1 translation.
> - KVM/MSI pathway requires an adress space that can do a stage-1
> translation.
understood. although I am not sure using system address space is the
best choice. But I may not be the best person to decide about this.
>
>> kvm_arch_fixup_msi_route() was introduced by
>> https://lore.kernel.org/all/1523518688-26674-12-git-send-email-eric.auger@redhat.com/
>>
>> This relies on the vIOMMU translate callback which is supposed to be bypassed in general with VFIO devices. Isn't needed only for emulated devices?
> Not only for emulated devices.
>
> This KVM function needs the translation for the IRQ injection for
> VFIO devices as well.
understood.
>
> Although we use RMR for underlying HW to bypass the stage-1, the
> translation for gIOVA=>vITS page (VIRT_GIC_ITS) still exists in
> the guest level. FWIW, it's just doesn't have the stage-2 mapping
> because HW never uses the "gIOVA" but a hard-coded SW_MSI address.
>
> In the meantime, a VFIO device in the guest is programmed with a
> gIOVA for MSI doorbell. This gIOVA can't be used for KVM code to
> inject IRQs. It needs the gPA (i.e. VIRT_GIC_ITS). So, it needs a
> translation address space to do that.
>
> Hope this is clear now.
OK. I understand the needs but I am unsure using system address space is
the good choice.
Eric
>
> Thanks
> Nicolin
>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback
2025-10-21 16:26 ` Eric Auger
@ 2025-10-21 18:56 ` Nicolin Chen
2025-10-22 16:25 ` Eric Auger
0 siblings, 1 reply; 118+ messages in thread
From: Nicolin Chen @ 2025-10-21 18:56 UTC (permalink / raw)
To: Eric Auger
Cc: Shameer Kolothum, qemu-arm, qemu-devel, peter.maydell, jgg,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, jonathan.cameron, zhangfei.gao, zhenzhong.duan,
yi.l.liu, shameerkolothum
On Tue, Oct 21, 2025 at 06:26:39PM +0200, Eric Auger wrote:
> Hi Nicolin,
>
> On 10/20/25 8:00 PM, Nicolin Chen wrote:
> > On Mon, Oct 20, 2025 at 06:14:33PM +0200, Eric Auger wrote:
> >>>> This will cause the device to be configured with wrong MSI doorbell
> >>>> address if it return the system address space.
> >>> I think it'd be nicer to elaborate why a wrong address will be returned:
> >>>
> >>> --------------------------------------------------------------------------
> >>> On ARM, a device behind an IOMMU requires translation for its MSI doorbell
> >>> address. When HW nested translation is enabled, the translation will also
> >>> happen in two stages: gIOVA => gPA => ITS page.
> >>>
> >>> In the accelerated SMMUv3 mode, both stages are translated by the HW. So,
> >>> get_address_space() returns the system address space for stage-2 mappings,
> >>> as the smmuv3-accel model doesn't involve in either stage.
> >> I don't understand "doesn't involve in either stage". This is still not
> >> obious to me that for an HW accelerated nested IOMMU get_address_space()
> >> shall return the system address space. I think this deserves to be
> >> explained and maybe documented along with the callback.
> > get_address_space() is used by pci_device_iommu_address_space(),
> > which is for attach or translation.
> >
> > In QEMU, we have an "iommu" type of memory region, to represent
> > the address space providing the stage-1 translation.
> >
> > In accel case excluding MSI, there is no need of "emulated iommu
> > translation" since HW/host SMMU takes care of both stages. Thus,
> > the system address is returned for get_address_space(), to avoid
> > stage-1 translation and to also allow VFIO devices to attach to
> > the system address space that the VFIO core will monitor to take
> > care of stage-2 mappings.
> but in general if you set as output 'as' the system_address_memory it
> rather means you have no translation in place. This is what I am not
> convinced about.
You mean you are not convinced about "no translation"?
> you say it aims at
> - avoiding stage-1 translation - allow VFIO devices to attach to the
> system address space that the VFIO core will monitor to take care of
> stage-2 mappings. Can you achieve the same goals with a proper address
> space?
Would you please define "proper"?
The disagreement is seemingly about using system address space or
even address_space_memory, IIUIC.
To our purpose here, so long as the vfio core can setup a proper
listener to monitor the guest physical address space, we are fine
with any alternative.
The system address space just seems to be the simplest one. FWIW,
kvm_arch_fixup_msi_route() also checks in the beginning:
if (as == &address_space_memory)
So, returning @address_space_memory seems to be straightforward?
I think I also need some education to understand why do we need
an indirect address space that eventually will be routed back to
address_space_memory?
Thanks
Nicolin
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback
2025-10-21 18:56 ` Nicolin Chen
@ 2025-10-22 16:25 ` Eric Auger
2025-10-22 16:56 ` Shameer Kolothum
0 siblings, 1 reply; 118+ messages in thread
From: Eric Auger @ 2025-10-22 16:25 UTC (permalink / raw)
To: Nicolin Chen
Cc: Shameer Kolothum, qemu-arm, qemu-devel, peter.maydell, jgg,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, jonathan.cameron, zhangfei.gao, zhenzhong.duan,
yi.l.liu, shameerkolothum
Hi Nicolin,
On 10/21/25 8:56 PM, Nicolin Chen wrote:
> On Tue, Oct 21, 2025 at 06:26:39PM +0200, Eric Auger wrote:
>> Hi Nicolin,
>>
>> On 10/20/25 8:00 PM, Nicolin Chen wrote:
>>> On Mon, Oct 20, 2025 at 06:14:33PM +0200, Eric Auger wrote:
>>>>>> This will cause the device to be configured with wrong MSI doorbell
>>>>>> address if it return the system address space.
>>>>> I think it'd be nicer to elaborate why a wrong address will be returned:
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> On ARM, a device behind an IOMMU requires translation for its MSI doorbell
>>>>> address. When HW nested translation is enabled, the translation will also
>>>>> happen in two stages: gIOVA => gPA => ITS page.
>>>>>
>>>>> In the accelerated SMMUv3 mode, both stages are translated by the HW. So,
>>>>> get_address_space() returns the system address space for stage-2 mappings,
>>>>> as the smmuv3-accel model doesn't involve in either stage.
>>>> I don't understand "doesn't involve in either stage". This is still not
>>>> obious to me that for an HW accelerated nested IOMMU get_address_space()
>>>> shall return the system address space. I think this deserves to be
>>>> explained and maybe documented along with the callback.
>>> get_address_space() is used by pci_device_iommu_address_space(),
>>> which is for attach or translation.
>>>
>>> In QEMU, we have an "iommu" type of memory region, to represent
>>> the address space providing the stage-1 translation.
>>>
>>> In accel case excluding MSI, there is no need of "emulated iommu
>>> translation" since HW/host SMMU takes care of both stages. Thus,
>>> the system address is returned for get_address_space(), to avoid
>>> stage-1 translation and to also allow VFIO devices to attach to
>>> the system address space that the VFIO core will monitor to take
>>> care of stage-2 mappings.
>> but in general if you set as output 'as' the system_address_memory it
>> rather means you have no translation in place. This is what I am not
>> convinced about.
> You mean you are not convinced about "no translation"?
I am not convinced about the choice of using address_space_memory.
>
>> you say it aims at
>> - avoiding stage-1 translation - allow VFIO devices to attach to the
>> system address space that the VFIO core will monitor to take care of
>> stage-2 mappings. Can you achieve the same goals with a proper address
>> space?
> Would you please define "proper"?
an address space different from address_space_memory
>
> The disagreement is seemingly about using system address space or
> even address_space_memory, IIUIC.
Yes my doubt is about:
smmuv3_accel_find_add_as()
* We are using the global &address_space_memory here, as this will
ensure
* same system address space pointer for all devices behind the
accelerated
* SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
* iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
* within the VM instead of duplicating them for every SMMUv3 instance.
*/
if (vfio_pci) {
return &address_space_memory;
I think it would be cleaner to a have an AddressSpace allocated on
purpose to support the VFIO accel use case, if possible.
To me returning address_space_memory pretends we are not doing any
translation. I understand it is "easy" to reuse that one but I wonder it
is the spirit of the get_address_space callback.
I would rather allocate a dedicated (shared) AddressSpace to support the
VFIO accel case. That's my suggestion.
>
> To our purpose here, so long as the vfio core can setup a proper
> listener to monitor the guest physical address space, we are fine
> with any alternative.
>
> The system address space just seems to be the simplest one. FWIW,
> kvm_arch_fixup_msi_route() also checks in the beginning:
> if (as == &address_space_memory)
>
> So, returning @address_space_memory seems to be straightforward?
>
> I think I also need some education to understand why do we need
> an indirect address space that eventually will be routed back to
> address_space_memory?
Well I am not an expert of AddressSpaces either. Reading hw/pci/pci.h
and get_address_space() callback API doc comment, I understand this is
the output address space for the PCI device. If you return
address_space_memory, to me this means there is no translation in place. By the way, this was the interpretation of kvm_arch_fixup_msi_route() on ARM
AddressSpace *as = pci_device_iommu_address_space(dev)
if (as == &address_space_memory) {
return 0;
}
/* MSI doorbell address is translated by an IOMMU */
Note: I am currently out of the office so I am not able to reply as fast as you may wish.
Thanks
Eric
>
> Thanks
> Nicolin
>
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback
2025-10-22 16:25 ` Eric Auger
@ 2025-10-22 16:56 ` Shameer Kolothum
0 siblings, 0 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-22 16:56 UTC (permalink / raw)
To: eric.auger@redhat.com, Nicolin Chen
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 22 October 2025 17:25
> To: Nicolin Chen <nicolinc@nvidia.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> Jason Gunthorpe <jgg@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 11/27] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
>
> External email: Use caution opening links or attachments
>
>
> Hi Nicolin,
>
> On 10/21/25 8:56 PM, Nicolin Chen wrote:
> > On Tue, Oct 21, 2025 at 06:26:39PM +0200, Eric Auger wrote:
> >> Hi Nicolin,
> >>
> >> On 10/20/25 8:00 PM, Nicolin Chen wrote:
> >>> On Mon, Oct 20, 2025 at 06:14:33PM +0200, Eric Auger wrote:
> >>>>>> This will cause the device to be configured with wrong MSI doorbell
> >>>>>> address if it return the system address space.
> >>>>> I think it'd be nicer to elaborate why a wrong address will be returned:
> >>>>>
> >>>>> --------------------------------------------------------------------------
> >>>>> On ARM, a device behind an IOMMU requires translation for its MSI
> doorbell
> >>>>> address. When HW nested translation is enabled, the translation will
> also
> >>>>> happen in two stages: gIOVA => gPA => ITS page.
> >>>>>
> >>>>> In the accelerated SMMUv3 mode, both stages are translated by the
> HW. So,
> >>>>> get_address_space() returns the system address space for stage-2
> mappings,
> >>>>> as the smmuv3-accel model doesn't involve in either stage.
> >>>> I don't understand "doesn't involve in either stage". This is still not
> >>>> obious to me that for an HW accelerated nested IOMMU
> get_address_space()
> >>>> shall return the system address space. I think this deserves to be
> >>>> explained and maybe documented along with the callback.
> >>> get_address_space() is used by pci_device_iommu_address_space(),
> >>> which is for attach or translation.
> >>>
> >>> In QEMU, we have an "iommu" type of memory region, to represent
> >>> the address space providing the stage-1 translation.
> >>>
> >>> In accel case excluding MSI, there is no need of "emulated iommu
> >>> translation" since HW/host SMMU takes care of both stages. Thus,
> >>> the system address is returned for get_address_space(), to avoid
> >>> stage-1 translation and to also allow VFIO devices to attach to
> >>> the system address space that the VFIO core will monitor to take
> >>> care of stage-2 mappings.
> >> but in general if you set as output 'as' the system_address_memory it
> >> rather means you have no translation in place. This is what I am not
> >> convinced about.
> > You mean you are not convinced about "no translation"?
> I am not convinced about the choice of using address_space_memory.
> >
> >> you say it aims at
> >> - avoiding stage-1 translation - allow VFIO devices to attach to the
> >> system address space that the VFIO core will monitor to take care of
> >> stage-2 mappings. Can you achieve the same goals with a proper address
> >> space?
> > Would you please define "proper"?
> an address space different from address_space_memory
> >
> > The disagreement is seemingly about using system address space or
> > even address_space_memory, IIUIC.
> Yes my doubt is about:
>
> smmuv3_accel_find_add_as()
> * We are using the global &address_space_memory here, as this will
> ensure
> * same system address space pointer for all devices behind the
> accelerated
> * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
> * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
> * within the VM instead of duplicating them for every SMMUv3 instance.
> */
> if (vfio_pci) {
> return &address_space_memory;
>
> I think it would be cleaner to a have an AddressSpace allocated on
> purpose to support the VFIO accel use case, if possible.
> To me returning address_space_memory pretends we are not doing any
> translation. I understand it is "easy" to reuse that one but I wonder it
> is the spirit of the get_address_space callback.
>
> I would rather allocate a dedicated (shared) AddressSpace to support the
> VFIO accel case. That's my suggestion.
Ok. I will give it a go with the "global variable in smmu-accel.c" route for a
separate shared address space that you suggested earlier in patch #6 thread.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback
2025-09-29 13:36 ` [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback Shameer Kolothum
2025-09-29 16:48 ` Jonathan Cameron via
2025-10-16 22:30 ` Nicolin Chen
@ 2025-10-20 16:21 ` Eric Auger
2 siblings, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-20 16:21 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
Hi Shameer
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> On ARM, when a device is behind an IOMMU, its MSI doorbell address is
> subject to translation by the IOMMU. This behavior affects vfio-pci
> passthrough devices assigned to guests using an accelerated SMMUv3.
>
> In this setup, we configure the host SMMUv3 in nested mode, where
> VFIO sets up the Stage-2 (S2) mappings for guest RAM, while the guest
> controls Stage-1 (S1). To allow VFIO to correctly configure S2 mappings,
> we currently return the system address space via the get_address_space()
> callback for vfio-pci devices.
>
> However, QEMU/KVM also uses this same callback path when resolving the
> address space for MSI doorbells:
>
> kvm_irqchip_add_msi_route()
> kvm_arch_fixup_msi_route()
> pci_device_iommu_address_space()
> get_address_space()
>
> This will cause the device to be configured with wrong MSI doorbell
> address if it return the system address space.
returns
> Introduce an optional get_msi_address_space() callback and use that in
> the above path if available. This will enable IOMMU implementations to
> make use of this if required.
if required
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/pci/pci.c | 19 +++++++++++++++++++
> include/hw/pci/pci.h | 16 ++++++++++++++++
> target/arm/kvm.c | 2 +-
> 3 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 1315ef13ea..6f9e1616dd 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2964,6 +2964,25 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> return &address_space_memory;
> }
>
> +AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
> +{
> + PCIBus *bus;
> + PCIBus *iommu_bus;
> + int devfn;
> +
> + pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> + if (iommu_bus) {
> + if (iommu_bus->iommu_ops->get_msi_address_space) {
> + return iommu_bus->iommu_ops->get_msi_address_space(bus,
> + iommu_bus->iommu_opaque, devfn);
See my reply to Nicolin's comment. From a high level point of view the
semantic of
get_msi_address_space versus get_address_space
does not look very clear. I have the impression for HW nested implementation you were forced to return the &system_address through the get_address_space
although there is a protecting IOMMU and you need another callback for return a proper IOMMU as for MSIs. This is still unclear and looks hacky to me at this point. I think we need to get the semantic of get_msi_address_space vs get_address_space more solid and you need to explain why get_address_space
is mandated to return &system_address in our case.
Maybe you explained that earlier in some thread but I fail to find those info again in the commit messages/comments and I think this is important.
> + } else {
> + return iommu_bus->iommu_ops->get_address_space(bus,
> + iommu_bus->iommu_opaque, devfn);
> + }
> + }
> + return &address_space_memory;
> +}
> +
> int pci_iommu_init_iotlb_notifier(PCIDevice *dev, IOMMUNotifier *n,
> IOMMUNotify fn, void *opaque)
> {
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index c54f2b53ae..0d3b351903 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -652,6 +652,21 @@ typedef struct PCIIOMMUOps {
> uint32_t pasid, bool priv_req, bool exec_req,
> hwaddr addr, bool lpig, uint16_t prgi, bool is_read,
> bool is_write);
> + /**
> + * @get_msi_address_space: get the address space for MSI doorbell address
> + * for devices
> + *
> + * Optional callback which returns a pointer to an #AddressSpace. This
> + * is required if MSI doorbell also gets translated through IOMMU(eg: ARM)
IOMMU (
> + *
> + * @bus: the #PCIBus being accessed.
> + *
> + * @opaque: the data passed to pci_setup_iommu().
> + *
> + * @devfn: device and function number
> + */
> + AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
> + int devfn);
> } PCIIOMMUOps;
>
> bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
> @@ -660,6 +675,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
> Error **errp);
> void pci_device_unset_iommu_device(PCIDevice *dev);
> +AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
>
> /**
> * pci_device_get_viommu_flags: get vIOMMU flags.
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index b8a1c071f5..10eb8655c6 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -1611,7 +1611,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
> int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
> uint64_t address, uint32_t data, PCIDevice *dev)
> {
> - AddressSpace *as = pci_device_iommu_address_space(dev);
> + AddressSpace *as = pci_device_iommu_msi_address_space(dev);
> hwaddr xlat, len, doorbell_gpa;
> MemoryRegionSection mrs;
> MemoryRegion *mr;
Thanks
Eric
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (10 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 11/27] hw/pci/pci: Introduce optional get_msi_address_space() callback Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 16:51 ` Jonathan Cameron via
` (2 more replies)
2025-09-29 13:36 ` [PATCH v4 13/27] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host Shameer Kolothum
` (15 subsequent siblings)
27 siblings, 3 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
Here we return the IOMMU address space if the device has S1 translation
enabled by Guest. Otherwise return system address space.
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 790887ac31..f4e01fba6d 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -387,6 +387,26 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
}
}
+static AddressSpace *smmuv3_accel_find_msi_as(PCIBus *bus, void *opaque,
+ int devfn)
+{
+ SMMUState *bs = opaque;
+ SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
+ SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
+ SMMUDevice *sdev = &accel_dev->sdev;
+
+ /*
+ * If the assigned vfio-pci dev has S1 translation enabled by
+ * Guest, return IOMMU address space for MSI translation.
+ * Otherwise, return system address space.
+ */
+ if (accel_dev->s1_hwpt) {
+ return &sdev->as;
+ } else {
+ return &address_space_memory;
+ }
+}
+
static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
{
@@ -475,6 +495,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
.get_viommu_flags = smmuv3_accel_get_viommu_flags,
.set_iommu_device = smmuv3_accel_set_iommu_device,
.unset_iommu_device = smmuv3_accel_unset_iommu_device,
+ .get_msi_address_space = smmuv3_accel_find_msi_as,
};
void smmuv3_accel_init(SMMUv3State *s)
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
2025-09-29 13:36 ` [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of " Shameer Kolothum
@ 2025-09-29 16:51 ` Jonathan Cameron via
2025-10-02 7:33 ` Shameer Kolothum
2025-10-16 23:28 ` Nicolin Chen
2025-10-20 16:43 ` Eric Auger
2 siblings, 1 reply; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 16:51 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:28 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> Here we return the IOMMU address space if the device has S1 translation
> enabled by Guest. Otherwise return system address space.
>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Naming question inline.
> ---
> hw/arm/smmuv3-accel.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 790887ac31..f4e01fba6d 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -387,6 +387,26 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> }
> }
>
> +static AddressSpace *smmuv3_accel_find_msi_as(PCIBus *bus, void *opaque,
Why find rather than get for naming?
> + int devfn)
> +{
> + SMMUState *bs = opaque;
> + SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> + SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
> + SMMUDevice *sdev = &accel_dev->sdev;
> +
> + /*
> + * If the assigned vfio-pci dev has S1 translation enabled by
> + * Guest, return IOMMU address space for MSI translation.
> + * Otherwise, return system address space.
> + */
> + if (accel_dev->s1_hwpt) {
> + return &sdev->as;
> + } else {
> + return &address_space_memory;
> + }
> +}
> +
> static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> {
>
> @@ -475,6 +495,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
> .get_viommu_flags = smmuv3_accel_get_viommu_flags,
> .set_iommu_device = smmuv3_accel_set_iommu_device,
> .unset_iommu_device = smmuv3_accel_unset_iommu_device,
> + .get_msi_address_space = smmuv3_accel_find_msi_as,
> };
>
> void smmuv3_accel_init(SMMUv3State *s)
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
2025-09-29 16:51 ` Jonathan Cameron via
@ 2025-10-02 7:33 ` Shameer Kolothum
0 siblings, 0 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-02 7:33 UTC (permalink / raw)
To: Jonathan Cameron
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 29 September 2025 17:51
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of
> get_msi_address_space() callback
>
> External email: Use caution opening links or attachments
>
>
> On Mon, 29 Sep 2025 14:36:28 +0100
> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>
> > Here we return the IOMMU address space if the device has S1 translation
> > enabled by Guest. Otherwise return system address space.
> >
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> Naming question inline.
>
> > ---
> > hw/arm/smmuv3-accel.c | 21 +++++++++++++++++++++
> > 1 file changed, 21 insertions(+)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> > index 790887ac31..f4e01fba6d 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> > @@ -387,6 +387,26 @@ static void
> smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> > }
> > }
> >
> > +static AddressSpace *smmuv3_accel_find_msi_as(PCIBus *bus, void
> *opaque,
>
> Why find rather than get for naming?
I just followed the get_address_space() convention. Will revisit.
Thanks,
Shameer
>
> > + int devfn)
> > +{
> > + SMMUState *bs = opaque;
> > + SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> > + SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus,
> devfn);
> > + SMMUDevice *sdev = &accel_dev->sdev;
> > +
> > + /*
> > + * If the assigned vfio-pci dev has S1 translation enabled by
> > + * Guest, return IOMMU address space for MSI translation.
> > + * Otherwise, return system address space.
> > + */
> > + if (accel_dev->s1_hwpt) {
> > + return &sdev->as;
> > + } else {
> > + return &address_space_memory;
> > + }
> > +}
> > +
> > static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> > {
> >
> > @@ -475,6 +495,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
> > .get_viommu_flags = smmuv3_accel_get_viommu_flags,
> > .set_iommu_device = smmuv3_accel_set_iommu_device,
> > .unset_iommu_device = smmuv3_accel_unset_iommu_device,
> > + .get_msi_address_space = smmuv3_accel_find_msi_as,
> > };
> >
> > void smmuv3_accel_init(SMMUv3State *s)
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
2025-09-29 13:36 ` [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of " Shameer Kolothum
2025-09-29 16:51 ` Jonathan Cameron via
@ 2025-10-16 23:28 ` Nicolin Chen
2025-10-20 16:43 ` Eric Auger
2 siblings, 0 replies; 118+ messages in thread
From: Nicolin Chen @ 2025-10-16 23:28 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, Sep 29, 2025 at 02:36:28PM +0100, Shameer Kolothum wrote:
> Here we return the IOMMU address space if the device has S1 translation
> enabled by Guest. Otherwise return system address space.
>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Apart from the naming that Jonathan pointed out,
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
2025-09-29 13:36 ` [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of " Shameer Kolothum
2025-09-29 16:51 ` Jonathan Cameron via
2025-10-16 23:28 ` Nicolin Chen
@ 2025-10-20 16:43 ` Eric Auger
2025-10-21 8:15 ` Shameer Kolothum
2 siblings, 1 reply; 118+ messages in thread
From: Eric Auger @ 2025-10-20 16:43 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm, qemu-devel
Cc: peter.maydell, jgg, nicolinc, ddutile, berrange, nathanc, mochs,
smostafa, wangzhou1, jiangkunkun, jonathan.cameron, zhangfei.gao,
zhenzhong.duan, yi.l.liu, shameerkolothum
Hi Shameer,
On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> Here we return the IOMMU address space if the device has S1 translation
> enabled by Guest. Otherwise return system address space.
>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/arm/smmuv3-accel.c | 21 +++++++++++++++++++++
> 1 file changed, 21 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 790887ac31..f4e01fba6d 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -387,6 +387,26 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> }
> }
>
> +static AddressSpace *smmuv3_accel_find_msi_as(PCIBus *bus, void *opaque,
> + int devfn)
> +{
> + SMMUState *bs = opaque;
> + SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> + SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
> + SMMUDevice *sdev = &accel_dev->sdev;
> +
> + /*
> + * If the assigned vfio-pci dev has S1 translation enabled by
> + * Guest, return IOMMU address space for MSI translation.
> + * Otherwise, return system address space.
> + */
> + if (accel_dev->s1_hwpt) {
> + return &sdev->as;
> + } else {
> + return &address_space_memory;
> + }
At the moment I don't understand this code either. In case of emulated
device it then returns address_space_memory whereas I would have
expected the opposite. I definitively need to trace things ;-)
Thanks
Eric
> +}
> +
> static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> {
>
> @@ -475,6 +495,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
> .get_viommu_flags = smmuv3_accel_get_viommu_flags,
> .set_iommu_device = smmuv3_accel_set_iommu_device,
> .unset_iommu_device = smmuv3_accel_unset_iommu_device,
> + .get_msi_address_space = smmuv3_accel_find_msi_as,
> };
>
> void smmuv3_accel_init(SMMUv3State *s)
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
2025-10-20 16:43 ` Eric Auger
@ 2025-10-21 8:15 ` Shameer Kolothum
2025-10-21 16:16 ` Eric Auger
0 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-21 8:15 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-arm@nongnu.org, qemu-devel@nongnu.org
Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
Hi Eric,
> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 20 October 2025 17:44
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of
> get_msi_address_space() callback
>
> External email: Use caution opening links or attachments
>
>
> Hi Shameer,
>
> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> > Here we return the IOMMU address space if the device has S1 translation
> > enabled by Guest. Otherwise return system address space.
> >
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi@huawei.com>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > ---
> > hw/arm/smmuv3-accel.c | 21 +++++++++++++++++++++
> > 1 file changed, 21 insertions(+)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> > index 790887ac31..f4e01fba6d 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> > @@ -387,6 +387,26 @@ static void
> smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> > }
> > }
> >
> > +static AddressSpace *smmuv3_accel_find_msi_as(PCIBus *bus, void
> *opaque,
> > + int devfn)
> > +{
> > + SMMUState *bs = opaque;
> > + SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
> > + SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus,
> bus, devfn);
> > + SMMUDevice *sdev = &accel_dev->sdev;
> > +
> > + /*
> > + * If the assigned vfio-pci dev has S1 translation enabled by
> > + * Guest, return IOMMU address space for MSI translation.
> > + * Otherwise, return system address space.
> > + */
> > + if (accel_dev->s1_hwpt) {
> > + return &sdev->as;
> > + } else {
> > + return &address_space_memory;
> > + }
> At the moment I don't understand this code either. In case of emulated
> device it then returns address_space_memory whereas I would have
> expected the opposite. I definitively need to trace things ;-)
We have,
[VIRT_GIC_ITS] = { 0x08080000, 0x00020000 },
I added a few prints in kvm_arch_fixup_msi_route() so that it may help
to understand how the translation of MSI doorbell is performed here.
If we return IOMMU addr space(&sdev->as) here,
kvm_arch_fixup_msi_route: MSI IOVA=0xffbf0040 msi_addr_lo=0xffbf0040 msi_addr_hi=0x0
kvm_arch_fixup_msi_route: Translated doorbell_gpa= 0x8090040
kvm_arch_fixup_msi_route: ret:MSI IOVA=0xffbf0040 translated: msi_addr_lo=0x8090040 msi_addr_hi=0x0
It gets the correct vITS gpA address after the translation through address_space_translate().
Since host uses the (MSI_IOVA_BASE, MSI_IOVA_LENGTH) for ITS doorbell mapping
and using IORT RMR we make sure there is an identity mapping for that range, it all
works fine.
Now, suppose if we return system addr space(&address_space_memory):
kvm_arch_fixup_msi_route: MSI IOVA=0xffbf0040 msi_addr_lo 0xffbf0040 msi_addr_hi 0x0
kvm_arch_fixup_msi_route: address_space_memory, nothing to do, return
And the device doorbell gets configured with gIOVA 0xffbf0040 instead of the vITS gPA
as Nicolin explained in the other thread.
Hope this helps.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of get_msi_address_space() callback
2025-10-21 8:15 ` Shameer Kolothum
@ 2025-10-21 16:16 ` Eric Auger
0 siblings, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-21 16:16 UTC (permalink / raw)
To: Shameer Kolothum, qemu-arm@nongnu.org, qemu-devel@nongnu.org
Cc: peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhangfei.gao@linaro.org, zhenzhong.duan@intel.com,
yi.l.liu@intel.com, shameerkolothum@gmail.com
Hi Shameer,
On 10/21/25 10:15 AM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 20 October 2025 17:44
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of
>> get_msi_address_space() callback
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Shameer,
>>
>> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
>>> Here we return the IOMMU address space if the device has S1 translation
>>> enabled by Guest. Otherwise return system address space.
>>>
>>> Signed-off-by: Shameer Kolothum
>> <shameerali.kolothum.thodi@huawei.com>
>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>>> ---
>>> hw/arm/smmuv3-accel.c | 21 +++++++++++++++++++++
>>> 1 file changed, 21 insertions(+)
>>>
>>> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
>>> index 790887ac31..f4e01fba6d 100644
>>> --- a/hw/arm/smmuv3-accel.c
>>> +++ b/hw/arm/smmuv3-accel.c
>>> @@ -387,6 +387,26 @@ static void
>> smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
>>> }
>>> }
>>>
>>> +static AddressSpace *smmuv3_accel_find_msi_as(PCIBus *bus, void
>> *opaque,
>>> + int devfn)
>>> +{
>>> + SMMUState *bs = opaque;
>>> + SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
>>> + SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus,
>> bus, devfn);
>>> + SMMUDevice *sdev = &accel_dev->sdev;
>>> +
>>> + /*
>>> + * If the assigned vfio-pci dev has S1 translation enabled by
>>> + * Guest, return IOMMU address space for MSI translation.
>>> + * Otherwise, return system address space.
>>> + */
>>> + if (accel_dev->s1_hwpt) {
>>> + return &sdev->as;
>>> + } else {
>>> + return &address_space_memory;
>>> + }
>> At the moment I don't understand this code either. In case of emulated
>> device it then returns address_space_memory whereas I would have
>> expected the opposite. I definitively need to trace things ;-)
Thank you for the traces!
> We have,
> [VIRT_GIC_ITS] = { 0x08080000, 0x00020000 },
>
> I added a few prints in kvm_arch_fixup_msi_route() so that it may help
> to understand how the translation of MSI doorbell is performed here.
>
> If we return IOMMU addr space(&sdev->as) here,
>
> kvm_arch_fixup_msi_route: MSI IOVA=0xffbf0040 msi_addr_lo=0xffbf0040 msi_addr_hi=0x0
so this gIOVA
> kvm_arch_fixup_msi_route: Translated doorbell_gpa= 0x8090040
> kvm_arch_fixup_msi_route: ret:MSI IOVA=0xffbf0040 translated: msi_addr_lo=0x8090040 msi_addr_hi=0x0
>
> It gets the correct vITS gpA address after the translation through address_space_translate().
I agree it needs to be translated into the vITS doorbell reg.
>
> Since host uses the (MSI_IOVA_BASE, MSI_IOVA_LENGTH) for ITS doorbell mapping
> and using IORT RMR we make sure there is an identity mapping for that range, it all
> works fine.
>
> Now, suppose if we return system addr space(&address_space_memory):
>
> kvm_arch_fixup_msi_route: MSI IOVA=0xffbf0040 msi_addr_lo 0xffbf0040 msi_addr_hi 0x0
> kvm_arch_fixup_msi_route: address_space_memory, nothing to do, return
>
> And the device doorbell gets configured with gIOVA 0xffbf0040 instead of the vITS gPA
> as Nicolin explained in the other thread.
I agree that for MSI support you must remy on the IOMMU MR translate
function, even for VFIO devices.
Thanks
Eric
>
> Hope this helps.
>
> Thanks,
> Shameer
>
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 13/27] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (11 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 12/27] hw/arm/smmuv3-accel: Make use of " Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-09-29 16:53 ` Jonathan Cameron via
2025-10-16 22:59 ` Nicolin Chen via
2025-09-29 13:36 ` [PATCH v4 14/27] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate Shameer Kolothum
` (14 subsequent siblings)
27 siblings, 2 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
Provide a helper and use that to issue the invalidation cmd to host SMMUv3.
We only issue one cmd at a time for now.
Support for batching of commands will be added later after analysing the
impact.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 38 ++++++++++++++++++++++++++++++++++++++
hw/arm/smmuv3-accel.h | 8 ++++++++
hw/arm/smmuv3.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 76 insertions(+)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index f4e01fba6d..9ad8595ce2 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -218,6 +218,44 @@ bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
return true;
}
+/*
+ * This issues the invalidation cmd to the host SMMUv3.
+ * Note: sdev can be NULL for certain invalidation commands
+ * e.g., SMMU_CMD_TLBI_NH_ASID, SMMU_CMD_TLBI_NH_VA etc.
+ */
+bool smmuv3_accel_issue_inv_cmd(SMMUv3State *bs, void *cmd, SMMUDevice *sdev,
+ Error **errp)
+{
+ SMMUv3State *s = ARM_SMMUV3(bs);
+ SMMUv3AccelState *s_accel = s->s_accel;
+ IOMMUFDViommu *viommu_core;
+ uint32_t entry_num = 1;
+
+ if (!s->accel || !s_accel->viommu) {
+ return true;
+ }
+
+ /*
+ * We may end up here for any emulated PCI bridge or root port type devices.
+ * However, passing invalidation commands with sid (eg: CFGI_CD) to host
+ * SMMUv3 only matters for vfio-pci endpoint devices. Hence check that if
+ * sdev is valid.
+ */
+ if (sdev) {
+ SMMUv3AccelDevice *accel_dev = container_of(sdev, SMMUv3AccelDevice,
+ sdev);
+ if (!accel_dev->vdev) {
+ return true;
+ }
+ }
+
+ viommu_core = &s_accel->viommu->core;
+ return iommufd_backend_invalidate_cache(
+ viommu_core->iommufd, viommu_core->viommu_id,
+ IOMMU_VIOMMU_INVALIDATE_DATA_ARM_SMMUV3,
+ sizeof(Cmd), &entry_num, cmd, errp);
+}
+
static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
PCIBus *bus, int devfn)
{
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index 6242614c00..3bdba47616 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -46,6 +46,8 @@ bool smmuv3_accel_install_nested_ste(SMMUv3State *s, SMMUDevice *sdev, int sid,
Error **errp);
bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
Error **errp);
+bool smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
+ Error **errp);
#else
static inline void smmuv3_accel_init(SMMUv3State *s)
{
@@ -62,6 +64,12 @@ smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
{
return true;
}
+static inline bool
+smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
+ Error **errp)
+{
+ return true;
+}
#endif
#endif /* HW_ARM_SMMUV3_ACCEL_H */
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 1fd8aaa0c7..3963bdc87f 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1381,6 +1381,7 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
{
uint32_t sid = CMD_SID(&cmd);
SMMUDevice *sdev = smmu_find_sdev(bs, sid);
+ Error *local_err = NULL;
if (CMD_SSEC(&cmd)) {
cmd_error = SMMU_CERROR_ILL;
@@ -1393,11 +1394,17 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
trace_smmuv3_cmdq_cfgi_cd(sid);
smmuv3_flush_config(sdev);
+ if (!smmuv3_accel_issue_inv_cmd(s, &cmd, sdev, &local_err)) {
+ error_report_err(local_err);
+ cmd_error = SMMU_CERROR_ILL;
+ break;
+ }
break;
}
case SMMU_CMD_TLBI_NH_ASID:
{
int asid = CMD_ASID(&cmd);
+ Error *local_err = NULL;
int vmid = -1;
if (!STAGE1_SUPPORTED(s)) {
@@ -1416,6 +1423,11 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
trace_smmuv3_cmdq_tlbi_nh_asid(asid);
smmu_inv_notifiers_all(&s->smmu_state);
smmu_iotlb_inv_asid_vmid(bs, asid, vmid);
+ if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
+ error_report_err(local_err);
+ cmd_error = SMMU_CERROR_ILL;
+ break;
+ }
break;
}
case SMMU_CMD_TLBI_NH_ALL:
@@ -1440,18 +1452,36 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
QEMU_FALLTHROUGH;
}
case SMMU_CMD_TLBI_NSNH_ALL:
+ {
+ Error *local_err = NULL;
+
trace_smmuv3_cmdq_tlbi_nsnh();
smmu_inv_notifiers_all(&s->smmu_state);
smmu_iotlb_inv_all(bs);
+ if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
+ error_report_err(local_err);
+ cmd_error = SMMU_CERROR_ILL;
+ break;
+ }
break;
+ }
case SMMU_CMD_TLBI_NH_VAA:
case SMMU_CMD_TLBI_NH_VA:
+ {
+ Error *local_err = NULL;
+
if (!STAGE1_SUPPORTED(s)) {
cmd_error = SMMU_CERROR_ILL;
break;
}
smmuv3_range_inval(bs, &cmd, SMMU_STAGE_1);
+ if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
+ error_report_err(local_err);
+ cmd_error = SMMU_CERROR_ILL;
+ break;
+ }
break;
+ }
case SMMU_CMD_TLBI_S12_VMALL:
{
int vmid = CMD_VMID(&cmd);
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 13/27] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
2025-09-29 13:36 ` [PATCH v4 13/27] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host Shameer Kolothum
@ 2025-09-29 16:53 ` Jonathan Cameron via
2025-10-16 22:59 ` Nicolin Chen via
1 sibling, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-09-29 16:53 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:29 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> Provide a helper and use that to issue the invalidation cmd to host SMMUv3.
> We only issue one cmd at a time for now.
>
> Support for batching of commands will be added later after analysing the
> impact.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
LGTM
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 13/27] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host
2025-09-29 13:36 ` [PATCH v4 13/27] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host Shameer Kolothum
2025-09-29 16:53 ` Jonathan Cameron via
@ 2025-10-16 22:59 ` Nicolin Chen via
1 sibling, 0 replies; 118+ messages in thread
From: Nicolin Chen via @ 2025-10-16 22:59 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, Sep 29, 2025 at 02:36:29PM +0100, Shameer Kolothum wrote:
> Provide a helper and use that to issue the invalidation cmd to host SMMUv3.
> We only issue one cmd at a time for now.
>
> Support for batching of commands will be added later after analysing the
> impact.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/arm/smmuv3-accel.c | 38 ++++++++++++++++++++++++++++++++++++++
> hw/arm/smmuv3-accel.h | 8 ++++++++
> hw/arm/smmuv3.c | 30 ++++++++++++++++++++++++++++++
> 3 files changed, 76 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index f4e01fba6d..9ad8595ce2 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -218,6 +218,44 @@ bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
> return true;
> }
>
> +/*
> + * This issues the invalidation cmd to the host SMMUv3.
> + * Note: sdev can be NULL for certain invalidation commands
> + * e.g., SMMU_CMD_TLBI_NH_ASID, SMMU_CMD_TLBI_NH_VA etc.
* Note a TLBI command is passed in with a NULL sdev.
> + */
> +bool smmuv3_accel_issue_inv_cmd(SMMUv3State *bs, void *cmd, SMMUDevice *sdev,
> + Error **errp)
> +{
> + SMMUv3State *s = ARM_SMMUV3(bs);
> + SMMUv3AccelState *s_accel = s->s_accel;
> + IOMMUFDViommu *viommu_core;
> + uint32_t entry_num = 1;
> +
> + if (!s->accel || !s_accel->viommu) {
if (!accel || !s_accel->viommu) {
> + /*
> + * We may end up here for any emulated PCI bridge or root port type devices.
> + * However, passing invalidation commands with sid (eg: CFGI_CD) to host
> + * SMMUv3 only matters for vfio-pci endpoint devices. Hence check that if
> + * sdev is valid.
> + */
I think we should use "allowed" over "matters".
> + if (sdev) {
> + SMMUv3AccelDevice *accel_dev = container_of(sdev, SMMUv3AccelDevice,
> + sdev);
> + if (!accel_dev->vdev) {
> + return true;
> + }
> + }
And we could simplify with:
/*
* Only a device associated with the vIOMMU (by having a valid vdev) is
* allowed to flush its device cache
*/
if (sdev && !container_of(sdev, SMMUv3AccelDevice, sdev)->vdev) {
return true;
}
> @@ -1440,18 +1452,36 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
> QEMU_FALLTHROUGH;
> }
> case SMMU_CMD_TLBI_NSNH_ALL:
> + {
> + Error *local_err = NULL;
> +
> trace_smmuv3_cmdq_tlbi_nsnh();
> smmu_inv_notifiers_all(&s->smmu_state);
> smmu_iotlb_inv_all(bs);
> + if (!smmuv3_accel_issue_inv_cmd(s, &cmd, NULL, &local_err)) {
> + error_report_err(local_err);
> + cmd_error = SMMU_CERROR_ILL;
> + break;
> + }
> break;
> + }
local_err is not used but only only printed. It'd be cleaner to
move it inside smmuv3_accel_issue_inv_cmd(). So, you would not
need to add "{}" nor have an "Error *" parameter in the helper.
Only cmd_error should stay.
With these,
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 14/27] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (12 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 13/27] hw/arm/smmuv3-accel: Add support to issue invalidation cmd to host Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 12:56 ` Jonathan Cameron via
2025-09-29 13:36 ` [PATCH v4 15/27] acpi/gpex: Fix PCI Express Slot Information function 0 returned value Shameer Kolothum
` (13 subsequent siblings)
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
Just before the device gets attached to the SMMUv3, make sure QEMU SMMUv3
features are compatible with the host SMMUv3.
Not all fields in the host SMMUv3 IDR registers are meaningful for userspace.
Only the following fields can be used:
- IDR0: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN, CD2L, ASID16, TTF
- IDR1: SIDSIZE, SSIDSIZE
- IDR3: BBML, RIL
- IDR5: VAX, GRAN64K, GRAN16K, GRAN4K
For now, the check is to make sure the features are in sync to enable
basic accelerated SMMUv3 support.
One other related change is, move the smmuv3_init_regs() to smmu_realize()
so that we do have that early enough for the check mentioned above.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 98 +++++++++++++++++++++++++++++++++++++++++++
hw/arm/smmuv3.c | 4 +-
2 files changed, 100 insertions(+), 2 deletions(-)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 9ad8595ce2..defeddbd8c 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -39,6 +39,96 @@
#define STE1_MASK (STE1_ETS | STE1_S1STALLD | STE1_S1CSH | STE1_S1COR | \
STE1_S1CIR | STE1_S1DSS)
+static bool
+smmuv3_accel_check_hw_compatible(SMMUv3State *s,
+ struct iommu_hw_info_arm_smmuv3 *info,
+ Error **errp)
+{
+ uint32_t val;
+
+ /*
+ * QEMU SMMUv3 supports both linear and 2-level stream tables.
+ */
+ val = FIELD_EX32(info->idr[0], IDR0, STLEVEL);
+ if (val != FIELD_EX32(s->idr[0], IDR0, STLEVEL)) {
+ s->idr[0] = FIELD_DP32(s->idr[0], IDR0, STLEVEL, val);
+ error_setg(errp, "Host SUMMUv3 differs in Stream Table format");
+ return false;
+ }
+
+ /* QEMU SMMUv3 supports only little-endian translation table walks */
+ val = FIELD_EX32(info->idr[0], IDR0, TTENDIAN);
+ if (!val && val > FIELD_EX32(s->idr[0], IDR0, TTENDIAN)) {
+ error_setg(errp, "Host SUMMUv3 doesn't support Little-endian "
+ "translation table");
+ return false;
+ }
+
+ /* QEMU SMMUv3 supports only AArch64 translation table format */
+ val = FIELD_EX32(info->idr[0], IDR0, TTF);
+ if (val < FIELD_EX32(s->idr[0], IDR0, TTF)) {
+ error_setg(errp, "Host SUMMUv3 deosn't support Arch64 Translation "
+ "table format");
+ return false;
+ }
+
+ /* QEMU SMMUv3 supports SIDSIZE 16 */
+ val = FIELD_EX32(info->idr[1], IDR1, SIDSIZE);
+ if (val < FIELD_EX32(s->idr[1], IDR1, SIDSIZE)) {
+ error_setg(errp, "Host SUMMUv3 SIDSIZE not compatible");
+ return false;
+ }
+
+ /* QEMU SMMUv3 supports Range Invalidation by default */
+ val = FIELD_EX32(info->idr[3], IDR3, RIL);
+ if (val != FIELD_EX32(s->idr[3], IDR3, RIL)) {
+ error_setg(errp, "Host SUMMUv3 deosn't support Range Invalidation");
+ return false;
+ }
+
+ val = FIELD_EX32(info->idr[5], IDR5, GRAN4K);
+ if (val != FIELD_EX32(s->idr[5], IDR5, GRAN4K)) {
+ error_setg(errp, "Host SMMUv3 doesn't support 64K translation granule");
+ return false;
+ }
+ val = FIELD_EX32(info->idr[5], IDR5, GRAN16K);
+ if (val != FIELD_EX32(s->idr[5], IDR5, GRAN16K)) {
+ error_setg(errp, "Host SMMUv3 doesn't support 16K translation granule");
+ return false;
+ }
+ val = FIELD_EX32(info->idr[5], IDR5, GRAN64K);
+ if (val != FIELD_EX32(s->idr[5], IDR5, GRAN64K)) {
+ error_setg(errp, "Host SMMUv3 doesn't support 16K translation granule");
+ return false;
+ }
+ return true;
+}
+
+static bool
+smmuv3_accel_hw_compatible(SMMUv3State *s, HostIOMMUDeviceIOMMUFD *idev,
+ Error **errp)
+{
+ struct iommu_hw_info_arm_smmuv3 info;
+ uint32_t data_type;
+ uint64_t caps;
+
+ if (!iommufd_backend_get_device_info(idev->iommufd, idev->devid, &data_type,
+ &info, sizeof(info), &caps, errp)) {
+ return false;
+ }
+
+ if (data_type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3) {
+ error_setg(errp, "Wrong data type (%d) for Host SMMUv3 device info",
+ data_type);
+ return false;
+ }
+
+ if (!smmuv3_accel_check_hw_compatible(s, &info, errp)) {
+ return false;
+ }
+ return true;
+}
+
static bool
smmuv3_accel_alloc_vdev(SMMUv3AccelDevice *accel_dev, int sid, Error **errp)
{
@@ -363,6 +453,14 @@ static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
return true;
}
+ /*
+ * Check the host SMMUv3 associated with the dev is compatible with the
+ * QEMU SMMUv3 accel.
+ */
+ if (!smmuv3_accel_hw_compatible(s, idev, errp)) {
+ return false;
+ }
+
if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
error_setg(errp, "Device 0x%x: Unable to alloc viommu", sid);
return false;
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 3963bdc87f..5830cf5a03 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1913,8 +1913,6 @@ static void smmu_reset_exit(Object *obj, ResetType type)
if (c->parent_phases.exit) {
c->parent_phases.exit(obj, type);
}
-
- smmuv3_init_regs(s);
}
static void smmu_realize(DeviceState *d, Error **errp)
@@ -1945,6 +1943,8 @@ static void smmu_realize(DeviceState *d, Error **errp)
sysbus_init_mmio(dev, &sys->iomem);
smmu_init_irq(s, dev);
+
+ smmuv3_init_regs(s);
}
static const VMStateDescription vmstate_smmuv3_queue = {
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 14/27] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate
2025-09-29 13:36 ` [PATCH v4 14/27] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate Shameer Kolothum
@ 2025-10-01 12:56 ` Jonathan Cameron via
2025-10-02 7:37 ` Shameer Kolothum
0 siblings, 1 reply; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 12:56 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:30 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> Just before the device gets attached to the SMMUv3, make sure QEMU SMMUv3
> features are compatible with the host SMMUv3.
>
> Not all fields in the host SMMUv3 IDR registers are meaningful for userspace.
> Only the following fields can be used:
>
> - IDR0: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN, CD2L, ASID16, TTF
> - IDR1: SIDSIZE, SSIDSIZE
> - IDR3: BBML, RIL
> - IDR5: VAX, GRAN64K, GRAN16K, GRAN4K
>
> For now, the check is to make sure the features are in sync to enable
> basic accelerated SMMUv3 support.
>
> One other related change is, move the smmuv3_init_regs() to smmu_realize()
> so that we do have that early enough for the check mentioned above.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Hi Shameer,
Back to this series...
Various things in the checks in here.
Jonathan
> ---
> hw/arm/smmuv3-accel.c | 98 +++++++++++++++++++++++++++++++++++++++++++
> hw/arm/smmuv3.c | 4 +-
> 2 files changed, 100 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 9ad8595ce2..defeddbd8c 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -39,6 +39,96 @@
> #define STE1_MASK (STE1_ETS | STE1_S1STALLD | STE1_S1CSH | STE1_S1COR | \
> STE1_S1CIR | STE1_S1DSS)
>
> +static bool
> +smmuv3_accel_check_hw_compatible(SMMUv3State *s,
> + struct iommu_hw_info_arm_smmuv3 *info,
> + Error **errp)
> +{
> + uint32_t val;
> +
> + /*
> + * QEMU SMMUv3 supports both linear and 2-level stream tables.
> + */
Single line comment would be more consistent with below and looks to be under 80 chars.
> + val = FIELD_EX32(info->idr[0], IDR0, STLEVEL);
> + if (val != FIELD_EX32(s->idr[0], IDR0, STLEVEL)) {
> + s->idr[0] = FIELD_DP32(s->idr[0], IDR0, STLEVEL, val);
This seems a rather odd side effect to have. Perhaps a comment on why
in error path it make sense to change s->idr[0]?
> + error_setg(errp, "Host SUMMUv3 differs in Stream Table format");
> + return false;
> + }
> +
> + /* QEMU SMMUv3 supports only little-endian translation table walks */
> + val = FIELD_EX32(info->idr[0], IDR0, TTENDIAN);
> + if (!val && val > FIELD_EX32(s->idr[0], IDR0, TTENDIAN)) {
This is a weird check. || maybe?
Otherwise if !val is true, then val is not likely to be greater than anything.
> + error_setg(errp, "Host SUMMUv3 doesn't support Little-endian "
> + "translation table");
> + return false;
> + }
> +
> + /* QEMU SMMUv3 supports only AArch64 translation table format */
> + val = FIELD_EX32(info->idr[0], IDR0, TTF);
> + if (val < FIELD_EX32(s->idr[0], IDR0, TTF)) {
> + error_setg(errp, "Host SUMMUv3 deosn't support Arch64 Translation "
Spell check the messages. doesn't.
> + "table format");
> + return false;
> + }
> +
> + /* QEMU SMMUv3 supports SIDSIZE 16 */
> + val = FIELD_EX32(info->idr[1], IDR1, SIDSIZE);
> + if (val < FIELD_EX32(s->idr[1], IDR1, SIDSIZE)) {
Why not
if (FIELD_EX32(info->idr[1], IDR1, SIDSIZE) <
FIELD_EX32(s->idr[1], IDR1, SIDSIZE))
I.e. why does the local variable make sense in cases where the value is
only used once. To me if anything this is slightly easier to read. You could
even align the variables so it's obvious it's comparing one field.
if (FIELD_EX32(info->idr[1], IDR1, SIDSIZE) <
FIELD_EX32(s->idr[1], IDR1, SIDSIZE))
> + error_setg(errp, "Host SUMMUv3 SIDSIZE not compatible");
> + return false;
> + }
> +
> + /* QEMU SMMUv3 supports Range Invalidation by default */
> + val = FIELD_EX32(info->idr[3], IDR3, RIL);
> + if (val != FIELD_EX32(s->idr[3], IDR3, RIL)) {
> + error_setg(errp, "Host SUMMUv3 deosn't support Range Invalidation");
doesn't.
> + return false;
> + }
> +
> + val = FIELD_EX32(info->idr[5], IDR5, GRAN4K);
> + if (val != FIELD_EX32(s->idr[5], IDR5, GRAN4K)) {
> + error_setg(errp, "Host SMMUv3 doesn't support 64K translation granule");
That doesn't smell like it's checking 64K
> + return false;
> + }
> + val = FIELD_EX32(info->idr[5], IDR5, GRAN16K);
> + if (val != FIELD_EX32(s->idr[5], IDR5, GRAN16K)) {
> + error_setg(errp, "Host SMMUv3 doesn't support 16K translation granule");
> + return false;
> + }
> + val = FIELD_EX32(info->idr[5], IDR5, GRAN64K);
> + if (val != FIELD_EX32(s->idr[5], IDR5, GRAN64K)) {
> + error_setg(errp, "Host SMMUv3 doesn't support 16K translation granule");
Nor is this one seem likely to be checking 16K.
> + return false;
> + }
> + return true;
> +}
> +
> +static bool
> +smmuv3_accel_hw_compatible(SMMUv3State *s, HostIOMMUDeviceIOMMUFD *idev,
> + Error **errp)
> +{
> + struct iommu_hw_info_arm_smmuv3 info;
> + uint32_t data_type;
> + uint64_t caps;
> +
> + if (!iommufd_backend_get_device_info(idev->iommufd, idev->devid, &data_type,
> + &info, sizeof(info), &caps, errp)) {
> + return false;
> + }
> +
> + if (data_type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3) {
> + error_setg(errp, "Wrong data type (%d) for Host SMMUv3 device info",
> + data_type);
Alignment looks off.
> + return false;
> + }
> +
> + if (!smmuv3_accel_check_hw_compatible(s, &info, errp)) {
> + return false;
> + }
> + return true;
> +}
> +
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 14/27] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate
2025-10-01 12:56 ` Jonathan Cameron via
@ 2025-10-02 7:37 ` Shameer Kolothum
2025-10-02 9:54 ` Jonathan Cameron via
0 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-02 7:37 UTC (permalink / raw)
To: Jonathan Cameron
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 01 October 2025 13:56
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 14/27] hw/arm/smmuv3-accel: Get host SMMUv3 hw
> info and validate
>
> External email: Use caution opening links or attachments
>
>
> On Mon, 29 Sep 2025 14:36:30 +0100
> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>
> > Just before the device gets attached to the SMMUv3, make sure QEMU
> SMMUv3
> > features are compatible with the host SMMUv3.
> >
> > Not all fields in the host SMMUv3 IDR registers are meaningful for userspace.
> > Only the following fields can be used:
> >
> > - IDR0: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN, CD2L, ASID16,
> TTF
> > - IDR1: SIDSIZE, SSIDSIZE
> > - IDR3: BBML, RIL
> > - IDR5: VAX, GRAN64K, GRAN16K, GRAN4K
> >
> > For now, the check is to make sure the features are in sync to enable
> > basic accelerated SMMUv3 support.
> >
> > One other related change is, move the smmuv3_init_regs() to
> smmu_realize()
> > so that we do have that early enough for the check mentioned above.
> >
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>
> Hi Shameer,
>
> Back to this series...
>
> Various things in the checks in here.
Thanks for going through the entire series. I will address the comments
in the next revision.
Between, you seem to have missed patch #20 though.
Thanks,
Shameer
>
> > ---
> > hw/arm/smmuv3-accel.c | 98
> +++++++++++++++++++++++++++++++++++++++++++
> > hw/arm/smmuv3.c | 4 +-
> > 2 files changed, 100 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> > index 9ad8595ce2..defeddbd8c 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> > @@ -39,6 +39,96 @@
> > #define STE1_MASK (STE1_ETS | STE1_S1STALLD | STE1_S1CSH |
> STE1_S1COR | \
> > STE1_S1CIR | STE1_S1DSS)
> >
> > +static bool
> > +smmuv3_accel_check_hw_compatible(SMMUv3State *s,
> > + struct iommu_hw_info_arm_smmuv3 *info,
> > + Error **errp)
> > +{
> > + uint32_t val;
> > +
> > + /*
> > + * QEMU SMMUv3 supports both linear and 2-level stream tables.
> > + */
>
> Single line comment would be more consistent with below and looks to be
> under 80 chars.
>
> > + val = FIELD_EX32(info->idr[0], IDR0, STLEVEL);
> > + if (val != FIELD_EX32(s->idr[0], IDR0, STLEVEL)) {
> > + s->idr[0] = FIELD_DP32(s->idr[0], IDR0, STLEVEL, val);
>
> This seems a rather odd side effect to have. Perhaps a comment on why
> in error path it make sense to change s->idr[0]?
>
> > + error_setg(errp, "Host SUMMUv3 differs in Stream Table format");
> > + return false;
> > + }
> > +
> > + /* QEMU SMMUv3 supports only little-endian translation table walks */
> > + val = FIELD_EX32(info->idr[0], IDR0, TTENDIAN);
> > + if (!val && val > FIELD_EX32(s->idr[0], IDR0, TTENDIAN)) {
>
> This is a weird check. || maybe?
>
> Otherwise if !val is true, then val is not likely to be greater than anything.
>
> > + error_setg(errp, "Host SUMMUv3 doesn't support Little-endian "
> > + "translation table");
> > + return false;
> > + }
> > +
> > + /* QEMU SMMUv3 supports only AArch64 translation table format */
> > + val = FIELD_EX32(info->idr[0], IDR0, TTF);
> > + if (val < FIELD_EX32(s->idr[0], IDR0, TTF)) {
> > + error_setg(errp, "Host SUMMUv3 deosn't support Arch64 Translation "
>
> Spell check the messages. doesn't.
>
> > + "table format");
> > + return false;
> > + }
> > +
> > + /* QEMU SMMUv3 supports SIDSIZE 16 */
> > + val = FIELD_EX32(info->idr[1], IDR1, SIDSIZE);
> > + if (val < FIELD_EX32(s->idr[1], IDR1, SIDSIZE)) {
>
> Why not
> if (FIELD_EX32(info->idr[1], IDR1, SIDSIZE) <
> FIELD_EX32(s->idr[1], IDR1, SIDSIZE))
> I.e. why does the local variable make sense in cases where the value is
> only used once. To me if anything this is slightly easier to read. You could
> even align the variables so it's obvious it's comparing one field.
>
> if (FIELD_EX32(info->idr[1], IDR1, SIDSIZE) <
> FIELD_EX32(s->idr[1], IDR1, SIDSIZE))
>
> > + error_setg(errp, "Host SUMMUv3 SIDSIZE not compatible");
> > + return false;
> > + }
> > +
> > + /* QEMU SMMUv3 supports Range Invalidation by default */
> > + val = FIELD_EX32(info->idr[3], IDR3, RIL);
> > + if (val != FIELD_EX32(s->idr[3], IDR3, RIL)) {
> > + error_setg(errp, "Host SUMMUv3 deosn't support Range
> Invalidation");
>
> doesn't.
>
> > + return false;
> > + }
> > +
> > + val = FIELD_EX32(info->idr[5], IDR5, GRAN4K);
> > + if (val != FIELD_EX32(s->idr[5], IDR5, GRAN4K)) {
> > + error_setg(errp, "Host SMMUv3 doesn't support 64K translation
> granule");
> That doesn't smell like it's checking 64K
> > + return false;
> > + }
> > + val = FIELD_EX32(info->idr[5], IDR5, GRAN16K);
> > + if (val != FIELD_EX32(s->idr[5], IDR5, GRAN16K)) {
> > + error_setg(errp, "Host SMMUv3 doesn't support 16K translation
> granule");
> > + return false;
> > + }
> > + val = FIELD_EX32(info->idr[5], IDR5, GRAN64K);
> > + if (val != FIELD_EX32(s->idr[5], IDR5, GRAN64K)) {
> > + error_setg(errp, "Host SMMUv3 doesn't support 16K translation
> granule");
> Nor is this one seem likely to be checking 16K.
> > + return false;
> > + }
> > + return true;
> > +}
> > +
> > +static bool
> > +smmuv3_accel_hw_compatible(SMMUv3State *s,
> HostIOMMUDeviceIOMMUFD *idev,
> > + Error **errp)
> > +{
> > + struct iommu_hw_info_arm_smmuv3 info;
> > + uint32_t data_type;
> > + uint64_t caps;
> > +
> > + if (!iommufd_backend_get_device_info(idev->iommufd, idev->devid,
> &data_type,
> > + &info, sizeof(info), &caps, errp)) {
> > + return false;
> > + }
> > +
> > + if (data_type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3) {
> > + error_setg(errp, "Wrong data type (%d) for Host SMMUv3 device info",
> > + data_type);
>
> Alignment looks off.
>
> > + return false;
> > + }
> > +
> > + if (!smmuv3_accel_check_hw_compatible(s, &info, errp)) {
> > + return false;
> > + }
> > + return true;
> > +}
> > +
>
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 14/27] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate
2025-10-02 7:37 ` Shameer Kolothum
@ 2025-10-02 9:54 ` Jonathan Cameron via
0 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-02 9:54 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
On Thu, 2 Oct 2025 07:37:46 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> > -----Original Message-----
> > From: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Sent: 01 October 2025 13:56
> > To: Shameer Kolothum <skolothumtho@nvidia.com>
> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> > eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> > <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> > berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> > <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> > jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> > zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> > shameerkolothum@gmail.com
> > Subject: Re: [PATCH v4 14/27] hw/arm/smmuv3-accel: Get host SMMUv3 hw
> > info and validate
> >
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, 29 Sep 2025 14:36:30 +0100
> > Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> >
> > > Just before the device gets attached to the SMMUv3, make sure QEMU
> > SMMUv3
> > > features are compatible with the host SMMUv3.
> > >
> > > Not all fields in the host SMMUv3 IDR registers are meaningful for userspace.
> > > Only the following fields can be used:
> > >
> > > - IDR0: ST_LEVEL, TERM_MODEL, STALL_MODEL, TTENDIAN, CD2L, ASID16,
> > TTF
> > > - IDR1: SIDSIZE, SSIDSIZE
> > > - IDR3: BBML, RIL
> > > - IDR5: VAX, GRAN64K, GRAN16K, GRAN4K
> > >
> > > For now, the check is to make sure the features are in sync to enable
> > > basic accelerated SMMUv3 support.
> > >
> > > One other related change is, move the smmuv3_init_regs() to
> > smmu_realize()
> > > so that we do have that early enough for the check mentioned above.
> > >
> > > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> >
> > Hi Shameer,
> >
> > Back to this series...
> >
> > Various things in the checks in here.
>
> Thanks for going through the entire series. I will address the comments
> in the next revision.
>
> Between, you seem to have missed patch #20 though.
I should, but don't, know enough about migration blockers to comment on that one!
>
> Thanks,
> Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 15/27] acpi/gpex: Fix PCI Express Slot Information function 0 returned value
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (13 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 14/27] hw/arm/smmuv3-accel: Get host SMMUv3 hw info and validate Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 12:59 ` Jonathan Cameron via
2025-09-29 13:36 ` [PATCH v4 16/27] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5 Shameer Kolothum
` (12 subsequent siblings)
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
From: Eric Auger <eric.auger@redhat.com>
At the moment we do not support other function than function 0. So according
to ACPI spec "_DSM (Device Specific Method)" description, bit 0 should rather
be 0, meaning no other function is supported than function 0.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/pci-host/gpex-acpi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
index 952a0ace19..4587baeb78 100644
--- a/hw/pci-host/gpex-acpi.c
+++ b/hw/pci-host/gpex-acpi.c
@@ -64,7 +64,7 @@ static Aml *build_pci_host_bridge_dsm_method(void)
UUID = aml_touuid("E5C937D0-3553-4D7A-9117-EA4D19C3434D");
ifctx = aml_if(aml_equal(aml_arg(0), UUID));
ifctx1 = aml_if(aml_equal(aml_arg(2), aml_int(0)));
- uint8_t byte_list[1] = {1};
+ uint8_t byte_list[1] = {0};
buf = aml_buffer(1, byte_list);
aml_append(ifctx1, aml_return(buf));
aml_append(ifctx, ifctx1);
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 15/27] acpi/gpex: Fix PCI Express Slot Information function 0 returned value
2025-09-29 13:36 ` [PATCH v4 15/27] acpi/gpex: Fix PCI Express Slot Information function 0 returned value Shameer Kolothum
@ 2025-10-01 12:59 ` Jonathan Cameron via
2025-10-02 7:39 ` Shameer Kolothum
0 siblings, 1 reply; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 12:59 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:31 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> From: Eric Auger <eric.auger@redhat.com>
>
> At the moment we do not support other function than function 0. So according
> to ACPI spec "_DSM (Device Specific Method)" description, bit 0 should rather
> be 0, meaning no other function is supported than function 0.
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Given description, why not yank this to the front and get it upstreamed quicker.
Also, a fixes tag seems appropriate?
Doesn't this show up in some of the tables tests?
Please include relevant chunk of AML as well as qemu AML generation code isn't
exactly easy to check against the spec. Probably +CC at least Michael Tsrikin
on next version of this patch.
J
> ---
> hw/pci-host/gpex-acpi.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
> index 952a0ace19..4587baeb78 100644
> --- a/hw/pci-host/gpex-acpi.c
> +++ b/hw/pci-host/gpex-acpi.c
> @@ -64,7 +64,7 @@ static Aml *build_pci_host_bridge_dsm_method(void)
> UUID = aml_touuid("E5C937D0-3553-4D7A-9117-EA4D19C3434D");
> ifctx = aml_if(aml_equal(aml_arg(0), UUID));
> ifctx1 = aml_if(aml_equal(aml_arg(2), aml_int(0)));
> - uint8_t byte_list[1] = {1};
> + uint8_t byte_list[1] = {0};
> buf = aml_buffer(1, byte_list);
> aml_append(ifctx1, aml_return(buf));
> aml_append(ifctx, ifctx1);
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 15/27] acpi/gpex: Fix PCI Express Slot Information function 0 returned value
2025-10-01 12:59 ` Jonathan Cameron via
@ 2025-10-02 7:39 ` Shameer Kolothum
2025-10-21 15:32 ` Eric Auger
0 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-02 7:39 UTC (permalink / raw)
To: Jonathan Cameron
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 01 October 2025 13:59
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 15/27] acpi/gpex: Fix PCI Express Slot Information
> function 0 returned value
>
> External email: Use caution opening links or attachments
>
>
> On Mon, 29 Sep 2025 14:36:31 +0100
> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>
> > From: Eric Auger <eric.auger@redhat.com>
> >
> > At the moment we do not support other function than function 0. So
> according
> > to ACPI spec "_DSM (Device Specific Method)" description, bit 0 should
> rather
> > be 0, meaning no other function is supported than function 0.
> >
> > Signed-off-by: Eric Auger <eric.auger@redhat.com>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> Given description, why not yank this to the front and get it upstreamed
> quicker.
> Also, a fixes tag seems appropriate?
>
> Doesn't this show up in some of the tables tests?
Possibly. I will double check that.
>
> Please include relevant chunk of AML as well as qemu AML generation code
> isn't
> exactly easy to check against the spec. Probably +CC at least Michael Tsrikin
> on next version of this patch.
Ok.
Thanks,
Shameer
> J
>
> > ---
> > hw/pci-host/gpex-acpi.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
> > index 952a0ace19..4587baeb78 100644
> > --- a/hw/pci-host/gpex-acpi.c
> > +++ b/hw/pci-host/gpex-acpi.c
> > @@ -64,7 +64,7 @@ static Aml *build_pci_host_bridge_dsm_method(void)
> > UUID = aml_touuid("E5C937D0-3553-4D7A-9117-EA4D19C3434D");
> > ifctx = aml_if(aml_equal(aml_arg(0), UUID));
> > ifctx1 = aml_if(aml_equal(aml_arg(2), aml_int(0)));
> > - uint8_t byte_list[1] = {1};
> > + uint8_t byte_list[1] = {0};
> > buf = aml_buffer(1, byte_list);
> > aml_append(ifctx1, aml_return(buf));
> > aml_append(ifctx, ifctx1);
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 15/27] acpi/gpex: Fix PCI Express Slot Information function 0 returned value
2025-10-02 7:39 ` Shameer Kolothum
@ 2025-10-21 15:32 ` Eric Auger
0 siblings, 0 replies; 118+ messages in thread
From: Eric Auger @ 2025-10-21 15:32 UTC (permalink / raw)
To: Shameer Kolothum, Jonathan Cameron
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
On 10/2/25 9:39 AM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Sent: 01 October 2025 13:59
>> To: Shameer Kolothum <skolothumtho@nvidia.com>
>> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
>> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
>> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
>> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
>> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
>> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 15/27] acpi/gpex: Fix PCI Express Slot Information
>> function 0 returned value
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On Mon, 29 Sep 2025 14:36:31 +0100
>> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>>
>>> From: Eric Auger <eric.auger@redhat.com>
>>>
>>> At the moment we do not support other function than function 0. So
>> according
>>> to ACPI spec "_DSM (Device Specific Method)" description, bit 0 should
>> rather
>>> be 0, meaning no other function is supported than function 0.
>>>
>>> Signed-off-by: Eric Auger <eric.auger@redhat.com>
>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>> Given description, why not yank this to the front and get it upstreamed
>> quicker.
I agree with Jonathan, Can be sent in a prerequisite patch/series with
potential updates in the ACPI table tests.
Thanks
Eric
>> Also, a fixes tag seems appropriate?
>>
>> Doesn't this show up in some of the tables tests?
> Possibly. I will double check that.
>
>> Please include relevant chunk of AML as well as qemu AML generation code
>> isn't
>> exactly easy to check against the spec. Probably +CC at least Michael Tsrikin
>> on next version of this patch.
> Ok.
>
> Thanks,
> Shameer
>
>
>> J
>>
>>> ---
>>> hw/pci-host/gpex-acpi.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
>>> index 952a0ace19..4587baeb78 100644
>>> --- a/hw/pci-host/gpex-acpi.c
>>> +++ b/hw/pci-host/gpex-acpi.c
>>> @@ -64,7 +64,7 @@ static Aml *build_pci_host_bridge_dsm_method(void)
>>> UUID = aml_touuid("E5C937D0-3553-4D7A-9117-EA4D19C3434D");
>>> ifctx = aml_if(aml_equal(aml_arg(0), UUID));
>>> ifctx1 = aml_if(aml_equal(aml_arg(2), aml_int(0)));
>>> - uint8_t byte_list[1] = {1};
>>> + uint8_t byte_list[1] = {0};
>>> buf = aml_buffer(1, byte_list);
>>> aml_append(ifctx1, aml_return(buf));
>>> aml_append(ifctx, ifctx1);
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 16/27] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (14 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 15/27] acpi/gpex: Fix PCI Express Slot Information function 0 returned value Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 13:05 ` Jonathan Cameron via
2025-09-29 13:36 ` [PATCH v4 17/27] hw/arm/virt: Set PCI preserve_config for accel SMMUv3 Shameer Kolothum
` (11 subsequent siblings)
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
From: Eric Auger <eric.auger@redhat.com>
Add a 'preserve_config' field in struct GPEXConfig and if set, generate the
DSM #5 for preserving PCI boot configurations. For SMMUV3 accel=on support,
we are making use of IORT RMRs in a subsequent patch and that requires the
DSM #5.
At the moment the DSM generation is not yet enabled.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
[Shameer: Removed possible duplicate _DSM creations]
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
Previously, QEMU reverted an attempt to enable DSM #5 because it caused a
regression,
https://lore.kernel.org/all/20210724185234.GA2265457@roeck-us.net/.
However, in this series, we enable it selectively, only when SMMUv3 is in
accelerator mode. The devices involved in the earlier regression are not
expected in accelerated SMMUv3 use cases.
---
hw/pci-host/gpex-acpi.c | 29 +++++++++++++++++++++++------
include/hw/pci-host/gpex.h | 1 +
2 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
index 4587baeb78..e3825ed0b1 100644
--- a/hw/pci-host/gpex-acpi.c
+++ b/hw/pci-host/gpex-acpi.c
@@ -51,10 +51,11 @@ static void acpi_dsdt_add_pci_route_table(Aml *dev, uint32_t irq,
}
}
-static Aml *build_pci_host_bridge_dsm_method(void)
+static Aml *build_pci_host_bridge_dsm_method(bool preserve_config)
{
Aml *method = aml_method("_DSM", 4, AML_NOTSERIALIZED);
Aml *UUID, *ifctx, *ifctx1, *buf;
+ uint8_t byte_list[1] = {0};
/* PCI Firmware Specification 3.0
* 4.6.1. _DSM for PCI Express Slot Information
@@ -64,10 +65,23 @@ static Aml *build_pci_host_bridge_dsm_method(void)
UUID = aml_touuid("E5C937D0-3553-4D7A-9117-EA4D19C3434D");
ifctx = aml_if(aml_equal(aml_arg(0), UUID));
ifctx1 = aml_if(aml_equal(aml_arg(2), aml_int(0)));
- uint8_t byte_list[1] = {0};
+ if (preserve_config) {
+ /* support for function 0 and function 5 */
+ byte_list[0] = 0x21;
+ }
buf = aml_buffer(1, byte_list);
aml_append(ifctx1, aml_return(buf));
aml_append(ifctx, ifctx1);
+ if (preserve_config) {
+ Aml *ifctx2 = aml_if(aml_equal(aml_arg(2), aml_int(5)));
+ /*
+ * 0 - The operating system must not ignore the PCI configuration that
+ * firmware has done at boot time.
+ */
+ aml_append(ifctx2, aml_return(aml_int(0)));
+ aml_append(ifctx, ifctx2);
+ }
+
aml_append(method, ifctx);
byte_list[0] = 0;
@@ -77,12 +91,13 @@ static Aml *build_pci_host_bridge_dsm_method(void)
}
static void acpi_dsdt_add_host_bridge_methods(Aml *dev,
- bool enable_native_pcie_hotplug)
+ bool enable_native_pcie_hotplug,
+ bool preserve_config)
{
/* Declare an _OSC (OS Control Handoff) method */
aml_append(dev,
build_pci_host_bridge_osc_method(enable_native_pcie_hotplug));
- aml_append(dev, build_pci_host_bridge_dsm_method());
+ aml_append(dev, build_pci_host_bridge_dsm_method(preserve_config));
}
void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
@@ -152,7 +167,8 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
build_cxl_osc_method(dev);
} else {
/* pxb bridges do not have ACPI PCI Hot-plug enabled */
- acpi_dsdt_add_host_bridge_methods(dev, true);
+ acpi_dsdt_add_host_bridge_methods(dev, true,
+ cfg->preserve_config);
}
aml_append(scope, dev);
@@ -227,7 +243,8 @@ void acpi_dsdt_add_gpex(Aml *scope, struct GPEXConfig *cfg)
}
aml_append(dev, aml_name_decl("_CRS", rbuf));
- acpi_dsdt_add_host_bridge_methods(dev, cfg->pci_native_hotplug);
+ acpi_dsdt_add_host_bridge_methods(dev, cfg->pci_native_hotplug,
+ cfg->preserve_config);
Aml *dev_res0 = aml_device("%s", "RES0");
aml_append(dev_res0, aml_name_decl("_HID", aml_string("PNP0C02")));
diff --git a/include/hw/pci-host/gpex.h b/include/hw/pci-host/gpex.h
index feaf827474..7eea16e728 100644
--- a/include/hw/pci-host/gpex.h
+++ b/include/hw/pci-host/gpex.h
@@ -46,6 +46,7 @@ struct GPEXConfig {
int irq;
PCIBus *bus;
bool pci_native_hotplug;
+ bool preserve_config;
};
typedef struct GPEXIrq GPEXIrq;
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 16/27] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5
2025-09-29 13:36 ` [PATCH v4 16/27] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5 Shameer Kolothum
@ 2025-10-01 13:05 ` Jonathan Cameron via
0 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 13:05 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:32 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> From: Eric Auger <eric.auger@redhat.com>
>
> Add a 'preserve_config' field in struct GPEXConfig and if set, generate the
> DSM #5 for preserving PCI boot configurations. For SMMUV3 accel=on support,
> we are making use of IORT RMRs in a subsequent patch and that requires the
> DSM #5.
>
> At the moment the DSM generation is not yet enabled.
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> [Shameer: Removed possible duplicate _DSM creations]
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Throw an AML blob in the patch description as easier to check that against
the spec. Add a specific spec reference as well.
> ---
> Previously, QEMU reverted an attempt to enable DSM #5 because it caused a
> regression,
> https://lore.kernel.org/all/20210724185234.GA2265457@roeck-us.net/.
>
> However, in this series, we enable it selectively, only when SMMUv3 is in
> accelerator mode. The devices involved in the earlier regression are not
> expected in accelerated SMMUv3 use cases.
> ---
> hw/pci-host/gpex-acpi.c | 29 +++++++++++++++++++++++------
> include/hw/pci-host/gpex.h | 1 +
> 2 files changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/hw/pci-host/gpex-acpi.c b/hw/pci-host/gpex-acpi.c
> index 4587baeb78..e3825ed0b1 100644
> --- a/hw/pci-host/gpex-acpi.c
> +++ b/hw/pci-host/gpex-acpi.c
> @@ -51,10 +51,11 @@ static void acpi_dsdt_add_pci_route_table(Aml *dev, uint32_t irq,
> }
> }
>
> -static Aml *build_pci_host_bridge_dsm_method(void)
> +static Aml *build_pci_host_bridge_dsm_method(bool preserve_config)
> {
> Aml *method = aml_method("_DSM", 4, AML_NOTSERIALIZED);
> Aml *UUID, *ifctx, *ifctx1, *buf;
> + uint8_t byte_list[1] = {0};
The inline declaration is a bit odd, but I'm not seeing a specific reason to
change that here. Perhaps call out the change as some 'other cleanup' in the
patch description if you want to make it anyway.
>
> /* PCI Firmware Specification 3.0
> * 4.6.1. _DSM for PCI Express Slot Information
> @@ -64,10 +65,23 @@ static Aml *build_pci_host_bridge_dsm_method(void)
> UUID = aml_touuid("E5C937D0-3553-4D7A-9117-EA4D19C3434D");
> ifctx = aml_if(aml_equal(aml_arg(0), UUID));
> ifctx1 = aml_if(aml_equal(aml_arg(2), aml_int(0)));
> - uint8_t byte_list[1] = {0};
> + if (preserve_config) {
> + /* support for function 0 and function 5 */
> + byte_list[0] = 0x21;
Change the comment to reflect the fix in previous patch as otherwise
it sounds like bit(0) means function 0 is supported.
/* support functions other than 0, specifically function 5 */
> + }
> buf = aml_buffer(1, byte_list);
> aml_append(ifctx1, aml_return(buf));
> aml_append(ifctx, ifctx1);
> + if (preserve_config) {
> + Aml *ifctx2 = aml_if(aml_equal(aml_arg(2), aml_int(5)));
> + /*
> + * 0 - The operating system must not ignore the PCI configuration that
> + * firmware has done at boot time.
> + */
> + aml_append(ifctx2, aml_return(aml_int(0)));
> + aml_append(ifctx, ifctx2);
> + }
> +
> aml_append(method, ifctx);
>
> byte_list[0] = 0;
> @@ -77,12 +91,13 @@ static Aml *build_pci_host_bridge_dsm_method(void)
> }
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 17/27] hw/arm/virt: Set PCI preserve_config for accel SMMUv3
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (15 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 16/27] hw/pci-host/gpex: Allow to generate preserve boot config DSM #5 Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 13:06 ` Jonathan Cameron via
2025-09-29 13:36 ` [PATCH v4 18/27] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Shameer Kolothum
` (10 subsequent siblings)
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
In a subsequent patch, SMMUv3 accel mode will make use of IORT RMR nodes
to enable nested translation of MSI doorbell addresses. IORT RMR requires
_DSM #5 to be set for the PCI host bridge so that the Guest kernel preserves
the PCI boot configuration.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/virt-acpi-build.c | 8 ++++++++
hw/arm/virt.c | 4 ++++
include/hw/arm/virt.h | 1 +
3 files changed, 13 insertions(+)
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index 96830f7c4e..fd03b7f6a9 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -163,6 +163,14 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
.pci_native_hotplug = !acpi_pcihp,
};
+ /*
+ * Accel SMMU requires RMRs for MSI 1-1 mapping, which
+ * require _DSM for preserving PCI Boot Configurations
+ */
+ if (vms->pci_preserve_config) {
+ cfg.preserve_config = true;
+ }
+
if (vms->highmem_mmio) {
cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
}
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index b533b0556e..6467d7cfc8 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -3087,6 +3087,10 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
}
create_smmuv3_dev_dtb(vms, dev, bus);
+ if (object_property_get_bool(OBJECT(dev), "accel", &error_abort) &&
+ !vms->pci_preserve_config) {
+ vms->pci_preserve_config = true;
+ }
}
}
diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
index ea2cff05b0..00287941a9 100644
--- a/include/hw/arm/virt.h
+++ b/include/hw/arm/virt.h
@@ -180,6 +180,7 @@ struct VirtMachineState {
bool ns_el2_virt_timer_irq;
CXLState cxl_devices_state;
bool legacy_smmuv3_present;
+ bool pci_preserve_config;
};
#define VIRT_ECAM_ID(high) (high ? VIRT_HIGH_PCIE_ECAM : VIRT_PCIE_ECAM)
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 17/27] hw/arm/virt: Set PCI preserve_config for accel SMMUv3
2025-09-29 13:36 ` [PATCH v4 17/27] hw/arm/virt: Set PCI preserve_config for accel SMMUv3 Shameer Kolothum
@ 2025-10-01 13:06 ` Jonathan Cameron via
0 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 13:06 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:33 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> In a subsequent patch, SMMUv3 accel mode will make use of IORT RMR nodes
> to enable nested translation of MSI doorbell addresses. IORT RMR requires
> _DSM #5 to be set for the PCI host bridge so that the Guest kernel preserves
> the PCI boot configuration.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
One trivial thing inline. Otherwise LGTM
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> ---
> hw/arm/virt-acpi-build.c | 8 ++++++++
> hw/arm/virt.c | 4 ++++
> include/hw/arm/virt.h | 1 +
> 3 files changed, 13 insertions(+)
>
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index 96830f7c4e..fd03b7f6a9 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -163,6 +163,14 @@ static void acpi_dsdt_add_pci(Aml *scope, const MemMapEntry *memmap,
> .pci_native_hotplug = !acpi_pcihp,
> };
>
> + /*
> + * Accel SMMU requires RMRs for MSI 1-1 mapping, which
> + * require _DSM for preserving PCI Boot Configurations
Early wrapping. I'd go nearer 80 chars.
> + */
> + if (vms->pci_preserve_config) {
> + cfg.preserve_config = true;
> + }
> +
> if (vms->highmem_mmio) {
> cfg.mmio64 = memmap[VIRT_HIGH_PCIE_MMIO];
> }
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index b533b0556e..6467d7cfc8 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -3087,6 +3087,10 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
> }
>
> create_smmuv3_dev_dtb(vms, dev, bus);
> + if (object_property_get_bool(OBJECT(dev), "accel", &error_abort) &&
> + !vms->pci_preserve_config) {
> + vms->pci_preserve_config = true;
> + }
> }
> }
>
> diff --git a/include/hw/arm/virt.h b/include/hw/arm/virt.h
> index ea2cff05b0..00287941a9 100644
> --- a/include/hw/arm/virt.h
> +++ b/include/hw/arm/virt.h
> @@ -180,6 +180,7 @@ struct VirtMachineState {
> bool ns_el2_virt_timer_irq;
> CXLState cxl_devices_state;
> bool legacy_smmuv3_present;
> + bool pci_preserve_config;
> };
>
> #define VIRT_ECAM_ID(high) (high ? VIRT_HIGH_PCIE_ECAM : VIRT_PCIE_ECAM)
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 18/27] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (16 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 17/27] hw/arm/virt: Set PCI preserve_config for accel SMMUv3 Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 13:30 ` Jonathan Cameron via
2025-09-29 13:36 ` [PATCH v4 19/27] hw/arm/smmuv3-accel: Install S1 bypass hwpt on reset Shameer Kolothum
` (9 subsequent siblings)
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
From: Eric Auger <eric.auger@redhat.com>
To handle SMMUv3 nested stage support it is practical to expose the guest
with reserved memory regions (RMRs) covering the IOVAs used by the host
kernel to map physical MSI doorbells.
Those IOVAs belong to [0x8000000, 0x8100000] matching MSI_IOVA_BASE and
MSI_IOVA_LENGTH definitions in kernel arm-smmu-v3 driver. This is the
window used to allocate IOVAs matching physical MSI doorbells.
With those RMRs, the guest is forced to use a flat mapping for this range.
Hence the assigned device is programmed with one IOVA from this range.
Stage 1, owned by the guest has a flat mapping for this IOVA. Stage2,
owned by the VMM then enforces a mapping from this IOVA to the physical
MSI doorbell.
The creation of those RMR nodes is only relevant if nested stage SMMU is
in use, along with VFIO. As VFIO devices can be hotplugged, all RMRs need
to be created in advance.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/virt-acpi-build.c | 75 ++++++++++++++++++++++++++++++++++++----
1 file changed, 68 insertions(+), 7 deletions(-)
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index fd03b7f6a9..d0c1e10019 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -264,7 +264,8 @@ static void acpi_dsdt_add_tpm(Aml *scope, VirtMachineState *vms)
* Note that @id_count gets internally subtracted by one, following the spec.
*/
static void build_iort_id_mapping(GArray *table_data, uint32_t input_base,
- uint32_t id_count, uint32_t out_ref)
+ uint32_t id_count, uint32_t out_ref,
+ uint32_t flags)
{
build_append_int_noprefix(table_data, input_base, 4); /* Input base */
/* Number of IDs - The number of IDs in the range minus one */
@@ -272,7 +273,7 @@ static void build_iort_id_mapping(GArray *table_data, uint32_t input_base,
build_append_int_noprefix(table_data, input_base, 4); /* Output base */
build_append_int_noprefix(table_data, out_ref, 4); /* Output Reference */
/* Flags */
- build_append_int_noprefix(table_data, 0 /* Single mapping (disabled) */, 4);
+ build_append_int_noprefix(table_data, flags, 4);
}
struct AcpiIortIdMapping {
@@ -320,6 +321,7 @@ typedef struct AcpiIortSMMUv3Dev {
GArray *rc_smmu_idmaps;
/* Offset of the SMMUv3 IORT Node relative to the start of the IORT */
size_t offset;
+ bool accel;
} AcpiIortSMMUv3Dev;
/*
@@ -374,6 +376,7 @@ static int iort_smmuv3_devices(Object *obj, void *opaque)
}
bus = PCI_BUS(object_property_get_link(obj, "primary-bus", &error_abort));
+ sdev.accel = object_property_get_bool(obj, "accel", &error_abort);
pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
sbdev = SYS_BUS_DEVICE(obj);
sdev.base = platform_bus_get_mmio_addr(pbus, sbdev, 0);
@@ -447,6 +450,56 @@ static void create_rc_its_idmaps(GArray *its_idmaps, GArray *smmuv3_devs)
}
}
+static void
+build_iort_rmr_nodes(GArray *table_data, GArray *smmuv3_devices, uint32_t *id)
+{
+ AcpiIortSMMUv3Dev *sdev;
+ AcpiIortIdMapping *idmap;
+ int i;
+
+ for (i = 0; i < smmuv3_devices->len; i++) {
+ sdev = &g_array_index(smmuv3_devices, AcpiIortSMMUv3Dev, i);
+ idmap = &g_array_index(sdev->rc_smmu_idmaps, AcpiIortIdMapping, 0);
+ int bdf = idmap->input_base;
+
+ if (!sdev->accel) {
+ continue;
+ }
+
+ /* Table 18 Reserved Memory Range Node */
+ build_append_int_noprefix(table_data, 6 /* RMR */, 1); /* Type */
+ /* Length */
+ build_append_int_noprefix(table_data, 28 + ID_MAPPING_ENTRY_SIZE + 20,
+ 2);
+ build_append_int_noprefix(table_data, 3, 1); /* Revision */
+ build_append_int_noprefix(table_data, *id, 4); /* Identifier */
+ /* Number of ID mappings */
+ build_append_int_noprefix(table_data, 1, 4);
+ /* Reference to ID Array */
+ build_append_int_noprefix(table_data, 28, 4);
+
+ /* RMR specific data */
+
+ /* Flags */
+ build_append_int_noprefix(table_data, 0 /* Disallow remapping */, 4);
+ /* Number of Memory Range Descriptors */
+ build_append_int_noprefix(table_data, 1 , 4);
+ /* Reference to Memory Range Descriptors */
+ build_append_int_noprefix(table_data, 28 + ID_MAPPING_ENTRY_SIZE, 4);
+ build_iort_id_mapping(table_data, bdf, idmap->id_count, sdev->offset,
+ 1);
+
+ /* Table 19 Memory Range Descriptor */
+
+ /* Physical Range offset */
+ build_append_int_noprefix(table_data, 0x8000000, 8);
+ /* Physical Range length */
+ build_append_int_noprefix(table_data, 0x100000, 8);
+ build_append_int_noprefix(table_data, 0, 4); /* Reserved */
+ *id += 1;
+ }
+}
+
/*
* Input Output Remapping Table (IORT)
* Conforms to "IO Remapping Table System Software on ARM Platforms",
@@ -464,7 +517,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
GArray *smmuv3_devs = g_array_new(false, true, sizeof(AcpiIortSMMUv3Dev));
GArray *rc_its_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping));
- AcpiTable table = { .sig = "IORT", .rev = 3, .oem_id = vms->oem_id,
+ AcpiTable table = { .sig = "IORT", .rev = 5, .oem_id = vms->oem_id,
.oem_table_id = vms->oem_table_id };
/* Table 2 The IORT */
acpi_table_begin(&table, table_data);
@@ -490,6 +543,13 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
nb_nodes++; /* ITS */
rc_mapping_count += rc_its_idmaps->len;
}
+ /* Calculate RMR nodes required. One per SMMUv3 with accelerated mode */
+ for (i = 0; i < num_smmus; i++) {
+ sdev = &g_array_index(smmuv3_devs, AcpiIortSMMUv3Dev, i);
+ if (sdev->accel) {
+ nb_nodes++;
+ }
+ }
} else {
if (vms->its) {
nb_nodes = 2; /* RC and ITS */
@@ -562,7 +622,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
/* Array of ID mappings */
if (smmu_mapping_count) {
/* Output IORT node is the ITS Group node (the first node). */
- build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET);
+ build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET, 0);
}
}
@@ -614,7 +674,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
AcpiIortIdMapping, j);
/* Output IORT node is the SMMUv3 node. */
build_iort_id_mapping(table_data, range->input_base,
- range->id_count, sdev->offset);
+ range->id_count, sdev->offset, 0);
}
}
@@ -627,7 +687,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
range = &g_array_index(rc_its_idmaps, AcpiIortIdMapping, i);
/* Output IORT node is the ITS Group node (the first node). */
build_iort_id_mapping(table_data, range->input_base,
- range->id_count, IORT_NODE_OFFSET);
+ range->id_count, IORT_NODE_OFFSET, 0);
}
}
} else {
@@ -636,9 +696,10 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
* SMMU: RC -> ITS.
* Output IORT node is the ITS Group node (the first node).
*/
- build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET);
+ build_iort_id_mapping(table_data, 0, 0x10000, IORT_NODE_OFFSET, 0);
}
+ build_iort_rmr_nodes(table_data, smmuv3_devs, &id);
acpi_table_end(linker, &table);
g_array_free(rc_its_idmaps, true);
for (i = 0; i < num_smmus; i++) {
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 18/27] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding
2025-09-29 13:36 ` [PATCH v4 18/27] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Shameer Kolothum
@ 2025-10-01 13:30 ` Jonathan Cameron via
0 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 13:30 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:34 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> From: Eric Auger <eric.auger@redhat.com>
>
> To handle SMMUv3 nested stage support it is practical to expose the guest
> with reserved memory regions (RMRs) covering the IOVAs used by the host
> kernel to map physical MSI doorbells.
>
> Those IOVAs belong to [0x8000000, 0x8100000] matching MSI_IOVA_BASE and
> MSI_IOVA_LENGTH definitions in kernel arm-smmu-v3 driver. This is the
> window used to allocate IOVAs matching physical MSI doorbells.
>
> With those RMRs, the guest is forced to use a flat mapping for this range.
> Hence the assigned device is programmed with one IOVA from this range.
> Stage 1, owned by the guest has a flat mapping for this IOVA. Stage2,
> owned by the VMM then enforces a mapping from this IOVA to the physical
> MSI doorbell.
>
> The creation of those RMR nodes is only relevant if nested stage SMMU is
> in use, along with VFIO. As VFIO devices can be hotplugged, all RMRs need
> to be created in advance.
>
> Signed-off-by: Eric Auger <eric.auger@redhat.com>
> Suggested-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Various comments inline on references and making the code a little more resilient
and obvious wrt to the things that there happen to be 1 of but which the spec
allows for multiples of.
> ---
> hw/arm/virt-acpi-build.c | 75 ++++++++++++++++++++++++++++++++++++----
> 1 file changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index fd03b7f6a9..d0c1e10019 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
> @@ -447,6 +450,56 @@ static void create_rc_its_idmaps(GArray *its_idmaps, GArray *smmuv3_devs)
> }
> }
>
> +static void
> +build_iort_rmr_nodes(GArray *table_data, GArray *smmuv3_devices, uint32_t *id)
> +{
> + AcpiIortSMMUv3Dev *sdev;
> + AcpiIortIdMapping *idmap;
> + int i;
> +
> + for (i = 0; i < smmuv3_devices->len; i++) {
> + sdev = &g_array_index(smmuv3_devices, AcpiIortSMMUv3Dev, i);
> + idmap = &g_array_index(sdev->rc_smmu_idmaps, AcpiIortIdMapping, 0);
> + int bdf = idmap->input_base;
> +
> + if (!sdev->accel) {
> + continue;
> + }
> +
> + /* Table 18 Reserved Memory Range Node */
Reference a spec version somewhere. Table numbers tend to change over time.
Table 18 in E.g version of spec is Root Complex Node for example. This is now
table 19.
> + build_append_int_noprefix(table_data, 6 /* RMR */, 1); /* Type */
> + /* Length */
> + build_append_int_noprefix(table_data, 28 + ID_MAPPING_ENTRY_SIZE + 20,
Add something to justify that + 20 (which I think is size of the Memory Range descriptors?)
The +28 is to start of bit after RMR specific data so that is kind of fine. As below
I'd add a constant for the number of ID mapping entries.
> + 2);
> + build_append_int_noprefix(table_data, 3, 1); /* Revision */
> + build_append_int_noprefix(table_data, *id, 4); /* Identifier */
> + /* Number of ID mappings */
> + build_append_int_noprefix(table_data, 1, 4);
I'd define a constant for the number of these and use it to help build the size.
Sure it is 1, but even so that would make the logic of placement simpler I think.
> + /* Reference to ID Array */
> + build_append_int_noprefix(table_data, 28, 4);
> +
> + /* RMR specific data */
> +
> + /* Flags */
> + build_append_int_noprefix(table_data, 0 /* Disallow remapping */, 4);
Whilst we are disallowing remapping as this says, we are also saying a few
other things as there are more things in this flags field. Such as that it's
nGnRnE
> + /* Number of Memory Range Descriptors */
> + build_append_int_noprefix(table_data, 1 , 4);
I'd use a constant for this as well so that can use it in the size generation
and here.
> + /* Reference to Memory Range Descriptors */
> + build_append_int_noprefix(table_data, 28 + ID_MAPPING_ENTRY_SIZE, 4);
> + build_iort_id_mapping(table_data, bdf, idmap->id_count, sdev->offset,
> + 1);
> +
> + /* Table 19 Memory Range Descriptor */
As above numbers changed, so specific spec version in the references.
It's 20 in the spec I downloaded today.
> +
> + /* Physical Range offset */
> + build_append_int_noprefix(table_data, 0x8000000, 8);
Can we get these from any defines? Feels like make these values match in a number
of places is necessary so we shouldn't really hard code them here.
> + /* Physical Range length */
> + build_append_int_noprefix(table_data, 0x100000, 8);
> + build_append_int_noprefix(table_data, 0, 4); /* Reserved */
> + *id += 1;
Trivial but why not
*id++;
better yet, do it where it's used rather than leaving it down here.
That way if in future multiple IDs are added for some reason then the increments
will go with them calls to add them.
> + }
> +}
> +
> /*
> * Input Output Remapping Table (IORT)
> * Conforms to "IO Remapping Table System Software on ARM Platforms",
> @@ -464,7 +517,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
> GArray *smmuv3_devs = g_array_new(false, true, sizeof(AcpiIortSMMUv3Dev));
> GArray *rc_its_idmaps = g_array_new(false, true, sizeof(AcpiIortIdMapping));
>
> - AcpiTable table = { .sig = "IORT", .rev = 3, .oem_id = vms->oem_id,
> + AcpiTable table = { .sig = "IORT", .rev = 5, .oem_id = vms->oem_id,
> .oem_table_id = vms->oem_table_id };
Seem to be missing the bios table test updates that will break with that version uptick.
Probably add a doc reference here so we can keep aligned. The spec E.g has reached revision 7
whilst this work has been ongoing.
Jonathan
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 19/27] hw/arm/smmuv3-accel: Install S1 bypass hwpt on reset
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (17 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 18/27] hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested binding Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 13:32 ` Jonathan Cameron via
2025-10-16 23:19 ` Nicolin Chen
2025-09-29 13:36 ` [PATCH v4 20/27] hw/arm/smmuv3: Add accel property for SMMUv3 device Shameer Kolothum
` (8 subsequent siblings)
27 siblings, 2 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
When the guest reboots with devices in nested mode (S1 + S2), any QEMU/UEFI
access to those devices can fail because S1 translation is not valid during
the reboot. For example, a passthrough NVMe device may hold GRUB boot info
that UEFI tries to read during the reboot.
Set S1 to bypass mode during reset to avoid such failures.
Reported-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 29 +++++++++++++++++++++++++++++
hw/arm/smmuv3-accel.h | 4 ++++
hw/arm/smmuv3.c | 1 +
3 files changed, 34 insertions(+)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index defeddbd8c..8396053a6c 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -634,6 +634,35 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
.get_msi_address_space = smmuv3_accel_find_msi_as,
};
+/*
+ * If the guest reboots and devices are configured for S1+S2, Stage1 must
+ * be switched to bypass. Otherwise, QEMU/UEFI may fail when accessing a
+ * device, e.g. when UEFI retrieves boot partition information from an
+ * assigned vfio-pci NVMe device.
+ */
+void smmuv3_accel_attach_bypass_hwpt(SMMUv3State *s)
+{
+ SMMUv3AccelDevice *accel_dev;
+ SMMUViommu *viommu;
+
+ if (!s->accel || !s->s_accel->viommu) {
+ return;
+ }
+
+ viommu = s->s_accel->viommu;
+ QLIST_FOREACH(accel_dev, &viommu->device_list, next) {
+ if (!accel_dev->vdev) {
+ continue;
+ }
+ if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
+ viommu->bypass_hwpt_id,
+ NULL)) {
+ error_report("Failed to install bypass hwpt id %u for dev id %u",
+ viommu->bypass_hwpt_id, accel_dev->idev->devid);
+ }
+ }
+}
+
void smmuv3_accel_init(SMMUv3State *s)
{
SMMUState *bs = ARM_SMMU(s);
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index 3bdba47616..75f858e34a 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -48,6 +48,7 @@ bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
Error **errp);
bool smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
Error **errp);
+void smmuv3_accel_attach_bypass_hwpt(SMMUv3State *s);
#else
static inline void smmuv3_accel_init(SMMUv3State *s)
{
@@ -70,6 +71,9 @@ smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
{
return true;
}
+static inline void smmuv3_accel_attach_bypass_hwpt(SMMUv3State *s)
+{
+}
#endif
#endif /* HW_ARM_SMMUV3_ACCEL_H */
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 5830cf5a03..94b2bbc374 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1913,6 +1913,7 @@ static void smmu_reset_exit(Object *obj, ResetType type)
if (c->parent_phases.exit) {
c->parent_phases.exit(obj, type);
}
+ smmuv3_accel_attach_bypass_hwpt(s);
}
static void smmu_realize(DeviceState *d, Error **errp)
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 19/27] hw/arm/smmuv3-accel: Install S1 bypass hwpt on reset
2025-09-29 13:36 ` [PATCH v4 19/27] hw/arm/smmuv3-accel: Install S1 bypass hwpt on reset Shameer Kolothum
@ 2025-10-01 13:32 ` Jonathan Cameron via
2025-10-16 23:19 ` Nicolin Chen
1 sibling, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 13:32 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:35 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> When the guest reboots with devices in nested mode (S1 + S2), any QEMU/UEFI
> access to those devices can fail because S1 translation is not valid during
> the reboot. For example, a passthrough NVMe device may hold GRUB boot info
> that UEFI tries to read during the reboot.
>
> Set S1 to bypass mode during reset to avoid such failures.
>
> Reported-by: Matthew R. Ochs <mochs@nvidia.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Seems reasonable.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 19/27] hw/arm/smmuv3-accel: Install S1 bypass hwpt on reset
2025-09-29 13:36 ` [PATCH v4 19/27] hw/arm/smmuv3-accel: Install S1 bypass hwpt on reset Shameer Kolothum
2025-10-01 13:32 ` Jonathan Cameron via
@ 2025-10-16 23:19 ` Nicolin Chen
2025-10-20 14:22 ` Shameer Kolothum
1 sibling, 1 reply; 118+ messages in thread
From: Nicolin Chen @ 2025-10-16 23:19 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, Sep 29, 2025 at 02:36:35PM +0100, Shameer Kolothum wrote:
> When the guest reboots with devices in nested mode (S1 + S2), any QEMU/UEFI
> access to those devices can fail because S1 translation is not valid during
> the reboot. For example, a passthrough NVMe device may hold GRUB boot info
> that UEFI tries to read during the reboot.
>
> Set S1 to bypass mode during reset to avoid such failures.
GBPA is set to bypass on reset so I think it's fine. Yet, maybe the
code should check that.
> Reported-by: Matthew R. Ochs <mochs@nvidia.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/arm/smmuv3-accel.c | 29 +++++++++++++++++++++++++++++
> hw/arm/smmuv3-accel.h | 4 ++++
> hw/arm/smmuv3.c | 1 +
> 3 files changed, 34 insertions(+)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index defeddbd8c..8396053a6c 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -634,6 +634,35 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
> .get_msi_address_space = smmuv3_accel_find_msi_as,
> };
>
> +/*
> + * If the guest reboots and devices are configured for S1+S2, Stage1 must
> + * be switched to bypass. Otherwise, QEMU/UEFI may fail when accessing a
> + * device, e.g. when UEFI retrieves boot partition information from an
> + * assigned vfio-pci NVMe device.
> + */
> +void smmuv3_accel_attach_bypass_hwpt(SMMUv3State *s)
We could rename it to something like smmuv3_accel_reset().
Nicolin
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 19/27] hw/arm/smmuv3-accel: Install S1 bypass hwpt on reset
2025-10-16 23:19 ` Nicolin Chen
@ 2025-10-20 14:22 ` Shameer Kolothum
0 siblings, 0 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-20 14:22 UTC (permalink / raw)
To: Nicolin Chen
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, ddutile@redhat.com,
berrange@redhat.com, Nathan Chen, Matt Ochs, smostafa@google.com,
wangzhou1@hisilicon.com, jiangkunkun@huawei.com,
jonathan.cameron@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 17 October 2025 00:20
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 19/27] hw/arm/smmuv3-accel: Install S1 bypass hwpt
> on reset
>
> On Mon, Sep 29, 2025 at 02:36:35PM +0100, Shameer Kolothum wrote:
> > When the guest reboots with devices in nested mode (S1 + S2), any
> > QEMU/UEFI access to those devices can fail because S1 translation is
> > not valid during the reboot. For example, a passthrough NVMe device
> > may hold GRUB boot info that UEFI tries to read during the reboot.
> >
> > Set S1 to bypass mode during reset to avoid such failures.
>
> GBPA is set to bypass on reset so I think it's fine. Yet, maybe the code should
> check that.
Looking at it again, I think it doesn't now as I moved smmuv3_init_regs() to
smmu_realize() in patch #14 and it is not in smmu_reset_exit() path anymore.
I need to carve out the IDR init separately. I will do that in v5.
> > Reported-by: Matthew R. Ochs <mochs@nvidia.com>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > ---
> > hw/arm/smmuv3-accel.c | 29 +++++++++++++++++++++++++++++
> > hw/arm/smmuv3-accel.h | 4 ++++
> > hw/arm/smmuv3.c | 1 +
> > 3 files changed, 34 insertions(+)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c index
> > defeddbd8c..8396053a6c 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> > @@ -634,6 +634,35 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
> > .get_msi_address_space = smmuv3_accel_find_msi_as, };
> >
> > +/*
> > + * If the guest reboots and devices are configured for S1+S2, Stage1
> > +must
> > + * be switched to bypass. Otherwise, QEMU/UEFI may fail when
> > +accessing a
> > + * device, e.g. when UEFI retrieves boot partition information from
> > +an
> > + * assigned vfio-pci NVMe device.
> > + */
> > +void smmuv3_accel_attach_bypass_hwpt(SMMUv3State *s)
>
> We could rename it to something like smmuv3_accel_reset().
Makes sense.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 20/27] hw/arm/smmuv3: Add accel property for SMMUv3 device
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (18 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 19/27] hw/arm/smmuv3-accel: Install S1 bypass hwpt on reset Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-16 21:48 ` Nicolin Chen
2025-09-29 13:36 ` [PATCH v4 21/27] hw/arm/smmuv3-accel: Add a property to specify RIL support Shameer Kolothum
` (7 subsequent siblings)
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
Introduce an “accel” property to enable accelerator mode.
Live migration is currently unsupported when accelerator mode is enabled,
so a migration blocker is added.
Because this mode relies on IORT RMR for MSI support, accelerator mode is
not supported for device tree boot.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3.c | 28 ++++++++++++++++++++++++++++
hw/arm/virt.c | 19 ++++++++++++-------
include/hw/arm/smmuv3.h | 1 +
3 files changed, 41 insertions(+), 7 deletions(-)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 94b2bbc374..a0f704fc35 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -20,6 +20,7 @@
#include "qemu/bitops.h"
#include "hw/irq.h"
#include "hw/sysbus.h"
+#include "migration/blocker.h"
#include "migration/vmstate.h"
#include "hw/qdev-properties.h"
#include "hw/qdev-core.h"
@@ -1916,6 +1917,17 @@ static void smmu_reset_exit(Object *obj, ResetType type)
smmuv3_accel_attach_bypass_hwpt(s);
}
+static bool smmu_validate_property(SMMUv3State *s, Error **errp)
+{
+#ifndef CONFIG_ARM_SMMUV3_ACCEL
+ if (s->accel) {
+ error_setg(errp, "accel=on support not compiled in");
+ return false;
+ }
+#endif
+ return true;
+}
+
static void smmu_realize(DeviceState *d, Error **errp)
{
SMMUState *sys = ARM_SMMU(d);
@@ -1924,8 +1936,17 @@ static void smmu_realize(DeviceState *d, Error **errp)
SysBusDevice *dev = SYS_BUS_DEVICE(d);
Error *local_err = NULL;
+ if (!smmu_validate_property(s, errp)) {
+ return;
+ }
+
if (s->accel) {
smmuv3_accel_init(s);
+ error_setg(&s->migration_blocker, "Migration not supported with SMMUv3 "
+ "accelerator mode enabled");
+ if (migrate_add_blocker(&s->migration_blocker, errp) < 0) {
+ return;
+ }
}
c->parent_realize(d, &local_err);
@@ -2025,6 +2046,7 @@ static const Property smmuv3_properties[] = {
* Defaults to stage 1
*/
DEFINE_PROP_STRING("stage", SMMUv3State, stage),
+ DEFINE_PROP_BOOL("accel", SMMUv3State, accel, false),
};
static void smmuv3_instance_init(Object *obj)
@@ -2046,6 +2068,12 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
device_class_set_props(dc, smmuv3_properties);
dc->hotpluggable = false;
dc->user_creatable = true;
+
+ object_class_property_set_description(klass,
+ "accel",
+ "Enable SMMUv3 accelerator support."
+ "Allows host SMMUv3 to be configured "
+ "in nested mode for vfio-pci dev assignment");
}
static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 6467d7cfc8..6b789fd7b5 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1474,8 +1474,8 @@ static void create_smmuv3_dt_bindings(const VirtMachineState *vms, hwaddr base,
g_free(node);
}
-static void create_smmuv3_dev_dtb(VirtMachineState *vms,
- DeviceState *dev, PCIBus *bus)
+static void create_smmuv3_dev_dtb(VirtMachineState *vms, DeviceState *dev,
+ PCIBus *bus, Error **errp)
{
PlatformBusDevice *pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
SysBusDevice *sbdev = SYS_BUS_DEVICE(dev);
@@ -1483,10 +1483,15 @@ static void create_smmuv3_dev_dtb(VirtMachineState *vms,
hwaddr base = platform_bus_get_mmio_addr(pbus, sbdev, 0);
MachineState *ms = MACHINE(vms);
- if (!(vms->bootinfo.firmware_loaded && virt_is_acpi_enabled(vms)) &&
- strcmp("pcie.0", bus->qbus.name)) {
- warn_report("SMMUv3 device only supported with pcie.0 for DT");
- return;
+ if (!(vms->bootinfo.firmware_loaded && virt_is_acpi_enabled(vms))) {
+ if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
+ error_setg(errp, "SMMUv3 with accel=on not supported for DT");
+ return;
+ }
+ if (strcmp("pcie.0", bus->qbus.name)) {
+ warn_report("SMMUv3 device only supported with pcie.0 for DT");
+ return;
+ }
}
base += vms->memmap[VIRT_PLATFORM_BUS].base;
irq += vms->irqmap[VIRT_PLATFORM_BUS];
@@ -3086,7 +3091,7 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
}
}
- create_smmuv3_dev_dtb(vms, dev, bus);
+ create_smmuv3_dev_dtb(vms, dev, bus, errp);
if (object_property_get_bool(OBJECT(dev), "accel", &error_abort) &&
!vms->pci_preserve_config) {
vms->pci_preserve_config = true;
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index 5f3e9089a7..874cbaaf32 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -67,6 +67,7 @@ struct SMMUv3State {
/* SMMU has HW accelerator support for nested S1 + s2 */
bool accel;
struct SMMUv3AccelState *s_accel;
+ Error *migration_blocker;
};
typedef enum {
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 20/27] hw/arm/smmuv3: Add accel property for SMMUv3 device
2025-09-29 13:36 ` [PATCH v4 20/27] hw/arm/smmuv3: Add accel property for SMMUv3 device Shameer Kolothum
@ 2025-10-16 21:48 ` Nicolin Chen
0 siblings, 0 replies; 118+ messages in thread
From: Nicolin Chen @ 2025-10-16 21:48 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, ddutile,
berrange, nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, Sep 29, 2025 at 02:36:36PM +0100, Shameer Kolothum wrote:
> Introduce an “accel” property to enable accelerator mode.
Looks better with ASCII quotation marks: "accel".
> Live migration is currently unsupported when accelerator mode is enabled,
> so a migration blocker is added.
>
> Because this mode relies on IORT RMR for MSI support, accelerator mode is
> not supported for device tree boot.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> @@ -67,6 +67,7 @@ struct SMMUv3State {
> /* SMMU has HW accelerator support for nested S1 + s2 */
> bool accel;
> struct SMMUv3AccelState *s_accel;
> + Error *migration_blocker;
No need of double space.
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 21/27] hw/arm/smmuv3-accel: Add a property to specify RIL support
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (19 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 20/27] hw/arm/smmuv3: Add accel property for SMMUv3 device Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 13:39 ` Jonathan Cameron via
2025-10-17 8:48 ` Zhangfei Gao
2025-09-29 13:36 ` [PATCH v4 22/27] hw/arm/smmuv3-accel: Add support for ATS Shameer Kolothum
` (6 subsequent siblings)
27 siblings, 2 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
Currently QEMU SMMUv3 has RIL support by default. But if accelerated mode
is enabled, RIL has to be compatible with host SMMUv3 support.
Add a property so that the user can specify this.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 16 ++++++++++++++--
hw/arm/smmuv3-accel.h | 4 ++++
hw/arm/smmuv3.c | 13 +++++++++++++
include/hw/arm/smmuv3.h | 1 +
4 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 8396053a6c..e8607b253e 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -79,10 +79,10 @@ smmuv3_accel_check_hw_compatible(SMMUv3State *s,
return false;
}
- /* QEMU SMMUv3 supports Range Invalidation by default */
+ /* User can override QEMU SMMUv3 Range Invalidation support */
val = FIELD_EX32(info->idr[3], IDR3, RIL);
if (val != FIELD_EX32(s->idr[3], IDR3, RIL)) {
- error_setg(errp, "Host SUMMUv3 deosn't support Range Invalidation");
+ error_setg(errp, "Host SUMMUv3 differs in Range Invalidation support");
return false;
}
@@ -634,6 +634,18 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
.get_msi_address_space = smmuv3_accel_find_msi_as,
};
+void smmuv3_accel_idr_override(SMMUv3State *s)
+{
+ if (!s->accel) {
+ return;
+ }
+
+ /* By default QEMU SMMUv3 has RIL. Update IDR3 if user has disabled it */
+ if (!s->ril) {
+ s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 0);
+ }
+}
+
/*
* If the guest reboots and devices are configured for S1+S2, Stage1 must
* be switched to bypass. Otherwise, QEMU/UEFI may fail when accessing a
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index 75f858e34a..357d8352c5 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -49,6 +49,7 @@ bool smmuv3_accel_install_nested_ste_range(SMMUv3State *s, SMMUSIDRange *range,
bool smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
Error **errp);
void smmuv3_accel_attach_bypass_hwpt(SMMUv3State *s);
+void smmuv3_accel_idr_override(SMMUv3State *s);
#else
static inline void smmuv3_accel_init(SMMUv3State *s)
{
@@ -74,6 +75,9 @@ smmuv3_accel_issue_inv_cmd(SMMUv3State *s, void *cmd, SMMUDevice *sdev,
static inline void smmuv3_accel_attach_bypass_hwpt(SMMUv3State *s)
{
}
+static inline void smmuv3_accel_idr_override(SMMUv3State *s)
+{
+}
#endif
#endif /* HW_ARM_SMMUV3_ACCEL_H */
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index a0f704fc35..0f3a61646a 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -300,6 +300,8 @@ static void smmuv3_init_regs(SMMUv3State *s)
s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1);
s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
+ smmuv3_accel_idr_override(s);
+
s->cmdq.base = deposit64(s->cmdq.base, 0, 5, SMMU_CMDQS);
s->cmdq.prod = 0;
s->cmdq.cons = 0;
@@ -1925,6 +1927,13 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
return false;
}
#endif
+ if (s->accel) {
+ return true;
+ }
+ if (!s->ril) {
+ error_setg(errp, "ril can only be disabled if accel=on");
+ return false;
+ }
return true;
}
@@ -2047,6 +2056,8 @@ static const Property smmuv3_properties[] = {
*/
DEFINE_PROP_STRING("stage", SMMUv3State, stage),
DEFINE_PROP_BOOL("accel", SMMUv3State, accel, false),
+ /* RIL can be turned off for accel cases */
+ DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
};
static void smmuv3_instance_init(Object *obj)
@@ -2074,6 +2085,8 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
"Enable SMMUv3 accelerator support."
"Allows host SMMUv3 to be configured "
"in nested mode for vfio-pci dev assignment");
+ object_class_property_set_description(klass, "ril",
+ "Enable/disable range invalidation support");
}
static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index 874cbaaf32..c555ea684e 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -68,6 +68,7 @@ struct SMMUv3State {
bool accel;
struct SMMUv3AccelState *s_accel;
Error *migration_blocker;
+ bool ril;
};
typedef enum {
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 21/27] hw/arm/smmuv3-accel: Add a property to specify RIL support
2025-09-29 13:36 ` [PATCH v4 21/27] hw/arm/smmuv3-accel: Add a property to specify RIL support Shameer Kolothum
@ 2025-10-01 13:39 ` Jonathan Cameron via
2025-10-17 8:48 ` Zhangfei Gao
1 sibling, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 13:39 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:37 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> Currently QEMU SMMUv3 has RIL support by default. But if accelerated mode
> is enabled, RIL has to be compatible with host SMMUv3 support.
>
> Add a property so that the user can specify this.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
One trivial comment inline.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> ---
> hw/arm/smmuv3-accel.c | 16 ++++++++++++++--
> hw/arm/smmuv3-accel.h | 4 ++++
> hw/arm/smmuv3.c | 13 +++++++++++++
> include/hw/arm/smmuv3.h | 1 +
> 4 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 8396053a6c..e8607b253e 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index a0f704fc35..0f3a61646a 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -300,6 +300,8 @@ static void smmuv3_init_regs(SMMUv3State *s)
> s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1);
> s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN64K, 1);
>
> + smmuv3_accel_idr_override(s);
> +
> s->cmdq.base = deposit64(s->cmdq.base, 0, 5, SMMU_CMDQS);
> s->cmdq.prod = 0;
> s->cmdq.cons = 0;
> @@ -1925,6 +1927,13 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
> return false;
> }
> #endif
> + if (s->accel) {
> + return true;
> + }
Feels to me that an early exit here is going to be slightly odd if other propoerties are added
later as they'll have to be added at a specific point in the function. Perhaps.
if (!s->accel) {
if (!s->ril) {
....
}
}
return true;
is going to be easier to extend.
> + if (!s->ril) {
> + error_setg(errp, "ril can only be disabled if accel=on");
> + return false;
> + }
> return true;
> }
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 21/27] hw/arm/smmuv3-accel: Add a property to specify RIL support
2025-09-29 13:36 ` [PATCH v4 21/27] hw/arm/smmuv3-accel: Add a property to specify RIL support Shameer Kolothum
2025-10-01 13:39 ` Jonathan Cameron via
@ 2025-10-17 8:48 ` Zhangfei Gao
2025-10-17 9:40 ` Shameer Kolothum
1 sibling, 1 reply; 118+ messages in thread
From: Zhangfei Gao @ 2025-10-17 8:48 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, jonathan.cameron, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sept 2025 at 21:40, Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>
> Currently QEMU SMMUv3 has RIL support by default. But if accelerated mode
> is enabled, RIL has to be compatible with host SMMUv3 support.
>
> Add a property so that the user can specify this.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
If ril=off is not specified, the guest kernel will not boot up, is
this expected?
Fail with log:
qemu-system-aarch64: -device
vfio-pci,host=0000:75:00.1,bus=pcie.0,iommufd=iommufd0:
vfio 0000:75:00.1: Failed to set vIOMMU: Host SUMMUv3 differs in
Range Invalidation support
Thanks
^ permalink raw reply [flat|nested] 118+ messages in thread
* RE: [PATCH v4 21/27] hw/arm/smmuv3-accel: Add a property to specify RIL support
2025-10-17 8:48 ` Zhangfei Gao
@ 2025-10-17 9:40 ` Shameer Kolothum
2025-10-17 14:05 ` Zhangfei Gao
0 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-17 9:40 UTC (permalink / raw)
To: Zhangfei Gao
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Zhangfei Gao <zhangfei.gao@linaro.org>
> Sent: 17 October 2025 09:49
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 21/27] hw/arm/smmuv3-accel: Add a property to
> specify RIL support
>
> External email: Use caution opening links or attachments
>
>
> On Mon, 29 Sept 2025 at 21:40, Shameer Kolothum
> <skolothumtho@nvidia.com> wrote:
> >
> > Currently QEMU SMMUv3 has RIL support by default. But if accelerated
> > mode is enabled, RIL has to be compatible with host SMMUv3 support.
> >
> > Add a property so that the user can specify this.
> >
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>
> If ril=off is not specified, the guest kernel will not boot up, is this expected?
>
> Fail with log:
> qemu-system-aarch64: -device
> vfio-pci,host=0000:75:00.1,bus=pcie.0,iommufd=iommufd0:
> vfio 0000:75:00.1: Failed to set vIOMMU: Host SUMMUv3 differs in Range
> Invalidation support
It will, if the host SMMUv3 doesn't have RIL. Please check that.
This is because when a device is attached to vSMMU , a compatibility check
is performed to ensure that the SMUUv3 features visible to guest are compatible
with host SMMUv3 it is tied to. By default, QEMU SMMUV3 reports RIL to Guest.
The "ril" option is provided so that user can specify this in case incompatibility.
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 21/27] hw/arm/smmuv3-accel: Add a property to specify RIL support
2025-10-17 9:40 ` Shameer Kolothum
@ 2025-10-17 14:05 ` Zhangfei Gao
0 siblings, 0 replies; 118+ messages in thread
From: Zhangfei Gao @ 2025-10-17 14:05 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
On Fri, 17 Oct 2025 at 17:40, Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Zhangfei Gao <zhangfei.gao@linaro.org>
> > Sent: 17 October 2025 09:49
> > To: Shameer Kolothum <skolothumtho@nvidia.com>
> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> > eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> > <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> > berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> > <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> > jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> > zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> > shameerkolothum@gmail.com
> > Subject: Re: [PATCH v4 21/27] hw/arm/smmuv3-accel: Add a property to
> > specify RIL support
> >
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, 29 Sept 2025 at 21:40, Shameer Kolothum
> > <skolothumtho@nvidia.com> wrote:
> > >
> > > Currently QEMU SMMUv3 has RIL support by default. But if accelerated
> > > mode is enabled, RIL has to be compatible with host SMMUv3 support.
> > >
> > > Add a property so that the user can specify this.
> > >
> > > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> >
> > If ril=off is not specified, the guest kernel will not boot up, is this expected?
> >
> > Fail with log:
> > qemu-system-aarch64: -device
> > vfio-pci,host=0000:75:00.1,bus=pcie.0,iommufd=iommufd0:
> > vfio 0000:75:00.1: Failed to set vIOMMU: Host SUMMUv3 differs in Range
> > Invalidation support
>
> It will, if the host SMMUv3 doesn't have RIL. Please check that.
Yes, the host SMMUv3 doesn't have RIL in my case.
> This is because when a device is attached to vSMMU , a compatibility check
> is performed to ensure that the SMUUv3 features visible to guest are compatible
> with host SMMUv3 it is tied to. By default, QEMU SMMUV3 reports RIL to Guest.
OK, got it, using ioctl to get host info and check the compatibility.
>
> The "ril" option is provided so that user can specify this in case incompatibility.
OK, Thanks
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 22/27] hw/arm/smmuv3-accel: Add support for ATS
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (20 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 21/27] hw/arm/smmuv3-accel: Add a property to specify RIL support Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 13:43 ` Jonathan Cameron via
2025-09-29 13:36 ` [PATCH v4 23/27] hw/arm/smmuv3-accel: Add property to specify OAS bits Shameer Kolothum
` (5 subsequent siblings)
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
QEMU SMMUv3 does not enable ATS (Address Translation Services) by default.
When accelerated mode is enabled and the host SMMUv3 supports ATS, it can
be useful to report ATS capability to the guest so it can take advantage
of it if the device also supports ATS.
Note: ATS support cannot be reliably detected from the host SMMUv3 IDR
registers alone, as firmware ACPI IORT tables may override them. The
user must therefore ensure the support before enabling it.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 4 ++++
hw/arm/smmuv3.c | 25 ++++++++++++++++++++++++-
hw/arm/virt-acpi-build.c | 10 ++++++++--
include/hw/arm/smmuv3.h | 1 +
4 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index e8607b253e..eee54316bf 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -644,6 +644,10 @@ void smmuv3_accel_idr_override(SMMUv3State *s)
if (!s->ril) {
s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 0);
}
+ /* QEMU SMMUv3 has no ATS. Update IDR0 if user has enabled it */
+ if (s->ats) {
+ s->idr[0] = FIELD_DP32(s->idr[0], IDR0, ATS, 1); /* ATS */
+ }
}
/*
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 0f3a61646a..77d46a9cd6 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -1510,13 +1510,28 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
*/
smmuv3_range_inval(bs, &cmd, SMMU_STAGE_2);
break;
+ case SMMU_CMD_ATC_INV:
+ {
+ SMMUDevice *sdev = smmu_find_sdev(bs, CMD_SID(&cmd));
+ Error *local_err = NULL;
+
+ if (!sdev) {
+ break;
+ }
+
+ if (!smmuv3_accel_issue_inv_cmd(s, &cmd, sdev, &local_err)) {
+ error_report_err(local_err);
+ cmd_error = SMMU_CERROR_ILL;
+ break;
+ }
+ break;
+ }
case SMMU_CMD_TLBI_EL3_ALL:
case SMMU_CMD_TLBI_EL3_VA:
case SMMU_CMD_TLBI_EL2_ALL:
case SMMU_CMD_TLBI_EL2_ASID:
case SMMU_CMD_TLBI_EL2_VA:
case SMMU_CMD_TLBI_EL2_VAA:
- case SMMU_CMD_ATC_INV:
case SMMU_CMD_PRI_RESP:
case SMMU_CMD_RESUME:
case SMMU_CMD_STALL_TERM:
@@ -1934,6 +1949,10 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
error_setg(errp, "ril can only be disabled if accel=on");
return false;
}
+ if (s->ats) {
+ error_setg(errp, "ats can only be enabled if accel=on");
+ return false;
+ }
return true;
}
@@ -2058,6 +2077,7 @@ static const Property smmuv3_properties[] = {
DEFINE_PROP_BOOL("accel", SMMUv3State, accel, false),
/* RIL can be turned off for accel cases */
DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
+ DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
};
static void smmuv3_instance_init(Object *obj)
@@ -2087,6 +2107,9 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
"in nested mode for vfio-pci dev assignment");
object_class_property_set_description(klass, "ril",
"Enable/disable range invalidation support");
+ object_class_property_set_description(klass, "ats",
+ "Enable/disable ATS support. Please ensure host platform has ATS "
+ "support before enabling this");
}
static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
index d0c1e10019..a53f0229b8 100644
--- a/hw/arm/virt-acpi-build.c
+++ b/hw/arm/virt-acpi-build.c
@@ -322,6 +322,7 @@ typedef struct AcpiIortSMMUv3Dev {
/* Offset of the SMMUv3 IORT Node relative to the start of the IORT */
size_t offset;
bool accel;
+ bool ats;
} AcpiIortSMMUv3Dev;
/*
@@ -377,6 +378,7 @@ static int iort_smmuv3_devices(Object *obj, void *opaque)
bus = PCI_BUS(object_property_get_link(obj, "primary-bus", &error_abort));
sdev.accel = object_property_get_bool(obj, "accel", &error_abort);
+ sdev.ats = object_property_get_bool(obj, "ats", &error_abort);
pbus = PLATFORM_BUS_DEVICE(vms->platform_bus_dev);
sbdev = SYS_BUS_DEVICE(obj);
sdev.base = platform_bus_get_mmio_addr(pbus, sbdev, 0);
@@ -511,6 +513,7 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
int i, nb_nodes, rc_mapping_count;
AcpiIortSMMUv3Dev *sdev;
size_t node_size;
+ bool ats_needed = false;
int num_smmus = 0;
uint32_t id = 0;
int rc_smmu_idmaps_len = 0;
@@ -546,6 +549,9 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
/* Calculate RMR nodes required. One per SMMUv3 with accelerated mode */
for (i = 0; i < num_smmus; i++) {
sdev = &g_array_index(smmuv3_devs, AcpiIortSMMUv3Dev, i);
+ if (sdev->ats) {
+ ats_needed = true;
+ }
if (sdev->accel) {
nb_nodes++;
}
@@ -645,8 +651,8 @@ build_iort(GArray *table_data, BIOSLinker *linker, VirtMachineState *vms)
build_append_int_noprefix(table_data, 0, 2); /* Reserved */
/* Table 15 Memory Access Flags */
build_append_int_noprefix(table_data, 0x3 /* CCA = CPM = DACS = 1 */, 1);
-
- build_append_int_noprefix(table_data, 0, 4); /* ATS Attribute */
+ /* ATS Attribute */
+ build_append_int_noprefix(table_data, (ats_needed ? 1 : 0), 4);
/* MCFG pci_segment */
build_append_int_noprefix(table_data, 0, 4); /* PCI Segment number */
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index c555ea684e..6f07baa144 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -69,6 +69,7 @@ struct SMMUv3State {
struct SMMUv3AccelState *s_accel;
Error *migration_blocker;
bool ril;
+ bool ats;
};
typedef enum {
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 22/27] hw/arm/smmuv3-accel: Add support for ATS
2025-09-29 13:36 ` [PATCH v4 22/27] hw/arm/smmuv3-accel: Add support for ATS Shameer Kolothum
@ 2025-10-01 13:43 ` Jonathan Cameron via
0 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 13:43 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:38 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> QEMU SMMUv3 does not enable ATS (Address Translation Services) by default.
> When accelerated mode is enabled and the host SMMUv3 supports ATS, it can
> be useful to report ATS capability to the guest so it can take advantage
> of it if the device also supports ATS.
>
> Note: ATS support cannot be reliably detected from the host SMMUv3 IDR
> registers alone, as firmware ACPI IORT tables may override them. The
> user must therefore ensure the support before enabling it.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Trivial stuff only.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Feels like there should be some host mechanism we could query
for support but if not I guess the 'don't set it wrong' comment
is the best we can do.
> ---
> hw/arm/smmuv3-accel.c | 4 ++++
> hw/arm/smmuv3.c | 25 ++++++++++++++++++++++++-
> hw/arm/virt-acpi-build.c | 10 ++++++++--
> include/hw/arm/smmuv3.h | 1 +
> 4 files changed, 37 insertions(+), 3 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index e8607b253e..eee54316bf 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -644,6 +644,10 @@ void smmuv3_accel_idr_override(SMMUv3State *s)
> if (!s->ril) {
> s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 0);
> }
> + /* QEMU SMMUv3 has no ATS. Update IDR0 if user has enabled it */
> + if (s->ats) {
> + s->idr[0] = FIELD_DP32(s->idr[0], IDR0, ATS, 1); /* ATS */
Not sure the comment adds anything given the field name!
> + }
> }
>
> /*
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index 0f3a61646a..77d46a9cd6 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -1510,13 +1510,28 @@ static int smmuv3_cmdq_consume(SMMUv3State *s)
> */
> smmuv3_range_inval(bs, &cmd, SMMU_STAGE_2);
> break;
> + case SMMU_CMD_ATC_INV:
> + {
> + SMMUDevice *sdev = smmu_find_sdev(bs, CMD_SID(&cmd));
> + Error *local_err = NULL;
> +
> + if (!sdev) {
> + break;
> + }
> +
> + if (!smmuv3_accel_issue_inv_cmd(s, &cmd, sdev, &local_err)) {
> + error_report_err(local_err);
> + cmd_error = SMMU_CERROR_ILL;
> + break;
> + }
> + break;
> + }
> case SMMU_CMD_TLBI_EL3_ALL:
> case SMMU_CMD_TLBI_EL3_VA:
> case SMMU_CMD_TLBI_EL2_ALL:
> case SMMU_CMD_TLBI_EL2_ASID:
> case SMMU_CMD_TLBI_EL2_VA:
> case SMMU_CMD_TLBI_EL2_VAA:
> - case SMMU_CMD_ATC_INV:
> case SMMU_CMD_PRI_RESP:
> case SMMU_CMD_RESUME:
> case SMMU_CMD_STALL_TERM:
> @@ -1934,6 +1949,10 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
> error_setg(errp, "ril can only be disabled if accel=on");
> return false;
> }
> + if (s->ats) {
> + error_setg(errp, "ats can only be enabled if accel=on");
> + return false;
Comment in previous patch follow through here...
> + }
> return true;
> }
> diff --git a/hw/arm/virt-acpi-build.c b/hw/arm/virt-acpi-build.c
> index d0c1e10019..a53f0229b8 100644
> --- a/hw/arm/virt-acpi-build.c
> +++ b/hw/arm/virt-acpi-build.c
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 23/27] hw/arm/smmuv3-accel: Add property to specify OAS bits
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (21 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 22/27] hw/arm/smmuv3-accel: Add support for ATS Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 13:46 ` Jonathan Cameron via
2025-09-29 13:36 ` [PATCH v4 24/27] backends/iommufd: Retrieve PASID width from iommufd_backend_get_device_info() Shameer Kolothum
` (4 subsequent siblings)
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
QEMU SMMUv3 currently sets the output address size (OAS) to 44 bits. With
accelerator mode enabled, a guest device may use SVA where CPU page tables
are shared with SMMUv3, requiring OAS at least equal to the CPU OAS. Add
a user option to set this.
Note: Linux kernel docs currently state the OAS field in the IDR register
is not meaningful for users. But looks like we need this information.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 15 +++++++++++++++
hw/arm/smmuv3-internal.h | 3 ++-
hw/arm/smmuv3.c | 15 ++++++++++++++-
include/hw/arm/smmuv3.h | 1 +
4 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index eee54316bf..ba37d690ad 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -86,6 +86,17 @@ smmuv3_accel_check_hw_compatible(SMMUv3State *s,
return false;
}
+ /*
+ * ToDo: OAS is not something Linux kernel doc says meaningful for user.
+ * But looks like OAS needs to be compatibe for accelerator support. Please
+ * check.
+ */
+ val = FIELD_EX32(info->idr[5], IDR5, OAS);
+ if (val < FIELD_EX32(s->idr[5], IDR5, OAS)) {
+ error_setg(errp, "Host SUMMUv3 OAS not compatible");
+ return false;
+ }
+
val = FIELD_EX32(info->idr[5], IDR5, GRAN4K);
if (val != FIELD_EX32(s->idr[5], IDR5, GRAN4K)) {
error_setg(errp, "Host SMMUv3 doesn't support 64K translation granule");
@@ -648,6 +659,10 @@ void smmuv3_accel_idr_override(SMMUv3State *s)
if (s->ats) {
s->idr[0] = FIELD_DP32(s->idr[0], IDR0, ATS, 1); /* ATS */
}
+ /* QEMU SMMUv3 has oas set 44. Update IDR5 if user has it set to 48 bits*/
+ if (s->oas == 48) {
+ s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS_48);
+ }
}
/*
diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
index b0dfa9465c..910a34e05b 100644
--- a/hw/arm/smmuv3-internal.h
+++ b/hw/arm/smmuv3-internal.h
@@ -111,7 +111,8 @@ REG32(IDR5, 0x14)
FIELD(IDR5, VAX, 10, 2);
FIELD(IDR5, STALL_MAX, 16, 16);
-#define SMMU_IDR5_OAS 4
+#define SMMU_IDR5_OAS_44 4
+#define SMMU_IDR5_OAS_48 5
REG32(IIDR, 0x18)
REG32(AIDR, 0x1c)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 77d46a9cd6..7c391ab711 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -294,7 +294,8 @@ static void smmuv3_init_regs(SMMUv3State *s)
s->idr[3] = FIELD_DP32(s->idr[3], IDR3, RIL, 1);
s->idr[3] = FIELD_DP32(s->idr[3], IDR3, BBML, 2);
- s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS); /* 44 bits */
+ /* OAS: 44 bits */
+ s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS_44);
/* 4K, 16K and 64K granule support */
s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN4K, 1);
s->idr[5] = FIELD_DP32(s->idr[5], IDR5, GRAN16K, 1);
@@ -1943,6 +1944,10 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
}
#endif
if (s->accel) {
+ if (s->oas != 44 && s->oas != 48) {
+ error_setg(errp, "oas can only be set to 44 or 48 bits");
+ return false;
+ }
return true;
}
if (!s->ril) {
@@ -1953,6 +1958,10 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
error_setg(errp, "ats can only be enabled if accel=on");
return false;
}
+ if (s->oas != 44) {
+ error_setg(errp, "oas can only be set to 44 bits if accel=off");
+ return false;
+ }
return true;
}
@@ -2078,6 +2087,7 @@ static const Property smmuv3_properties[] = {
/* RIL can be turned off for accel cases */
DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
+ DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
};
static void smmuv3_instance_init(Object *obj)
@@ -2110,6 +2120,9 @@ static void smmuv3_class_init(ObjectClass *klass, const void *data)
object_class_property_set_description(klass, "ats",
"Enable/disable ATS support. Please ensure host platform has ATS "
"support before enabling this");
+ object_class_property_set_description(klass, "oas",
+ "Specify Output Address Size. Supported values are 44 or 48 bits "
+ "Defaults to 44 bits");
}
static int smmuv3_notify_flag_changed(IOMMUMemoryRegion *iommu,
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index 6f07baa144..d3788b2d85 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -70,6 +70,7 @@ struct SMMUv3State {
Error *migration_blocker;
bool ril;
bool ats;
+ uint8_t oas;
};
typedef enum {
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 23/27] hw/arm/smmuv3-accel: Add property to specify OAS bits
2025-09-29 13:36 ` [PATCH v4 23/27] hw/arm/smmuv3-accel: Add property to specify OAS bits Shameer Kolothum
@ 2025-10-01 13:46 ` Jonathan Cameron via
0 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 13:46 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:39 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> QEMU SMMUv3 currently sets the output address size (OAS) to 44 bits. With
> accelerator mode enabled, a guest device may use SVA where CPU page tables
> are shared with SMMUv3, requiring OAS at least equal to the CPU OAS. Add
> a user option to set this.
>
> Note: Linux kernel docs currently state the OAS field in the IDR register
> is not meaningful for users. But looks like we need this information.
So is there a kernel documentation fix pending? :)
Mind you I think we should ensure this is true anyway in QEMU as some other
OS might do weird things if it's not.
Maybe we should just raise the default QEMU uses (with compat stuff for older
qemu) and not worry about an exposed control for this?
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/arm/smmuv3-accel.c | 15 +++++++++++++++
> hw/arm/smmuv3-internal.h | 3 ++-
> hw/arm/smmuv3.c | 15 ++++++++++++++-
> include/hw/arm/smmuv3.h | 1 +
> 4 files changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index eee54316bf..ba37d690ad 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -86,6 +86,17 @@ smmuv3_accel_check_hw_compatible(SMMUv3State *s,
> return false;
> }
>
> + /*
> + * ToDo: OAS is not something Linux kernel doc says meaningful for user.
> + * But looks like OAS needs to be compatibe for accelerator support. Please
spell check this too..
> + * check.
> + */
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 24/27] backends/iommufd: Retrieve PASID width from iommufd_backend_get_device_info()
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (22 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 23/27] hw/arm/smmuv3-accel: Add property to specify OAS bits Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 13:50 ` Jonathan Cameron via
2025-09-29 13:36 ` [PATCH v4 25/27] backends/iommufd: Add a callback helper to retrieve PASID support Shameer Kolothum
` (3 subsequent siblings)
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
And store it in HostIOMMUDeviceCaps for later use.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
backends/iommufd.c | 6 +++++-
hw/arm/smmuv3-accel.c | 3 ++-
hw/vfio/iommufd.c | 7 +++++--
include/system/host_iommu_device.h | 2 ++
include/system/iommufd.h | 3 ++-
5 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index d3029d4658..023e67bc46 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -388,7 +388,8 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be,
bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
uint32_t *type, void *data, uint32_t len,
- uint64_t *caps, Error **errp)
+ uint64_t *caps, uint8_t *pasid_log2,
+ Error **errp)
{
struct iommu_hw_info info = {
.size = sizeof(info),
@@ -407,6 +408,9 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
g_assert(caps);
*caps = info.out_capabilities;
+ if (pasid_log2) {
+ *pasid_log2 = info.out_max_pasid_log2;
+ }
return true;
}
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index ba37d690ad..283d36e6cd 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -124,7 +124,8 @@ smmuv3_accel_hw_compatible(SMMUv3State *s, HostIOMMUDeviceIOMMUFD *idev,
uint64_t caps;
if (!iommufd_backend_get_device_info(idev->iommufd, idev->devid, &data_type,
- &info, sizeof(info), &caps, errp)) {
+ &info, sizeof(info), &caps, NULL,
+ errp)) {
return false;
}
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 525df30ed1..89aa1b76a8 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -366,7 +366,8 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
* instead.
*/
if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
- &type, NULL, 0, &hw_caps, errp)) {
+ &type, NULL, 0, &hw_caps, NULL,
+ errp)) {
return false;
}
@@ -901,19 +902,21 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
HostIOMMUDeviceCaps *caps = &hiod->caps;
VendorCaps *vendor_caps = &caps->vendor_caps;
enum iommu_hw_info_type type;
+ uint8_t pasid_log2;
uint64_t hw_caps;
hiod->agent = opaque;
if (!iommufd_backend_get_device_info(vdev->iommufd, vdev->devid, &type,
vendor_caps, sizeof(*vendor_caps),
- &hw_caps, errp)) {
+ &hw_caps, &pasid_log2, errp)) {
return false;
}
hiod->name = g_strdup(vdev->name);
caps->type = type;
caps->hw_caps = hw_caps;
+ caps->max_pasid_log2 = pasid_log2;
idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
idev->iommufd = vdev->iommufd;
diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index ab849a4a82..c6a2a3899a 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -29,6 +29,7 @@ typedef union VendorCaps {
*
* @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this represents
* the @out_capabilities value returned from IOMMU_GET_HW_INFO ioctl)
+ * @max_pasid_log2: width of PASIDs supported by host IOMMU device
*
* @vendor_caps: host platform IOMMU vendor specific capabilities (e.g. on
* IOMMUFD this represents a user-space buffer filled by kernel
@@ -37,6 +38,7 @@ typedef union VendorCaps {
typedef struct HostIOMMUDeviceCaps {
uint32_t type;
uint64_t hw_caps;
+ uint8_t max_pasid_log2;
VendorCaps vendor_caps;
} HostIOMMUDeviceCaps;
#endif
diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index e852193f35..d3efcffc45 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -71,7 +71,8 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
hwaddr iova, ram_addr_t size);
bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
uint32_t *type, void *data, uint32_t len,
- uint64_t *caps, Error **errp);
+ uint64_t *caps, uint8_t *pasid_log2,
+ Error **errp);
bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
uint32_t pt_id, uint32_t flags,
uint32_t data_type, uint32_t data_len,
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 24/27] backends/iommufd: Retrieve PASID width from iommufd_backend_get_device_info()
2025-09-29 13:36 ` [PATCH v4 24/27] backends/iommufd: Retrieve PASID width from iommufd_backend_get_device_info() Shameer Kolothum
@ 2025-10-01 13:50 ` Jonathan Cameron via
0 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 13:50 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:40 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
Bring the bit of the description in the title down here as well.
Depending on what tools people use for browsing git it might
end up in very different places on their screen.
> And store it in HostIOMMUDeviceCaps for later use.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Trivial comment inline.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> ---
> backends/iommufd.c | 6 +++++-
> hw/arm/smmuv3-accel.c | 3 ++-
> hw/vfio/iommufd.c | 7 +++++--
> include/system/host_iommu_device.h | 2 ++
> include/system/iommufd.h | 3 ++-
> 5 files changed, 16 insertions(+), 5 deletions(-)
...
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index ab849a4a82..c6a2a3899a 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -29,6 +29,7 @@ typedef union VendorCaps {
> *
> * @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this represents
> * the @out_capabilities value returned from IOMMU_GET_HW_INFO ioctl)
Blank line here to match local style.
> + * @max_pasid_log2: width of PASIDs supported by host IOMMU device
> *
> * @vendor_caps: host platform IOMMU vendor specific capabilities (e.g. on
> * IOMMUFD this represents a user-space buffer filled by kernel
> @@ -37,6 +38,7 @@ typedef union VendorCaps {
> typedef struct HostIOMMUDeviceCaps {
> uint32_t type;
> uint64_t hw_caps;
> + uint8_t max_pasid_log2;
> VendorCaps vendor_caps;
> } HostIOMMUDeviceCaps;
> #endif
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index e852193f35..d3efcffc45 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -71,7 +71,8 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> hwaddr iova, ram_addr_t size);
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
> uint32_t *type, void *data, uint32_t len,
> - uint64_t *caps, Error **errp);
> + uint64_t *caps, uint8_t *pasid_log2,
> + Error **errp);
> bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> uint32_t pt_id, uint32_t flags,
> uint32_t data_type, uint32_t data_len,
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 25/27] backends/iommufd: Add a callback helper to retrieve PASID support
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (23 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 24/27] backends/iommufd: Retrieve PASID width from iommufd_backend_get_device_info() Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 13:52 ` Jonathan Cameron via
2025-09-29 13:36 ` [PATCH v4 26/27] vfio: Synthesize vPASID capability to VM Shameer Kolothum
` (2 subsequent siblings)
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
Subsequent patch will make use of this to add a PASID CAP for assigned devices.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
backends/iommufd.c | 9 +++++++++
include/system/host_iommu_device.h | 12 ++++++++++++
2 files changed, 21 insertions(+)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 023e67bc46..0ff46a5747 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -523,6 +523,14 @@ bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
return idevc->detach_hwpt(idev, errp);
}
+static uint8_t hiod_iommufd_get_pasid(HostIOMMUDevice *hiod, uint64_t *hw_caps)
+{
+ HostIOMMUDeviceCaps *caps = &hiod->caps;
+
+ *hw_caps = caps->hw_caps;
+ return caps->max_pasid_log2;
+}
+
static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
{
HostIOMMUDeviceCaps *caps = &hiod->caps;
@@ -543,6 +551,7 @@ static void hiod_iommufd_class_init(ObjectClass *oc, const void *data)
HostIOMMUDeviceClass *hioc = HOST_IOMMU_DEVICE_CLASS(oc);
hioc->get_cap = hiod_iommufd_get_cap;
+ hioc->get_pasid = hiod_iommufd_get_pasid;
};
static const TypeInfo types[] = {
diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index c6a2a3899a..3773c54977 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -115,6 +115,18 @@ struct HostIOMMUDeviceClass {
* @hiod: handle to the host IOMMU device
*/
uint64_t (*get_page_size_mask)(HostIOMMUDevice *hiod);
+ /**
+ * @get_pasid: Get PASID support information along this
+ * @hiod Host IOMMU device
+ * Optional callback. If not implemented, PASID not supported
+ *
+ * @hiod: handle to the host IOMMU device
+ *
+ * @out_hw_caps: Output the generic iommu capability info which includes
+ * device PASID CAP info
+ * Returns the width of PASIDs. Zero means no PASID support
+ */
+ uint8_t (*get_pasid)(HostIOMMUDevice *hiod, uint64_t *out_hw_caps);
};
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 25/27] backends/iommufd: Add a callback helper to retrieve PASID support
2025-09-29 13:36 ` [PATCH v4 25/27] backends/iommufd: Add a callback helper to retrieve PASID support Shameer Kolothum
@ 2025-10-01 13:52 ` Jonathan Cameron via
0 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 13:52 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:41 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> Subsequent patch will make use of this to add a PASID CAP for assigned devices.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Trivial stuff.
> ---
> backends/iommufd.c | 9 +++++++++
> include/system/host_iommu_device.h | 12 ++++++++++++
> 2 files changed, 21 insertions(+)
>
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 023e67bc46..0ff46a5747 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -523,6 +523,14 @@ bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> return idevc->detach_hwpt(idev, errp);
> }
>
> +static uint8_t hiod_iommufd_get_pasid(HostIOMMUDevice *hiod, uint64_t *hw_caps)
> +{
> + HostIOMMUDeviceCaps *caps = &hiod->caps;
> +
> + *hw_caps = caps->hw_caps;
> + return caps->max_pasid_log2;
> +}
> +
> static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
> {
> HostIOMMUDeviceCaps *caps = &hiod->caps;
> @@ -543,6 +551,7 @@ static void hiod_iommufd_class_init(ObjectClass *oc, const void *data)
> HostIOMMUDeviceClass *hioc = HOST_IOMMU_DEVICE_CLASS(oc);
>
> hioc->get_cap = hiod_iommufd_get_cap;
> + hioc->get_pasid = hiod_iommufd_get_pasid;
> };
>
> static const TypeInfo types[] = {
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index c6a2a3899a..3773c54977 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -115,6 +115,18 @@ struct HostIOMMUDeviceClass {
> * @hiod: handle to the host IOMMU device
> */
> uint64_t (*get_page_size_mask)(HostIOMMUDevice *hiod);
> + /**
> + * @get_pasid: Get PASID support information along this
> + * @hiod Host IOMMU device
> + * Optional callback. If not implemented, PASID not supported
> + *
> + * @hiod: handle to the host IOMMU device
> + *
> + * @out_hw_caps: Output the generic iommu capability info which includes
> + * device PASID CAP info
Blank line here to match local style.
> + * Returns the width of PASIDs. Zero means no PASID support
* Returns: width of PASIDs. ...
to match other comments in this file. I only checked a few of them
so maybe this style is there somewhere as well.
> + */
> + uint8_t (*get_pasid)(HostIOMMUDevice *hiod, uint64_t *out_hw_caps);
> };
>
> /*
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 26/27] vfio: Synthesize vPASID capability to VM
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (24 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 25/27] backends/iommufd: Add a callback helper to retrieve PASID support Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 13:58 ` Jonathan Cameron via
2025-09-29 13:36 ` [PATCH v4 27/27] hw.arm/smmuv3: Add support for PASID enable Shameer Kolothum
2025-10-17 6:25 ` [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Zhangfei Gao
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
From: Yi Liu <yi.l.liu@intel.com>
If user wants to expose PASID capability in vIOMMU, then VFIO would also
report the PASID cap for this device if the underlying hardware supports
it as well.
As a start, this chooses to put the vPASID cap in the last 8 bytes of the
vconfig space. This is a choice in the good hope of no conflict with any
existing cap or hidden registers. For the devices that has hidden registers,
user should figure out a proper offset for the vPASID cap. This may require
an option for user to config it. Here we leave it as a future extension.
There are more discussions on the mechanism of finding the proper offset.
https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/vfio/pci.c | 31 +++++++++++++++++++++++++++++++
include/hw/iommu.h | 1 +
2 files changed, 32 insertions(+)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5b022da19e..f54ebd0111 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -24,6 +24,7 @@
#include <sys/ioctl.h>
#include "hw/hw.h"
+#include "hw/iommu.h"
#include "hw/pci/msi.h"
#include "hw/pci/msix.h"
#include "hw/pci/pci_bridge.h"
@@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
{
+ HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
+ HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
PCIDevice *pdev = PCI_DEVICE(vdev);
+ uint8_t max_pasid_log2 = 0;
+ bool pasid_cap_added = false;
+ uint64_t hw_caps;
uint32_t header;
uint16_t cap_id, next, size;
uint8_t cap_ver;
@@ -2578,12 +2584,37 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
pcie_add_capability(pdev, cap_id, cap_ver, next, size);
}
break;
+ case PCI_EXT_CAP_ID_PASID:
+ pasid_cap_added = true;
+ /* fallthrough */
default:
pcie_add_capability(pdev, cap_id, cap_ver, next, size);
}
}
+ /*
+ * If PCI_EXT_CAP_ID_PASID not present, try to get information from the host
+ */
+ if (!pasid_cap_added && hiodc->get_pasid) {
+ max_pasid_log2 = hiodc->get_pasid(hiod, &hw_caps);
+ }
+
+ /*
+ * If supported, adds the PASID capability in the end of the PCIE config
+ * space. TODO: Add option for enabling pasid at a safe offset.
+ */
+ if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
+ VIOMMU_FLAG_PASID_SUPPORTED)) {
+ bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC) ? true : false;
+ bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV) ? true : false;
+
+ pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF,
+ max_pasid_log2, exec_perm, priv_mod);
+ /* PASID capability is fully emulated by QEMU */
+ memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
+ }
+
/* Cleanup chain head ID if necessary */
if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
diff --git a/include/hw/iommu.h b/include/hw/iommu.h
index 65d652950a..52e7f0cd96 100644
--- a/include/hw/iommu.h
+++ b/include/hw/iommu.h
@@ -14,6 +14,7 @@
enum {
/* Nesting parent HWPT will be reused by vIOMMU to create nested HWPT */
VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
+ VIOMMU_FLAG_PASID_SUPPORTED = BIT_ULL(1),
};
#endif /* HW_IOMMU_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 26/27] vfio: Synthesize vPASID capability to VM
2025-09-29 13:36 ` [PATCH v4 26/27] vfio: Synthesize vPASID capability to VM Shameer Kolothum
@ 2025-10-01 13:58 ` Jonathan Cameron via
2025-10-02 8:03 ` Shameer Kolothum
0 siblings, 1 reply; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 13:58 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:42 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> If user wants to expose PASID capability in vIOMMU, then VFIO would also
> report the PASID cap for this device if the underlying hardware supports
> it as well.
>
> As a start, this chooses to put the vPASID cap in the last 8 bytes of the
> vconfig space. This is a choice in the good hope of no conflict with any
> existing cap or hidden registers. For the devices that has hidden registers,
> user should figure out a proper offset for the vPASID cap. This may require
> an option for user to config it. Here we leave it as a future extension.
> There are more discussions on the mechanism of finding the proper offset.
>
> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/vfio/pci.c | 31 +++++++++++++++++++++++++++++++
> include/hw/iommu.h | 1 +
> 2 files changed, 32 insertions(+)
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5b022da19e..f54ebd0111 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -24,6 +24,7 @@
> #include <sys/ioctl.h>
>
> #include "hw/hw.h"
> +#include "hw/iommu.h"
> #include "hw/pci/msi.h"
> #include "hw/pci/msix.h"
> #include "hw/pci/pci_bridge.h"
> @@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
>
> static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> {
> + HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
> + HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
> PCIDevice *pdev = PCI_DEVICE(vdev);
> + uint8_t max_pasid_log2 = 0;
> + bool pasid_cap_added = false;
> + uint64_t hw_caps;
> uint32_t header;
> uint16_t cap_id, next, size;
> uint8_t cap_ver;
> @@ -2578,12 +2584,37 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> }
> break;
> + case PCI_EXT_CAP_ID_PASID:
> + pasid_cap_added = true;
> + /* fallthrough */
> default:
> pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> }
>
> }
>
> + /*
> + * If PCI_EXT_CAP_ID_PASID not present, try to get information from the host
Say why it might or might not be present...
> + */
> + if (!pasid_cap_added && hiodc->get_pasid) {
> + max_pasid_log2 = hiodc->get_pasid(hiod, &hw_caps);
> + }
> +
> + /*
> + * If supported, adds the PASID capability in the end of the PCIE config
> + * space. TODO: Add option for enabling pasid at a safe offset.
What are you thinking needs doing to make it safe? If it's at the end and there
is space isn't that enough?
> + */
> + if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
> + VIOMMU_FLAG_PASID_SUPPORTED)) {
> + bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC) ? true : false;
> + bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV) ? true : false;
> +
> + pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF,
> + max_pasid_log2, exec_perm, priv_mod);
> + /* PASID capability is fully emulated by QEMU */
> + memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
> + }
> +
> /* Cleanup chain head ID if necessary */
> if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 26/27] vfio: Synthesize vPASID capability to VM
2025-10-01 13:58 ` Jonathan Cameron via
@ 2025-10-02 8:03 ` Shameer Kolothum
2025-10-02 9:58 ` Jonathan Cameron via
0 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-02 8:03 UTC (permalink / raw)
To: Jonathan Cameron
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 01 October 2025 14:58
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 26/27] vfio: Synthesize vPASID capability to VM
>
> External email: Use caution opening links or attachments
>
>
> On Mon, 29 Sep 2025 14:36:42 +0100
> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>
> > From: Yi Liu <yi.l.liu@intel.com>
> >
> > If user wants to expose PASID capability in vIOMMU, then VFIO would also
> > report the PASID cap for this device if the underlying hardware supports
> > it as well.
> >
> > As a start, this chooses to put the vPASID cap in the last 8 bytes of the
> > vconfig space. This is a choice in the good hope of no conflict with any
> > existing cap or hidden registers. For the devices that has hidden registers,
> > user should figure out a proper offset for the vPASID cap. This may require
> > an option for user to config it. Here we leave it as a future extension.
> > There are more discussions on the mechanism of finding the proper offset.
> >
> >
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> kernel.org%2Fkvm%2FBN9PR11MB5276318969A212AD0649C7BE8CBE2%4
> 0BN9PR11MB5276.namprd11.prod.outlook.com%2F&data=05%7C02%7Csk
> olothumtho%40nvidia.com%7Cfc027ec76e294ee3db4808de00f29861%7C4
> 3083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C63894923913220611
> 4%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjA
> uMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7
> C%7C%7C&sdata=Qjerx3fDhLranvJPSoif4z0ue%2FcVgFYvFjroPgCjOQQ%3D&
> reserved=0
> >
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > ---
> > hw/vfio/pci.c | 31 +++++++++++++++++++++++++++++++
> > include/hw/iommu.h | 1 +
> > 2 files changed, 32 insertions(+)
> >
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 5b022da19e..f54ebd0111 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -24,6 +24,7 @@
> > #include <sys/ioctl.h>
> >
> > #include "hw/hw.h"
> > +#include "hw/iommu.h"
> > #include "hw/pci/msi.h"
> > #include "hw/pci/msix.h"
> > #include "hw/pci/pci_bridge.h"
> > @@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice
> *vdev, uint16_t pos)
> >
> > static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > {
> > + HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
> > + HostIOMMUDeviceClass *hiodc =
> HOST_IOMMU_DEVICE_GET_CLASS(hiod);
> > PCIDevice *pdev = PCI_DEVICE(vdev);
> > + uint8_t max_pasid_log2 = 0;
> > + bool pasid_cap_added = false;
> > + uint64_t hw_caps;
> > uint32_t header;
> > uint16_t cap_id, next, size;
> > uint8_t cap_ver;
> > @@ -2578,12 +2584,37 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> > pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> > }
> > break;
> > + case PCI_EXT_CAP_ID_PASID:
> > + pasid_cap_added = true;
> > + /* fallthrough */
> > default:
> > pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> > }
> >
> > }
> >
> > + /*
> > + * If PCI_EXT_CAP_ID_PASID not present, try to get information from the
> host
>
> Say why it might or might not be present...
>
> > + */
> > + if (!pasid_cap_added && hiodc->get_pasid) {
> > + max_pasid_log2 = hiodc->get_pasid(hiod, &hw_caps);
> > + }
> > +
> > + /*
> > + * If supported, adds the PASID capability in the end of the PCIE config
> > + * space. TODO: Add option for enabling pasid at a safe offset.
>
> What are you thinking needs doing to make it safe? If it's at the end and there
> is space isn't that enough?
That is based on this discussion thread (mentioned in commit log as well)
https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/
" - Some devices are known to place registers in configuration space,
outside of the capability chains, which historically makes it
difficult to place a purely virtual capability without potentially
masking such hidden registers."
However, in this series we're trying to limit the impact by only placing the PASID
capability for devices that are behind the vIOMMU and where the user has explicitly
enabled PASID support for vIOMMU.
Thanks,
Shameer
>
> > + */
> > + if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
> > + VIOMMU_FLAG_PASID_SUPPORTED)) {
> > + bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC) ?
> true : false;
> > + bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV) ?
> true : false;
> > +
> > + pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE -
> PCI_EXT_CAP_PASID_SIZEOF,
> > + max_pasid_log2, exec_perm, priv_mod);
> > + /* PASID capability is fully emulated by QEMU */
> > + memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
> > + }
> > +
> > /* Cleanup chain head ID if necessary */
> > if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> > pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
^ permalink raw reply [flat|nested] 118+ messages in thread* Re: [PATCH v4 26/27] vfio: Synthesize vPASID capability to VM
2025-10-02 8:03 ` Shameer Kolothum
@ 2025-10-02 9:58 ` Jonathan Cameron via
0 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-02 9:58 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, zhangfei.gao@linaro.org,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
On Thu, 2 Oct 2025 08:03:09 +0000
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> > -----Original Message-----
> > From: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Sent: 01 October 2025 14:58
> > To: Shameer Kolothum <skolothumtho@nvidia.com>
> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> > eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> > <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> > berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> > <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> > jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> > zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> > shameerkolothum@gmail.com
> > Subject: Re: [PATCH v4 26/27] vfio: Synthesize vPASID capability to VM
> >
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, 29 Sep 2025 14:36:42 +0100
> > Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> >
> > > From: Yi Liu <yi.l.liu@intel.com>
> > >
> > > If user wants to expose PASID capability in vIOMMU, then VFIO would also
> > > report the PASID cap for this device if the underlying hardware supports
> > > it as well.
> > >
> > > As a start, this chooses to put the vPASID cap in the last 8 bytes of the
> > > vconfig space. This is a choice in the good hope of no conflict with any
> > > existing cap or hidden registers. For the devices that has hidden registers,
> > > user should figure out a proper offset for the vPASID cap. This may require
> > > an option for user to config it. Here we leave it as a future extension.
> > > There are more discussions on the mechanism of finding the proper offset.
> > >
> > >
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> > kernel.org%2Fkvm%2FBN9PR11MB5276318969A212AD0649C7BE8CBE2%4
> > 0BN9PR11MB5276.namprd11.prod.outlook.com%2F&data=05%7C02%7Csk
> > olothumtho%40nvidia.com%7Cfc027ec76e294ee3db4808de00f29861%7C4
> > 3083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C63894923913220611
> > 4%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjA
> > uMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7
> > C%7C%7C&sdata=Qjerx3fDhLranvJPSoif4z0ue%2FcVgFYvFjroPgCjOQQ%3D&
> > reserved=0
> > >
> > > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > > ---
> > > hw/vfio/pci.c | 31 +++++++++++++++++++++++++++++++
> > > include/hw/iommu.h | 1 +
> > > 2 files changed, 32 insertions(+)
> > >
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index 5b022da19e..f54ebd0111 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -24,6 +24,7 @@
> > > #include <sys/ioctl.h>
> > >
> > > #include "hw/hw.h"
> > > +#include "hw/iommu.h"
> > > #include "hw/pci/msi.h"
> > > #include "hw/pci/msix.h"
> > > #include "hw/pci/pci_bridge.h"
> > > @@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice
> > *vdev, uint16_t pos)
> > >
> > > static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> > > {
> > > + HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
> > > + HostIOMMUDeviceClass *hiodc =
> > HOST_IOMMU_DEVICE_GET_CLASS(hiod);
> > > PCIDevice *pdev = PCI_DEVICE(vdev);
> > > + uint8_t max_pasid_log2 = 0;
> > > + bool pasid_cap_added = false;
> > > + uint64_t hw_caps;
> > > uint32_t header;
> > > uint16_t cap_id, next, size;
> > > uint8_t cap_ver;
> > > @@ -2578,12 +2584,37 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> > *vdev)
> > > pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> > > }
> > > break;
> > > + case PCI_EXT_CAP_ID_PASID:
> > > + pasid_cap_added = true;
> > > + /* fallthrough */
> > > default:
> > > pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> > > }
> > >
> > > }
> > >
> > > + /*
> > > + * If PCI_EXT_CAP_ID_PASID not present, try to get information from the
> > host
> >
> > Say why it might or might not be present...
> >
> > > + */
> > > + if (!pasid_cap_added && hiodc->get_pasid) {
> > > + max_pasid_log2 = hiodc->get_pasid(hiod, &hw_caps);
> > > + }
> > > +
> > > + /*
> > > + * If supported, adds the PASID capability in the end of the PCIE config
> > > + * space. TODO: Add option for enabling pasid at a safe offset.
> >
> > What are you thinking needs doing to make it safe? If it's at the end and there
> > is space isn't that enough?
>
> That is based on this discussion thread (mentioned in commit log as well)
> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/
>
>
> " - Some devices are known to place registers in configuration space,
> outside of the capability chains, which historically makes it
> difficult to place a purely virtual capability without potentially
> masking such hidden registers."
Yuk. I know this is sometimes done to chicken bit certain capabilities. Some of the
CXL ones only surface in root ports (on one platform anyway) if the link is trained up
as CXL. The registers are there anyway, just the chain pointers that are edited.
Anything truely hidden that is need for operation and in my view they are on
their own!
>
> However, in this series we're trying to limit the impact by only placing the PASID
> capability for devices that are behind the vIOMMU and where the user has explicitly
> enabled PASID support for vIOMMU.
Make sense. So this TODO is more of a do it if anyone ever needs it.
Jonathan
>
> Thanks,
> Shameer
>
>
> >
> > > + */
> > > + if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
> > > + VIOMMU_FLAG_PASID_SUPPORTED)) {
> > > + bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC) ?
> > true : false;
> > > + bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV) ?
> > true : false;
> > > +
> > > + pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE -
> > PCI_EXT_CAP_PASID_SIZEOF,
> > > + max_pasid_log2, exec_perm, priv_mod);
> > > + /* PASID capability is fully emulated by QEMU */
> > > + memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
> > > + }
> > > +
> > > /* Cleanup chain head ID if necessary */
> > > if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> > > pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
>
>
^ permalink raw reply [flat|nested] 118+ messages in thread
* [PATCH v4 27/27] hw.arm/smmuv3: Add support for PASID enable
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (25 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 26/27] vfio: Synthesize vPASID capability to VM Shameer Kolothum
@ 2025-09-29 13:36 ` Shameer Kolothum
2025-10-01 14:01 ` Jonathan Cameron via
2025-10-17 6:25 ` [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Zhangfei Gao
27 siblings, 1 reply; 118+ messages in thread
From: Shameer Kolothum @ 2025-09-29 13:36 UTC (permalink / raw)
To: qemu-arm, qemu-devel
Cc: eric.auger, peter.maydell, jgg, nicolinc, ddutile, berrange,
nathanc, mochs, smostafa, wangzhou1, jiangkunkun,
jonathan.cameron, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
QEMU SMMUv3 currently forces SSID (Substream ID) to zero. One key use case
for accelerated mode is Shared Virtual Addressing (SVA), which requires
SSID support so the guest can maintain multiple context descriptors per
substream ID.
Provide an option for user to enable PASID support. A SSIDSIZE of 16
is currently used as default.
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/arm/smmuv3-accel.c | 24 +++++++++++++++++++++++-
hw/arm/smmuv3-internal.h | 1 +
hw/arm/smmuv3.c | 8 +++++++-
include/hw/arm/smmuv3.h | 1 +
4 files changed, 32 insertions(+), 2 deletions(-)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 283d36e6cd..0de9598dcb 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -79,6 +79,13 @@ smmuv3_accel_check_hw_compatible(SMMUv3State *s,
return false;
}
+ /* If user enables PASID support(pasid=on), QEMU sets SSIDSIZE to 16 */
+ val = FIELD_EX32(info->idr[1], IDR1, SSIDSIZE);
+ if (val < FIELD_EX32(s->idr[1], IDR1, SSIDSIZE)) {
+ error_setg(errp, "Host SUMMUv3 SSIDSIZE not compatible");
+ return false;
+ }
+
/* User can override QEMU SMMUv3 Range Invalidation support */
val = FIELD_EX32(info->idr[3], IDR3, RIL);
if (val != FIELD_EX32(s->idr[3], IDR3, RIL)) {
@@ -635,7 +642,14 @@ static uint64_t smmuv3_accel_get_viommu_flags(void *opaque)
* The real HW nested support should be reported from host SMMUv3 and if
* it doesn't, the nesting parent allocation will fail anyway in VFIO core.
*/
- return VIOMMU_FLAG_WANT_NESTING_PARENT;
+ uint64_t flags = VIOMMU_FLAG_WANT_NESTING_PARENT;
+ SMMUState *bs = opaque;
+ SMMUv3State *s = ARM_SMMUV3(bs);
+
+ if (s->pasid) {
+ flags |= VIOMMU_FLAG_PASID_SUPPORTED;
+ }
+ return flags;
}
static const PCIIOMMUOps smmuv3_accel_ops = {
@@ -664,6 +678,14 @@ void smmuv3_accel_idr_override(SMMUv3State *s)
if (s->oas == 48) {
s->idr[5] = FIELD_DP32(s->idr[5], IDR5, OAS, SMMU_IDR5_OAS_48);
}
+
+ /*
+ * By default QEMU SMMUv3 has no PASID(SSID) support. Update IDR1 if user
+ * has enabled it.
+ */
+ if (s->pasid) {
+ s->idr[1] = FIELD_DP32(s->idr[1], IDR1, SSIDSIZE, SMMU_IDR1_SSIDSIZE);
+ }
}
/*
diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
index 910a34e05b..38e9da245b 100644
--- a/hw/arm/smmuv3-internal.h
+++ b/hw/arm/smmuv3-internal.h
@@ -81,6 +81,7 @@ REG32(IDR1, 0x4)
FIELD(IDR1, ECMDQ, 31, 1)
#define SMMU_IDR1_SIDSIZE 16
+#define SMMU_IDR1_SSIDSIZE 16
#define SMMU_CMDQS 19
#define SMMU_EVENTQS 19
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 7c391ab711..f7a1635ec7 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -604,7 +604,8 @@ static int decode_ste(SMMUv3State *s, SMMUTransCfg *cfg,
}
}
- if (STE_S1CDMAX(ste) != 0) {
+ /* If pasid enabled, we report SSIDSIZE = 16 */
+ if (!FIELD_EX32(s->idr[1], IDR1, SSIDSIZE) && STE_S1CDMAX(ste) != 0) {
qemu_log_mask(LOG_UNIMP,
"SMMUv3 does not support multiple context descriptors yet\n");
goto bad_ste;
@@ -1962,6 +1963,10 @@ static bool smmu_validate_property(SMMUv3State *s, Error **errp)
error_setg(errp, "oas can only be set to 44 bits if accel=off");
return false;
}
+ if (s->pasid) {
+ error_setg(errp, "pasid can only be enabled if accel=on");
+ return false;
+ }
return true;
}
@@ -2088,6 +2093,7 @@ static const Property smmuv3_properties[] = {
DEFINE_PROP_BOOL("ril", SMMUv3State, ril, true),
DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
+ DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
};
static void smmuv3_instance_init(Object *obj)
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index d3788b2d85..3781b79fc8 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -71,6 +71,7 @@ struct SMMUv3State {
bool ril;
bool ats;
uint8_t oas;
+ bool pasid;
};
typedef enum {
--
2.43.0
^ permalink raw reply related [flat|nested] 118+ messages in thread* Re: [PATCH v4 27/27] hw.arm/smmuv3: Add support for PASID enable
2025-09-29 13:36 ` [PATCH v4 27/27] hw.arm/smmuv3: Add support for PASID enable Shameer Kolothum
@ 2025-10-01 14:01 ` Jonathan Cameron via
0 siblings, 0 replies; 118+ messages in thread
From: Jonathan Cameron via @ 2025-10-01 14:01 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, zhangfei.gao, zhenzhong.duan, yi.l.liu,
shameerkolothum
On Mon, 29 Sep 2025 14:36:43 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> QEMU SMMUv3 currently forces SSID (Substream ID) to zero. One key use case
> for accelerated mode is Shared Virtual Addressing (SVA), which requires
> SSID support so the guest can maintain multiple context descriptors per
> substream ID.
>
> Provide an option for user to enable PASID support. A SSIDSIZE of 16
> is currently used as default.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Some follow through from comments in earlier patches, but in general LGTM
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
^ permalink raw reply [flat|nested] 118+ messages in thread
* Re: [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3
2025-09-29 13:36 [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Shameer Kolothum
` (26 preceding siblings ...)
2025-09-29 13:36 ` [PATCH v4 27/27] hw.arm/smmuv3: Add support for PASID enable Shameer Kolothum
@ 2025-10-17 6:25 ` Zhangfei Gao
2025-10-17 9:43 ` Shameer Kolothum
27 siblings, 1 reply; 118+ messages in thread
From: Zhangfei Gao @ 2025-10-17 6:25 UTC (permalink / raw)
To: Shameer Kolothum
Cc: qemu-arm, qemu-devel, eric.auger, peter.maydell, jgg, nicolinc,
ddutile, berrange, nathanc, mochs, smostafa, wangzhou1,
jiangkunkun, jonathan.cameron, zhenzhong.duan, yi.l.liu,
shameerkolothum
Hi, Shameer
On Mon, 29 Sept 2025 at 21:39, Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>
> Hi,
>
> Changes from RFCv3:
>
> -Removed RFC tag as we have the user-creatable SMMUv3 sereis now applied[0]
> -Addressed feedback from RFCv3. Thanks to all!(I believe I have addressed
> all comments, apologies if I missed any)
> -Removed dependency on “at least one cold-plugged vfio-pci device.” The
> accelerated SMMUv3 features are now initialized based on QEMU SMMUv3
> defaults, and each time a device is attached, the host SMMUv3 info is
> retrieved and features are cross-checked.
> -Includes IORT RMR support to enable MSI doorbell address translation.
> Thanks to Eric, this is based on his earlier attempt on DSM #5 and
> IORT RMR support.
> -Added optional properties (like ATS, RIL, etc.) for the user to override
> the default QEMU SMMUv3 features.
> -Deferred batched invalidation of commands for now. This series supports
> basic single in-order command issuing to the host. Batched support will
> be added as a follow up series.
> -Includes synthesizing PASID capability for the assigned vfio-pci device.
> Thanks to Yi’s effort, this is based on his out-of-tree patches.
> -Added a migration blocker for now. Plan is to enable migration support
> later.
> -Has dependency(patches: 4/5/8)on Zhenzhong's pass-through support series[1]
>
> PATCH organization:
> 1–20: Enables accelerated SMMUv3 with features based on default QEMU SMMUv3,
> including IORT RMR based MSI support.
> 21–23: Adds options for specifying RIL, ATS, and OAS features.
> 24–27: Adds PASID support, including VFIO changes.
>
> Tests:
> Performed basic sanity tests on an NVIDIA GRACE platform with GPU device
> assignments. A CUDA test application was used to verify the SVA use case.
> Further tests are always welcome.
>
> Eg: Qemu Cmd line:
>
> qemu-system-aarch64 -machine virt,gic-version=3,highmem-mmio-size=2T \
> -cpu host -smp cpus=4 -m size=16G,slots=2,maxmem=66G -nographic \
> -bios QEMU_EFI.fd -object iommufd,id=iommufd0 -enable-kvm \
> -object memory-backend-ram,size=8G,id=m0 \
> -object memory-backend-ram,size=8G,id=m1 \
> -numa node,memdev=m0,cpus=0-3,nodeid=0 -numa node,memdev=m1,nodeid=1 \
> -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 -numa node,nodeid=5 \
> -numa node,nodeid=6 -numa node,nodeid=7 -numa node,nodeid=8 -numa node,nodeid=9 \
> -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0 \
> -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.0,accel=on,ats=on,ril=off,pasid=on,oas=48 \
> -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1,pref64-reserve=512G,id=dev0 \
> -device vfio-pci,host=0019:06:00.0,rombar=0,id=dev0,iommufd=iommufd0,bus=pcie.port1 \
> -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
> ...
> -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
> -device pxb-pcie,id=pcie.2,bus_nr=8,bus=pcie.0 \
> -device arm-smmuv3,primary-bus=pcie.2,id=smmuv3.1,accel=on,ats=on,ril=off,pasid=on \
> -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2,pref64-reserve=512G \
> -device vfio-pci,host=0018:06:00.0,rombar=0,id=dev1,iommufd=iommufd0,bus=pcie.port2 \
> -device virtio-blk-device,drive=fs \
> -drive file=image.qcow2,index=0,media=disk,format=qcow2,if=none,id=fs \
> -net none \
> -nographic
>
> A complete branch can be found here,
> https://github.com/shamiali2008/qemu-master smmuv3-accel-v4
I have tested this series with stall enabled.
https://github.com/Linaro/qemu/pull/new/10.1.50-wip
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
By the way, the stall feature requires some additional patches,
including page fault handling.
Shall we handle that after this series?
Thanks
^ permalink raw reply [flat|nested] 118+ messages in thread* RE: [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3
2025-10-17 6:25 ` [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable accelerated SMMUv3 Zhangfei Gao
@ 2025-10-17 9:43 ` Shameer Kolothum
0 siblings, 0 replies; 118+ messages in thread
From: Shameer Kolothum @ 2025-10-17 9:43 UTC (permalink / raw)
To: Zhangfei Gao
Cc: qemu-arm@nongnu.org, qemu-devel@nongnu.org, eric.auger@redhat.com,
peter.maydell@linaro.org, Jason Gunthorpe, Nicolin Chen,
ddutile@redhat.com, berrange@redhat.com, Nathan Chen, Matt Ochs,
smostafa@google.com, wangzhou1@hisilicon.com,
jiangkunkun@huawei.com, jonathan.cameron@huawei.com,
zhenzhong.duan@intel.com, yi.l.liu@intel.com,
shameerkolothum@gmail.com
Hi Zhangfei,
> -----Original Message-----
> From: Zhangfei Gao <zhangfei.gao@linaro.org>
> Sent: 17 October 2025 07:25
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>;
> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 00/27] hw/arm/virt: Add support for user-creatable
> accelerated SMMUv3
>
> I have tested this series with stall enabled.
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Thanks for that.
> By the way, the stall feature requires some additional patches, including page
> fault handling.
> Shall we handle that after this series?
Yes. I am working on v5 of the series addressing comments/feedback received so far.
STALL can be enabled as a follow up series as it is not that straightforward 😊
Thanks,
Shameer
^ permalink raw reply [flat|nested] 118+ messages in thread