[PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device
@ 2025-05-21 11:14 Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 01/21] backends/iommufd: Add a helper to invalidate user-managed HWPT Zhenzhong Duan
                   ` (21 more replies)
  0 siblings, 22 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Hi,

Per Jason Wang's suggestion, iommufd nesting series[1] is split into
"Enable stage-1 translation for emulated device" series and
"Enable stage-1 translation for passthrough device" series.

This series is 2nd part focusing on passthrough device. We don't do
shadowing of guest page table for passthrough device but pass stage-1
page table to host side to construct a nested domain. There was some
effort to enable this feature in old days, see [2] for details.

The key design is to utilize the dual-stage IOMMU translation
(also known as IOMMU nested translation) capability in host IOMMU.
As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform
the stage-1 address translation. Along with it, modifications to
present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

        .-------------.  .---------------------------.
        |   vIOMMU    |  | Guest I/O page table      |
        |             |  '---------------------------'
        .----------------/
        | PASID Entry |--- PASID cache flush --+
        '-------------'                        |
        |             |                        V
        |             |           I/O page table pointer in GPA
        '-------------'
    Guest
    ------| Shadow |---------------------------|--------
          v        v                           v
    Host
        .-------------.  .------------------------.
        |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
        |             |  '------------------------'
        .----------------/  |
        | PASID Entry |     V (Nested xlate)
        '----------------\.--------------------------------------.
        |             |   | Stage2 for GPA->HPA, unmanaged domain|
        |             |   '--------------------------------------'
        '-------------'
For history reason, there are different namings in different VTD spec rev,
Where:
 - Stage1 = First stage = First level = flts
 - Stage2 = Second stage = Second level = slts
<Intel VT-d Nested translation>

There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
  subsystem. VFIO calls them to register/unregister HostIOMMUDevice
  instance to vIOMMU at vfio device realize stage.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
  to bind/unbind device to IOMMUFD backed domains, either nested
  domain or not.

See below diagram:

        VFIO Device                                 Intel IOMMU
    .-----------------.                         .-------------------.
    |                 |                         |                   |
    |       .---------|PCIIOMMUOps              |.-------------.    |
    |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
    |       | Device  |------------------------>|| Device list |    |
    |       .---------|(unset_iommu_device)     |.-------------.    |
    |                 |                         |       |           |
    |                 |                         |       V           |
    |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
    |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
    |       | link    |<------------------------|  |   Device    |  |
    |       .---------|            (detach_hwpt)|  .-------------.  |
    |                 |                         |       |           |
    |                 |                         |       ...         |
    .-----------------.                         .-------------------.

Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
whenever possible and create new one on demand, also supports multiple
iommufd objects and ERRATA_772415.

E.g., Under one guest's scope, Stage-2 page table could be shared by different
devices if there is no conflict and devices link to same iommufd object,
i.e. devices under same host IOMMU can share same stage-2 page table. If there
is conflict, i.e. there is one device under non cache coherency mode which is
different from others, it requires a separate stage-2 page table in non-CC mode.

SPR platform has ERRATA_772415 which requires no readonly mappings
in stage-2 page table. This series supports creating VTDIOASContainer
with no readonly mappings. If there is a rare case that some IOMMUs
on a multiple IOMMU host have ERRATA_772415 and others not, this
design can still survive.

See below example diagram for a full view:

      IntelIOMMUState
             |
             V
    .------------------.    .------------------.    .-------------------.
    | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
    | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,only RW)|
    .------------------.    .------------------.    .-------------------.
             |                       |                              |
             |                       .-->...                        |
             V                                                      V
      .-------------------.    .-------------------.          .---------------.
      |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
      .-------------------.    .-------------------.          .---------------.
          |            |               |                            |
          |            |               |                            |
    .-----------.  .-----------.  .------------.              .------------.
    | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
    | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
    | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
    |           |  |           |  | (iommufd0) |              | (iommufd0) |
    .-----------.  .-----------.  .------------.              .------------.

This series is also a prerequisite work for vSVA, i.e. Sharing
guest application address space with passthrough devices.

To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...

Passthrough device should use iommufd backend to work with stage-1 translation.
i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...

If host doesn't support nested translation, qemu will fail with an unsupported
report.

Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test

Fault report isn't supported in this series, we presume guest kernel always
construct correct S1 page table for passthrough device. For emulated devices,
the emulation code already provided S1 fault injection.

PATCH1-6:  Add HWPT-based nesting infrastructure support
PATCH7-8:  Some cleanup work
PATCH9:    cap/ecap related compatibility check between vIOMMU and Host IOMMU
PATCH10-20:Implement stage-1 page table for passthrough device
PATCH21:   Enable stage-1 translation for passthrough device

Qemu code can be found at [3]

TODO:
- RAM discard
- dirty tracking on stage-2 page table
- Fault report to guest when HW Stage-1 faults

[1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
[2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv3

Thanks
Zhenzhong

Changelog:
rfcv3:
- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer)
- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
- simplify return value check of get_cap() (Eric)
- drop realize_late (Cedric, Eric)
- split patch13:intel_iommu: Add PASID cache management infrastructure (Eric)
- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
- refine comments (Eric, Donald)

rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
  iommu pasid, this is important for dropping VTDPASIDAddressSpace


Yi Liu (3):
  intel_iommu: Replay pasid binds after context cache invalidation
  intel_iommu: Propagate PASID-based iotlb invalidation to host
  intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed

Zhenzhong Duan (18):
  backends/iommufd: Add a helper to invalidate user-managed HWPT
  vfio/iommufd: Add properties and handlers to
    TYPE_HOST_IOMMU_DEVICE_IOMMUFD
  vfio/iommufd: Initialize iommufd specific members in
    HostIOMMUDeviceIOMMUFD
  vfio/iommufd: Implement [at|de]tach_hwpt handlers
  vfio/iommufd: Save vendor specific device info
  iommufd: Implement query of host VTD IOMMU's capability
  intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
    vtd_ce_get_pasid_entry
  intel_iommu: Optimize context entry cache utilization
  intel_iommu: Check for compatibility with IOMMUFD backed device when
    x-flts=on
  intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  intel_iommu: Handle PASID entry removing and updating
  intel_iommu: Handle PASID entry adding
  intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
  intel_iommu: Bind/unbind guest page table to host
  intel_iommu: ERRATA_772415 workaround
  intel_iommu: Bypass replay in stage-1 page table mode
  intel_iommu: Enable host device when x-flts=on in scalable mode

 hw/i386/intel_iommu_internal.h     |   56 +
 include/hw/i386/intel_iommu.h      |   33 +-
 include/system/host_iommu_device.h |   32 +
 include/system/iommufd.h           |   54 +
 backends/iommufd.c                 |   94 +-
 hw/i386/intel_iommu.c              | 1670 ++++++++++++++++++++++++----
 hw/vfio/iommufd.c                  |   40 +
 backends/trace-events              |    1 +
 hw/i386/trace-events               |   13 +
 9 files changed, 1791 insertions(+), 202 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 01/21] backends/iommufd: Add a helper to invalidate user-managed HWPT
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 02/21] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD Zhenzhong Duan
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

This helper passes cache invalidation request from guest to invalidate
stage-1 page table cache in host hardware.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/system/iommufd.h |  4 ++++
 backends/iommufd.c       | 33 +++++++++++++++++++++++++++++++++
 backends/trace-events    |  1 +
 3 files changed, 38 insertions(+)

diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index cbab75bfbf..5399519626 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -61,6 +61,10 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
                                       uint64_t iova, ram_addr_t size,
                                       uint64_t page_size, uint64_t *data,
                                       Error **errp);
+bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
+                                      uint32_t data_type, uint32_t entry_len,
+                                      uint32_t *entry_num, void *data_ptr,
+                                      Error **errp);
 
 #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
 #endif
diff --git a/backends/iommufd.c b/backends/iommufd.c
index b73f75cd0b..c8788a6438 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -311,6 +311,39 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
     return true;
 }
 
+bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
+                                      uint32_t data_type, uint32_t entry_len,
+                                      uint32_t *entry_num, void *data_ptr,
+                                      Error **errp)
+{
+    int ret, fd = be->fd;
+    uint32_t total_entries = *entry_num;
+    struct iommu_hwpt_invalidate cache = {
+        .size = sizeof(cache),
+        .hwpt_id = id,
+        .data_type = data_type,
+        .entry_len = entry_len,
+        .entry_num = total_entries,
+        .data_uptr = (uintptr_t)data_ptr,
+    };
+
+    ret = ioctl(fd, IOMMU_HWPT_INVALIDATE, &cache);
+    trace_iommufd_backend_invalidate_cache(fd, id, data_type, entry_len,
+                                           total_entries, cache.entry_num,
+                                           (uintptr_t)data_ptr,
+                                           ret ? errno : 0);
+    if (ret) {
+        *entry_num = cache.entry_num;
+        error_setg_errno(errp, errno, "IOMMU_HWPT_INVALIDATE failed:"
+                         " totally %d entries, processed %d entries",
+                         total_entries, cache.entry_num);
+    } else {
+        g_assert(total_entries == cache.entry_num);
+    }
+
+    return !ret;
+}
+
 static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
 {
     HostIOMMUDeviceCaps *caps = &hiod->caps;
diff --git a/backends/trace-events b/backends/trace-events
index 40811a3162..7278214ea5 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -18,3 +18,4 @@ iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_
 iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
 iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%u enable=%d (%d)"
 iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
+iommufd_backend_invalidate_cache(int iommufd, uint32_t id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 02/21] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 01/21] backends/iommufd: Add a helper to invalidate user-managed HWPT Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 03/21] vfio/iommufd: Initialize iommufd specific members in HostIOMMUDeviceIOMMUFD Zhenzhong Duan
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Enhance HostIOMMUDeviceIOMMUFD object with 3 new members, specific
to the iommufd BE + 2 new class functions.

IOMMUFD BE includes IOMMUFD handle, devid and hwpt_id. IOMMUFD handle
and devid are used to allocate/free ioas and hwpt. hwpt_id is used to
re-attach IOMMUFD backed device to its default VFIO sub-system created
hwpt, i.e., when vIOMMU is disabled by guest. These properties will be
initialized after attachment.

2 new class functions are [at|de]tach_hwpt(). They are used to
attach/detach hwpt. VFIO and VDPA can have different implementions,
so implementation will be in sub-class instead of HostIOMMUDeviceIOMMUFD,
e.g., in HostIOMMUDeviceIOMMUFDVFIO.

Add two wrappers host_iommu_device_iommufd_[at|de]tach_hwpt to
wrap the two functions.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/system/iommufd.h | 50 ++++++++++++++++++++++++++++++++++++++++
 backends/iommufd.c       | 22 ++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index 5399519626..a704575662 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -67,4 +67,54 @@ bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
                                       Error **errp);
 
 #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
+OBJECT_DECLARE_TYPE(HostIOMMUDeviceIOMMUFD, HostIOMMUDeviceIOMMUFDClass,
+                    HOST_IOMMU_DEVICE_IOMMUFD)
+
+/* Overload of the host IOMMU device for the iommufd backend */
+struct HostIOMMUDeviceIOMMUFD {
+    HostIOMMUDevice parent_obj;
+
+    IOMMUFDBackend *iommufd;
+    uint32_t devid;
+    uint32_t hwpt_id;
+};
+
+struct HostIOMMUDeviceIOMMUFDClass {
+    HostIOMMUDeviceClass parent_class;
+
+    /**
+     * @attach_hwpt: attach host IOMMU device to IOMMUFD hardware page table.
+     * VFIO and VDPA device can have different implementation.
+     *
+     * Mandatory callback.
+     *
+     * @idev: host IOMMU device backed by IOMMUFD backend.
+     *
+     * @hwpt_id: ID of IOMMUFD hardware page table.
+     *
+     * @errp: pass an Error out when attachment fails.
+     *
+     * Returns: true on success, false on failure.
+     */
+    bool (*attach_hwpt)(HostIOMMUDeviceIOMMUFD *idev, uint32_t hwpt_id,
+                        Error **errp);
+    /**
+     * @detach_hwpt: detach host IOMMU device from IOMMUFD hardware page table.
+     * VFIO and VDPA device can have different implementation.
+     *
+     * Mandatory callback.
+     *
+     * @idev: host IOMMU device backed by IOMMUFD backend.
+     *
+     * @errp: pass an Error out when attachment fails.
+     *
+     * Returns: true on success, false on failure.
+     */
+    bool (*detach_hwpt)(HostIOMMUDeviceIOMMUFD *idev, Error **errp);
+};
+
+bool host_iommu_device_iommufd_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           uint32_t hwpt_id, Error **errp);
+bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           Error **errp);
 #endif
diff --git a/backends/iommufd.c b/backends/iommufd.c
index c8788a6438..b114fb08e7 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -344,6 +344,26 @@ bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
     return !ret;
 }
 
+bool host_iommu_device_iommufd_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           uint32_t hwpt_id, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFDClass *idevc =
+        HOST_IOMMU_DEVICE_IOMMUFD_GET_CLASS(idev);
+
+    g_assert(idevc->attach_hwpt);
+    return idevc->attach_hwpt(idev, hwpt_id, errp);
+}
+
+bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           Error **errp)
+{
+    HostIOMMUDeviceIOMMUFDClass *idevc =
+        HOST_IOMMU_DEVICE_IOMMUFD_GET_CLASS(idev);
+
+    g_assert(idevc->detach_hwpt);
+    return idevc->detach_hwpt(idev, errp);
+}
+
 static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
 {
     HostIOMMUDeviceCaps *caps = &hiod->caps;
@@ -382,6 +402,8 @@ static const TypeInfo types[] = {
     }, {
         .name = TYPE_HOST_IOMMU_DEVICE_IOMMUFD,
         .parent = TYPE_HOST_IOMMU_DEVICE,
+        .instance_size = sizeof(HostIOMMUDeviceIOMMUFD),
+        .class_size = sizeof(HostIOMMUDeviceIOMMUFDClass),
         .class_init = hiod_iommufd_class_init,
         .abstract = true,
     }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 03/21] vfio/iommufd: Initialize iommufd specific members in HostIOMMUDeviceIOMMUFD
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 01/21] backends/iommufd: Add a helper to invalidate user-managed HWPT Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 02/21] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 04/21] vfio/iommufd: Implement [at|de]tach_hwpt handlers Zhenzhong Duan
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

There are three iommufd specific members in HostIOMMUDeviceIOMMUFD
that need to be initialized after attachment, they will all be used
by vIOMMU.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/iommufd.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index af1c7ab10a..5fde2b633a 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -814,6 +814,7 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
                                       Error **errp)
 {
     VFIODevice *vdev = opaque;
+    HostIOMMUDeviceIOMMUFD *idev;
     HostIOMMUDeviceCaps *caps = &hiod->caps;
     enum iommu_hw_info_type type;
     union {
@@ -833,6 +834,11 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
     caps->type = type;
     caps->hw_caps = hw_caps;
 
+    idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
+    idev->iommufd = vdev->iommufd;
+    idev->devid = vdev->devid;
+    idev->hwpt_id = vdev->hwpt->hwpt_id;
+
     return true;
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 04/21] vfio/iommufd: Implement [at|de]tach_hwpt handlers
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 03/21] vfio/iommufd: Initialize iommufd specific members in HostIOMMUDeviceIOMMUFD Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info Zhenzhong Duan
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Implement [at|de]tach_hwpt handlers in VFIO subsystem. vIOMMU
utilizes them to attach to or detach from hwpt on host side.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/iommufd.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 5fde2b633a..d661737c17 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -810,6 +810,24 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, const void *data)
     vioc->query_dirty_bitmap = iommufd_query_dirty_bitmap;
 };
 
+static bool
+host_iommu_device_iommufd_vfio_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           uint32_t hwpt_id, Error **errp)
+{
+    VFIODevice *vbasedev = HOST_IOMMU_DEVICE(idev)->agent;
+
+    return !iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp);
+}
+
+static bool
+host_iommu_device_iommufd_vfio_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           Error **errp)
+{
+    VFIODevice *vbasedev = HOST_IOMMU_DEVICE(idev)->agent;
+
+    return iommufd_cdev_detach_ioas_hwpt(vbasedev, errp);
+}
+
 static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
                                       Error **errp)
 {
@@ -864,10 +882,14 @@ hiod_iommufd_vfio_get_page_size_mask(HostIOMMUDevice *hiod)
 static void hiod_iommufd_vfio_class_init(ObjectClass *oc, const void *data)
 {
     HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_CLASS(oc);
+    HostIOMMUDeviceIOMMUFDClass *idevc = HOST_IOMMU_DEVICE_IOMMUFD_CLASS(oc);
 
     hiodc->realize = hiod_iommufd_vfio_realize;
     hiodc->get_iova_ranges = hiod_iommufd_vfio_get_iova_ranges;
     hiodc->get_page_size_mask = hiod_iommufd_vfio_get_page_size_mask;
+
+    idevc->attach_hwpt = host_iommu_device_iommufd_vfio_attach_hwpt;
+    idevc->detach_hwpt = host_iommu_device_iommufd_vfio_detach_hwpt;
 };
 
 static const TypeInfo types[] = {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (3 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 04/21] vfio/iommufd: Implement [at|de]tach_hwpt handlers Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 21:57   ` Nicolin Chen
  2025-05-26 12:15   ` Cédric Le Goater
  2025-05-21 11:14 ` [PATCH rfcv3 06/21] iommufd: Implement query of host VTD IOMMU's capability Zhenzhong Duan
                   ` (16 subsequent siblings)
  21 siblings, 2 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Some device information returned by ioctl(IOMMU_GET_HW_INFO) are vendor
specific. Save them all in a new defined structure mirroring that vendor
IOMMU's structure, then get_cap() can query those information for
capability.

We can't use the vendor IOMMU's structure directly because they are in
linux/iommufd.h which breaks build on windows.

Suggested-by: Eric Auger <eric.auger@redhat.com>
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/system/host_iommu_device.h | 12 ++++++++++++
 hw/vfio/iommufd.c                  | 12 ++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index 809cced4ba..908bfe32c7 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -15,6 +15,17 @@
 #include "qom/object.h"
 #include "qapi/error.h"
 
+/* This is mirror of struct iommu_hw_info_vtd */
+typedef struct Vtd_Caps {
+    uint32_t flags;
+    uint64_t cap_reg;
+    uint64_t ecap_reg;
+} Vtd_Caps;
+
+typedef union VendorCaps {
+    Vtd_Caps vtd;
+} VendorCaps;
+
 /**
  * struct HostIOMMUDeviceCaps - Define host IOMMU device capabilities.
  *
@@ -26,6 +37,7 @@
 typedef struct HostIOMMUDeviceCaps {
     uint32_t type;
     uint64_t hw_caps;
+    VendorCaps vendor_caps;
 } HostIOMMUDeviceCaps;
 
 #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index d661737c17..5c740222e5 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -834,6 +834,7 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
     VFIODevice *vdev = opaque;
     HostIOMMUDeviceIOMMUFD *idev;
     HostIOMMUDeviceCaps *caps = &hiod->caps;
+    VendorCaps *vendor_caps = &caps->vendor_caps;
     enum iommu_hw_info_type type;
     union {
         struct iommu_hw_info_vtd vtd;
@@ -852,6 +853,17 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
     caps->type = type;
     caps->hw_caps = hw_caps;
 
+    switch (type) {
+    case IOMMU_HW_INFO_TYPE_INTEL_VTD:
+        vendor_caps->vtd.flags = data.vtd.flags;
+        vendor_caps->vtd.cap_reg = data.vtd.cap_reg;
+        vendor_caps->vtd.ecap_reg = data.vtd.ecap_reg;
+        break;
+    case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
+    case IOMMU_HW_INFO_TYPE_NONE:
+        break;
+    }
+
     idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
     idev->iommufd = vdev->iommufd;
     idev->devid = vdev->devid;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 06/21] iommufd: Implement query of host VTD IOMMU's capability
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (4 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 07/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP|ERRATA] for IOMMUFD
backed host VTD IOMMU device.

Query on these capabilities is not supported for legacy backend because there
is no plan to support nesting with legacy backend backed host device.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h     |  1 +
 include/system/host_iommu_device.h |  7 ++++++
 backends/iommufd.c                 | 39 ++++++++++++++++++++++++++++--
 3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e8b211e8b0..2cda744786 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -191,6 +191,7 @@
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_SC                 (1ULL << 7)
 #define VTD_ECAP_MHMV               (15ULL << 20)
+#define VTD_ECAP_NEST               (1ULL << 26)
 #define VTD_ECAP_SRS                (1ULL << 31)
 #define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index 908bfe32c7..30da88789d 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -33,6 +33,10 @@ typedef union VendorCaps {
  *
  * @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this represents
  *           the @out_capabilities value returned from IOMMU_GET_HW_INFO ioctl)
+ *
+ * @vendor_caps: host platform IOMMU vendor specific capabilities (e.g. on
+ *               IOMMUFD this represents extracted content from data_uptr
+ *               buffer returned from IOMMU_GET_HW_INFO ioctl)
  */
 typedef struct HostIOMMUDeviceCaps {
     uint32_t type;
@@ -117,6 +121,9 @@ struct HostIOMMUDeviceClass {
  */
 #define HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE        0
 #define HOST_IOMMU_DEVICE_CAP_AW_BITS           1
+#define HOST_IOMMU_DEVICE_CAP_NESTING           2
+#define HOST_IOMMU_DEVICE_CAP_FS1GP             3
+#define HOST_IOMMU_DEVICE_CAP_ERRATA            4
 
 #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
 #endif
diff --git a/backends/iommufd.c b/backends/iommufd.c
index b114fb08e7..d91c1eb8b8 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -21,6 +21,7 @@
 #include "hw/vfio/vfio-device.h"
 #include <sys/ioctl.h>
 #include <linux/iommufd.h>
+#include "hw/i386/intel_iommu_internal.h"
 
 static void iommufd_backend_init(Object *obj)
 {
@@ -364,6 +365,41 @@ bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
     return idevc->detach_hwpt(idev, errp);
 }
 
+static int hiod_iommufd_get_vtd_cap(HostIOMMUDevice *hiod, int cap,
+                                    Error **errp)
+{
+    Vtd_Caps *caps = &hiod->caps.vendor_caps.vtd;
+
+    switch (cap) {
+    case HOST_IOMMU_DEVICE_CAP_NESTING:
+        return !!(caps->ecap_reg & VTD_ECAP_NEST);
+    case HOST_IOMMU_DEVICE_CAP_FS1GP:
+        return !!(caps->cap_reg & VTD_CAP_FS1GP);
+    case HOST_IOMMU_DEVICE_CAP_ERRATA:
+        return caps->flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17;
+    default:
+        error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
+        return -EINVAL;
+    }
+}
+
+static int hiod_iommufd_get_vendor_cap(HostIOMMUDevice *hiod, int cap,
+                                       Error **errp)
+{
+    enum iommu_hw_info_type type = hiod->caps.type;
+
+    switch (type) {
+    case IOMMU_HW_INFO_TYPE_INTEL_VTD:
+        return hiod_iommufd_get_vtd_cap(hiod, cap, errp);
+    case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
+    case IOMMU_HW_INFO_TYPE_NONE:
+        break;
+    }
+
+    error_setg(errp, "%s: unsupported capability type %x", hiod->name, type);
+    return -EINVAL;
+}
+
 static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
 {
     HostIOMMUDeviceCaps *caps = &hiod->caps;
@@ -374,8 +410,7 @@ static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
     case HOST_IOMMU_DEVICE_CAP_AW_BITS:
         return vfio_device_get_aw_bits(hiod->agent);
     default:
-        error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
-        return -EINVAL;
+        return hiod_iommufd_get_vendor_cap(hiod, cap, errp);
     }
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 07/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (5 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 06/21] iommufd: Implement query of host VTD IOMMU's capability Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 08/21] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
of rid2pasid, then it was extended to get any pasid entry. So a new name
vtd_ce_get_pasid_entry is better to match what it actually does.

No functional change intended.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
---
 hw/i386/intel_iommu.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 69d72ad35c..f0b1f90eff 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
     return 0;
 }
 
-static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
+static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
                                       VTDContextEntry *ce,
                                       VTDPASIDEntry *pe,
                                       uint32_t pasid)
@@ -1025,7 +1025,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return VTD_PE_GET_FL_LEVEL(&pe);
         } else {
@@ -1048,7 +1048,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
     }
 
@@ -1116,7 +1116,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
         } else {
@@ -1522,7 +1522,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
      * has valid rid2pasid setting, which includes valid
      * rid2pasid field and corresponding pasid entry setting
      */
-    return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
+    return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1611,7 +1611,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
     }
 
@@ -1687,7 +1687,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
     int ret;
 
     if (s->root_scalable) {
-        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (ret) {
             /*
              * This error is guest triggerable. We should assumt PT
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 08/21] intel_iommu: Optimize context entry cache utilization
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (6 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 07/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 09/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

There are many call sites referencing context entry by calling
vtd_dev_to_context_entry() which will traverse the DMAR table.

In most cases we can use cached context entry in vtd_as->context_cache_entry
except when its entry is stale. Currently only global and domain context
invalidation stale it.

So introduce a helper function vtd_as_to_context_entry() to fetch from cache
before trying with vtd_dev_to_context_entry().

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f0b1f90eff..a2f3250724 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1597,6 +1597,22 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     return 0;
 }
 
+static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint8_t bus_num = pci_bus_num(vtd_as->bus);
+    uint8_t devfn = vtd_as->devfn;
+    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
+
+    /* Try to fetch context-entry from cache first */
+    if (cc_entry->context_cache_gen == s->context_cache_gen) {
+        *ce = cc_entry->context_entry;
+        return 0;
+    } else {
+        return vtd_dev_to_context_entry(s, bus_num, devfn, ce);
+    }
+}
+
 static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
                                      void *private)
 {
@@ -1649,9 +1665,7 @@ static int vtd_address_space_sync(VTDAddressSpace *vtd_as)
         return 0;
     }
 
-    ret = vtd_dev_to_context_entry(vtd_as->iommu_state,
-                                   pci_bus_num(vtd_as->bus),
-                                   vtd_as->devfn, &ce);
+    ret = vtd_as_to_context_entry(vtd_as, &ce);
     if (ret) {
         if (ret == -VTD_FR_CONTEXT_ENTRY_P) {
             /*
@@ -1710,8 +1724,7 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
     assert(as);
 
     s = as->iommu_state;
-    if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
-                                 &ce)) {
+    if (vtd_as_to_context_entry(as, &ce)) {
         /*
          * Possibly failed to parse the context entry for some reason
          * (e.g., during init, or any guest configuration errors on
@@ -2435,8 +2448,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
     vtd_iommu_unlock(s);
 
     QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
-        if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
-                                      vtd_as->devfn, &ce) &&
+        if (!vtd_as_to_context_entry(vtd_as, &ce) &&
             domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
             vtd_address_space_sync(vtd_as);
         }
@@ -2458,8 +2470,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
     hwaddr size = (1 << am) * VTD_PAGE_SIZE;
 
     QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
-        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
-                                       vtd_as->devfn, &ce);
+        ret = vtd_as_to_context_entry(vtd_as, &ce);
         if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
             uint32_t rid2pasid = PCI_NO_PASID;
 
@@ -2966,8 +2977,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
     vtd_iommu_unlock(s);
 
     QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
-        if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
-                                      vtd_as->devfn, &ce) &&
+        if (!vtd_as_to_context_entry(vtd_as, &ce) &&
             domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
             uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
 
@@ -4146,7 +4156,7 @@ static void vtd_report_ir_illegal_access(VTDAddressSpace *vtd_as,
     assert(vtd_as->pasid != PCI_NO_PASID);
 
     /* Try out best to fetch FPD, we can't do anything more */
-    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+    if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
         if (!is_fpd_set && s->root_scalable) {
             vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, vtd_as->pasid);
@@ -4506,7 +4516,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     /* replay is protected by BQL, page walk will re-setup it safely */
     iova_tree_remove(vtd_as->iova_tree, map);
 
-    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+    if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
         trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
                                   "legacy mode",
                                   bus_n, PCI_SLOT(vtd_as->devfn),
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 09/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (7 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 08/21] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 10/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
is passed to host to construct nested page table. We need to check
compatibility of some critical IOMMU capabilities between vIOMMU and
host IOMMU to ensure guest stage-1 page table could be used by host.

For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
does not, then this IOMMUFD backed device should be failed.

Declare an enum type host_iommu_device_iommu_hw_info_type aliased to
iommu_hw_info_type which comes from iommufd header file. This can avoid
build failure on windows which doesn't support iommufd.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/system/host_iommu_device.h | 13 +++++++++++
 hw/i386/intel_iommu.c              | 36 ++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)

diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index 30da88789d..38070aff09 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -125,5 +125,18 @@ struct HostIOMMUDeviceClass {
 #define HOST_IOMMU_DEVICE_CAP_FS1GP             3
 #define HOST_IOMMU_DEVICE_CAP_ERRATA            4
 
+/**
+ * enum host_iommu_device_iommu_hw_info_type - IOMMU Hardware Info Types
+ * @HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not
+ *                                             report hardware info
+ * @HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ *
+ * This is alias to enum iommu_hw_info_type but for general purpose.
+ */
+enum host_iommu_device_iommu_hw_info_type {
+    HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_NONE,
+    HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD,
+};
+
 #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
 #endif
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a2f3250724..dc839037cf 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -39,6 +39,7 @@
 #include "kvm/kvm_i386.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "system/iommufd.h"
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -4361,6 +4362,41 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         return true;
     }
 
+    /* Remaining checks are all stage-1 translation specific */
+    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
+        return false;
+    }
+
+    /*
+     * HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE should be supported by different
+     * backend devices, either VFIO or VDPA.
+     */
+    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE, errp);
+    assert(ret >= 0);
+    if (ret != HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD) {
+        error_setg(errp, "Incompatible host platform IOMMU type %d", ret);
+        return false;
+    }
+
+    /*
+     * HOST_IOMMU_DEVICE_CAP_NESTING/FS1GP are VTD vendor specific
+     * capabilities, so get_cap() should never fail on them now that
+     * HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD type check passed
+     * above.
+     */
+    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_NESTING, errp);
+    if (ret != 1) {
+        error_setg(errp, "Host IOMMU doesn't support nested translation");
+        return false;
+    }
+
+    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_FS1GP, errp);
+    if (s->fs1gp && ret != 1) {
+        error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
+        return false;
+    }
+
     error_setg(errp, "host device is uncompatible with stage-1 translation");
     return false;
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 10/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (8 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 09/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 11/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

Introduce a new structure VTDHostIOMMUDevice which replaces
HostIOMMUDevice to be stored in hash table.

It includes a reference to HostIOMMUDevice and IntelIOMMUState,
also includes BDF information which will be used in future
patches.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  7 +++++++
 include/hw/i386/intel_iommu.h  |  2 +-
 hw/i386/intel_iommu.c          | 14 ++++++++++++--
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 2cda744786..18bc22fc72 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -28,6 +28,7 @@
 #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
 #define HW_I386_INTEL_IOMMU_INTERNAL_H
 #include "hw/i386/intel_iommu.h"
+#include "system/host_iommu_device.h"
 
 /*
  * Intel IOMMU register specification
@@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
 /* Bits to decide the offset for each level */
 #define VTD_LEVEL_BITS           9
 
+typedef struct VTDHostIOMMUDevice {
+    IntelIOMMUState *iommu_state;
+    PCIBus *bus;
+    uint8_t devfn;
+    HostIOMMUDevice *hiod;
+} VTDHostIOMMUDevice;
 #endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index e95477e855..50f9b27a45 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -295,7 +295,7 @@ struct IntelIOMMUState {
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
-    GHashTable *vtd_host_iommu_dev;             /* HostIOMMUDevice */
+    GHashTable *vtd_host_iommu_dev;             /* VTDHostIOMMUDevice */
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index dc839037cf..b2ea109c7c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
 
 static void vtd_hiod_destroy(gpointer v)
 {
-    object_unref(v);
+    VTDHostIOMMUDevice *vtd_hiod = v;
+
+    object_unref(vtd_hiod->hiod);
+    g_free(vtd_hiod);
 }
 
 static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
@@ -4405,6 +4408,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
                                      HostIOMMUDevice *hiod, Error **errp)
 {
     IntelIOMMUState *s = opaque;
+    VTDHostIOMMUDevice *vtd_hiod;
     struct vtd_as_key key = {
         .bus = bus,
         .devfn = devfn,
@@ -4421,6 +4425,12 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
         return false;
     }
 
+    vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
+    vtd_hiod->bus = bus;
+    vtd_hiod->devfn = (uint8_t)devfn;
+    vtd_hiod->iommu_state = s;
+    vtd_hiod->hiod = hiod;
+
     if (!vtd_check_hiod(s, hiod, errp)) {
         vtd_iommu_unlock(s);
         return false;
@@ -4431,7 +4441,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     new_key->devfn = devfn;
 
     object_ref(hiod);
-    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
+    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
 
     vtd_iommu_unlock(s);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 11/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (9 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 10/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 12/21] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

We already have vtd_find_add_as() to find an AS from BDF+pasid, but this
pasid is passed from PCI subsystem. PCI device supports two request types,
Requests-without-PASID and Requests-with-PASID. Requests-without-PASID
doesn't include a PASID TLP prefix, IOMMU fetches rid_pasid from context
entry and use it as IOMMU's pasid to index pasid table.

So we need to translate between PCI's pasid and IOMMU's pasid specially
for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.

vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to vtd_as
which contains PCI's pasid vtd_as->pasid.

vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to iommu_pasid.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 50 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b2ea109c7c..a9c0bd5021 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1617,6 +1617,56 @@ static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
     }
 }
 
+static inline int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
+                                               uint32_t *pasid)
+{
+    VTDContextEntry ce;
+    int ret;
+
+    ret = vtd_as_to_context_entry(vtd_as, &ce);
+    if (ret) {
+        return ret;
+    }
+
+    /* Translate to iommu pasid if PCI_NO_PASID */
+    if (vtd_as->pasid == PCI_NO_PASID) {
+        *pasid = VTD_CE_GET_RID2PASID(&ce);
+    } else {
+        *pasid = vtd_as->pasid;
+    }
+
+    return 0;
+}
+
+static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
+                                                   gpointer user_data)
+{
+    VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
+    struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
+    uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
+    uint32_t pasid;
+
+    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+        return false;
+    }
+
+    return (pasid == target->pasid) && (sid == target->sid);
+}
+
+/* Translate iommu pasid to vtd_as */
+static inline
+VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
+                                                uint16_t sid, uint32_t pasid)
+{
+    struct vtd_as_raw_key key = {
+        .sid = sid,
+        .pasid = pasid
+    };
+
+    return g_hash_table_find(s->vtd_address_spaces,
+                             vtd_find_as_by_sid_and_iommu_pasid, &key);
+}
+
 static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
                                      void *private)
 {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 12/21] intel_iommu: Handle PASID entry removing and updating
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (10 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 11/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 13/21] intel_iommu: Handle PASID entry adding Zhenzhong Duan
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
pasid entry and track PASID usage and future PASID tagged DMA address
translation support in vIOMMU.

VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
never freed. For other pasid, VTDAddressSpace instance is created/destroyed
per the guest pasid entry set up/destroy for passthrough devices. While for
emulated devices, VTDAddressSpace instance is created in the PASID tagged DMA
translation and be destroyed per guest PASID cache invalidation. This focuses
on the PASID cache management for passthrough devices as there is no PASID
capable emulated devices yet.

When guest modifies a PASID entry, QEMU will capture the guest pasid selective
pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
invalidation reasons:

    a) a present pasid entry moved to non-present
    b) a present pasid entry to be a present entry
    c) a non-present pasid entry moved to present

This handles a) and b), following patch will handle c).

vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
and compare it with the PASID cache.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  26 ++++
 include/hw/i386/intel_iommu.h  |   6 +
 hw/i386/intel_iommu.c          | 252 +++++++++++++++++++++++++++++++--
 hw/i386/trace-events           |   3 +
 4 files changed, 277 insertions(+), 10 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 18bc22fc72..82b84db80f 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
                                   * request while disabled */
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
+    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
     /* PASID directory entry access failure */
     VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
     /* The Present(P) field of pasid directory entry is 0 */
@@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000f1c0ULL
 #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
 
+#define VTD_INV_DESC_PASIDC_G          (3ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PASIDC_DID(val)   (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0  0xfff000000000f1c0ULL
+
+#define VTD_INV_DESC_PASIDC_DSI        (0ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
+#define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
@@ -552,6 +562,21 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+typedef enum VTDPCInvType {
+    /* pasid cache invalidation rely on guest PASID entry */
+    VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
+    VTD_PASID_CACHE_DOMSI,      /* pasid cache domain selective invalidation */
+    VTD_PASID_CACHE_PASIDSI,    /* pasid cache pasid selective invalidation */
+} VTDPCInvType;
+
+typedef struct VTDPASIDCacheInfo {
+    VTDPCInvType type;
+    uint16_t domain_id;
+    uint32_t pasid;
+    PCIBus *bus;
+    uint16_t devfn;
+} VTDPASIDCacheInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
@@ -563,6 +588,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
 #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
 #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
 
 /* PASID Granular Translation Type Mask */
 #define VTD_PASID_ENTRY_P              1ULL
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 50f9b27a45..fbc9da903a 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -95,6 +95,11 @@ struct VTDPASIDEntry {
     uint64_t val[8];
 };
 
+typedef struct VTDPASIDCacheEntry {
+    struct VTDPASIDEntry pasid_entry;
+    bool cache_filled;
+} VTDPASIDCacheEntry;
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -107,6 +112,7 @@ struct VTDAddressSpace {
     MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
+    VTDPASIDCacheEntry pasid_cache_entry;
     QLIST_ENTRY(VTDAddressSpace) next;
     /* Superset of notifier flags that this address space has */
     IOMMUNotifierFlag notifier_flags;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a9c0bd5021..0c6587735e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -825,6 +825,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
     }
 }
 
+static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
+{
+    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -3104,6 +3109,236 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
     return true;
 }
 
+static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
+                                            uint32_t pasid, VTDPASIDEntry *pe)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDContextEntry ce;
+    int ret;
+
+    if (!s->root_scalable) {
+        return -VTD_FR_RTADDR_INV_TTM;
+    }
+
+    ret = vtd_as_to_context_entry(vtd_as, &ce);
+    if (ret) {
+        return ret;
+    }
+
+    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+    return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/*
+ * This function fills in the pasid entry in &vtd_as. Caller
+ * of this function should hold iommu_lock.
+ */
+static void vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
+                                 VTDPASIDEntry *pe)
+{
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+
+    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+        /* No need to go further as cached pasid entry is latest */
+        return;
+    }
+
+    pc_entry->pasid_entry = *pe;
+    pc_entry->cache_filled = true;
+    /*
+     * TODO: send pasid bind to host for passthru devices
+     */
+}
+
+/*
+ * This function is used to clear cached pasid entry in vtd_as
+ * instances. Caller of this function should hold iommu_lock.
+ */
+static gboolean vtd_flush_pasid(gpointer key, gpointer value,
+                                gpointer user_data)
+{
+    VTDPASIDCacheInfo *pc_info = user_data;
+    VTDAddressSpace *vtd_as = value;
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    VTDPASIDEntry pe;
+    uint16_t did;
+    uint32_t pasid;
+    int ret;
+
+    /* Replay only filled pasid entry cache for passthrough device */
+    if (!pc_entry->cache_filled) {
+        return false;
+    }
+    did = vtd_pe_get_did(&pc_entry->pasid_entry);
+
+    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+        goto remove;
+    }
+
+    switch (pc_info->type) {
+    case VTD_PASID_CACHE_PASIDSI:
+        if (pc_info->pasid != pasid) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_DOMSI:
+        if (pc_info->domain_id != did) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_GLOBAL_INV:
+        break;
+    default:
+        error_report("invalid pc_info->type");
+        abort();
+    }
+
+    /*
+     * pasid cache invalidation may indicate a present pasid
+     * entry to present pasid entry modification. To cover such
+     * case, vIOMMU emulator needs to fetch latest guest pasid
+     * entry and check cached pasid entry, then update pasid
+     * cache and send pasid bind/unbind to host properly.
+     */
+    ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
+    if (ret) {
+        /*
+         * No valid pasid entry in guest memory. e.g. pasid entry
+         * was modified to be either all-zero or non-present. Either
+         * case means existing pasid cache should be removed.
+         */
+        goto remove;
+    }
+
+    vtd_fill_pe_in_cache(s, vtd_as, &pe);
+    return false;
+
+remove:
+    /*
+     * TODO: send pasid unbind to host for passthru devices
+     */
+    pc_entry->cache_filled = false;
+
+    /*
+     * Don't remove address space of PCI_NO_PASID which is created by PCI
+     * sub-system.
+     */
+    if (vtd_as->pasid == PCI_NO_PASID) {
+        return false;
+    }
+    return true;
+}
+
+/*
+ * This function syncs the pasid bindings between guest and host.
+ * It includes updating the pasid cache in vIOMMU and updating the
+ * pasid bindings per guest's latest pasid entry presence.
+ */
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info)
+{
+    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
+        return;
+    }
+
+    /*
+     * Regards to a pasid cache invalidation, e.g. a PSI.
+     * it could be either cases of below:
+     * a) a present pasid entry moved to non-present
+     * b) a present pasid entry to be a present entry
+     * c) a non-present pasid entry moved to present
+     *
+     * Different invalidation granularity may affect different device
+     * scope and pasid scope. But for each invalidation granularity,
+     * it needs to do two steps to sync host and guest pasid binding.
+     *
+     * Here is the handling of a PSI:
+     * 1) loop all the existing vtd_as instances to update them
+     *    according to the latest guest pasid entry in pasid table.
+     *    this will make sure affected existing vtd_as instances
+     *    cached the latest pasid entries. Also, during the loop, the
+     *    host should be notified if needed. e.g. pasid unbind or pasid
+     *    update. Should be able to cover case a) and case b).
+     *
+     * 2) loop all devices to cover case c)
+     *    - For devices which are backed by HostIOMMUDeviceIOMMUFD instances,
+     *      we loop them and check if guest pasid entry exists. If yes,
+     *      it is case c), we update the pasid cache and also notify
+     *      host.
+     *    - For devices which are not backed by HostIOMMUDeviceIOMMUFD,
+     *      it is not necessary to create pasid cache at this phase since
+     *      it could be created when vIOMMU does DMA address translation.
+     *      This is not yet implemented since there is no emulated
+     *      pasid-capable devices today. If we have such devices in
+     *      future, the pasid cache shall be created there.
+     * Other granularity follow the same steps, just with different scope
+     *
+     */
+
+    vtd_iommu_lock(s);
+    /*
+     * Step 1: loop all the existing vtd_as instances for pasid unbind and
+     * update.
+     */
+    g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid,
+                                pc_info);
+    vtd_iommu_unlock(s);
+
+    /* TODO: Step 2: loop all the existing vtd_hiod instances for pasid bind. */
+}
+
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+                                   VTDInvDesc *inv_desc)
+{
+    uint16_t domain_id;
+    uint32_t pasid;
+    VTDPASIDCacheInfo pc_info;
+    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
+                        VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
+
+    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
+                                     __func__, "pasid cache inv")) {
+        return false;
+    }
+
+    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
+
+    switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
+    case VTD_INV_DESC_PASIDC_DSI:
+        trace_vtd_pasid_cache_dsi(domain_id);
+        pc_info.type = VTD_PASID_CACHE_DOMSI;
+        pc_info.domain_id = domain_id;
+        break;
+
+    case VTD_INV_DESC_PASIDC_PASID_SI:
+        /* PASID selective implies a DID selective */
+        trace_vtd_pasid_cache_psi(domain_id, pasid);
+        pc_info.type = VTD_PASID_CACHE_PASIDSI;
+        pc_info.domain_id = domain_id;
+        pc_info.pasid = pasid;
+        break;
+
+    case VTD_INV_DESC_PASIDC_GLOBAL:
+        trace_vtd_pasid_cache_gsi();
+        pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+        break;
+
+    default:
+        error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    vtd_pasid_cache_sync(s, &pc_info);
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -3265,6 +3500,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    case VTD_INV_DESC_PC:
+        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_pasid_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
     case VTD_INV_DESC_PIOTLB:
         trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
         if (!vtd_process_piotlb_desc(s, &inv_desc)) {
@@ -3300,16 +3542,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
-    /*
-     * TODO: the entity of below two cases will be implemented in future series.
-     * To make guest (which integrates scalable mode support patch set in
-     * iommu driver) work, just return true is enough so far.
-     */
-    case VTD_INV_DESC_PC:
-        if (s->scalable_mode) {
-            break;
-        }
-    /* fallthrough */
     default:
         error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
                           " (unknown type)", __func__, inv_desc.hi,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ac9e1a10aa..ae5bbfcdc0 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 13/21] intel_iommu: Handle PASID entry adding
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (11 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 12/21] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 14/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

When guest modifies a PASID entry, QEMU will capture the guest pasid selective
pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
invalidation reasons:

    a) a present pasid entry moved to non-present
    b) a present pasid entry to be a present entry
    c) a non-present pasid entry moved to present

This handles c).

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |   1 +
 hw/i386/intel_iommu.c          | 167 ++++++++++++++++++++++++++++++++-
 2 files changed, 167 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 82b84db80f..4f6d9e9036 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -558,6 +558,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CTX_ENTRY_LEGACY_SIZE     16
 #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
 
+#define VTD_SM_CONTEXT_ENTRY_PDTS(val)      (((val) >> 9) & 0x7)
 #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 0c6587735e..8d9076216c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -825,6 +825,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
     }
 }
 
+static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
+{
+    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
+}
+
 static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
 {
     return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
@@ -3234,6 +3239,157 @@ remove:
     return true;
 }
 
+static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+                                        dma_addr_t pt_base,
+                                        int start,
+                                        int end,
+                                        VTDPASIDCacheInfo *info)
+{
+    VTDPASIDEntry pe;
+    int pasid = start;
+    int pasid_next;
+
+    while (pasid < end) {
+        pasid_next = pasid + 1;
+
+        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+            && vtd_pe_present(&pe)) {
+            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
+            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
+            VTDAddressSpace *vtd_as;
+
+            vtd_iommu_lock(s);
+            /*
+             * When indexed by rid2pasid, vtd_as should have been created,
+             * e.g., by PCI subsystem. For other iommu pasid, we need to
+             * create vtd_as dynamically. The other iommu pasid is same as
+             * PCI's pasid, so it's used as input of vtd_find_add_as().
+             */
+            vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
+            if (!vtd_as) {
+                vtd_iommu_unlock(s);
+                vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
+            }
+            vtd_iommu_unlock(s);
+
+            if ((info->type == VTD_PASID_CACHE_DOMSI ||
+                 info->type == VTD_PASID_CACHE_PASIDSI) &&
+                !(info->domain_id == vtd_pe_get_did(&pe))) {
+                /*
+                 * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
+                 * requires domain ID check. If domain Id check fail,
+                 * go to next pasid.
+                 */
+                pasid = pasid_next;
+                continue;
+            }
+            vtd_fill_pe_in_cache(s, vtd_as, &pe);
+        }
+        pasid = pasid_next;
+    }
+}
+
+/*
+ * Currently, VT-d scalable mode pasid table is a two level table,
+ * this function aims to loop a range of PASIDs in a given pasid
+ * table to identify the pasid config in guest.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
+                                    dma_addr_t pdt_base,
+                                    int start,
+                                    int end,
+                                    VTDPASIDCacheInfo *info)
+{
+    VTDPASIDDirEntry pdire;
+    int pasid = start;
+    int pasid_next;
+    dma_addr_t pt_base;
+
+    while (pasid < end) {
+        pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
+                      (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
+        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+            && vtd_pdire_present(&pdire)) {
+            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
+        }
+        pasid = pasid_next;
+    }
+}
+
+static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
+                                          int start, int end,
+                                          VTDPASIDCacheInfo *info)
+{
+    VTDContextEntry ce;
+    VTDAddressSpace *vtd_as;
+
+    vtd_as = vtd_find_add_as(s, info->bus, info->devfn, PCI_NO_PASID);
+
+    if (!vtd_as_to_context_entry(vtd_as, &ce)) {
+        uint32_t max_pasid;
+
+        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
+        if (end > max_pasid) {
+            end = max_pasid;
+        }
+        vtd_sm_pasid_table_walk(s,
+                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
+                                start,
+                                end,
+                                info);
+    }
+}
+
+/*
+ * This function replay the guest pasid bindings to hosts by
+ * walking the guest PASID table. This ensures host will have
+ * latest guest pasid bindings.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+                                            VTDPASIDCacheInfo *pc_info)
+{
+    VTDHostIOMMUDevice *vtd_hiod;
+    int start = 0, end = 1; /* only rid2pasid is supported */
+    VTDPASIDCacheInfo walk_info;
+    GHashTableIter as_it;
+
+    switch (pc_info->type) {
+    case VTD_PASID_CACHE_PASIDSI:
+        start = pc_info->pasid;
+        end = pc_info->pasid + 1;
+        /*
+         * PASID selective invalidation is within domain,
+         * thus fall through.
+         */
+    case VTD_PASID_CACHE_DOMSI:
+    case VTD_PASID_CACHE_GLOBAL_INV:
+        /* loop all assigned devices */
+        break;
+    default:
+        error_report("invalid pc_info->type for replay");
+        abort();
+    }
+
+    /*
+     * In this replay, only needs to care about the devices which
+     * are backed by host IOMMU. For such devices, their vtd_hiod
+     * instances are in the s->vtd_host_iommu_dev. For devices which
+     * are not backed by host IOMMU, it is not necessary to replay
+     * the bindings since their cache could be re-created in the future
+     * DMA address translation. Access to vtd_host_iommu_dev is already
+     * protected by BQL, so no iommu lock needed here.
+     */
+    walk_info = *pc_info;
+    g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
+    while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
+        /* bus|devfn fields are not identical with pc_info */
+        walk_info.bus = vtd_hiod->bus;
+        walk_info.devfn = vtd_hiod->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+    }
+}
+
 /*
  * This function syncs the pasid bindings between guest and host.
  * It includes updating the pasid cache in vIOMMU and updating the
@@ -3289,7 +3445,16 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
                                 pc_info);
     vtd_iommu_unlock(s);
 
-    /* TODO: Step 2: loop all the existing vtd_hiod instances for pasid bind. */
+    /*
+     * Step 2: loop all the existing vtd_hiod instances for pasid bind.
+     * Ideally, needs to loop all devices to find if there is any new
+     * PASID binding regards to the PASID cache invalidation request.
+     * But it is enough to loop the devices which are backed by host
+     * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
+     * if new PASID happened on them, their vtd_as instance could
+     * be created during future vIOMMU DMA translation.
+     */
+    vtd_replay_guest_pasid_bindings(s, pc_info);
 }
 
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 14/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (12 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 13/21] intel_iommu: Handle PASID entry adding Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

FORCE_RESET is different from GLOBAL_INV which updates pasid cache if
underlying pasid entry is still valid, it drops all the pasid caches.

FORCE_RESET isn't a VTD spec defined invalidation type for pasid cache,
only used internally in system level reset.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  2 ++
 hw/i386/intel_iommu.c          | 28 ++++++++++++++++++++++++++++
 hw/i386/trace-events           |  1 +
 3 files changed, 31 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 4f6d9e9036..5e5583d94a 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -564,6 +564,8 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
 typedef enum VTDPCInvType {
+    /* Force reset all */
+    VTD_PASID_CACHE_FORCE_RESET = 0,
     /* pasid cache invalidation rely on guest PASID entry */
     VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
     VTD_PASID_CACHE_DOMSI,      /* pasid cache domain selective invalidation */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 8d9076216c..050b0d3ca2 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -86,6 +86,8 @@ struct vtd_iotlb_key {
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+
 static void vtd_panic_require_caching_mode(void)
 {
     error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -390,6 +392,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
     vtd_iommu_lock(s);
     vtd_reset_iotlb_locked(s);
     vtd_reset_context_cache_locked(s);
+    vtd_pasid_cache_reset_locked(s);
     vtd_iommu_unlock(s);
 }
 
@@ -3186,6 +3189,8 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
     }
 
     switch (pc_info->type) {
+    case VTD_PASID_CACHE_FORCE_RESET:
+        goto remove;
     case VTD_PASID_CACHE_PASIDSI:
         if (pc_info->pasid != pasid) {
             return false;
@@ -3239,6 +3244,26 @@ remove:
     return true;
 }
 
+/* Caller of this function should hold iommu_lock */
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_reset();
+
+    pc_info.type = VTD_PASID_CACHE_FORCE_RESET;
+
+    /*
+     * Reset pasid cache is a big hammer, so use g_hash_table_foreach_remove
+     * which will free the vtd_as instances. Also, as a big hammer, use
+     * VTD_PASID_CACHE_FORCE_RESET to ensure all the vtd_as instances are
+     * dropped, meanwhile the change will be passed to host if
+     * HostIOMMUDeviceIOMMUFD is available.
+     */
+    g_hash_table_foreach_remove(s->vtd_address_spaces,
+                                vtd_flush_pasid, &pc_info);
+}
+
 static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
                                         dma_addr_t pt_base,
                                         int start,
@@ -3366,6 +3391,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     case VTD_PASID_CACHE_GLOBAL_INV:
         /* loop all assigned devices */
         break;
+    case VTD_PASID_CACHE_FORCE_RESET:
+        /* For force reset, no need to go further replay */
+        return;
     default:
         error_report("invalid pc_info->type for replay");
         abort();
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ae5bbfcdc0..c8a936eb46 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (13 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 14/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 22:49   ` Nicolin Chen
  2025-05-21 11:14 ` [PATCH rfcv3 16/21] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

This captures the guest PASID table entry modifications and
propagates the changes to host to attach a hwpt with type determined
per guest PGTT configuration.

When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
the hwpt on host side is a nested page table.

The guest page table is configured as stage-1 page table (gIOVA->GPA)
whose translation result would further go through host VT-d stage-2
page table(GPA->HPA) under nested translation mode. This is the key
to support gIOVA over stage-1 page table for Intel VT-d in
virtualization environment.

Stage-2 page table could be shared by different devices if there is
no conflict and devices link to same iommufd object, i.e. devices
under same host IOMMU can share same stage-2 page table. If there
is conflict, i.e. there is one device under non cache coherency
mode which is different from others, it requires a separate
stage-2 page table in non-CC mode.

See below example diagram:

      IntelIOMMUState
             |
             V
    .------------------.    .------------------.
    | VTDIOASContainer |--->| VTDIOASContainer |--->...
    |    (iommufd0)    |    |    (iommufd1)    |
    .------------------.    .------------------.
             |                       |
             |                       .-->...
             V
      .-------------------.    .-------------------.
      |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...
      .-------------------.    .-------------------.
          |            |               |
          |            |               |
    .-----------.  .-----------.  .------------.
    | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |
    | Device(CC)|  | Device(CC)|  | Device     |
    | (iommufd0)|  | (iommufd0)|  | (non-CC)   |
    |           |  |           |  | (iommufd0) |
    .-----------.  .-----------.  .------------.

Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  11 +
 include/hw/i386/intel_iommu.h  |  24 ++
 hw/i386/intel_iommu.c          | 581 +++++++++++++++++++++++++++++++--
 hw/i386/trace-events           |   8 +
 4 files changed, 604 insertions(+), 20 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 5e5583d94a..e76f43bb8f 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -563,6 +563,13 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+typedef enum VTDPASIDOp {
+    VTD_PASID_BIND,
+    VTD_PASID_UPDATE,
+    VTD_PASID_UNBIND,
+    VTD_OP_NUM
+} VTDPASIDOp;
+
 typedef enum VTDPCInvType {
     /* Force reset all */
     VTD_PASID_CACHE_FORCE_RESET = 0,
@@ -578,6 +585,7 @@ typedef struct VTDPASIDCacheInfo {
     uint32_t pasid;
     PCIBus *bus;
     uint16_t devfn;
+    bool error_happened;
 } VTDPASIDCacheInfo;
 
 /* PASID Table Related Definitions */
@@ -606,6 +614,9 @@ typedef struct VTDPASIDCacheInfo {
 
 #define VTD_SM_PASID_ENTRY_FLPM          3ULL
 #define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(val)  (!!((val) & 1ULL))
+#define VTD_SM_PASID_ENTRY_WPE_BIT(val)  (!!(((val) >> 4) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
 
 /* First Level Paging Structure */
 /* Masks for First Level Paging Entry */
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index fbc9da903a..594281c1d3 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -100,10 +100,32 @@ typedef struct VTDPASIDCacheEntry {
     bool cache_filled;
 } VTDPASIDCacheEntry;
 
+typedef struct VTDIOASContainer {
+    struct IOMMUFDBackend *iommufd;
+    uint32_t ioas_id;
+    MemoryListener listener;
+    QLIST_HEAD(, VTDS2Hwpt) s2_hwpt_list;
+    QLIST_ENTRY(VTDIOASContainer) next;
+    Error *error;
+} VTDIOASContainer;
+
+typedef struct VTDS2Hwpt {
+    uint32_t users;
+    uint32_t hwpt_id;
+    VTDIOASContainer *container;
+    QLIST_ENTRY(VTDS2Hwpt) next;
+} VTDS2Hwpt;
+
+typedef struct VTDHwpt {
+    uint32_t hwpt_id;
+    VTDS2Hwpt *s2_hwpt;
+} VTDHwpt;
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
     uint32_t pasid;
+    VTDHwpt hwpt;
     AddressSpace as;
     IOMMUMemoryRegion iommu;
     MemoryRegion root;          /* The root container of the device */
@@ -303,6 +325,8 @@ struct IntelIOMMUState {
 
     GHashTable *vtd_host_iommu_dev;             /* VTDHostIOMMUDevice */
 
+    QLIST_HEAD(, VTDIOASContainer) containers;
+
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
     dma_addr_t intr_root;           /* Interrupt remapping table pointer */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 050b0d3ca2..3269a66ac7 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -20,6 +20,7 @@
  */
 
 #include "qemu/osdep.h"
+#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qapi/error.h"
@@ -40,6 +41,9 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "system/iommufd.h"
+#ifdef CONFIG_IOMMUFD
+#include <linux/iommufd.h>
+#endif
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -838,11 +842,40 @@ static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
     return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
 }
 
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+    return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+}
+
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+    return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
+}
+
+static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
+}
+
+/* check if pgtt is first stage translation */
+static inline bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
 }
 
+static inline void pasid_cache_info_set_error(VTDPASIDCacheInfo *pc_info)
+{
+    if (pc_info->error_happened) {
+        return;
+    }
+    pc_info->error_happened = true;
+}
+
 /**
  * Caller of this function should check present bit if wants
  * to use pdir entry for further usage except for fpd bit check.
@@ -1776,7 +1809,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
              */
             return false;
         }
-        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
+        return vtd_pe_pgtt_is_pt(&pe);
     }
 
     return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
@@ -2403,6 +2436,497 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
+#ifdef CONFIG_IOMMUFD
+static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
+{
+    return !memory_region_is_ram(section->mr) ||
+           memory_region_is_protected(section->mr) ||
+           /*
+            * Sizing an enabled 64-bit BAR can cause spurious mappings to
+            * addresses in the upper part of the 64-bit address space.  These
+            * are never accessed by the CPU and beyond the address width of
+            * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
+            */
+           section->offset_within_address_space & (1ULL << 63);
+}
+
+static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
+                                                 MemoryRegionSection *section)
+{
+    VTDIOASContainer *container = container_of(listener,
+                                               VTDIOASContainer, listener);
+    IOMMUFDBackend *iommufd = container->iommufd;
+    uint32_t ioas_id = container->ioas_id;
+    hwaddr iova;
+    Int128 llend, llsize;
+    void *vaddr;
+    Error *err = NULL;
+    int ret;
+
+    if (iommufd_listener_skipped_section(section)) {
+        return;
+    }
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
+    llsize = int128_sub(llend, int128_make64(iova));
+    vaddr = memory_region_get_ram_ptr(section->mr) +
+            section->offset_within_region +
+            (iova - section->offset_within_address_space);
+
+    memory_region_ref(section->mr);
+
+    ret = iommufd_backend_map_dma(iommufd, ioas_id, iova, int128_get64(llsize),
+                                  vaddr, section->readonly);
+    if (!ret) {
+        return;
+    }
+
+    error_setg(&err,
+               "iommufd_listener_region_add_s2domain(%p, 0x%"HWADDR_PRIx", "
+               "0x%"HWADDR_PRIx", %p) = %d (%s)",
+               container, iova, int128_get64(llsize), vaddr, ret,
+               strerror(-ret));
+
+    if (memory_region_is_ram_device(section->mr)) {
+        /* Allow unexpected mappings not to be fatal for RAM devices */
+        error_report_err(err);
+        return;
+    }
+
+    if (!container->error) {
+        error_propagate_prepend(&container->error, err, "Region %s: ",
+                                memory_region_name(section->mr));
+    } else {
+        error_free(err);
+    }
+}
+
+static void iommufd_listener_region_del_s2domain(MemoryListener *listener,
+                                                 MemoryRegionSection *section)
+{
+    VTDIOASContainer *container = container_of(listener,
+                                               VTDIOASContainer, listener);
+    IOMMUFDBackend *iommufd = container->iommufd;
+    uint32_t ioas_id = container->ioas_id;
+    hwaddr iova;
+    Int128 llend, llsize;
+    int ret;
+
+    if (iommufd_listener_skipped_section(section)) {
+        return;
+    }
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    ret = iommufd_backend_unmap_dma(iommufd, ioas_id,
+                                    iova, int128_get64(llsize));
+    if (ret) {
+        error_report("iommufd_listener_region_del_s2domain(%p, "
+                     "0x%"HWADDR_PRIx", 0x%"HWADDR_PRIx") = %d (%s)",
+                     container, iova, int128_get64(llsize), ret,
+                     strerror(-ret));
+    }
+
+    memory_region_unref(section->mr);
+}
+
+static const MemoryListener iommufd_s2domain_memory_listener = {
+    .name = "iommufd_s2domain",
+    .priority = 1000,
+    .region_add = iommufd_listener_region_add_s2domain,
+    .region_del = iommufd_listener_region_del_s2domain,
+};
+
+static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
+                                  VTDPASIDEntry *pe)
+{
+    memset(vtd, 0, sizeof(*vtd));
+
+    vtd->flags =  (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ?
+                                        IOMMU_VTD_S1_SRE : 0) |
+                  (VTD_SM_PASID_ENTRY_WPE_BIT(pe->val[2]) ?
+                                        IOMMU_VTD_S1_WPE : 0) |
+                  (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ?
+                                        IOMMU_VTD_S1_EAFE : 0);
+    vtd->addr_width = vtd_pe_get_fl_aw(pe);
+    vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
+}
+
+static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                              VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
+                              VTDPASIDEntry *pe, Error **errp)
+{
+    struct iommu_hwpt_vtd_s1 vtd;
+    uint32_t hwpt_id, s2_hwpt_id = s2_hwpt->hwpt_id;
+
+    vtd_init_s1_hwpt_data(&vtd, pe);
+
+    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+                                    s2_hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
+                                    sizeof(vtd), &vtd, &hwpt_id, errp)) {
+        return -EINVAL;
+    }
+
+    hwpt->hwpt_id = hwpt_id;
+
+    return 0;
+}
+
+static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev, VTDHwpt *hwpt)
+{
+    iommufd_backend_free_id(idev->iommufd, hwpt->hwpt_id);
+}
+
+static VTDS2Hwpt *vtd_ioas_container_get_s2_hwpt(VTDIOASContainer *container,
+                                                 uint32_t hwpt_id)
+{
+    VTDS2Hwpt *s2_hwpt;
+
+    QLIST_FOREACH(s2_hwpt, &container->s2_hwpt_list, next) {
+        if (s2_hwpt->hwpt_id == hwpt_id) {
+            return s2_hwpt;
+        }
+    }
+
+    s2_hwpt = g_malloc0(sizeof(*s2_hwpt));
+
+    s2_hwpt->hwpt_id = hwpt_id;
+    s2_hwpt->container = container;
+    QLIST_INSERT_HEAD(&container->s2_hwpt_list, s2_hwpt, next);
+
+    return s2_hwpt;
+}
+
+static void vtd_ioas_container_put_s2_hwpt(VTDS2Hwpt *s2_hwpt)
+{
+    VTDIOASContainer *container = s2_hwpt->container;
+
+    if (s2_hwpt->users) {
+        return;
+    }
+
+    QLIST_REMOVE(s2_hwpt, next);
+    iommufd_backend_free_id(container->iommufd, s2_hwpt->hwpt_id);
+    g_free(s2_hwpt);
+}
+
+static void vtd_ioas_container_destroy(VTDIOASContainer *container)
+{
+    if (!QLIST_EMPTY(&container->s2_hwpt_list)) {
+        return;
+    }
+
+    QLIST_REMOVE(container, next);
+    memory_listener_unregister(&container->listener);
+    iommufd_backend_free_id(container->iommufd, container->ioas_id);
+    g_free(container);
+}
+
+static int vtd_device_attach_hwpt(VTDHostIOMMUDevice *vtd_hiod,
+                                  uint32_t pasid, VTDPASIDEntry *pe,
+                                  VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
+                                  Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    int ret;
+
+    if (vtd_pe_pgtt_is_flt(pe)) {
+        ret = vtd_create_s1_hwpt(idev, s2_hwpt, hwpt, pe, errp);
+        if (ret) {
+            return ret;
+        }
+    } else {
+        hwpt->hwpt_id = s2_hwpt->hwpt_id;
+    }
+
+    ret = !host_iommu_device_iommufd_attach_hwpt(idev, hwpt->hwpt_id, errp);
+    trace_vtd_device_attach_hwpt(idev->devid, pasid, hwpt->hwpt_id, ret);
+    if (ret) {
+        if (vtd_pe_pgtt_is_flt(pe)) {
+            vtd_destroy_s1_hwpt(idev, hwpt);
+        }
+        hwpt->hwpt_id = 0;
+        error_report("devid %d pasid %d failed to attach hwpt %d",
+                     idev->devid, pasid, hwpt->hwpt_id);
+        return ret;
+    }
+
+    s2_hwpt->users++;
+    hwpt->s2_hwpt = s2_hwpt;
+
+    return 0;
+}
+
+static void vtd_device_detach_hwpt(VTDHostIOMMUDevice *vtd_hiod,
+                                   uint32_t pasid, VTDPASIDEntry *pe,
+                                   VTDHwpt *hwpt, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    int ret;
+
+    if (vtd_hiod->iommu_state->dmar_enabled) {
+        ret = !host_iommu_device_iommufd_detach_hwpt(idev, errp);
+        trace_vtd_device_detach_hwpt(idev->devid, pasid, ret);
+    } else {
+        ret = !host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
+        trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
+                                           ret);
+    }
+
+    if (ret) {
+        error_report("devid %d pasid %d failed to attach hwpt %d",
+                     idev->devid, pasid, hwpt->hwpt_id);
+    }
+
+    if (vtd_pe_pgtt_is_flt(pe)) {
+        vtd_destroy_s1_hwpt(idev, hwpt);
+    }
+
+    hwpt->s2_hwpt->users--;
+    hwpt->s2_hwpt = NULL;
+    hwpt->hwpt_id = 0;
+}
+
+static int vtd_device_attach_container(VTDHostIOMMUDevice *vtd_hiod,
+                                       VTDIOASContainer *container,
+                                       uint32_t pasid, VTDPASIDEntry *pe,
+                                       VTDHwpt *hwpt, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    IOMMUFDBackend *iommufd = idev->iommufd;
+    VTDS2Hwpt *s2_hwpt;
+    uint32_t s2_hwpt_id;
+    Error *err = NULL;
+    int ret;
+
+    /* try to attach to an existing hwpt in this container */
+    QLIST_FOREACH(s2_hwpt, &container->s2_hwpt_list, next) {
+        ret = vtd_device_attach_hwpt(vtd_hiod, pasid, pe, s2_hwpt, hwpt, &err);
+        if (ret) {
+            const char *msg = error_get_pretty(err);
+
+            trace_vtd_device_fail_attach_existing_hwpt(msg);
+            error_free(err);
+            err = NULL;
+        } else {
+            goto found_hwpt;
+        }
+    }
+
+    if (!iommufd_backend_alloc_hwpt(iommufd, idev->devid, container->ioas_id,
+                                    IOMMU_HWPT_ALLOC_NEST_PARENT,
+                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
+                                    &s2_hwpt_id, errp)) {
+        return -EINVAL;
+    }
+
+    s2_hwpt = vtd_ioas_container_get_s2_hwpt(container, s2_hwpt_id);
+
+    /* Attach vtd device to a new allocated hwpt within iommufd */
+    ret = vtd_device_attach_hwpt(vtd_hiod, pasid, pe, s2_hwpt, hwpt, errp);
+    if (ret) {
+        goto err_attach_hwpt;
+    }
+
+found_hwpt:
+    trace_vtd_device_attach_container(iommufd->fd, idev->devid, pasid,
+                                      container->ioas_id, hwpt->hwpt_id);
+    return 0;
+
+err_attach_hwpt:
+    vtd_ioas_container_put_s2_hwpt(s2_hwpt);
+    return ret;
+}
+
+static void vtd_device_detach_container(VTDHostIOMMUDevice *vtd_hiod,
+                                        uint32_t pasid, VTDPASIDEntry *pe,
+                                        VTDHwpt *hwpt, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    IOMMUFDBackend *iommufd = idev->iommufd;
+    VTDS2Hwpt *s2_hwpt = hwpt->s2_hwpt;
+
+    trace_vtd_device_detach_container(iommufd->fd, idev->devid, pasid);
+    vtd_device_detach_hwpt(vtd_hiod, pasid, pe, hwpt, errp);
+    vtd_ioas_container_put_s2_hwpt(s2_hwpt);
+}
+
+static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                     uint32_t pasid, VTDPASIDEntry *pe,
+                                     VTDHwpt *hwpt, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    IOMMUFDBackend *iommufd = idev->iommufd;
+    IntelIOMMUState *s = vtd_hiod->iommu_state;
+    VTDIOASContainer *container;
+    Error *err = NULL;
+    uint32_t ioas_id;
+    int ret;
+
+    /* try to attach to an existing container in this space */
+    QLIST_FOREACH(container, &s->containers, next) {
+        if (container->iommufd != iommufd) {
+            continue;
+        }
+
+        if (vtd_device_attach_container(vtd_hiod, container, pasid, pe, hwpt,
+                                        &err)) {
+            const char *msg = error_get_pretty(err);
+
+            trace_vtd_device_fail_attach_existing_container(msg);
+            error_free(err);
+            err = NULL;
+        } else {
+            return 0;
+        }
+    }
+
+    /* Need to allocate a new dedicated container */
+    ret = iommufd_backend_alloc_ioas(iommufd, &ioas_id, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    trace_vtd_device_alloc_ioas(iommufd->fd, ioas_id);
+
+    container = g_malloc0(sizeof(*container));
+    container->iommufd = iommufd;
+    container->ioas_id = ioas_id;
+    QLIST_INIT(&container->s2_hwpt_list);
+
+    if (vtd_device_attach_container(vtd_hiod, container, pasid, pe, hwpt,
+                                    errp)) {
+        goto err_attach_container;
+    }
+
+    container->listener = iommufd_s2domain_memory_listener;
+    memory_listener_register(&container->listener, &address_space_memory);
+
+    if (container->error) {
+        ret = -1;
+        error_propagate_prepend(errp, container->error,
+                                "memory listener initialization failed: ");
+        goto err_listener_register;
+    }
+
+    QLIST_INSERT_HEAD(&s->containers, container, next);
+
+    return 0;
+
+err_listener_register:
+    vtd_device_detach_container(vtd_hiod, pasid, pe, hwpt, errp);
+err_attach_container:
+    iommufd_backend_free_id(iommufd, container->ioas_id);
+    g_free(container);
+    return ret;
+}
+
+static void vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                      uint32_t pasid, VTDPASIDEntry *pe,
+                                      VTDHwpt *hwpt, Error **errp)
+{
+    VTDIOASContainer *container = hwpt->s2_hwpt->container;
+
+    vtd_device_detach_container(vtd_hiod, pasid, pe, hwpt, errp);
+    vtd_ioas_container_destroy(container);
+}
+
+static int vtd_device_attach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
+                                   VTDAddressSpace *vtd_as, VTDPASIDEntry *pe)
+{
+    /*
+     * If pe->gptt != FLT, should be go ahead to do bind as host only
+     * accepts guest FLT under nesting. If pe->pgtt==PT, should setup
+     * the pasid with GPA page table. Otherwise should return failure.
+     */
+    if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
+        return -EINVAL;
+    }
+
+    /* Should fail if the FLPT base is 0 */
+    if (vtd_pe_pgtt_is_flt(pe) && !vtd_pe_get_flpt_base(pe)) {
+        return -EINVAL;
+    }
+
+    return vtd_device_attach_iommufd(vtd_hiod, vtd_as->pasid, pe,
+                                     &vtd_as->hwpt, &error_abort);
+}
+
+static int vtd_device_detach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
+                                   VTDAddressSpace *vtd_as)
+{
+    VTDPASIDEntry *cached_pe = vtd_as->pasid_cache_entry.cache_filled ?
+                       &vtd_as->pasid_cache_entry.pasid_entry : NULL;
+
+    if (!cached_pe ||
+        (!vtd_pe_pgtt_is_flt(cached_pe) && !vtd_pe_pgtt_is_pt(cached_pe))) {
+        return 0;
+    }
+
+    vtd_device_detach_iommufd(vtd_hiod, vtd_as->pasid, cached_pe,
+                              &vtd_as->hwpt, &error_abort);
+
+    return 0;
+}
+
+/**
+ * Caller should hold iommu_lock.
+ */
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
+                                VTDPASIDEntry *pe, VTDPASIDOp op)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDHostIOMMUDevice *vtd_hiod;
+    int devfn = vtd_as->devfn;
+    int ret = -EINVAL;
+    struct vtd_as_key key = {
+        .bus = vtd_as->bus,
+        .devfn = devfn,
+    };
+
+    vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
+    if (!vtd_hiod || !vtd_hiod->hiod) {
+        /* means no need to go further, e.g. for emulated devices */
+        return 0;
+    }
+
+    if (vtd_as->pasid != PCI_NO_PASID) {
+        error_report("Non-rid_pasid %d not supported yet", vtd_as->pasid);
+        return ret;
+    }
+
+    switch (op) {
+    case VTD_PASID_UPDATE:
+    case VTD_PASID_BIND:
+    {
+        ret = vtd_device_attach_pgtbl(vtd_hiod, vtd_as, pe);
+        break;
+    }
+    case VTD_PASID_UNBIND:
+    {
+        ret = vtd_device_detach_pgtbl(vtd_hiod, vtd_as);
+        break;
+    }
+    default:
+        error_report_once("Unknown VTDPASIDOp!!!\n");
+        break;
+    }
+
+    return ret;
+}
+#else
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
+                                VTDPASIDEntry *pe, VTDPASIDOp op)
+{
+    return 0;
+}
+#endif
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -3145,21 +3669,27 @@ static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
  * This function fills in the pasid entry in &vtd_as. Caller
  * of this function should hold iommu_lock.
  */
-static void vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
-                                 VTDPASIDEntry *pe)
+static int vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
+                                VTDPASIDEntry *pe)
 {
     VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    int ret;
 
-    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
-        /* No need to go further as cached pasid entry is latest */
-        return;
+    if (pc_entry->cache_filled) {
+        if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+            /* No need to go further as cached pasid entry is latest */
+            return 0;
+        }
+        ret = vtd_bind_guest_pasid(vtd_as, pe, VTD_PASID_UPDATE);
+    } else {
+        ret = vtd_bind_guest_pasid(vtd_as, pe, VTD_PASID_BIND);
     }
 
-    pc_entry->pasid_entry = *pe;
-    pc_entry->cache_filled = true;
-    /*
-     * TODO: send pasid bind to host for passthru devices
-     */
+    if (!ret) {
+        pc_entry->pasid_entry = *pe;
+        pc_entry->cache_filled = true;
+    }
+    return ret;
 }
 
 /*
@@ -3225,14 +3755,20 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         goto remove;
     }
 
-    vtd_fill_pe_in_cache(s, vtd_as, &pe);
+    if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
+        pasid_cache_info_set_error(pc_info);
+    }
     return false;
 
 remove:
-    /*
-     * TODO: send pasid unbind to host for passthru devices
-     */
-    pc_entry->cache_filled = false;
+    if (pc_entry->cache_filled) {
+        if (vtd_bind_guest_pasid(vtd_as, NULL, VTD_PASID_UNBIND)) {
+            pasid_cache_info_set_error(pc_info);
+            return false;
+        } else {
+            pc_entry->cache_filled = false;
+        }
+    }
 
     /*
      * Don't remove address space of PCI_NO_PASID which is created by PCI
@@ -3247,7 +3783,7 @@ remove:
 /* Caller of this function should hold iommu_lock */
 static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
 {
-    VTDPASIDCacheInfo pc_info;
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
 
     trace_vtd_pasid_cache_reset();
 
@@ -3308,7 +3844,9 @@ static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
                 pasid = pasid_next;
                 continue;
             }
-            vtd_fill_pe_in_cache(s, vtd_as, &pe);
+            if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
+                pasid_cache_info_set_error(info);
+            }
         }
         pasid = pasid_next;
     }
@@ -3416,6 +3954,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
         walk_info.devfn = vtd_hiod->devfn;
         vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
     }
+    if (walk_info.error_happened) {
+        pasid_cache_info_set_error(pc_info);
+    }
 }
 
 /*
@@ -3488,9 +4029,9 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
     uint16_t domain_id;
     uint32_t pasid;
-    VTDPASIDCacheInfo pc_info;
     uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
                         VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
 
@@ -3529,7 +4070,7 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     }
 
     vtd_pasid_cache_sync(s, &pc_info);
-    return true;
+    return !pc_info.error_happened ? true : false;
 }
 
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index c8a936eb46..de903a0033 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -73,6 +73,14 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
 vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
 vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
 vtd_reset_exit(void) ""
+vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
+vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_fail_attach_existing_hwpt(const char *msg) " %s"
+vtd_device_attach_container(int fd, uint32_t dev_id, uint32_t pasid, uint32_t ioas_id, uint32_t hwpt_id) "iommufd %d dev_id %d pasid %d ioas_id %d hwpt_id %d"
+vtd_device_detach_container(int fd, uint32_t dev_id, uint32_t pasid) "iommufd %d dev_id %d pasid %d"
+vtd_device_fail_attach_existing_container(const char *msg) " %s"
+vtd_device_alloc_ioas(int fd, uint32_t ioas_id) "iommufd %d ioas_id %d"
 
 # amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 16/21] intel_iommu: ERRATA_772415 workaround
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (14 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 17/21] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on stage-2 page table could still be written.

Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.

[0] https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update

We utilize the new added IOMMUFD container/ioas/hwpt management framework in
VTD. Add a check to create new VTDIOASContainer to only hold RW mappings,
then this VTDIOASContainer can be used as backend for device with
ERRATA_772415. See below diagram for details:

      IntelIOMMUState
             |
             V
    .------------------.    .------------------.    .-------------------.
    | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
    | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,only RW)|
    .------------------.    .------------------.    .-------------------.
             |                       |                              |
             |                       .-->...                        |
             V                                                      V
      .-------------------.    .-------------------.          .---------------.
      |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
      .-------------------.    .-------------------.          .---------------.
          |            |               |                            |
          |            |               |                            |
    .-----------.  .-----------.  .------------.              .------------.
    | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
    | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
    | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
    |           |  |           |  | (iommufd0) |              | (iommufd0) |
    .-----------.  .-----------.  .------------.              .------------.

Changed to pass VTDHostIOMMUDevice pointer to vtd_check_hdev() so errata
could be saved.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  1 +
 include/hw/i386/intel_iommu.h  |  1 +
 hw/i386/intel_iommu.c          | 25 +++++++++++++++++--------
 3 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e76f43bb8f..75d840f9fe 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -654,5 +654,6 @@ typedef struct VTDHostIOMMUDevice {
     PCIBus *bus;
     uint8_t devfn;
     HostIOMMUDevice *hiod;
+    uint32_t errata;
 } VTDHostIOMMUDevice;
 #endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 594281c1d3..9b156dc32e 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -103,6 +103,7 @@ typedef struct VTDPASIDCacheEntry {
 typedef struct VTDIOASContainer {
     struct IOMMUFDBackend *iommufd;
     uint32_t ioas_id;
+    uint32_t errata;
     MemoryListener listener;
     QLIST_HEAD(, VTDS2Hwpt) s2_hwpt_list;
     QLIST_ENTRY(VTDIOASContainer) next;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 3269a66ac7..9ffc2a8ffc 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2437,7 +2437,8 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
 }
 
 #ifdef CONFIG_IOMMUFD
-static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
+static bool iommufd_listener_skipped_section(VTDIOASContainer *container,
+                                             MemoryRegionSection *section)
 {
     return !memory_region_is_ram(section->mr) ||
            memory_region_is_protected(section->mr) ||
@@ -2447,7 +2448,8 @@ static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
             * are never accessed by the CPU and beyond the address width of
             * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
             */
-           section->offset_within_address_space & (1ULL << 63);
+           section->offset_within_address_space & (1ULL << 63) ||
+           (container->errata && section->readonly);
 }
 
 static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
@@ -2463,7 +2465,7 @@ static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
     Error *err = NULL;
     int ret;
 
-    if (iommufd_listener_skipped_section(section)) {
+    if (iommufd_listener_skipped_section(container, section)) {
         return;
     }
     iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
@@ -2514,7 +2516,7 @@ static void iommufd_listener_region_del_s2domain(MemoryListener *listener,
     Int128 llend, llsize;
     int ret;
 
-    if (iommufd_listener_skipped_section(section)) {
+    if (iommufd_listener_skipped_section(container, section)) {
         return;
     }
     iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
@@ -2770,7 +2772,8 @@ static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
 
     /* try to attach to an existing container in this space */
     QLIST_FOREACH(container, &s->containers, next) {
-        if (container->iommufd != iommufd) {
+        if (container->iommufd != iommufd ||
+            container->errata != vtd_hiod->errata) {
             continue;
         }
 
@@ -2797,6 +2800,7 @@ static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
     container = g_malloc0(sizeof(*container));
     container->iommufd = iommufd;
     container->ioas_id = ioas_id;
+    container->errata = vtd_hiod->errata;
     QLIST_INIT(&container->s2_hwpt_list);
 
     if (vtd_device_attach_container(vtd_hiod, container, pasid, pe, hwpt,
@@ -5355,9 +5359,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
     return vtd_dev_as;
 }
 
-static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
+static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
                            Error **errp)
 {
+    HostIOMMUDevice *hiod = vtd_hiod->hiod;
     HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
     int ret;
 
@@ -5399,7 +5404,7 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
     }
 
     /*
-     * HOST_IOMMU_DEVICE_CAP_NESTING/FS1GP are VTD vendor specific
+     * HOST_IOMMU_DEVICE_CAP_NESTING/FS1GP/ERRATA are VTD vendor specific
      * capabilities, so get_cap() should never fail on them now that
      * HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD type check passed
      * above.
@@ -5416,6 +5421,9 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         return false;
     }
 
+    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_ERRATA, errp);
+    vtd_hiod->errata = ret;
+
     error_setg(errp, "host device is uncompatible with stage-1 translation");
     return false;
 }
@@ -5447,7 +5455,8 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     vtd_hiod->iommu_state = s;
     vtd_hiod->hiod = hiod;
 
-    if (!vtd_check_hiod(s, hiod, errp)) {
+    if (!vtd_check_hiod(s, vtd_hiod, errp)) {
+        g_free(vtd_hiod);
         vtd_iommu_unlock(s);
         return false;
     }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 17/21] intel_iommu: Replay pasid binds after context cache invalidation
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (15 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 16/21] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 18/21] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
	Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

From: Yi Liu <yi.l.liu@intel.com>

This replays guest pasid attachments after context cache invalidation.
This is a behavior to ensure safety. Actually, programmer should issue
pasid cache invalidation with proper granularity after issuing a context
cache invalidation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  1 +
 hw/i386/intel_iommu.c          | 51 ++++++++++++++++++++++++++++++++--
 hw/i386/trace-events           |  1 +
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 75d840f9fe..198726b48f 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -575,6 +575,7 @@ typedef enum VTDPCInvType {
     VTD_PASID_CACHE_FORCE_RESET = 0,
     /* pasid cache invalidation rely on guest PASID entry */
     VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
+    VTD_PASID_CACHE_DEVSI,      /* pasid cache device selective invalidation */
     VTD_PASID_CACHE_DOMSI,      /* pasid cache domain selective invalidation */
     VTD_PASID_CACHE_PASIDSI,    /* pasid cache pasid selective invalidation */
 } VTDPCInvType;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9ffc2a8ffc..d686d0ee1a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -91,6 +91,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
 static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  PCIBus *bus, uint16_t devfn);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -2417,6 +2421,8 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
+
     trace_vtd_inv_desc_cc_global();
     /* Protects context cache */
     vtd_iommu_lock(s);
@@ -2434,6 +2440,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
      * VT-d emulation codes.
      */
     vtd_iommu_replay_all(s);
+
+    pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+    vtd_pasid_cache_sync(s, &pc_info);
 }
 
 #ifdef CONFIG_IOMMUFD
@@ -2989,6 +2998,21 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
              * happened.
              */
             vtd_address_space_sync(vtd_as);
+            /*
+             * Per spec, context flush should also followed with PASID
+             * cache and iotlb flush. Regards to a device selective
+             * context cache invalidation:
+             * if (emaulted_device)
+             *    invalidate pasid cache and pasid-based iotlb
+             * else if (assigned_device)
+             *    check if the device has been bound to any pasid
+             *    invoke pasid_unbind regards to each bound pasid
+             * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
+             * caches, while for piotlb in QEMU, we don't have it yet, so
+             * no handling. For assigned device, host iommu driver would
+             * flush piotlb when a pasid unbind is pass down to it.
+             */
+             vtd_pasid_cache_devsi(s, vtd_as->bus, devfn);
         }
     }
 }
@@ -3737,6 +3761,11 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         /* Fall through */
     case VTD_PASID_CACHE_GLOBAL_INV:
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        if (pc_info->bus != vtd_as->bus || pc_info->devfn != vtd_as->devfn) {
+            return false;
+        }
+        break;
     default:
         error_report("invalid pc_info->type");
         abort();
@@ -3933,6 +3962,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     case VTD_PASID_CACHE_GLOBAL_INV:
         /* loop all assigned devices */
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        walk_info.bus = pc_info->bus;
+        walk_info.devfn = pc_info->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+        return;
     case VTD_PASID_CACHE_FORCE_RESET:
         /* For force reset, no need to go further replay */
         return;
@@ -3968,8 +4002,7 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
  * It includes updating the pasid cache in vIOMMU and updating the
  * pasid bindings per guest's latest pasid entry presence.
  */
-static void vtd_pasid_cache_sync(IntelIOMMUState *s,
-                                 VTDPASIDCacheInfo *pc_info)
+static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
 {
     if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
         return;
@@ -4030,6 +4063,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
     vtd_replay_guest_pasid_bindings(s, pc_info);
 }
 
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  PCIBus *bus, uint16_t devfn)
+{
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
+
+    trace_vtd_pasid_cache_devsi(devfn);
+
+    pc_info.type = VTD_PASID_CACHE_DEVSI;
+    pc_info.bus = bus;
+    pc_info.devfn = devfn;
+
+    vtd_pasid_cache_sync(s, &pc_info);
+}
+
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index de903a0033..f001b820d9 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -28,6 +28,7 @@ vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 18/21] intel_iommu: Propagate PASID-based iotlb invalidation to host
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (16 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 17/21] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 19/21] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

This traps the guest PASID-based iotlb invalidation request and propagate it
to host.

Intel VT-d 3.0 supports nested translation in PASID granular. Guest SVA support
could be implemented by configuring nested translation on specific PASID. This
is also known as dual stage DMA translation.

Under such configuration, guest owns the GVA->GPA translation which is
configured as stage-1 page table in host side for a specific pasid, and host
owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
invalidation should be propagated to host since host IOMMU will cache first
level page table related mappings during DMA address translation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |   6 ++
 hw/i386/intel_iommu.c          | 118 ++++++++++++++++++++++++++++++++-
 2 files changed, 122 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 198726b48f..e4552ff9bd 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -589,6 +589,12 @@ typedef struct VTDPASIDCacheInfo {
     bool error_happened;
 } VTDPASIDCacheInfo;
 
+typedef struct VTDPIOTLBInvInfo {
+    uint16_t domain_id;
+    uint32_t pasid;
+    struct iommu_hwpt_vtd_s1_invalidate *inv_data;
+} VTDPIOTLBInvInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d686d0ee1a..bb21060d7e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2932,12 +2932,110 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
 
     return ret;
 }
+
+/*
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_invalidate_piotlb(VTDAddressSpace *vtd_as,
+                                  struct iommu_hwpt_vtd_s1_invalidate *cache)
+{
+    VTDHostIOMMUDevice *vtd_hiod;
+    HostIOMMUDeviceIOMMUFD *idev;
+    VTDHwpt *hwpt = &vtd_as->hwpt;
+    int devfn = vtd_as->devfn;
+    struct vtd_as_key key = {
+        .bus = vtd_as->bus,
+        .devfn = devfn,
+    };
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint32_t entry_num = 1; /* Only implement one request for simplicity */
+    Error *err;
+
+    if (!hwpt) {
+        return;
+    }
+
+    vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
+    if (!vtd_hiod || !vtd_hiod->hiod) {
+        return;
+    }
+    idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+
+    if (!iommufd_backend_invalidate_cache(idev->iommufd, hwpt->hwpt_id,
+                                          IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+                                          sizeof(*cache), &entry_num, cache,
+                                          &err)) {
+        error_report_err(err);
+    }
+}
+
+/*
+ * This function is a loop function for the s->vtd_address_spaces
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host. Caller of this function
+ * should hold iommu_lock.
+ */
+static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
+                                  gpointer user_data)
+{
+    VTDPIOTLBInvInfo *piotlb_info = user_data;
+    VTDAddressSpace *vtd_as = value;
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    uint32_t pasid;
+    uint16_t did;
+
+    /* Replay only fill pasid entry cache for passthrough device */
+    if (!pc_entry->cache_filled ||
+        !vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
+        return;
+    }
+
+    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+        return;
+    }
+
+    did = vtd_pe_get_did(&pc_entry->pasid_entry);
+
+    if (piotlb_info->domain_id == did && piotlb_info->pasid == pasid) {
+        vtd_invalidate_piotlb(vtd_as, piotlb_info->inv_data);
+    }
+}
+
+static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
+                                      uint16_t domain_id, uint32_t pasid,
+                                      hwaddr addr, uint64_t npages, bool ih)
+{
+    struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
+    VTDPIOTLBInvInfo piotlb_info;
+
+    cache_info.addr = addr;
+    cache_info.npages = npages;
+    cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.inv_data = &cache_info;
+
+    /*
+     * Here loops all the vtd_as instances in s->vtd_address_spaces
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_address_spaces,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+}
 #else
 static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
                                 VTDPASIDEntry *pe, VTDPASIDOp op)
 {
     return 0;
 }
+
+static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
+                                      uint16_t domain_id, uint32_t pasid,
+                                      hwaddr addr, uint64_t npages, bool ih)
+{
+}
 #endif
 
 /* Do a context-cache device-selective invalidation.
@@ -3591,6 +3689,13 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
     info.pasid = pasid;
 
     vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_as instances in s->vtd_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    vtd_flush_pasid_iotlb_all(s, domain_id, pasid, 0, (uint64_t)-1, 0);
+
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
                                 &info);
     vtd_iommu_unlock(s);
@@ -3613,7 +3718,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
-                                       uint32_t pasid, hwaddr addr, uint8_t am)
+                                       uint32_t pasid, hwaddr addr, uint8_t am,
+                                       bool ih)
 {
     VTDIOTLBPageInvInfo info;
 
@@ -3623,6 +3729,13 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     info.mask = ~((1 << am) - 1);
 
     vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_as instances in s->vtd_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    vtd_flush_pasid_iotlb_all(s, domain_id, pasid, addr, 1 << am, ih);
+
     g_hash_table_foreach_remove(s->iotlb,
                                 vtd_hash_remove_by_page_piotlb, &info);
     vtd_iommu_unlock(s);
@@ -3656,7 +3769,8 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
     case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
         am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
         addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
-        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am);
+        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+                                   VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
         break;
 
     default:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 19/21] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (17 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 18/21] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 20/21] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
the pasid bindings on host side become stale and need to be updated.

Introduce a helper function vtd_refresh_pasid_bind() for that purpose.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index bb21060d7e..b8f3b8effa 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -89,6 +89,7 @@ struct vtd_iotlb_key {
 
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_refresh_pasid_bind(IntelIOMMUState *s);
 
 static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
 static void vtd_pasid_cache_sync(IntelIOMMUState *s,
@@ -3362,6 +3363,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_refresh_pasid_bind(s);
 }
 
 /* Set Interrupt Remap Table Pointer */
@@ -3396,6 +3398,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_refresh_pasid_bind(s);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -4111,6 +4114,26 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     }
 }
 
+static void vtd_refresh_pasid_bind(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info = { .error_happened = false,
+                                  .type = VTD_PASID_CACHE_GLOBAL_INV };
+
+    /*
+     * Only when dmar is enabled, should pasid bindings replayed,
+     * otherwise no need to replay.
+     */
+    if (!s->dmar_enabled) {
+        return;
+    }
+
+    if (!s->flts || !s->root_scalable) {
+        return;
+    }
+
+    vtd_replay_guest_pasid_bindings(s, &pc_info);
+}
+
 /*
  * This function syncs the pasid bindings between guest and host.
  * It includes updating the pasid cache in vIOMMU and updating the
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 20/21] intel_iommu: Bypass replay in stage-1 page table mode
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (18 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 19/21] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-21 11:14 ` [PATCH rfcv3 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
  2025-05-26 12:19 ` [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Cédric Le Goater
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

VFIO utilizes replay to setup initial shadow iommu mappings.
But when stage-1 page table is configured, it is passed to
host to construct nested page table, there is no replay needed.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b8f3b8effa..e7c662f609 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -5768,6 +5768,14 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     VTDContextEntry ce;
     DMAMap map = { .iova = 0, .size = HWADDR_MAX };
 
+    /*
+     * Replay on stage-1 page table is meaningless as stage-1 page table
+     * is passthroughed to host to construct nested page table
+     */
+    if (s->flts && s->root_scalable) {
+        return;
+    }
+
     /* replay is protected by BQL, page walk will re-setup it safely */
     iova_tree_remove(vtd_as->iova_tree, map);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH rfcv3 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (19 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 20/21] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
@ 2025-05-21 11:14 ` Zhenzhong Duan
  2025-05-26 12:19 ` [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Cédric Le Goater
  21 siblings, 0 replies; 63+ messages in thread
From: Zhenzhong Duan @ 2025-05-21 11:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

Now that all infrastructures of supporting passthrough device running
with stage-1 translation are there, enable it now.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e7c662f609..c64bd9506e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -5608,8 +5608,7 @@ static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
     ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_ERRATA, errp);
     vtd_hiod->errata = ret;
 
-    error_setg(errp, "host device is uncompatible with stage-1 translation");
-    return false;
+    return true;
 }
 
 static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info
  2025-05-21 11:14 ` [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info Zhenzhong Duan
@ 2025-05-21 21:57   ` Nicolin Chen
  2025-05-22  9:21     ` Duan, Zhenzhong
  2025-05-26 12:15   ` Cédric Le Goater
  1 sibling, 1 reply; 63+ messages in thread
From: Nicolin Chen @ 2025-05-21 21:57 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

On Wed, May 21, 2025 at 07:14:35PM +0800, Zhenzhong Duan wrote:
> @@ -852,6 +853,17 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>      caps->type = type;
>      caps->hw_caps = hw_caps;
>  
> +    switch (type) {
> +    case IOMMU_HW_INFO_TYPE_INTEL_VTD:
> +        vendor_caps->vtd.flags = data.vtd.flags;
> +        vendor_caps->vtd.cap_reg = data.vtd.cap_reg;
> +        vendor_caps->vtd.ecap_reg = data.vtd.ecap_reg;
> +        break;
> +    case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
> +    case IOMMU_HW_INFO_TYPE_NONE:

Should this be a part of hiod_iommufd_get_vendor_cap() in backends?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-21 11:14 ` [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-05-21 22:49   ` Nicolin Chen
  2025-05-22  6:50     ` Duan, Zhenzhong
  2025-05-23  6:22     ` Yi Liu
  0 siblings, 2 replies; 63+ messages in thread
From: Nicolin Chen @ 2025-05-21 22:49 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
> +static const MemoryListener iommufd_s2domain_memory_listener = {
> +    .name = "iommufd_s2domain",
> +    .priority = 1000,
> +    .region_add = iommufd_listener_region_add_s2domain,
> +    .region_del = iommufd_listener_region_del_s2domain,
> +};

Would you mind elaborating When and how vtd does all S2 mappings?

On ARM, the default vfio_memory_listener could capture the entire
guest RAM and add to the address space. So what we do is basically
reusing the vfio_memory_listener:
https://lore.kernel.org/qemu-devel/20250311141045.66620-13-shameerali.kolothum.thodi@huawei.com/

The thing is that when a VFIO device is attached to the container
upon a nesting configuration, the ->get_address_space op should
return the system address space as S1 nested HWPT isn't allocated
yet. Then all the iommu as routines in vfio_listener_region_add()
would be skipped, ending up with mapping the guest RAM in S2 HWPT
correctly. Not until the S1 nested HWPT is allocated by the guest
OS (after guest boots), can the ->get_address_space op return the
iommu address space.

With this address space shift, S2 mappings can be simply captured
and done by vfio_memory_listener. Then, such an s2domain listener
would be largely redundant.

So the second question is:
Does vtd have to own this iommufd_s2domain_memory_listener? IOW,
does vtd_host_dma_iommu() have to return the iommu address space
all the time?

> +static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                              VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
> +                              VTDPASIDEntry *pe, Error **errp)
> +{
> +    struct iommu_hwpt_vtd_s1 vtd;
> +    uint32_t hwpt_id, s2_hwpt_id = s2_hwpt->hwpt_id;
> +
> +    vtd_init_s1_hwpt_data(&vtd, pe);
> +
> +    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> +                                    s2_hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
> +                                    sizeof(vtd), &vtd, &hwpt_id, errp)) {
> +        return -EINVAL;
> +    }
> +
> +    hwpt->hwpt_id = hwpt_id;
> +
> +    return 0;
> +}
> +
> +static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev, VTDHwpt *hwpt)
> +{
> +    iommufd_backend_free_id(idev->iommufd, hwpt->hwpt_id);
> +}

I think you did some substantial work to isolate the get_hw_info
part inside the iommufd backend code, which looks nice and clean
as the vIOMMU code simply does iodc->get_cap().

However, that then makes these direct raw backend function calls
very awkward :-/

In my view, the way to make sense is either:
* We don't do any isolation, but just call raw backend functions
  in vIOMMU code
* We do every isolation, and never call raw backend functions in
  vIOMMU code
?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-21 22:49   ` Nicolin Chen
@ 2025-05-22  6:50     ` Duan, Zhenzhong
  2025-05-22 19:29       ` Nicolin Chen
  2025-05-23  6:22     ` Yi Liu
  1 sibling, 1 reply; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-05-22  6:50 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>host
>
>On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
>> +static const MemoryListener iommufd_s2domain_memory_listener = {
>> +    .name = "iommufd_s2domain",
>> +    .priority = 1000,
>> +    .region_add = iommufd_listener_region_add_s2domain,
>> +    .region_del = iommufd_listener_region_del_s2domain,
>> +};
>
>Would you mind elaborating When and how vtd does all S2 mappings?

When guest trigger pasid cache invalidation, vIOMMU will attach device
to stage2 page table if guest's PGTT=PT or nested page table if PGTT=Stage1.
All these page tables are dynamically created during attach. We don't use
VFIO's shadow page table. The S2 mappings are also created during attach.
See:

vtd_device_attach_iommufd()
{
...
    vtd_device_attach_container();
    container->listener = iommufd_s2domain_memory_listener;
    memory_listener_register(&container->listener, &address_space_memory);
...
}

>
>On ARM, the default vfio_memory_listener could capture the entire
>guest RAM and add to the address space. So what we do is basically
>reusing the vfio_memory_listener:
>https://lore.kernel.org/qemu-devel/20250311141045.66620-13-
>shameerali.kolothum.thodi@huawei.com/
>
>The thing is that when a VFIO device is attached to the container
>upon a nesting configuration, the ->get_address_space op should
>return the system address space as S1 nested HWPT isn't allocated
>yet. Then all the iommu as routines in vfio_listener_region_add()
>would be skipped, ending up with mapping the guest RAM in S2 HWPT
>correctly. Not until the S1 nested HWPT is allocated by the guest
>OS (after guest boots), can the ->get_address_space op return the
>iommu address space.

When S1 hwpt is allocated by guest, who will notify VFIO to call
->get_address_space op() again to get iommu address space?

>
>With this address space shift, S2 mappings can be simply captured
>and done by vfio_memory_listener. Then, such an s2domain listener
>would be largely redundant.

I didn't get how arm smmu supports switching address space, will VFIO call
->get_address_space() twice, once to get system address space and the other
for iommu address space?

>
>So the second question is:
>Does vtd have to own this iommufd_s2domain_memory_listener? IOW,
>does vtd_host_dma_iommu() have to return the iommu address space
>all the time?

Vtd only support to return a fixed address space, under the address space
there can be either system memory region or iommu memory region enabled.

>
>> +static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
>> +                              VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
>> +                              VTDPASIDEntry *pe, Error **errp)
>> +{
>> +    struct iommu_hwpt_vtd_s1 vtd;
>> +    uint32_t hwpt_id, s2_hwpt_id = s2_hwpt->hwpt_id;
>> +
>> +    vtd_init_s1_hwpt_data(&vtd, pe);
>> +
>> +    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
>> +                                    s2_hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
>> +                                    sizeof(vtd), &vtd, &hwpt_id, errp)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    hwpt->hwpt_id = hwpt_id;
>> +
>> +    return 0;
>> +}
>> +
>> +static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
>VTDHwpt *hwpt)
>> +{
>> +    iommufd_backend_free_id(idev->iommufd, hwpt->hwpt_id);
>> +}
>
>I think you did some substantial work to isolate the get_hw_info
>part inside the iommufd backend code, which looks nice and clean
>as the vIOMMU code simply does iodc->get_cap().
>
>However, that then makes these direct raw backend function calls
>very awkward :-/
>
>In my view, the way to make sense is either:
>* We don't do any isolation, but just call raw backend functions
>  in vIOMMU code
>* We do every isolation, and never call raw backend functions in
>  vIOMMU code

Iommufd backend functions are general for all modules usage including
vIOMMU. I think we are not blocking vIOMMU calling raw backend functions.
We just provide a general interface for querying capabilities, not all capabilities
are from iommufd get_hw_info result, e.g., aw_bits.

Zhenzhong




^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info
  2025-05-21 21:57   ` Nicolin Chen
@ 2025-05-22  9:21     ` Duan, Zhenzhong
  2025-05-22 19:35       ` Nicolin Chen
  0 siblings, 1 reply; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-05-22  9:21 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info
>
>On Wed, May 21, 2025 at 07:14:35PM +0800, Zhenzhong Duan wrote:
>> @@ -852,6 +853,17 @@ static bool
>hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>>      caps->type = type;
>>      caps->hw_caps = hw_caps;
>>
>> +    switch (type) {
>> +    case IOMMU_HW_INFO_TYPE_INTEL_VTD:
>> +        vendor_caps->vtd.flags = data.vtd.flags;
>> +        vendor_caps->vtd.cap_reg = data.vtd.cap_reg;
>> +        vendor_caps->vtd.ecap_reg = data.vtd.ecap_reg;
>> +        break;
>> +    case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
>> +    case IOMMU_HW_INFO_TYPE_NONE:
>
>Should this be a part of hiod_iommufd_get_vendor_cap() in backends?

Made following	adjustments which save raw data in VendorCaps,
let me know if it matches your thought.

diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index 38070aff09..14cda4fdc3 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -14,16 +14,12 @@

 #include "qom/object.h"
 #include "qapi/error.h"
-
-/* This is mirror of struct iommu_hw_info_vtd */
-typedef struct Vtd_Caps {
-    uint32_t flags;
-    uint64_t cap_reg;
-    uint64_t ecap_reg;
-} Vtd_Caps;
+#ifdef CONFIG_LINUX
+#include "linux/iommufd.h"

 typedef union VendorCaps {
-    Vtd_Caps vtd;
+    struct iommu_hw_info_vtd vtd;
+    struct iommu_hw_info_arm_smmuv3 smmuv3;
 } VendorCaps;

 /**
@@ -43,6 +39,7 @@ typedef struct HostIOMMUDeviceCaps {
     uint64_t hw_caps;
     VendorCaps vendor_caps;
 } HostIOMMUDeviceCaps;
+#endif

 #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
 OBJECT_DECLARE_TYPE(HostIOMMUDevice, HostIOMMUDeviceClass, HOST_IOMMU_DEVICE)
@@ -54,7 +51,9 @@ struct HostIOMMUDevice {
     void *agent; /* pointer to agent device, ie. VFIO or VDPA device */
     PCIBus *aliased_bus;
     int aliased_devfn;
+#ifdef CONFIG_LINUX
     HostIOMMUDeviceCaps caps;
+#endif
 };

 /**
diff --git a/backends/iommufd.c b/backends/iommufd.c
index d91c1eb8b8..63209659f3 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -368,7 +368,7 @@ bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
 static int hiod_iommufd_get_vtd_cap(HostIOMMUDevice *hiod, int cap,
                                     Error **errp)
 {
-    Vtd_Caps *caps = &hiod->caps.vendor_caps.vtd;
+    struct iommu_hw_info_vtd *caps = &hiod->caps.vendor_caps.vtd;

     switch (cap) {
     case HOST_IOMMU_DEVICE_CAP_NESTING:
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 5c740222e5..fbf47cab09 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -836,15 +836,12 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
     HostIOMMUDeviceCaps *caps = &hiod->caps;
     VendorCaps *vendor_caps = &caps->vendor_caps;
     enum iommu_hw_info_type type;
-    union {
-        struct iommu_hw_info_vtd vtd;
-    } data;
     uint64_t hw_caps;

     hiod->agent = opaque;

-    if (!iommufd_backend_get_device_info(vdev->iommufd, vdev->devid,
-                                         &type, &data, sizeof(data),
+    if (!iommufd_backend_get_device_info(vdev->iommufd, vdev->devid, &type,
+                                         vendor_caps, sizeof(*vendor_caps),
                                          &hw_caps, errp)) {
         return false;
     }
@@ -853,17 +850,6 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
     caps->type = type;
     caps->hw_caps = hw_caps;

-    switch (type) {
-    case IOMMU_HW_INFO_TYPE_INTEL_VTD:
-        vendor_caps->vtd.flags = data.vtd.flags;
-        vendor_caps->vtd.cap_reg = data.vtd.cap_reg;
-        vendor_caps->vtd.ecap_reg = data.vtd.ecap_reg;
-        break;
-    case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
-    case IOMMU_HW_INFO_TYPE_NONE:
-        break;
-    }
-
     idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
     idev->iommufd = vdev->iommufd;
     idev->devid = vdev->devid;


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-22  6:50     ` Duan, Zhenzhong
@ 2025-05-22 19:29       ` Nicolin Chen
  2025-05-23  6:26         ` Yi Liu
  2025-05-26  3:34         ` Duan, Zhenzhong
  0 siblings, 2 replies; 63+ messages in thread
From: Nicolin Chen @ 2025-05-22 19:29 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

On Thu, May 22, 2025 at 06:50:42AM +0000, Duan, Zhenzhong wrote:
> 
> 
> >-----Original Message-----
> >From: Nicolin Chen <nicolinc@nvidia.com>
> >Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
> >host
> >
> >On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
> >> +static const MemoryListener iommufd_s2domain_memory_listener = {
> >> +    .name = "iommufd_s2domain",
> >> +    .priority = 1000,
> >> +    .region_add = iommufd_listener_region_add_s2domain,
> >> +    .region_del = iommufd_listener_region_del_s2domain,
> >> +};
> >
> >Would you mind elaborating When and how vtd does all S2 mappings?
> 
> When guest trigger pasid cache invalidation, vIOMMU will attach device
> to stage2 page table if guest's PGTT=PT or nested page table if PGTT=Stage1.
> All these page tables are dynamically created during attach. We don't use
> VFIO's shadow page table. The S2 mappings are also created during attach.

OK. That I can understand.

Next question: what does VTD actually map onto the S2 page table?
The entire guest RAM? Or just a part of that?

On ARM, the VFIO listener would capture the entire RAM, and map it
on S2 page table. I wonder if VTD would do the same.

> >On ARM, the default vfio_memory_listener could capture the entire
> >guest RAM and add to the address space. So what we do is basically
> >reusing the vfio_memory_listener:
> >https://lore.kernel.org/qemu-devel/20250311141045.66620-13-
> >shameerali.kolothum.thodi@huawei.com/
> >
> >The thing is that when a VFIO device is attached to the container
> >upon a nesting configuration, the ->get_address_space op should
> >return the system address space as S1 nested HWPT isn't allocated
> >yet. Then all the iommu as routines in vfio_listener_region_add()
> >would be skipped, ending up with mapping the guest RAM in S2 HWPT
> >correctly. Not until the S1 nested HWPT is allocated by the guest
> >OS (after guest boots), can the ->get_address_space op return the
> >iommu address space.
> 
> When S1 hwpt is allocated by guest, who will notify VFIO to call
> ->get_address_space op() again to get iommu address space?

Hmm, would you please elaborate why VFIO needs to call that again?

I can see VFIO create the MAP/UNMAP notifiers for an iommu address
space. However, the device operating in the nested translation mode
should go through IOMMU HW for these two:
 - S1 page table (MAP) will be created by the guest OS
 - S1 invalidation (UNMAP) will be issued by the guest OS, and then
   trapped by QEMU to forward to the HWPT uAPI to the host kernel.

As you mentioned, there is no need of a shadow page table in this
mode. What else does VT-d need from an iommu address space?

On ARM, the only reason that we shift address space, is for KVM to
inject MSI, as it only has the gIOVA and requires the iommu address
space to translate that to gPA. Refer to kvm_arch_fixup_msi_route()
in target/arm/kvm.c where it calls pci_device_iommu_address_space.

> >With this address space shift, S2 mappings can be simply captured
> >and done by vfio_memory_listener. Then, such an s2domain listener
> >would be largely redundant.
> 
> I didn't get how arm smmu supports switching address space, will VFIO call
> ->get_address_space() twice, once to get system address space and the other
> for iommu address space?

The set_iommu_device() attaches the device to an stage2 page table
by default, indicating that the device works in the S1 passthrough
mode (for VTD, that's PGTT=PT) at VM creation. And this is where
the system address space should be returned by get_address_space().

If the guest kernel sets an S1 Translate mode for the device (for
VTD, that's PGTT=Stage1), QEMU would trap that and allocate an S1
HWPT for device to attach. Starting from here, get_address_space()
can return the iommu address space -- on ARM, we only need it for
KVM to translate MSI.

If the guest kernel sets an S1 Bypass mode for the device (for VTD,
that's PGTT=PT), the device would continue stay in the system AS,
i.e. get_address_space() wouldn't need to switch.

> >> +static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> >> +                              VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
> >> +                              VTDPASIDEntry *pe, Error **errp)
> >> +{
> >> +    struct iommu_hwpt_vtd_s1 vtd;
> >> +    uint32_t hwpt_id, s2_hwpt_id = s2_hwpt->hwpt_id;
> >> +
> >> +    vtd_init_s1_hwpt_data(&vtd, pe);
> >> +
> >> +    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> >> +                                    s2_hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
> >> +                                    sizeof(vtd), &vtd, &hwpt_id, errp)) {
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    hwpt->hwpt_id = hwpt_id;
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> >VTDHwpt *hwpt)
> >> +{
> >> +    iommufd_backend_free_id(idev->iommufd, hwpt->hwpt_id);
> >> +}
> >
> >I think you did some substantial work to isolate the get_hw_info
> >part inside the iommufd backend code, which looks nice and clean
> >as the vIOMMU code simply does iodc->get_cap().
> >
> >However, that then makes these direct raw backend function calls
> >very awkward :-/
> >
> >In my view, the way to make sense is either:
> >* We don't do any isolation, but just call raw backend functions
> >  in vIOMMU code
> >* We do every isolation, and never call raw backend functions in
> >  vIOMMU code
> 
> Iommufd backend functions are general for all modules usage including
> vIOMMU. I think we are not blocking vIOMMU calling raw backend functions.

Well, I am not against doing that. Calling the raw iommufd APIs is
easier :)

> We just provide a general interface for querying capabilities, not all capabilities
> are from iommufd get_hw_info result, e.g., aw_bits.

Question is why we need a general interface for get_hw_info(), yet
not for other iommufd APIs.

IIRC, Cedirc's suggested a general interface for get_hw_info so the
vIOMMU code can be unaware of the underlying source (doesn't matter
if it's from iommufd or from vfio).

I see aw_bits from QEMU command line. Mind elaborating?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info
  2025-05-22  9:21     ` Duan, Zhenzhong
@ 2025-05-22 19:35       ` Nicolin Chen
  0 siblings, 0 replies; 63+ messages in thread
From: Nicolin Chen @ 2025-05-22 19:35 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

On Thu, May 22, 2025 at 09:21:04AM +0000, Duan, Zhenzhong wrote:
> 
> 
> >-----Original Message-----
> >From: Nicolin Chen <nicolinc@nvidia.com>
> >Subject: Re: [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info
> >
> >On Wed, May 21, 2025 at 07:14:35PM +0800, Zhenzhong Duan wrote:
> >> @@ -852,6 +853,17 @@ static bool
> >hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
> >>      caps->type = type;
> >>      caps->hw_caps = hw_caps;
> >>
> >> +    switch (type) {
> >> +    case IOMMU_HW_INFO_TYPE_INTEL_VTD:
> >> +        vendor_caps->vtd.flags = data.vtd.flags;
> >> +        vendor_caps->vtd.cap_reg = data.vtd.cap_reg;
> >> +        vendor_caps->vtd.ecap_reg = data.vtd.ecap_reg;
> >> +        break;
> >> +    case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
> >> +    case IOMMU_HW_INFO_TYPE_NONE:
> >
> >Should this be a part of hiod_iommufd_get_vendor_cap() in backends?
> 
> Made following	adjustments which save raw data in VendorCaps,
> let me know if it matches your thought.

Yea, LGTM. Point is that we keep all vendor structure decoding
inside the backend, so VFIO wouldn't need to care about types
nor what's inside the data.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-21 22:49   ` Nicolin Chen
  2025-05-22  6:50     ` Duan, Zhenzhong
@ 2025-05-23  6:22     ` Yi Liu
  2025-05-23  6:52       ` Duan, Zhenzhong
  2025-05-23 21:12       ` Nicolin Chen
  1 sibling, 2 replies; 63+ messages in thread
From: Yi Liu @ 2025-05-23  6:22 UTC (permalink / raw)
  To: Nicolin Chen, Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng, Yi Sun,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

Hey Nic,

On 2025/5/22 06:49, Nicolin Chen wrote:
> On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
>> +static const MemoryListener iommufd_s2domain_memory_listener = {
>> +    .name = "iommufd_s2domain",
>> +    .priority = 1000,
>> +    .region_add = iommufd_listener_region_add_s2domain,
>> +    .region_del = iommufd_listener_region_del_s2domain,
>> +};
> 
> Would you mind elaborating When and how vtd does all S2 mappings?
> 
> On ARM, the default vfio_memory_listener could capture the entire
> guest RAM and add to the address space. So what we do is basically
> reusing the vfio_memory_listener:
> https://lore.kernel.org/qemu-devel/20250311141045.66620-13-shameerali.kolothum.thodi@huawei.com/

in concept yes, all the guest ram. but due to an errata, we need
to skip the RO mappings.

> The thing is that when a VFIO device is attached to the container
> upon a nesting configuration, the ->get_address_space op should
> return the system address space as S1 nested HWPT isn't allocated
> yet. Then all the iommu as routines in vfio_listener_region_add()
> would be skipped, ending up with mapping the guest RAM in S2 HWPT
> correctly. Not until the S1 nested HWPT is allocated by the guest
> OS (after guest boots), can the ->get_address_space op return the
> iommu address space.

This seems a bit different between ARM and VT-d emulation. The VT-d
emulation code returns the iommu address space regardless of what
translation mode guest configured. But the MR of the address space
has two overlapped subregions, one is nodmar, another one is iommu.
As the naming shows, the nodmar is aliased to the system MR. And before
the guest enables iommu and set PGTT to a non-PT mode (e.g. S1 or S2),
the effective MR alias is the nodmar, hence the mapping this address
space holds are the GPA mappings in the beginning. If guest set PGTT to S2,
then the iommu MR is enabled, hence the mapping is gIOVA mappings
accordingly. So in VT-d emulation, the address space switch is more the MR
alias switching.

In this series, we mainly want to support S1 translation type for guest.
And it is based on nested translation, which needs a S2 domain that holds
the GPA mappings. Besides S1 translation type, PT is also supported. Both
the two types need a S2 domain which already holds GPA mappings. So we have
this internal listener. Also, we want to skip RO mappings on S2, so that's
another reason for it.  @Zhenzhong, perhaps, it can be described in the
commit message why an internal listener is introduced.

> 
> With this address space shift, S2 mappings can be simply captured
> and done by vfio_memory_listener. Then, such an s2domain listener
> would be largely redundant.

hope above addressed your question.

> So the second question is:
> Does vtd have to own this iommufd_s2domain_memory_listener? IOW,

yes based on the current design. when guest GPTT==PT, attach device
to S2 hwpt, when it goes to S1, then attach it to a S1 hwpt whose
parent is the aforementioned S2 hwpt. This S2 hwpt is always there
for use.

> does vtd_host_dma_iommu() have to return the iommu address space
> all the time?

yes, all the time.

-- 
Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-22 19:29       ` Nicolin Chen
@ 2025-05-23  6:26         ` Yi Liu
  2025-05-26  3:34         ` Duan, Zhenzhong
  1 sibling, 0 replies; 63+ messages in thread
From: Yi Liu @ 2025-05-23  6:26 UTC (permalink / raw)
  To: Nicolin Chen, Duan, Zhenzhong
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On 2025/5/23 03:29, Nicolin Chen wrote:
> On Thu, May 22, 2025 at 06:50:42AM +0000, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>> Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>>> host
>>>
>>> On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
>>>> +static const MemoryListener iommufd_s2domain_memory_listener = {
>>>> +    .name = "iommufd_s2domain",
>>>> +    .priority = 1000,
>>>> +    .region_add = iommufd_listener_region_add_s2domain,
>>>> +    .region_del = iommufd_listener_region_del_s2domain,
>>>> +};
>>>
>>> Would you mind elaborating When and how vtd does all S2 mappings?
>>
>> When guest trigger pasid cache invalidation, vIOMMU will attach device
>> to stage2 page table if guest's PGTT=PT or nested page table if PGTT=Stage1.
>> All these page tables are dynamically created during attach. We don't use
>> VFIO's shadow page table. The S2 mappings are also created during attach.
> 
> OK. That I can understand.
> 
> Next question: what does VTD actually map onto the S2 page table?
> The entire guest RAM? Or just a part of that?
> 
> On ARM, the VFIO listener would capture the entire RAM, and map it
> on S2 page table. I wonder if VTD would do the same.
> 
>>> On ARM, the default vfio_memory_listener could capture the entire
>>> guest RAM and add to the address space. So what we do is basically
>>> reusing the vfio_memory_listener:
>>> https://lore.kernel.org/qemu-devel/20250311141045.66620-13-
>>> shameerali.kolothum.thodi@huawei.com/
>>>
>>> The thing is that when a VFIO device is attached to the container
>>> upon a nesting configuration, the ->get_address_space op should
>>> return the system address space as S1 nested HWPT isn't allocated
>>> yet. Then all the iommu as routines in vfio_listener_region_add()
>>> would be skipped, ending up with mapping the guest RAM in S2 HWPT
>>> correctly. Not until the S1 nested HWPT is allocated by the guest
>>> OS (after guest boots), can the ->get_address_space op return the
>>> iommu address space.
>>
>> When S1 hwpt is allocated by guest, who will notify VFIO to call
>> ->get_address_space op() again to get iommu address space?
> 
> Hmm, would you please elaborate why VFIO needs to call that again?
> 
> I can see VFIO create the MAP/UNMAP notifiers for an iommu address
> space. However, the device operating in the nested translation mode
> should go through IOMMU HW for these two:
>   - S1 page table (MAP) will be created by the guest OS
>   - S1 invalidation (UNMAP) will be issued by the guest OS, and then
>     trapped by QEMU to forward to the HWPT uAPI to the host kernel.
> 
> As you mentioned, there is no need of a shadow page table in this
> mode. What else does VT-d need from an iommu address space?
> 
> On ARM, the only reason that we shift address space, is for KVM to
> inject MSI, as it only has the gIOVA and requires the iommu address
> space to translate that to gPA. Refer to kvm_arch_fixup_msi_route()
> in target/arm/kvm.c where it calls pci_device_iommu_address_space.
> 
>>> With this address space shift, S2 mappings can be simply captured
>>> and done by vfio_memory_listener. Then, such an s2domain listener
>>> would be largely redundant.
>>
>> I didn't get how arm smmu supports switching address space, will VFIO call
>> ->get_address_space() twice, once to get system address space and the other
>> for iommu address space?
> 
> The set_iommu_device() attaches the device to an stage2 page table

hmmm. I'm not sure if this is accurate. I think this set_iommu_device
just acts as setting some handle for this particular device to the vIOMMU
side. It has not idea about what address space nor page table.

> by default, indicating that the device works in the S1 passthrough
> mode (for VTD, that's PGTT=PT) at VM creation. And this is where
> the system address space should be returned by get_address_space().
> 
> If the guest kernel sets an S1 Translate mode for the device (for
> VTD, that's PGTT=Stage1), QEMU would trap that and allocate an S1
> HWPT for device to attach. Starting from here, get_address_space()
> can return the iommu address space -- on ARM, we only need it for
> KVM to translate MSI.
> 

refer to the last reply. This seems to be different between ARM and VT-d
emulation.

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-23  6:22     ` Yi Liu
@ 2025-05-23  6:52       ` Duan, Zhenzhong
  2025-05-23 21:12       ` Nicolin Chen
  1 sibling, 0 replies; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-05-23  6:52 UTC (permalink / raw)
  To: Liu, Yi L, Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>host
>
>Hey Nic,
>
>On 2025/5/22 06:49, Nicolin Chen wrote:
>> On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
>>> +static const MemoryListener iommufd_s2domain_memory_listener = {
>>> +    .name = "iommufd_s2domain",
>>> +    .priority = 1000,
>>> +    .region_add = iommufd_listener_region_add_s2domain,
>>> +    .region_del = iommufd_listener_region_del_s2domain,
>>> +};
>>
>> Would you mind elaborating When and how vtd does all S2 mappings?
>>
>> On ARM, the default vfio_memory_listener could capture the entire
>> guest RAM and add to the address space. So what we do is basically
>> reusing the vfio_memory_listener:
>> https://lore.kernel.org/qemu-devel/20250311141045.66620-13-
>shameerali.kolothum.thodi@huawei.com/
>
>in concept yes, all the guest ram. but due to an errata, we need
>to skip the RO mappings.
>
>> The thing is that when a VFIO device is attached to the container
>> upon a nesting configuration, the ->get_address_space op should
>> return the system address space as S1 nested HWPT isn't allocated
>> yet. Then all the iommu as routines in vfio_listener_region_add()
>> would be skipped, ending up with mapping the guest RAM in S2 HWPT
>> correctly. Not until the S1 nested HWPT is allocated by the guest
>> OS (after guest boots), can the ->get_address_space op return the
>> iommu address space.
>
>This seems a bit different between ARM and VT-d emulation. The VT-d
>emulation code returns the iommu address space regardless of what
>translation mode guest configured. But the MR of the address space
>has two overlapped subregions, one is nodmar, another one is iommu.
>As the naming shows, the nodmar is aliased to the system MR. And before
>the guest enables iommu and set PGTT to a non-PT mode (e.g. S1 or S2),
>the effective MR alias is the nodmar, hence the mapping this address
>space holds are the GPA mappings in the beginning. If guest set PGTT to S2,
>then the iommu MR is enabled, hence the mapping is gIOVA mappings
>accordingly. So in VT-d emulation, the address space switch is more the MR
>alias switching.
>
>In this series, we mainly want to support S1 translation type for guest.
>And it is based on nested translation, which needs a S2 domain that holds
>the GPA mappings. Besides S1 translation type, PT is also supported. Both
>the two types need a S2 domain which already holds GPA mappings. So we have
>this internal listener. Also, we want to skip RO mappings on S2, so that's
>another reason for it.  @Zhenzhong, perhaps, it can be described in the
>commit message why an internal listener is introduced.

Thanks Yi for accurate explanation, sure, will add comments for internal listener.

BRs,
Zhenzhong

>
>>
>> With this address space shift, S2 mappings can be simply captured
>> and done by vfio_memory_listener. Then, such an s2domain listener
>> would be largely redundant.
>
>hope above addressed your question.
>
>> So the second question is:
>> Does vtd have to own this iommufd_s2domain_memory_listener? IOW,
>
>yes based on the current design. when guest GPTT==PT, attach device
>to S2 hwpt, when it goes to S1, then attach it to a S1 hwpt whose
>parent is the aforementioned S2 hwpt. This S2 hwpt is always there
>for use.
>
>> does vtd_host_dma_iommu() have to return the iommu address space
>> all the time?
>
>yes, all the time.
>
>--
>Regards,
>Yi Liu

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-23  6:22     ` Yi Liu
  2025-05-23  6:52       ` Duan, Zhenzhong
@ 2025-05-23 21:12       ` Nicolin Chen
  2025-05-26  3:46         ` Duan, Zhenzhong
  2025-05-26  7:24         ` Yi Liu
  1 sibling, 2 replies; 63+ messages in thread
From: Nicolin Chen @ 2025-05-23 21:12 UTC (permalink / raw)
  To: Yi Liu
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, eric.auger, mst,
	jasowang, peterx, ddutile, jgg, shameerali.kolothum.thodi,
	joao.m.martins, clement.mathieu--drif, kevin.tian, chao.p.peng,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

Hey,

Thanks for the reply.

Just want to say that I am asking a lot to understand why VT-d is
different than ARM, so as to decide whether ARM should follow VT-d
implementing a separate listener or just use the VFIO listener.

On Fri, May 23, 2025 at 02:22:15PM +0800, Yi Liu wrote:
> Hey Nic,
> 
> On 2025/5/22 06:49, Nicolin Chen wrote:
> > On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
> > > +static const MemoryListener iommufd_s2domain_memory_listener = {
> > > +    .name = "iommufd_s2domain",
> > > +    .priority = 1000,
> > > +    .region_add = iommufd_listener_region_add_s2domain,
> > > +    .region_del = iommufd_listener_region_del_s2domain,
> > > +};
> > 
> > Would you mind elaborating When and how vtd does all S2 mappings?
> > 
> > On ARM, the default vfio_memory_listener could capture the entire
> > guest RAM and add to the address space. So what we do is basically
> > reusing the vfio_memory_listener:
> > https://lore.kernel.org/qemu-devel/20250311141045.66620-13-shameerali.kolothum.thodi@huawei.com/
> 
> in concept yes, all the guest ram. but due to an errata, we need
> to skip the RO mappings.

Mind elaborating what are RO mappings? Can those be possible within
the range of the RAM?

> > The thing is that when a VFIO device is attached to the container
> > upon a nesting configuration, the ->get_address_space op should
> > return the system address space as S1 nested HWPT isn't allocated
> > yet. Then all the iommu as routines in vfio_listener_region_add()
> > would be skipped, ending up with mapping the guest RAM in S2 HWPT
> > correctly. Not until the S1 nested HWPT is allocated by the guest
> > OS (after guest boots), can the ->get_address_space op return the
> > iommu address space.
> 
> This seems a bit different between ARM and VT-d emulation. The VT-d
> emulation code returns the iommu address space regardless of what
> translation mode guest configured. But the MR of the address space
> has two overlapped subregions, one is nodmar, another one is iommu.
> As the naming shows, the nodmar is aliased to the system MR.

OK. But why two overlapped subregions v.s. two separate two ASs?

> And before
> the guest enables iommu and set PGTT to a non-PT mode (e.g. S1 or S2),
> the effective MR alias is the nodmar, hence the mapping this address
> space holds are the GPA mappings in the beginning.

I think this is same on ARM, where get_address_space() may return
system address space. And for VT-d, it actually returns the range
of the system address space (just though a sub MR of an iommu AS),
right? 

> If guest set PGTT to S2,
> then the iommu MR is enabled, hence the mapping is gIOVA mappings
> accordingly. So in VT-d emulation, the address space switch is more the MR
> alias switching.

Zhenzhong said that there is no shadow page table for the nesting
setup, i.e. gIOVA=>gPA mappings are entirely done by the guest OS.

Then, why does VT-d need to switch to the iommu MR here?

> In this series, we mainly want to support S1 translation type for guest.
> And it is based on nested translation, which needs a S2 domain that holds
> the GPA mappings. Besides S1 translation type, PT is also supported. Both
> the two types need a S2 domain which already holds GPA mappings. So we have
> this internal listener.

Hmm, the reasoning to the last "so" doesn't sound enough. The VFIO
listener could do the same...

> Also, we want to skip RO mappings on S2, so that's
> another reason for it.  @Zhenzhong, perhaps, it can be described in the
> commit message why an internal listener is introduced.

OK. I think that can be a good reason to have an internal listener,
only if VFIO can't skip the RO mappings.

> > So the second question is:
> > Does vtd have to own this iommufd_s2domain_memory_listener? IOW,
> 
> yes based on the current design. when guest GPTT==PT, attach device
> to S2 hwpt, when it goes to S1, then attach it to a S1 hwpt whose
> parent is the aforementioned S2 hwpt. This S2 hwpt is always there
> for use.

ARM is doing the same thing. And the exact point "this S2 hwpt is
always there for use" has been telling me that the device can just
stay at the S2 address space (system), since the guest kernel will
take care of the S1 address space (iommu).

Overall, the questions here have been two-fold:

1.Why does VT-d need an internal listener?

  I can see the (only) reason is for the RO mappings.

  Yet, Is there anything that we can do to the VFIO listener to
  bypass these RO mappings?

2.Why not return the system AS all the time when nesting is on?
  Why switch to the iommu AS when device attaches to S1 HWPT?

  For ARM, MSI requires a translation so it has to; but MSI on
  VT-d doesn't. So, I couldn't see why VT-d needs to return the
  iommu AS via get_address_space().

  However, combining the question-1, my gut feeling is that, VT-d
  needs to skip RO mappings for errata, while the VFIO listener
  can't do that. So, VT-d has to create its own listener. And to
  avoid duplicated mappings on the same address space, it has to
  bypass the VFIO listener by working it around with an IOMMU AS.
  Is this correct?

Thanks
Nic

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-22 19:29       ` Nicolin Chen
  2025-05-23  6:26         ` Yi Liu
@ 2025-05-26  3:34         ` Duan, Zhenzhong
  1 sibling, 0 replies; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-05-26  3:34 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
...
>> >I think you did some substantial work to isolate the get_hw_info
>> >part inside the iommufd backend code, which looks nice and clean
>> >as the vIOMMU code simply does iodc->get_cap().
>> >
>> >However, that then makes these direct raw backend function calls
>> >very awkward :-/
>> >
>> >In my view, the way to make sense is either:
>> >* We don't do any isolation, but just call raw backend functions
>> >  in vIOMMU code
>> >* We do every isolation, and never call raw backend functions in
>> >  vIOMMU code
>>
>> Iommufd backend functions are general for all modules usage including
>> vIOMMU. I think we are not blocking vIOMMU calling raw backend functions.
>
>Well, I am not against doing that. Calling the raw iommufd APIs is
>easier :)
>
>> We just provide a general interface for querying capabilities, not all capabilities
>> are from iommufd get_hw_info result, e.g., aw_bits.
>
>Question is why we need a general interface for get_hw_info(), yet
>not for other iommufd APIs.

Because other iommufd APIs are already abstract and hides vendor difference,
but get_hw_info() is not, it returns raw vendor data.
This makes consumer of this raw vendor data have to recognize the vendor data format.
For example, when virtio-iommu wants to know host iommu capability, it has to check
raw data from get_hw_info(), we hide this difference in the backend so virtio-iommu
could just call .get_cap(HOST_IOMMU_DEVICE_CAP_*), no matter what underlying
platform is, Vtd, smmuv3, etc..

>
>IIRC, Cedirc's suggested a general interface for get_hw_info so the
>vIOMMU code can be unaware of the underlying source (doesn't matter
>if it's from iommufd or from vfio).
>
>I see aw_bits from QEMU command line. Mind elaborating?

You mean aw_bits for virtual intel-iommu, that defines the IOVA address width of intel-iommu.
We need to check aw_bits from host intel-iommu with virtual intel-iommu's to determine
compatibility.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-23 21:12       ` Nicolin Chen
@ 2025-05-26  3:46         ` Duan, Zhenzhong
  2025-05-26  7:24         ` Yi Liu
  1 sibling, 0 replies; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-05-26  3:46 UTC (permalink / raw)
  To: Nicolin Chen, Liu, Yi L
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
...
>> yes based on the current design. when guest GPTT==PT, attach device
>> to S2 hwpt, when it goes to S1, then attach it to a S1 hwpt whose
>> parent is the aforementioned S2 hwpt. This S2 hwpt is always there
>> for use.
>
>ARM is doing the same thing. And the exact point "this S2 hwpt is
>always there for use" has been telling me that the device can just
>stay at the S2 address space (system), since the guest kernel will
>take care of the S1 address space (iommu).
>
>Overall, the questions here have been two-fold:
>
>1.Why does VT-d need an internal listener?
>
>  I can see the (only) reason is for the RO mappings.

It's not the only reason. Another reason is we want to support the
case that VFIO device and emulated device under same group,
e.g., they are under a PCIe-to-PCI bridge.

In fact, .get_address_space() returns the AS for the group rather
than device, see pci_device_get_iommu_bus_devfn().

>
>  Yet, Is there anything that we can do to the VFIO listener to
>  bypass these RO mappings?
>
>2.Why not return the system AS all the time when nesting is on?
>  Why switch to the iommu AS when device attaches to S1 HWPT?

The emulated device wants the iommu AS and VFIO device wants
System AS, just curious how you handle this case in your series?

Thanks
Zhenzhong



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-23 21:12       ` Nicolin Chen
  2025-05-26  3:46         ` Duan, Zhenzhong
@ 2025-05-26  7:24         ` Yi Liu
  2025-05-26 17:35           ` Nicolin Chen
  1 sibling, 1 reply; 63+ messages in thread
From: Yi Liu @ 2025-05-26  7:24 UTC (permalink / raw)
  To: Nicolin Chen, Peter Xu
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, eric.auger, mst,
	jasowang, peterx, ddutile, jgg, shameerali.kolothum.thodi,
	joao.m.martins, clement.mathieu--drif, kevin.tian, chao.p.peng,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On 2025/5/24 05:12, Nicolin Chen wrote:
> Hey,
> 
> Thanks for the reply.
> 
> Just want to say that I am asking a lot to understand why VT-d is
> different than ARM, so as to decide whether ARM should follow VT-d
> implementing a separate listener or just use the VFIO listener.
> 
> On Fri, May 23, 2025 at 02:22:15PM +0800, Yi Liu wrote:
>> Hey Nic,
>>
>> On 2025/5/22 06:49, Nicolin Chen wrote:
>>> On Wed, May 21, 2025 at 07:14:45PM +0800, Zhenzhong Duan wrote:
>>>> +static const MemoryListener iommufd_s2domain_memory_listener = {
>>>> +    .name = "iommufd_s2domain",
>>>> +    .priority = 1000,
>>>> +    .region_add = iommufd_listener_region_add_s2domain,
>>>> +    .region_del = iommufd_listener_region_del_s2domain,
>>>> +};
>>>
>>> Would you mind elaborating When and how vtd does all S2 mappings?
>>>
>>> On ARM, the default vfio_memory_listener could capture the entire
>>> guest RAM and add to the address space. So what we do is basically
>>> reusing the vfio_memory_listener:
>>> https://lore.kernel.org/qemu-devel/20250311141045.66620-13-shameerali.kolothum.thodi@huawei.com/
>>
>> in concept yes, all the guest ram. but due to an errata, we need
>> to skip the RO mappings.
> 
> Mind elaborating what are RO mappings? Can those be possible within
> the range of the RAM?

Below are RO regions when booting Q35 machine (this is the pcie capable 
platform and also vIOMMU capable), 4GB memory. For the bios and rom
regions, it looks reasonable. I'm not quite sure why there is RO RAM yet.
But it seems to be the fact we need to face.

vfio_listener_region_add, section->mr->name: pc.bios, iova: fffc0000, size: 
40000, vaddr: 7fb314200000, RO
vfio_listener_region_add, section->mr->name: pc.rom, iova: c0000, size: 
20000, vaddr: 7fb206c00000, RO
vfio_listener_region_add, section->mr->name: pc.bios, iova: e0000, size: 
20000, vaddr: 7fb314220000, RO
vfio_listener_region_add, section->mr->name: pc.rom, iova: d8000, size: 
8000, vaddr: 7fb206c18000, RO
vfio_listener_region_add, section->mr->name: pc.bios, iova: e0000, size: 
10000, vaddr: 7fb314220000, RO
vfio_listener_region_add, section->mr->name: vga.rom, iova: febc0000, size: 
10000, vaddr: 7fb205800000, RO
vfio_listener_region_add, section->mr->name: virtio-net-pci.rom, iova: 
feb80000, size: 40000, vaddr: 7fb205600000, RO
vfio_listener_region_add, section->mr->name: pc.ram, iova: c0000, size: 
b000, vaddr: 7fb207ec0000, RO
vfio_listener_region_add, section->mr->name: pc.ram, iova: ce000, size: 
a000, vaddr: 7fb207ece000, RO
vfio_listener_region_add, section->mr->name: pc.ram, iova: f0000, size: 
10000, vaddr: 7fb207ef0000, RO
vfio_listener_region_add, section->mr->name: pc.ram, iova: ce000, size: 
1a000, vaddr: 7fb207ece000, RO

>>> The thing is that when a VFIO device is attached to the container
>>> upon a nesting configuration, the ->get_address_space op should
>>> return the system address space as S1 nested HWPT isn't allocated
>>> yet. Then all the iommu as routines in vfio_listener_region_add()
>>> would be skipped, ending up with mapping the guest RAM in S2 HWPT
>>> correctly. Not until the S1 nested HWPT is allocated by the guest
>>> OS (after guest boots), can the ->get_address_space op return the
>>> iommu address space.
>>
>> This seems a bit different between ARM and VT-d emulation. The VT-d
>> emulation code returns the iommu address space regardless of what
>> translation mode guest configured. But the MR of the address space
>> has two overlapped subregions, one is nodmar, another one is iommu.
>> As the naming shows, the nodmar is aliased to the system MR.
> 
> OK. But why two overlapped subregions v.s. two separate two ASs?

TBH. I don't have the exact reason about it. +Cc Peter if he remembers
it or not.

IMHO. At least for vfio devices, I can see only one get_address_space()
call. So even there are two ASs, how should the vfio be notified when the
AS changed? Since vIOMMU is the source of map/umap requests, it looks fine
to always return iommu AS and handle the AS switch by switching the enabled
subregions according to the guest vIOMMU translation types.

> 
>> And before
>> the guest enables iommu and set PGTT to a non-PT mode (e.g. S1 or S2),
>> the effective MR alias is the nodmar, hence the mapping this address
>> space holds are the GPA mappings in the beginning.
> 
> I think this is same on ARM, where get_address_space() may return
> system address space. And for VT-d, it actually returns the range
> of the system address space (just though a sub MR of an iommu AS),
> right?

hmmm, I'm not quite getting why it is similar. As I replied, the VT-d
emulation code returns iommu AS in get_address_space(). I didn't see
where it returns address_space_memory (the system address space).

> 
>> If guest set PGTT to S2,
>> then the iommu MR is enabled, hence the mapping is gIOVA mappings
>> accordingly. So in VT-d emulation, the address space switch is more the MR
>> alias switching.
> 
> Zhenzhong said that there is no shadow page table for the nesting
> setup, i.e. gIOVA=>gPA mappings are entirely done by the guest OS.
> 
> Then, why does VT-d need to switch to the iommu MR here?

what I described in prior email is the general idea of the AS switching
before this series. nesting for sure does not need this switching just like
PT.

>> In this series, we mainly want to support S1 translation type for guest.
>> And it is based on nested translation, which needs a S2 domain that holds
>> the GPA mappings. Besides S1 translation type, PT is also supported. Both
>> the two types need a S2 domain which already holds GPA mappings. So we have
>> this internal listener.
> 
> Hmm, the reasoning to the last "so" doesn't sound enough. The VFIO
> listener could do the same...

yes. I just realized that RO mappings should be allowed for the normal
S2 domains. Only the nested parent S2 domain should skip the RO mappings.

> 
>> Also, we want to skip RO mappings on S2, so that's
>> another reason for it.  @Zhenzhong, perhaps, it can be described in the
>> commit message why an internal listener is introduced.
> 
> OK. I think that can be a good reason to have an internal listener,
> only if VFIO can't skip the RO mappings.
> 
>>> So the second question is:
>>> Does vtd have to own this iommufd_s2domain_memory_listener? IOW,
>>
>> yes based on the current design. when guest GPTT==PT, attach device
>> to S2 hwpt, when it goes to S1, then attach it to a S1 hwpt whose
>> parent is the aforementioned S2 hwpt. This S2 hwpt is always there
>> for use.
> 
> ARM is doing the same thing. And the exact point "this S2 hwpt is
> always there for use" has been telling me that the device can just
> stay at the S2 address space (system), since the guest kernel will
> take care of the S1 address space (iommu).
> 
> Overall, the questions here have been two-fold:
> 
> 1.Why does VT-d need an internal listener?
> 
>    I can see the (only) reason is for the RO mappings.
> 
>    Yet, Is there anything that we can do to the VFIO listener to
>    bypass these RO mappings?
> 
> 2.Why not return the system AS all the time when nesting is on?
>    Why switch to the iommu AS when device attaches to S1 HWPT?

no switch if going to setup nesting.

Just got a question on ARM side. IIUC. The ARM emulation code will return
the system address space in the get_address_space() op before guest enables
vIOMMU. Hence the IOAS in the vfio side is GPA IOAS. When guest enables
vIOMMU, the emulation will return iommu address space. Hence, the vfio side
needs switch to gIOVA IOAS? My question is if guest is setting S1
translation, and the emulation code figures out it is going to set up
nested translation, will the get_address_space() op return the iommu
address space as well? If so, where is the GPA IOAS locates? In this
series, the VT-d emulation code actually has an internal GPA IOAS which
skips RO mappings.

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info
  2025-05-21 11:14 ` [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info Zhenzhong Duan
  2025-05-21 21:57   ` Nicolin Chen
@ 2025-05-26 12:15   ` Cédric Le Goater
  2025-05-27  2:12     ` Duan, Zhenzhong
  1 sibling, 1 reply; 63+ messages in thread
From: Cédric Le Goater @ 2025-05-26 12:15 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

Hello Zhenzhong,

On 5/21/25 13:14, Zhenzhong Duan wrote:
> Some device information returned by ioctl(IOMMU_GET_HW_INFO) are vendor
> specific. Save them all in a new defined structure mirroring that vendor
> IOMMU's structure, then get_cap() can query those information for
> capability.
> 
> We can't use the vendor IOMMU's structure directly because they are in
> linux/iommufd.h which breaks build on windows.
> 
> Suggested-by: Eric Auger <eric.auger@redhat.com>
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   include/system/host_iommu_device.h | 12 ++++++++++++
>   hw/vfio/iommufd.c                  | 12 ++++++++++++
>   2 files changed, 24 insertions(+)
> 
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index 809cced4ba..908bfe32c7 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -15,6 +15,17 @@
>   #include "qom/object.h"
>   #include "qapi/error.h"
>   
> +/* This is mirror of struct iommu_hw_info_vtd */
> +typedef struct Vtd_Caps {

please name the struct VtdCaps instead.


Thanks,

C.



> +    uint32_t flags;
> +    uint64_t cap_reg;
> +    uint64_t ecap_reg;
> +} Vtd_Caps;
> +
> +typedef union VendorCaps {
> +    Vtd_Caps vtd;
> +} VendorCaps;
> +
>   /**
>    * struct HostIOMMUDeviceCaps - Define host IOMMU device capabilities.
>    *
> @@ -26,6 +37,7 @@
>   typedef struct HostIOMMUDeviceCaps {
>       uint32_t type;
>       uint64_t hw_caps;
> +    VendorCaps vendor_caps;
>   } HostIOMMUDeviceCaps;
>   
>   #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index d661737c17..5c740222e5 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -834,6 +834,7 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>       VFIODevice *vdev = opaque;
>       HostIOMMUDeviceIOMMUFD *idev;
>       HostIOMMUDeviceCaps *caps = &hiod->caps;
> +    VendorCaps *vendor_caps = &caps->vendor_caps;
>       enum iommu_hw_info_type type;
>       union {
>           struct iommu_hw_info_vtd vtd;
> @@ -852,6 +853,17 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>       caps->type = type;
>       caps->hw_caps = hw_caps;
>   
> +    switch (type) {
> +    case IOMMU_HW_INFO_TYPE_INTEL_VTD:
> +        vendor_caps->vtd.flags = data.vtd.flags;
> +        vendor_caps->vtd.cap_reg = data.vtd.cap_reg;
> +        vendor_caps->vtd.ecap_reg = data.vtd.ecap_reg;
> +        break;
> +    case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
> +    case IOMMU_HW_INFO_TYPE_NONE:
> +        break;
> +    }
> +
>       idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
>       idev->iommufd = vdev->iommufd;
>       idev->devid = vdev->devid;



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device
  2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (20 preceding siblings ...)
  2025-05-21 11:14 ` [PATCH rfcv3 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
@ 2025-05-26 12:19 ` Cédric Le Goater
  2025-05-27  2:16   ` Duan, Zhenzhong
  21 siblings, 1 reply; 63+ messages in thread
From: Cédric Le Goater @ 2025-05-26 12:19 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

On 5/21/25 13:14, Zhenzhong Duan wrote:
> Hi,
> 
> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
> "Enable stage-1 translation for emulated device" series and
> "Enable stage-1 translation for passthrough device" series.
> 
> This series is 2nd part focusing on passthrough device. We don't do
> shadowing of guest page table for passthrough device but pass stage-1
> page table to host side to construct a nested domain. There was some
> effort to enable this feature in old days, see [2] for details.
> 
> The key design is to utilize the dual-stage IOMMU translation
> (also known as IOMMU nested translation) capability in host IOMMU.
> As the below diagram shows, guest I/O page table pointer in GPA
> (guest physical address) is passed to host and be used to perform
> the stage-1 address translation. Along with it, modifications to
> present mappings in the guest I/O page table should be followed
> with an IOTLB invalidation.
> 
>          .-------------.  .---------------------------.
>          |   vIOMMU    |  | Guest I/O page table      |
>          |             |  '---------------------------'
>          .----------------/
>          | PASID Entry |--- PASID cache flush --+
>          '-------------'                        |
>          |             |                        V
>          |             |           I/O page table pointer in GPA
>          '-------------'
>      Guest
>      ------| Shadow |---------------------------|--------
>            v        v                           v
>      Host
>          .-------------.  .------------------------.
>          |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
>          |             |  '------------------------'
>          .----------------/  |
>          | PASID Entry |     V (Nested xlate)
>          '----------------\.--------------------------------------.
>          |             |   | Stage2 for GPA->HPA, unmanaged domain|
>          |             |   '--------------------------------------'
>          '-------------'
> For history reason, there are different namings in different VTD spec rev,
> Where:
>   - Stage1 = First stage = First level = flts
>   - Stage2 = Second stage = Second level = slts
> <Intel VT-d Nested translation>
> 
> There are some interactions between VFIO and vIOMMU
> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>    subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>    instance to vIOMMU at vfio device realize stage.
> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>    to bind/unbind device to IOMMUFD backed domains, either nested
>    domain or not.
> 
> See below diagram:
> 
>          VFIO Device                                 Intel IOMMU
>      .-----------------.                         .-------------------.
>      |                 |                         |                   |
>      |       .---------|PCIIOMMUOps              |.-------------.    |
>      |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
>      |       | Device  |------------------------>|| Device list |    |
>      |       .---------|(unset_iommu_device)     |.-------------.    |
>      |                 |                         |       |           |
>      |                 |                         |       V           |
>      |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>      |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
>      |       | link    |<------------------------|  |   Device    |  |
>      |       .---------|            (detach_hwpt)|  .-------------.  |
>      |                 |                         |       |           |
>      |                 |                         |       ...         |
>      .-----------------.                         .-------------------.
> 
> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
> whenever possible and create new one on demand, also supports multiple
> iommufd objects and ERRATA_772415.
> 
> E.g., Under one guest's scope, Stage-2 page table could be shared by different
> devices if there is no conflict and devices link to same iommufd object,
> i.e. devices under same host IOMMU can share same stage-2 page table. If there
> is conflict, i.e. there is one device under non cache coherency mode which is
> different from others, it requires a separate stage-2 page table in non-CC mode.
> 
> SPR platform has ERRATA_772415 which requires no readonly mappings
> in stage-2 page table. This series supports creating VTDIOASContainer
> with no readonly mappings. If there is a rare case that some IOMMUs
> on a multiple IOMMU host have ERRATA_772415 and others not, this
> design can still survive.
> 
> See below example diagram for a full view:
> 
>        IntelIOMMUState
>               |
>               V
>      .------------------.    .------------------.    .-------------------.
>      | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
>      | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,only RW)|
>      .------------------.    .------------------.    .-------------------.
>               |                       |                              |
>               |                       .-->...                        |
>               V                                                      V
>        .-------------------.    .-------------------.          .---------------.
>        |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
>        .-------------------.    .-------------------.          .---------------.
>            |            |               |                            |
>            |            |               |                            |
>      .-----------.  .-----------.  .------------.              .------------.
>      | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>      | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>      | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>      |           |  |           |  | (iommufd0) |              | (iommufd0) |
>      .-----------.  .-----------.  .------------.              .------------.
> 
> This series is also a prerequisite work for vSVA, i.e. Sharing
> guest application address space with passthrough devices.
> 
> To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
> 
> Passthrough device should use iommufd backend to work with stage-1 translation.
> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
> 
> If host doesn't support nested translation, qemu will fail with an unsupported
> report.
> 
> Test done:
> - VFIO devices hotplug/unplug
> - different VFIO devices linked to different iommufds
> - vhost net device ping test
> 
> Fault report isn't supported in this series, we presume guest kernel always
> construct correct S1 page table for passthrough device. For emulated devices,
> the emulation code already provided S1 fault injection.
> 
> PATCH1-6:  Add HWPT-based nesting infrastructure support

The first 6 patches are all VFIO or IOMMUFD related. They are
mostly  additions and I didn't see anything wrong. They could
be merged in advance through the VFIO tree.

Thanks,

C.




> PATCH7-8:  Some cleanup work
> PATCH9:    cap/ecap related compatibility check between vIOMMU and Host IOMMU
> PATCH10-20:Implement stage-1 page table for passthrough device
> PATCH21:   Enable stage-1 translation for passthrough device
> 
> Qemu code can be found at [3]
> 
> TODO:
> - RAM discard
> - dirty tracking on stage-2 page table
> - Fault report to guest when HW Stage-1 faults
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
> [2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
> [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv3
> 
> Thanks
> Zhenzhong
> 
> Changelog:
> rfcv3:
> - s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer)
> - hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
> - simplify return value check of get_cap() (Eric)
> - drop realize_late (Cedric, Eric)
> - split patch13:intel_iommu: Add PASID cache management infrastructure (Eric)
> - s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
> - s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
> - refine comments (Eric, Donald)
> 
> rfcv2:
> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
> - add two cleanup patches(patch9-10)
> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
>    iommu pasid, this is important for dropping VTDPASIDAddressSpace
> 
> 
> Yi Liu (3):
>    intel_iommu: Replay pasid binds after context cache invalidation
>    intel_iommu: Propagate PASID-based iotlb invalidation to host
>    intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
> 
> Zhenzhong Duan (18):
>    backends/iommufd: Add a helper to invalidate user-managed HWPT
>    vfio/iommufd: Add properties and handlers to
>      TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>    vfio/iommufd: Initialize iommufd specific members in
>      HostIOMMUDeviceIOMMUFD
>    vfio/iommufd: Implement [at|de]tach_hwpt handlers
>    vfio/iommufd: Save vendor specific device info
>    iommufd: Implement query of host VTD IOMMU's capability
>    intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>      vtd_ce_get_pasid_entry
>    intel_iommu: Optimize context entry cache utilization
>    intel_iommu: Check for compatibility with IOMMUFD backed device when
>      x-flts=on
>    intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>    intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
>    intel_iommu: Handle PASID entry removing and updating
>    intel_iommu: Handle PASID entry adding
>    intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
>    intel_iommu: Bind/unbind guest page table to host
>    intel_iommu: ERRATA_772415 workaround
>    intel_iommu: Bypass replay in stage-1 page table mode
>    intel_iommu: Enable host device when x-flts=on in scalable mode
> 
>   hw/i386/intel_iommu_internal.h     |   56 +
>   include/hw/i386/intel_iommu.h      |   33 +-
>   include/system/host_iommu_device.h |   32 +
>   include/system/iommufd.h           |   54 +
>   backends/iommufd.c                 |   94 +-
>   hw/i386/intel_iommu.c              | 1670 ++++++++++++++++++++++++----
>   hw/vfio/iommufd.c                  |   40 +
>   backends/trace-events              |    1 +
>   hw/i386/trace-events               |   13 +
>   9 files changed, 1791 insertions(+), 202 deletions(-)
> 



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-26  7:24         ` Yi Liu
@ 2025-05-26 17:35           ` Nicolin Chen
  2025-05-28  7:12             ` Duan, Zhenzhong
  0 siblings, 1 reply; 63+ messages in thread
From: Nicolin Chen @ 2025-05-26 17:35 UTC (permalink / raw)
  To: Yi Liu
  Cc: Peter Xu, Zhenzhong Duan, qemu-devel, alex.williamson, clg,
	eric.auger, mst, jasowang, ddutile, jgg,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, chao.p.peng, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

OK. Let me clarify this at the top as I see the gap here now:

First, the vSMMU model is based on Zhenzhong's older series that
keeps an ioas_id in the HostIOMMUDeviceIOMMUFD structure, which
now it only keeps an hwpt_id in this RFCv3 series. This ioas_id
is allocated when a passthrough cdev attaches to a VFIO container.

Second, the vSMMU model reuses the default IOAS via that ioas_id.
Since the VFIO container doesn't allocate a nesting parent S2 HWPT
(maybe it could?), so the vSMMU allocates another S2 HWPT in the
vIOMMU code.

Third, the vSMMU model, for invalidation efficiency and HW Queue
support, isolates all emulated devices out of the nesting-enabled
vSMMU instance, suggested by Jason. So, only passthrough devices
would use the nesting-enabled vSMMU instance, meaning there is no
need of IOMMU_NOTIFIER_IOTLB_EVENTS:
 - MAP is not needed as there is no shadow page table. QEMU only
   traps the page table pointer and forwards it to host kernel.
 - UNMAP is not needed as QEMU only traps invalidation requests
   and forwards them to host kernel.

(let's forget about the "address space switch" for MSI for now.)

So, in the vSMMU model, there is actually no need for the iommu
AS. And there is only one IOAS in the VM instance allocated by the
VFIO container. And this IOAS manages the GPA->PA mappings. So,
get_address_space() returns the system AS for passthrough devices.

On the other hand, the VT-d model is a bit different. It's a giant
vIOMMU for all devices (either passthrough or emualted). For all
emulated devices, it needs IOMMU_NOTIFIER_IOTLB_EVENTS, i.e. the
iommu address space returned via get_address_space().

That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
for passthrough devices, right?

IIUIC, in the VT-d model, a passthrough device also gets attached
to the VFIO container via iommufd_cdev_attach, allocating an IOAS.
But it returns the iommu address space, treating them like those
emulated devices, although the underlying MR of the returned IOMMU
AS is backed by a nodmar MR (that is essentially a system AS).

This seems to completely ignore the default IOAS owned by the VFIO
container, because it needs to bypass those RO mappings(?)

Then for passthrough devices, the VT-d model allocates an internal
IOAS that further requires an internal S2 listener, which seems an
large duplication of what the VFIO container already does..

So, here are things that I want us to conclude:
 1) Since the VFIO container already has an IOAS for a passthrough
    device, and IOMMU_NOTIFIER_IOTLB_EVENTS isn't seemingly needed,
    why not setup this default IOAS to manage gPA=>PA mappings by
    returning the system AS via get_address_space() for passthrough
    devices?

    I got that the VT-d model might have some concern against this,
    as the default listener would map those RO regions. Yet, maybe
    the right approach is to figure out a way to bypass RO regions
    in the core v.s. duplicating another ioas_alloc()/map() and S2
    listener?

 2) If (1) makes sense, I think we can further simplify the routine
    by allocating a nesting parent HWPT in iommufd_cdev_attach(),
    as long as the attaching device is identified as "passthrough"
    and there is "iommufd" in its "-device" string?

    After all, IOMMU_HWPT_ALLOC_NEST_PARENT is a common flag.

On Mon, May 26, 2025 at 03:24:50PM +0800, Yi Liu wrote:
> vfio_listener_region_add, section->mr->name: pc.bios, iova: fffc0000, size:
> 40000, vaddr: 7fb314200000, RO
> vfio_listener_region_add, section->mr->name: pc.rom, iova: c0000, size:
> 20000, vaddr: 7fb206c00000, RO
..
> vfio_listener_region_add, section->mr->name: pc.ram, iova: ce000, size:
> 1a000, vaddr: 7fb207ece000, RO

OK. They look like memory carveouts for FWs. "iova" is gPA right?

And they can be in the range of a guest RAM..

Mind elaborating why they shouldn't be mapped onto nesting parent
S2?

> IMHO. At least for vfio devices, I can see only one get_address_space()
> call. So even there are two ASs, how should the vfio be notified when the
> AS changed? Since vIOMMU is the source of map/umap requests, it looks fine
> to always return iommu AS and handle the AS switch by switching the enabled
> subregions according to the guest vIOMMU translation types.

No, VFIO doesn't get notified when the AS changes.

The vSMMU model wants VFIO to stay in the system AS since the VFIO
container manages the S2 mappings for guest PA.

The "switch" in vSMMU model is only needed by KVM for MSI doorbell
translation. By thinking it carefully, maybe it shouldn't switch AS
because VFIO might be confused if it somehow does get_address_space
again in the future..

Thanks
Nic

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info
  2025-05-26 12:15   ` Cédric Le Goater
@ 2025-05-27  2:12     ` Duan, Zhenzhong
  0 siblings, 0 replies; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-05-27  2:12 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hi Cédric,

>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info
>
>Hello Zhenzhong,
>
>On 5/21/25 13:14, Zhenzhong Duan wrote:
>> Some device information returned by ioctl(IOMMU_GET_HW_INFO) are vendor
>> specific. Save them all in a new defined structure mirroring that vendor
>> IOMMU's structure, then get_cap() can query those information for
>> capability.
>>
>> We can't use the vendor IOMMU's structure directly because they are in
>> linux/iommufd.h which breaks build on windows.
>>
>> Suggested-by: Eric Auger <eric.auger@redhat.com>
>> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   include/system/host_iommu_device.h | 12 ++++++++++++
>>   hw/vfio/iommufd.c                  | 12 ++++++++++++
>>   2 files changed, 24 insertions(+)
>>
>> diff --git a/include/system/host_iommu_device.h
>b/include/system/host_iommu_device.h
>> index 809cced4ba..908bfe32c7 100644
>> --- a/include/system/host_iommu_device.h
>> +++ b/include/system/host_iommu_device.h
>> @@ -15,6 +15,17 @@
>>   #include "qom/object.h"
>>   #include "qapi/error.h"
>>
>> +/* This is mirror of struct iommu_hw_info_vtd */
>> +typedef struct Vtd_Caps {
>
>please name the struct VtdCaps instead.

Will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device
  2025-05-26 12:19 ` [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Cédric Le Goater
@ 2025-05-27  2:16   ` Duan, Zhenzhong
  0 siblings, 0 replies; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-05-27  2:16 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for
>passthrough device
>
>On 5/21/25 13:14, Zhenzhong Duan wrote:
>> Hi,
>>
>> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
>> "Enable stage-1 translation for emulated device" series and
>> "Enable stage-1 translation for passthrough device" series.
>>
>> This series is 2nd part focusing on passthrough device. We don't do
>> shadowing of guest page table for passthrough device but pass stage-1
>> page table to host side to construct a nested domain. There was some
>> effort to enable this feature in old days, see [2] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation
>> (also known as IOMMU nested translation) capability in host IOMMU.
>> As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform
>> the stage-1 address translation. Along with it, modifications to
>> present mappings in the guest I/O page table should be followed
>> with an IOTLB invalidation.
>>
>>          .-------------.  .---------------------------.
>>          |   vIOMMU    |  | Guest I/O page table      |
>>          |             |  '---------------------------'
>>          .----------------/
>>          | PASID Entry |--- PASID cache flush --+
>>          '-------------'                        |
>>          |             |                        V
>>          |             |           I/O page table pointer in GPA
>>          '-------------'
>>      Guest
>>      ------| Shadow |---------------------------|--------
>>            v        v                           v
>>      Host
>>          .-------------.  .------------------------.
>>          |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
>>          |             |  '------------------------'
>>          .----------------/  |
>>          | PASID Entry |     V (Nested xlate)
>>          '----------------\.--------------------------------------.
>>          |             |   | Stage2 for GPA->HPA, unmanaged domain|
>>          |             |   '--------------------------------------'
>>          '-------------'
>> For history reason, there are different namings in different VTD spec rev,
>> Where:
>>   - Stage1 = First stage = First level = flts
>>   - Stage2 = Second stage = Second level = slts
>> <Intel VT-d Nested translation>
>>
>> There are some interactions between VFIO and vIOMMU
>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>    subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>    instance to vIOMMU at vfio device realize stage.
>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>    to bind/unbind device to IOMMUFD backed domains, either nested
>>    domain or not.
>>
>> See below diagram:
>>
>>          VFIO Device                                 Intel IOMMU
>>      .-----------------.                         .-------------------.
>>      |                 |                         |                   |
>>      |       .---------|PCIIOMMUOps              |.-------------.    |
>>      |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
>>      |       | Device  |------------------------>|| Device list |    |
>>      |       .---------|(unset_iommu_device)     |.-------------.    |
>>      |                 |                         |       |           |
>>      |                 |                         |       V           |
>>      |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>>      |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
>>      |       | link    |<------------------------|  |   Device    |  |
>>      |       .---------|            (detach_hwpt)|  .-------------.  |
>>      |                 |                         |       |           |
>>      |                 |                         |       ...         |
>>      .-----------------.                         .-------------------.
>>
>> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
>> whenever possible and create new one on demand, also supports multiple
>> iommufd objects and ERRATA_772415.
>>
>> E.g., Under one guest's scope, Stage-2 page table could be shared by different
>> devices if there is no conflict and devices link to same iommufd object,
>> i.e. devices under same host IOMMU can share same stage-2 page table. If
>there
>> is conflict, i.e. there is one device under non cache coherency mode which is
>> different from others, it requires a separate stage-2 page table in non-CC mode.
>>
>> SPR platform has ERRATA_772415 which requires no readonly mappings
>> in stage-2 page table. This series supports creating VTDIOASContainer
>> with no readonly mappings. If there is a rare case that some IOMMUs
>> on a multiple IOMMU host have ERRATA_772415 and others not, this
>> design can still survive.
>>
>> See below example diagram for a full view:
>>
>>        IntelIOMMUState
>>               |
>>               V
>>      .------------------.    .------------------.    .-------------------.
>>      | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |--
>>...
>>      | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,only RW)|
>>      .------------------.    .------------------.    .-------------------.
>>               |                       |                              |
>>               |                       .-->...                        |
>>               V                                                      V
>>        .-------------------.    .-------------------.          .---------------.
>>        |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-
>->...
>>        .-------------------.    .-------------------.          .---------------.
>>            |            |               |                            |
>>            |            |               |                            |
>>      .-----------.  .-----------.  .------------.              .------------.
>>      | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>>      | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>>      | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>>      |           |  |           |  | (iommufd0) |              | (iommufd0) |
>>      .-----------.  .-----------.  .------------.              .------------.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing
>> guest application address space with passthrough devices.
>>
>> To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
>> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>
>> Passthrough device should use iommufd backend to work with stage-1
>translation.
>> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>
>> If host doesn't support nested translation, qemu will fail with an unsupported
>> report.
>>
>> Test done:
>> - VFIO devices hotplug/unplug
>> - different VFIO devices linked to different iommufds
>> - vhost net device ping test
>>
>> Fault report isn't supported in this series, we presume guest kernel always
>> construct correct S1 page table for passthrough device. For emulated devices,
>> the emulation code already provided S1 fault injection.
>>
>> PATCH1-6:  Add HWPT-based nesting infrastructure support
>
>The first 6 patches are all VFIO or IOMMUFD related. They are
>mostly  additions and I didn't see anything wrong. They could
>be merged in advance through the VFIO tree.

OK, I'll send a prerequisite series containing only the first 6 patches
with suggested changes recently.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-26 17:35           ` Nicolin Chen
@ 2025-05-28  7:12             ` Duan, Zhenzhong
  2025-06-12 12:53               ` Yi Liu
  2025-06-16  5:47               ` Nicolin Chen
  0 siblings, 2 replies; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-05-28  7:12 UTC (permalink / raw)
  To: Nicolin Chen, Liu, Yi L
  Cc: Peter Xu, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>host
>
>OK. Let me clarify this at the top as I see the gap here now:
>
>First, the vSMMU model is based on Zhenzhong's older series that
>keeps an ioas_id in the HostIOMMUDeviceIOMMUFD structure, which
>now it only keeps an hwpt_id in this RFCv3 series. This ioas_id
>is allocated when a passthrough cdev attaches to a VFIO container.
>
>Second, the vSMMU model reuses the default IOAS via that ioas_id.
>Since the VFIO container doesn't allocate a nesting parent S2 HWPT
>(maybe it could?), so the vSMMU allocates another S2 HWPT in the
>vIOMMU code.
>
>Third, the vSMMU model, for invalidation efficiency and HW Queue
>support, isolates all emulated devices out of the nesting-enabled
>vSMMU instance, suggested by Jason. So, only passthrough devices
>would use the nesting-enabled vSMMU instance, meaning there is no
>need of IOMMU_NOTIFIER_IOTLB_EVENTS:

I see, then you need to check if there is emulated device under nesting-enabled vSMMU and fail if there is.

> - MAP is not needed as there is no shadow page table. QEMU only
>   traps the page table pointer and forwards it to host kernel.
> - UNMAP is not needed as QEMU only traps invalidation requests
>   and forwards them to host kernel.
>
>(let's forget about the "address space switch" for MSI for now.)
>
>So, in the vSMMU model, there is actually no need for the iommu
>AS. And there is only one IOAS in the VM instance allocated by the
>VFIO container. And this IOAS manages the GPA->PA mappings. So,
>get_address_space() returns the system AS for passthrough devices.
>
>On the other hand, the VT-d model is a bit different. It's a giant
>vIOMMU for all devices (either passthrough or emualted). For all
>emulated devices, it needs IOMMU_NOTIFIER_IOTLB_EVENTS, i.e. the
>iommu address space returned via get_address_space().
>
>That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
>for passthrough devices, right?

No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual vtd
supports stage-1 translation, guest still can choose to run in legacy mode(stage2),
e.g., with kernel cmdline intel_iommu=on,sm_off

So before guest run, we don't know which kind of page table either stage1 or stage2
for this VFIO device by guest. So we have to use iommu AS to catch stage2's MAP event
if guest choose stage2.

>
>IIUIC, in the VT-d model, a passthrough device also gets attached
>to the VFIO container via iommufd_cdev_attach, allocating an IOAS.
>But it returns the iommu address space, treating them like those
>emulated devices, although the underlying MR of the returned IOMMU
>AS is backed by a nodmar MR (that is essentially a system AS).
>
>This seems to completely ignore the default IOAS owned by the VFIO
>container, because it needs to bypass those RO mappings(?)
>
>Then for passthrough devices, the VT-d model allocates an internal
>IOAS that further requires an internal S2 listener, which seems an
>large duplication of what the VFIO container already does..
>
>So, here are things that I want us to conclude:
> 1) Since the VFIO container already has an IOAS for a passthrough
>    device, and IOMMU_NOTIFIER_IOTLB_EVENTS isn't seemingly needed,
>    why not setup this default IOAS to manage gPA=>PA mappings by
>    returning the system AS via get_address_space() for passthrough
>    devices?
>
>    I got that the VT-d model might have some concern against this,
>    as the default listener would map those RO regions. Yet, maybe
>    the right approach is to figure out a way to bypass RO regions
>    in the core v.s. duplicating another ioas_alloc()/map() and S2
>    listener?
>
> 2) If (1) makes sense, I think we can further simplify the routine
>    by allocating a nesting parent HWPT in iommufd_cdev_attach(),
>    as long as the attaching device is identified as "passthrough"
>    and there is "iommufd" in its "-device" string?
>
>    After all, IOMMU_HWPT_ALLOC_NEST_PARENT is a common flag.
>
>On Mon, May 26, 2025 at 03:24:50PM +0800, Yi Liu wrote:
>> vfio_listener_region_add, section->mr->name: pc.bios, iova: fffc0000, size:
>> 40000, vaddr: 7fb314200000, RO
>> vfio_listener_region_add, section->mr->name: pc.rom, iova: c0000, size:
>> 20000, vaddr: 7fb206c00000, RO
>..
>> vfio_listener_region_add, section->mr->name: pc.ram, iova: ce000, size:
>> 1a000, vaddr: 7fb207ece000, RO
>
>OK. They look like memory carveouts for FWs. "iova" is gPA right?
>
>And they can be in the range of a guest RAM..
>
>Mind elaborating why they shouldn't be mapped onto nesting parent
>S2?
>
>> IMHO. At least for vfio devices, I can see only one get_address_space()
>> call. So even there are two ASs, how should the vfio be notified when the
>> AS changed? Since vIOMMU is the source of map/umap requests, it looks fine
>> to always return iommu AS and handle the AS switch by switching the enabled
>> subregions according to the guest vIOMMU translation types.
>
>No, VFIO doesn't get notified when the AS changes.
>
>The vSMMU model wants VFIO to stay in the system AS since the VFIO
>container manages the S2 mappings for guest PA.
>
>The "switch" in vSMMU model is only needed by KVM for MSI doorbell
>translation. By thinking it carefully, maybe it shouldn't switch AS
>because VFIO might be confused if it somehow does get_address_space
>again in the future..
>
>Thanks
>Nic


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-28  7:12             ` Duan, Zhenzhong
@ 2025-06-12 12:53               ` Yi Liu
  2025-06-12 14:06                 ` Shameerali Kolothum Thodi via
                                   ` (2 more replies)
  2025-06-16  5:47               ` Nicolin Chen
  1 sibling, 3 replies; 63+ messages in thread
From: Yi Liu @ 2025-06-12 12:53 UTC (permalink / raw)
  To: Duan, Zhenzhong, Nicolin Chen
  Cc: Peter Xu, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On 2025/5/28 15:12, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>> host
>>
>> OK. Let me clarify this at the top as I see the gap here now:
>>
>> First, the vSMMU model is based on Zhenzhong's older series that
>> keeps an ioas_id in the HostIOMMUDeviceIOMMUFD structure, which
>> now it only keeps an hwpt_id in this RFCv3 series. This ioas_id
>> is allocated when a passthrough cdev attaches to a VFIO container.
>>
>> Second, the vSMMU model reuses the default IOAS via that ioas_id.
>> Since the VFIO container doesn't allocate a nesting parent S2 HWPT
>> (maybe it could?), so the vSMMU allocates another S2 HWPT in the
>> vIOMMU code.
>>
>> Third, the vSMMU model, for invalidation efficiency and HW Queue
>> support, isolates all emulated devices out of the nesting-enabled
>> vSMMU instance, suggested by Jason. So, only passthrough devices
>> would use the nesting-enabled vSMMU instance, meaning there is no
>> need of IOMMU_NOTIFIER_IOTLB_EVENTS:
> 
> I see, then you need to check if there is emulated device under nesting-enabled vSMMU and fail if there is.
> 
>> - MAP is not needed as there is no shadow page table. QEMU only
>>    traps the page table pointer and forwards it to host kernel.
>> - UNMAP is not needed as QEMU only traps invalidation requests
>>    and forwards them to host kernel.
>>
>> (let's forget about the "address space switch" for MSI for now.)
>>
>> So, in the vSMMU model, there is actually no need for the iommu
>> AS. And there is only one IOAS in the VM instance allocated by the
>> VFIO container. And this IOAS manages the GPA->PA mappings. So,
>> get_address_space() returns the system AS for passthrough devices.
>>
>> On the other hand, the VT-d model is a bit different. It's a giant
>> vIOMMU for all devices (either passthrough or emualted). For all
>> emulated devices, it needs IOMMU_NOTIFIER_IOTLB_EVENTS, i.e. the
>> iommu address space returned via get_address_space().
>>
>> That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
>> for passthrough devices, right?
> 
> No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual vtd
> supports stage-1 translation, guest still can choose to run in legacy mode(stage2),
> e.g., with kernel cmdline intel_iommu=on,sm_off
> 
> So before guest run, we don't know which kind of page table either stage1 or stage2
> for this VFIO device by guest. So we have to use iommu AS to catch stage2's MAP event
> if guest choose stage2.

@Zheznzhong, if guest decides to use legacy mode then vIOMMU should switch
the MRs of the device's AS, hence the IOAS created by VFIO container would
be switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
switched to IOMMU MR. So it should be able to support shadowing the guest
IO page table. Hence, this should not be a problem.

@Nicolin, I think your major point is making the VFIO container IOAS as a
GPA IOAS (always return system AS in get_address_space op) and reusing it
when setting nested translation. Is it? I think it should work if:
1) we can let the vfio memory listener filter out the RO pages per vIOMMU's
    request. But I don't want the get_address_space op always return system
    AS as the reason mentioned by Zhenzhong above.
2) we can disallow emulated/passthru devices behind the same pcie-pci
    bridge[1]. For emulated devices, AS should switch to iommu MR, while for
    passthru devices, it needs the AS stick with the system MR hence be able
    to keep the VFIO container IOAS as a GPA IOAS. To support this, let AS
    switch to iommu MR and have a separate GPA IOAS is needed. This separate
    GPA IOAS can be shared by all the passthru devices.

[1] 
https://lore.kernel.org/all/SJ0PR11MB6744E2BA00BBE677B2B49BE99265A@SJ0PR11MB6744.namprd11.prod.outlook.com/#t

So basically, we are ok with your idea. But we should decide if it is 
necessary to support the topology in 2). I think this is a general
question. TBH. I don't have much information to judge if it is valuable.
Perhaps, let's hear from more people.

>>
>> IIUIC, in the VT-d model, a passthrough device also gets attached
>> to the VFIO container via iommufd_cdev_attach, allocating an IOAS.
>> But it returns the iommu address space, treating them like those
>> emulated devices, although the underlying MR of the returned IOMMU
>> AS is backed by a nodmar MR (that is essentially a system AS).
>>
>> This seems to completely ignore the default IOAS owned by the VFIO
>> container, because it needs to bypass those RO mappings(?)
>>
>> Then for passthrough devices, the VT-d model allocates an internal
>> IOAS that further requires an internal S2 listener, which seems an
>> large duplication of what the VFIO container already does..
>>
>> So, here are things that I want us to conclude:
>> 1) Since the VFIO container already has an IOAS for a passthrough
>>     device, and IOMMU_NOTIFIER_IOTLB_EVENTS isn't seemingly needed,
>>     why not setup this default IOAS to manage gPA=>PA mappings by
>>     returning the system AS via get_address_space() for passthrough
>>     devices?
>>
>>     I got that the VT-d model might have some concern against this,
>>     as the default listener would map those RO regions. Yet, maybe
>>     the right approach is to figure out a way to bypass RO regions
>>     in the core v.s. duplicating another ioas_alloc()/map() and S2
>>     listener?
>>
>> 2) If (1) makes sense, I think we can further simplify the routine
>>     by allocating a nesting parent HWPT in iommufd_cdev_attach(),
>>     as long as the attaching device is identified as "passthrough"
>>     and there is "iommufd" in its "-device" string?
>>
>>     After all, IOMMU_HWPT_ALLOC_NEST_PARENT is a common flag.
>>
>> On Mon, May 26, 2025 at 03:24:50PM +0800, Yi Liu wrote:
>>> vfio_listener_region_add, section->mr->name: pc.bios, iova: fffc0000, size:
>>> 40000, vaddr: 7fb314200000, RO
>>> vfio_listener_region_add, section->mr->name: pc.rom, iova: c0000, size:
>>> 20000, vaddr: 7fb206c00000, RO
>> ..
>>> vfio_listener_region_add, section->mr->name: pc.ram, iova: ce000, size:
>>> 1a000, vaddr: 7fb207ece000, RO
>>
>> OK. They look like memory carveouts for FWs. "iova" is gPA right?
>>
>> And they can be in the range of a guest RAM..
>>
>> Mind elaborating why they shouldn't be mapped onto nesting parent
>> S2?

@Nicolin, It's due to ERRATA_772415.

>>> IMHO. At least for vfio devices, I can see only one get_address_space()
>>> call. So even there are two ASs, how should the vfio be notified when the
>>> AS changed? Since vIOMMU is the source of map/umap requests, it looks fine
>>> to always return iommu AS and handle the AS switch by switching the enabled
>>> subregions according to the guest vIOMMU translation types.
>>
>> No, VFIO doesn't get notified when the AS changes.
>>
>> The vSMMU model wants VFIO to stay in the system AS since the VFIO
>> container manages the S2 mappings for guest PA.
>>
>> The "switch" in vSMMU model is only needed by KVM for MSI doorbell
>> translation. By thinking it carefully, maybe it shouldn't switch AS
>> because VFIO might be confused if it somehow does get_address_space
>> again in the future..

@Nicolin, not quite get the detailed logic for the MSI stuff on SMMU. But I
agree with the last sentence. get_address_space should return a consistent
AS.

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-12 12:53               ` Yi Liu
@ 2025-06-12 14:06                 ` Shameerali Kolothum Thodi via
  2025-06-16  6:04                   ` Nicolin Chen
  2025-06-16  3:24                 ` Duan, Zhenzhong
  2025-06-16  5:59                 ` Nicolin Chen
  2 siblings, 1 reply; 63+ messages in thread
From: Shameerali Kolothum Thodi via @ 2025-06-12 14:06 UTC (permalink / raw)
  To: Yi Liu, Duan, Zhenzhong, Nicolin Chen
  Cc: Peter Xu, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Peng, Chao P, Yi Sun, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost



> -----Original Message-----
> From: Yi Liu <yi.l.liu@intel.com>
> Sent: Thursday, June 12, 2025 1:54 PM
> To: Duan, Zhenzhong <zhenzhong.duan@intel.com>; Nicolin Chen
> <nicolinc@nvidia.com>
> Cc: Peter Xu <peterx@redhat.com>; qemu-devel@nongnu.org;
> alex.williamson@redhat.com; clg@redhat.com; eric.auger@redhat.com;
> mst@redhat.com; jasowang@redhat.com; ddutile@redhat.com;
> jgg@nvidia.com; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; joao.m.martins@oracle.com;
> clement.mathieu--drif@eviden.com; Tian, Kevin <kevin.tian@intel.com>;
> Peng, Chao P <chao.p.peng@intel.com>; Yi Sun <yi.y.sun@linux.intel.com>;
> Marcel Apfelbaum <marcel.apfelbaum@gmail.com>; Paolo Bonzini
> <pbonzini@redhat.com>; Richard Henderson
> <richard.henderson@linaro.org>; Eduardo Habkost
> <eduardo@habkost.net>
> Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page
> table to host
 
> >> The "switch" in vSMMU model is only needed by KVM for MSI doorbell
> >> translation. By thinking it carefully, maybe it shouldn't switch AS
> >> because VFIO might be confused if it somehow does get_address_space
> >> again in the future..
> 
> @Nicolin, not quite get the detailed logic for the MSI stuff on SMMU. But I
> agree with the last sentence. get_address_space should return a consistent
> AS.

I think it is because, in ARM world the MSI doorbell address is translated by
an IOMMU. Hence, if the Guest device is behind IOMMU, it needs to return
the IOMMU AS in,

kvm_irqchip_add_msi_route()
 kvm_arch_fixup_msi_route()
   pci_device_iommu_address_space()  --> .get_address_space()  -->At this point we now return IOMMU AS.

If not the device will be configured with a  wrong MSI doorbell address.

Nicolin, you seems to suggest we could avoid this switching and always return
System AS. Does that mean we handle this KVM/MSI case separately?
Could you please detail out the idea?

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-12 12:53               ` Yi Liu
  2025-06-12 14:06                 ` Shameerali Kolothum Thodi via
@ 2025-06-16  3:24                 ` Duan, Zhenzhong
  2025-06-16  6:34                   ` Nicolin Chen
  2025-06-16  5:59                 ` Nicolin Chen
  2 siblings, 1 reply; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-06-16  3:24 UTC (permalink / raw)
  To: Liu, Yi L, Nicolin Chen
  Cc: Peter Xu, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>host
>
>On 2025/5/28 15:12, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>> Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table
>to
>>> host
>>>
>>> OK. Let me clarify this at the top as I see the gap here now:
>>>
>>> First, the vSMMU model is based on Zhenzhong's older series that
>>> keeps an ioas_id in the HostIOMMUDeviceIOMMUFD structure, which
>>> now it only keeps an hwpt_id in this RFCv3 series. This ioas_id
>>> is allocated when a passthrough cdev attaches to a VFIO container.
>>>
>>> Second, the vSMMU model reuses the default IOAS via that ioas_id.
>>> Since the VFIO container doesn't allocate a nesting parent S2 HWPT
>>> (maybe it could?), so the vSMMU allocates another S2 HWPT in the
>>> vIOMMU code.
>>>
>>> Third, the vSMMU model, for invalidation efficiency and HW Queue
>>> support, isolates all emulated devices out of the nesting-enabled
>>> vSMMU instance, suggested by Jason. So, only passthrough devices
>>> would use the nesting-enabled vSMMU instance, meaning there is no
>>> need of IOMMU_NOTIFIER_IOTLB_EVENTS:
>>
>> I see, then you need to check if there is emulated device under nesting-enabled
>vSMMU and fail if there is.
>>
>>> - MAP is not needed as there is no shadow page table. QEMU only
>>>    traps the page table pointer and forwards it to host kernel.
>>> - UNMAP is not needed as QEMU only traps invalidation requests
>>>    and forwards them to host kernel.
>>>
>>> (let's forget about the "address space switch" for MSI for now.)
>>>
>>> So, in the vSMMU model, there is actually no need for the iommu
>>> AS. And there is only one IOAS in the VM instance allocated by the
>>> VFIO container. And this IOAS manages the GPA->PA mappings. So,
>>> get_address_space() returns the system AS for passthrough devices.
>>>
>>> On the other hand, the VT-d model is a bit different. It's a giant
>>> vIOMMU for all devices (either passthrough or emualted). For all
>>> emulated devices, it needs IOMMU_NOTIFIER_IOTLB_EVENTS, i.e. the
>>> iommu address space returned via get_address_space().
>>>
>>> That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
>>> for passthrough devices, right?
>>
>> No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual vtd
>> supports stage-1 translation, guest still can choose to run in legacy
>mode(stage2),
>> e.g., with kernel cmdline intel_iommu=on,sm_off
>>
>> So before guest run, we don't know which kind of page table either stage1 or
>stage2
>> for this VFIO device by guest. So we have to use iommu AS to catch stage2's
>MAP event
>> if guest choose stage2.
>
>@Zheznzhong, if guest decides to use legacy mode then vIOMMU should switch
>the MRs of the device's AS, hence the IOAS created by VFIO container would
>be switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
>switched to IOMMU MR. So it should be able to support shadowing the guest
>IO page table. Hence, this should not be a problem.
>
>@Nicolin, I think your major point is making the VFIO container IOAS as a
>GPA IOAS (always return system AS in get_address_space op) and reusing it
>when setting nested translation. Is it? I think it should work if:
>1) we can let the vfio memory listener filter out the RO pages per vIOMMU's
>    request. But I don't want the get_address_space op always return system
>    AS as the reason mentioned by Zhenzhong above.
>2) we can disallow emulated/passthru devices behind the same pcie-pci
>    bridge[1]. For emulated devices, AS should switch to iommu MR, while for
>    passthru devices, it needs the AS stick with the system MR hence be able
>    to keep the VFIO container IOAS as a GPA IOAS. To support this, let AS
>    switch to iommu MR and have a separate GPA IOAS is needed. This separate
>    GPA IOAS can be shared by all the passthru devices.
>
>[1]
>https://lore.kernel.org/all/SJ0PR11MB6744E2BA00BBE677B2B49BE99265A@SJ0
>PR11MB6744.namprd11.prod.outlook.com/#t
>
>So basically, we are ok with your idea. But we should decide if it is
>necessary to support the topology in 2). I think this is a general
>question. TBH. I don't have much information to judge if it is valuable.
>Perhaps, let's hear from more people.

Hi @Liu, Yi L @Nicolin Chen, for emulated/passthru devices behind the same pcie-pci bridge, I think of an idea, adding a new PCI callback:

AddressSpace * (*get_address_space_extend)(PCIBus *bus, void *opaque, int devfn, bool accel_dev);

which pass in real bus/devfn and a new param accel_dev which is true for vfio device.
Vtd implements this callback and return separate AS for vfio device if it's under an pcie-pci bridge and flts=on;
otherwise it fallback to call .get_address_space(). This way emulated devices and passthru devices behind the same pcie-pci bridge can have different AS.

If above idea is acceptable, then only obstacle is ERRATA_772415, maybe we can let VFIO check this errata and bypass RO mapping from beginning?
Or we just block this VFIO device running with flts=on if ERRATA_772415 and suggesting running with flts=off?

Thanks
Zhenzhong



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-05-28  7:12             ` Duan, Zhenzhong
  2025-06-12 12:53               ` Yi Liu
@ 2025-06-16  5:47               ` Nicolin Chen
  2025-06-16  8:15                 ` Duan, Zhenzhong
  1 sibling, 1 reply; 63+ messages in thread
From: Nicolin Chen @ 2025-06-16  5:47 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: Liu, Yi L, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P, Yi Sun, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

Sorry for a late reply.

On Wed, May 28, 2025 at 07:12:25AM +0000, Duan, Zhenzhong wrote:
> >Third, the vSMMU model, for invalidation efficiency and HW Queue
> >support, isolates all emulated devices out of the nesting-enabled
> >vSMMU instance, suggested by Jason. So, only passthrough devices
> >would use the nesting-enabled vSMMU instance, meaning there is no
> >need of IOMMU_NOTIFIER_IOTLB_EVENTS:
> 
> I see, then you need to check if there is emulated device under nesting-enabled vSMMU and fail if there is.

Shameer is working on a multi-vSMMU model in the QEMU. This gives
each VM different instances to attach devices. And we do not plan
to support emulated devices on an nesting enabled vSMMU instance,
which is a bit different than the VT-d model.

> >On the other hand, the VT-d model is a bit different. It's a giant
> >vIOMMU for all devices (either passthrough or emualted). For all
> >emulated devices, it needs IOMMU_NOTIFIER_IOTLB_EVENTS, i.e. the
> >iommu address space returned via get_address_space().
> >
> >That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
> >for passthrough devices, right?
> 
> No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual vtd
> supports stage-1 translation, guest still can choose to run in legacy mode(stage2),
> e.g., with kernel cmdline intel_iommu=on,sm_off
> 
> So before guest run, we don't know which kind of page table either stage1 or stage2
> for this VFIO device by guest. So we have to use iommu AS to catch stage2's MAP event
> if guest choose stage2.

IIUIC, the guest kernel cmdline can switch the mode between the
stage1 (nesting) and stage2 (legacy/emulated VT-d), right?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-12 12:53               ` Yi Liu
  2025-06-12 14:06                 ` Shameerali Kolothum Thodi via
  2025-06-16  3:24                 ` Duan, Zhenzhong
@ 2025-06-16  5:59                 ` Nicolin Chen
  2025-06-16  7:38                   ` Yi Liu
  2 siblings, 1 reply; 63+ messages in thread
From: Nicolin Chen @ 2025-06-16  5:59 UTC (permalink / raw)
  To: Yi Liu
  Cc: Duan, Zhenzhong, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P, Yi Sun, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

On Thu, Jun 12, 2025 at 08:53:40PM +0800, Yi Liu wrote:
> > > That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
> > > for passthrough devices, right?
> > 
> > No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual vtd
> > supports stage-1 translation, guest still can choose to run in legacy mode(stage2),
> > e.g., with kernel cmdline intel_iommu=on,sm_off
> > 
> > So before guest run, we don't know which kind of page table either stage1 or stage2
> > for this VFIO device by guest. So we have to use iommu AS to catch stage2's MAP event
> > if guest choose stage2.
> 
> @Zheznzhong, if guest decides to use legacy mode then vIOMMU should switch
> the MRs of the device's AS, hence the IOAS created by VFIO container would
> be switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
> switched to IOMMU MR. So it should be able to support shadowing the guest
> IO page table. Hence, this should not be a problem.
> 
> @Nicolin, I think your major point is making the VFIO container IOAS as a
> GPA IOAS (always return system AS in get_address_space op) and reusing it
> when setting nested translation. Is it? I think it should work if:
> 1) we can let the vfio memory listener filter out the RO pages per vIOMMU's
>    request.

Yes.

> But I don't want the get_address_space op always return system
>    AS as the reason mentioned by Zhenzhong above.

So, you mean the VT-d model would need a runtime notification to
switch the address space of the VFIO ioas?

TBH, I am still unclear how many cases the VT-d model would need
support here :-/

> 2) we can disallow emulated/passthru devices behind the same pcie-pci
>    bridge[1]. For emulated devices, AS should switch to iommu MR, while for
>    passthru devices, it needs the AS stick with the system MR hence be able
>    to keep the VFIO container IOAS as a GPA IOAS. To support this, let AS
>    switch to iommu MR and have a separate GPA IOAS is needed. This separate
>    GPA IOAS can be shared by all the passthru devices.

Yea, ARM is doing in a similar way.

> So basically, we are ok with your idea. But we should decide if it is
> necessary to support the topology in 2). I think this is a general
> question. TBH. I don't have much information to judge if it is valuable.
> Perhaps, let's hear from more people.

I would be okay if VT-d decides to move on with its own listener,
if it turns out to be the relatively better case. But for ARM, I'd
like to see we can reuse the VFIO container IOAS.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-12 14:06                 ` Shameerali Kolothum Thodi via
@ 2025-06-16  6:04                   ` Nicolin Chen
  0 siblings, 0 replies; 63+ messages in thread
From: Nicolin Chen @ 2025-06-16  6:04 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Yi Liu, Duan, Zhenzhong, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On Thu, Jun 12, 2025 at 02:06:15PM +0000, Shameerali Kolothum Thodi wrote:
> > >> The "switch" in vSMMU model is only needed by KVM for MSI doorbell
> > >> translation. By thinking it carefully, maybe it shouldn't switch AS
> > >> because VFIO might be confused if it somehow does get_address_space
> > >> again in the future..
> > 
> > @Nicolin, not quite get the detailed logic for the MSI stuff on SMMU. But I
> > agree with the last sentence. get_address_space should return a consistent
> > AS.
> 
> I think it is because, in ARM world the MSI doorbell address is translated by
> an IOMMU. Hence, if the Guest device is behind IOMMU, it needs to return
> the IOMMU AS in,
> 
> kvm_irqchip_add_msi_route()
>  kvm_arch_fixup_msi_route()
>    pci_device_iommu_address_space()  --> .get_address_space()  -->At this point we now return IOMMU AS.
> 
> If not the device will be configured with a  wrong MSI doorbell address.

Yes. The KVM code on ARM needs to translate the MSI location from
gIOVA to gPA, because MSI on ARM is behind IOMMU.

> Nicolin, you seems to suggest we could avoid this switching and always return
> System AS. Does that mean we handle this KVM/MSI case separately?
> Could you please detail out the idea?

We could add one of following ops:
    get_msi_address_space
    get_msi_address/translate_msi_iova

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-16  3:24                 ` Duan, Zhenzhong
@ 2025-06-16  6:34                   ` Nicolin Chen
  2025-06-16  8:54                     ` Duan, Zhenzhong
  0 siblings, 1 reply; 63+ messages in thread
From: Nicolin Chen @ 2025-06-16  6:34 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: Liu, Yi L, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P, Yi Sun, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

On Mon, Jun 16, 2025 at 03:24:06AM +0000, Duan, Zhenzhong wrote:
> Hi @Liu, Yi L @Nicolin Chen, for emulated/passthru devices
> behind the same pcie-pci bridge, I think of an idea, adding
> a new PCI callback:
> 
> AddressSpace * (*get_address_space_extend)(PCIBus *bus, 
> void *opaque, int devfn, bool accel_dev);
>
> which pass in real bus/devfn and a new param accel_dev which
> is true for vfio device.

Just =y for all vfio (passthrough) devices?

ARM tentatively does this for get_address_space using Shameer's
trick to detect if the device is a passthrough VFIO one:

    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
    bool has_iommufd = !!object_property_find(OBJECT(pdev), "iommufd");

    if (smmu->nested && ... && has_iommufd) {
        return &sdev->as_sysmem;
    }

So, I guess "accel_dev" could be just:
    !!object_property_find(OBJECT(pdev), "iommufd")
?

> Vtd implements this callback and return separate AS for vfio
> device if it's under an pcie-pci bridge and flts=on;
> otherwise it fallback to call .get_address_space(). This way
> emulated devices and passthru devices behind the same pcie-pci
> bridge can have different AS.

Again, if "vfio-device" tag with "iommufd" property is enough to
identify devices to separate their address spaces, perhaps the
existing get_address_space is enough.

> If above idea is acceptable, then only obstacle is ERRATA_772415,
> maybe we can let VFIO check this errata and bypass RO mapping from
> beginning?

Yes. There can be some communication between vIOMMU and the VFIO
core.

> Or we just block this VFIO device running with flts=on if
> ERRATA_772415 and suggesting running with flts=off?

That sounds like a simpler solution, so long as nobody complains
about this limitation :)

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-16  5:59                 ` Nicolin Chen
@ 2025-06-16  7:38                   ` Yi Liu
  2025-06-17  3:22                     ` Nicolin Chen
  0 siblings, 1 reply; 63+ messages in thread
From: Yi Liu @ 2025-06-16  7:38 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Duan, Zhenzhong, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P, Yi Sun, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

On 2025/6/16 13:59, Nicolin Chen wrote:
> On Thu, Jun 12, 2025 at 08:53:40PM +0800, Yi Liu wrote:
>>>> That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
>>>> for passthrough devices, right?
>>>
>>> No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual vtd
>>> supports stage-1 translation, guest still can choose to run in legacy mode(stage2),
>>> e.g., with kernel cmdline intel_iommu=on,sm_off
>>>
>>> So before guest run, we don't know which kind of page table either stage1 or stage2
>>> for this VFIO device by guest. So we have to use iommu AS to catch stage2's MAP event
>>> if guest choose stage2.
>>
>> @Zheznzhong, if guest decides to use legacy mode then vIOMMU should switch
>> the MRs of the device's AS, hence the IOAS created by VFIO container would
>> be switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
>> switched to IOMMU MR. So it should be able to support shadowing the guest
>> IO page table. Hence, this should not be a problem.
>>
>> @Nicolin, I think your major point is making the VFIO container IOAS as a
>> GPA IOAS (always return system AS in get_address_space op) and reusing it
>> when setting nested translation. Is it? I think it should work if:
>> 1) we can let the vfio memory listener filter out the RO pages per vIOMMU's
>>     request.
> 
> Yes.
> 
>> But I don't want the get_address_space op always return system
>>     AS as the reason mentioned by Zhenzhong above.
> 
> So, you mean the VT-d model would need a runtime notification to
> switch the address space of the VFIO ioas?

It's not a notification. It's done by switching AS. Detail can be found
in vtd_switch_address_space().

> TBH, I am still unclear how many cases the VT-d model would need
> support here :-/
 >
>> 2) we can disallow emulated/passthru devices behind the same pcie-pci
>>     bridge[1]. For emulated devices, AS should switch to iommu MR, while for
>>     passthru devices, it needs the AS stick with the system MR hence be able
>>     to keep the VFIO container IOAS as a GPA IOAS. To support this, let AS
>>     switch to iommu MR and have a separate GPA IOAS is needed. This separate
>>     GPA IOAS can be shared by all the passthru devices.
> 
> Yea, ARM is doing in a similar way.
> 
>> So basically, we are ok with your idea. But we should decide if it is
>> necessary to support the topology in 2). I think this is a general
>> question. TBH. I don't have much information to judge if it is valuable.
>> Perhaps, let's hear from more people.
> 
> I would be okay if VT-d decides to move on with its own listener,
> if it turns out to be the relatively better case. But for ARM, I'd
> like to see we can reuse the VFIO container IOAS.

I didn't see a problem so far on this part. Have you seen any?

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-16  5:47               ` Nicolin Chen
@ 2025-06-16  8:15                 ` Duan, Zhenzhong
  2025-06-17  3:14                   ` Nicolin Chen
  0 siblings, 1 reply; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-06-16  8:15 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Liu, Yi L, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P, Yi Sun, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>host
>
>Sorry for a late reply.
>
>On Wed, May 28, 2025 at 07:12:25AM +0000, Duan, Zhenzhong wrote:
>> >Third, the vSMMU model, for invalidation efficiency and HW Queue
>> >support, isolates all emulated devices out of the nesting-enabled
>> >vSMMU instance, suggested by Jason. So, only passthrough devices
>> >would use the nesting-enabled vSMMU instance, meaning there is no
>> >need of IOMMU_NOTIFIER_IOTLB_EVENTS:
>>
>> I see, then you need to check if there is emulated device under nesting-enabled
>vSMMU and fail if there is.
>
>Shameer is working on a multi-vSMMU model in the QEMU. This gives
>each VM different instances to attach devices. And we do not plan
>to support emulated devices on an nesting enabled vSMMU instance,
>which is a bit different than the VT-d model.

I see.

>
>> >On the other hand, the VT-d model is a bit different. It's a giant
>> >vIOMMU for all devices (either passthrough or emualted). For all
>> >emulated devices, it needs IOMMU_NOTIFIER_IOTLB_EVENTS, i.e. the
>> >iommu address space returned via get_address_space().
>> >
>> >That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
>> >for passthrough devices, right?
>>
>> No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual vtd
>> supports stage-1 translation, guest still can choose to run in legacy
>mode(stage2),
>> e.g., with kernel cmdline intel_iommu=on,sm_off
>>
>> So before guest run, we don't know which kind of page table either stage1 or
>stage2
>> for this VFIO device by guest. So we have to use iommu AS to catch stage2's
>MAP event
>> if guest choose stage2.
>
>IIUIC, the guest kernel cmdline can switch the mode between the
>stage1 (nesting) and stage2 (legacy/emulated VT-d), right?

Right. E.g., kexec from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off",
Then first kernel will run in scalable mode and use stage1(nesting) and second kernel will run in legacy mode and use stage2.

Zhenzhong


^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-16  6:34                   ` Nicolin Chen
@ 2025-06-16  8:54                     ` Duan, Zhenzhong
  2025-06-16  9:36                       ` Yi Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-06-16  8:54 UTC (permalink / raw)
  To: Nicolin Chen, Liu, Yi L
  Cc: Peter Xu, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>host
>
>On Mon, Jun 16, 2025 at 03:24:06AM +0000, Duan, Zhenzhong wrote:
>> Hi @Liu, Yi L @Nicolin Chen, for emulated/passthru devices
>> behind the same pcie-pci bridge, I think of an idea, adding
>> a new PCI callback:
>>
>> AddressSpace * (*get_address_space_extend)(PCIBus *bus,
>> void *opaque, int devfn, bool accel_dev);
>>
>> which pass in real bus/devfn and a new param accel_dev which
>> is true for vfio device.
>
>Just =y for all vfio (passthrough) devices?
>
>ARM tentatively does this for get_address_space using Shameer's
>trick to detect if the device is a passthrough VFIO one:
>
>    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>    bool has_iommufd = !!object_property_find(OBJECT(pdev), "iommufd");
>
>    if (smmu->nested && ... && has_iommufd) {
>        return &sdev->as_sysmem;
>    }
>
>So, I guess "accel_dev" could be just:
>    !!object_property_find(OBJECT(pdev), "iommufd")
>?

You are right, we don't need param accel_dev. Below should work:

object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)

>
>> Vtd implements this callback and return separate AS for vfio
>> device if it's under an pcie-pci bridge and flts=on;
>> otherwise it fallback to call .get_address_space(). This way
>> emulated devices and passthru devices behind the same pcie-pci
>> bridge can have different AS.
>
>Again, if "vfio-device" tag with "iommufd" property is enough to
>identify devices to separate their address spaces, perhaps the
>existing get_address_space is enough.

We need get_address_space_extend() to pass real BDF.
get_address_space pass group's BDF which made pci_find_device return wrong device.

>
>> If above idea is acceptable, then only obstacle is ERRATA_772415,
>> maybe we can let VFIO check this errata and bypass RO mapping from
>> beginning?
>
>Yes. There can be some communication between vIOMMU and the VFIO
>core.
>
>> Or we just block this VFIO device running with flts=on if
>> ERRATA_772415 and suggesting running with flts=off?
>
>That sounds like a simpler solution, so long as nobody complains
>about this limitation :)

I plan to apply this simpler solution except there is objection, because
I don't want to bring complexity to VFIO just for an Errata. I remember
ERRATA_772415 exists only on old SPR, @Liu, Yi L can correct me if I'm wrong.

We can also introduce a new vtd option ignore_errata to force ignore the errata
with a warning message in case user care more on performance over security.

With these two solutions, vtd can reuse VFIO container's IOAS and HWPT just
like ARM, so we are common on this part.

Comments welcome! If no objection, I will apply these solutions in v2.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-16  8:54                     ` Duan, Zhenzhong
@ 2025-06-16  9:36                       ` Yi Liu
  2025-06-16 10:16                         ` Duan, Zhenzhong
  0 siblings, 1 reply; 63+ messages in thread
From: Yi Liu @ 2025-06-16  9:36 UTC (permalink / raw)
  To: Duan, Zhenzhong, Nicolin Chen
  Cc: Peter Xu, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On 2025/6/16 16:54, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>> host
>>
>> On Mon, Jun 16, 2025 at 03:24:06AM +0000, Duan, Zhenzhong wrote:
>>> Hi @Liu, Yi L @Nicolin Chen, for emulated/passthru devices
>>> behind the same pcie-pci bridge, I think of an idea, adding
>>> a new PCI callback:
>>>
>>> AddressSpace * (*get_address_space_extend)(PCIBus *bus,
>>> void *opaque, int devfn, bool accel_dev);
>>>
>>> which pass in real bus/devfn and a new param accel_dev which
>>> is true for vfio device.
>>
>> Just =y for all vfio (passthrough) devices?
>>

TBH. It's a bit hacky to me in concept. It may be more cleaner to detect
and block such topology.

BTW. @Nic, I suppose nesting vSMMUv3 does not have this concern since
you will put the passthru devices under a separate vIOMMU which should
ensure that the emulated devices won't share AS with passthrough device.
right?

>> ARM tentatively does this for get_address_space using Shameer's
>> trick to detect if the device is a passthrough VFIO one:
>>
>>     PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>>     bool has_iommufd = !!object_property_find(OBJECT(pdev), "iommufd");
>>
>>     if (smmu->nested && ... && has_iommufd) {
>>         return &sdev->as_sysmem;
>>     }
>>
>> So, I guess "accel_dev" could be just:
>>     !!object_property_find(OBJECT(pdev), "iommufd")
>> ?
> 
> You are right, we don't need param accel_dev. Below should work:
> 
> object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)
> 
>>
>>> Vtd implements this callback and return separate AS for vfio
>>> device if it's under an pcie-pci bridge and flts=on;
>>> otherwise it fallback to call .get_address_space(). This way
>>> emulated devices and passthru devices behind the same pcie-pci
>>> bridge can have different AS.
>>
>> Again, if "vfio-device" tag with "iommufd" property is enough to
>> identify devices to separate their address spaces, perhaps the
>> existing get_address_space is enough.
> 
> We need get_address_space_extend() to pass real BDF.
> get_address_space pass group's BDF which made pci_find_device return wrong device.
>
>>
>>> If above idea is acceptable, then only obstacle is ERRATA_772415,
>>> maybe we can let VFIO check this errata and bypass RO mapping from
>>> beginning?
>>
>> Yes. There can be some communication between vIOMMU and the VFIO
>> core.
>>
>>> Or we just block this VFIO device running with flts=on if
>>> ERRATA_772415 and suggesting running with flts=off?
>>
>> That sounds like a simpler solution, so long as nobody complains
>> about this limitation :)
> 
> I plan to apply this simpler solution except there is objection, because
> I don't want to bring complexity to VFIO just for an Errata. I remember
> ERRATA_772415 exists only on old SPR, @Liu, Yi L can correct me if I'm wrong.

hmmm. I'm fine to pass some info to vfio hence let vfio skip RO mappings.
Is there other info that VFIO needs to get from vIOMMU? Hope start adding
such mechanism with normal requirement. :)

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-16  9:36                       ` Yi Liu
@ 2025-06-16 10:16                         ` Duan, Zhenzhong
  2025-06-17  7:04                           ` Yi Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-06-16 10:16 UTC (permalink / raw)
  To: Liu, Yi L, Nicolin Chen
  Cc: Peter Xu, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>host
>
>On 2025/6/16 16:54, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>> Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table
>to
>>> host
>>>
>>> On Mon, Jun 16, 2025 at 03:24:06AM +0000, Duan, Zhenzhong wrote:
>>>> Hi @Liu, Yi L @Nicolin Chen, for emulated/passthru devices
>>>> behind the same pcie-pci bridge, I think of an idea, adding
>>>> a new PCI callback:
>>>>
>>>> AddressSpace * (*get_address_space_extend)(PCIBus *bus,
>>>> void *opaque, int devfn, bool accel_dev);
>>>>
>>>> which pass in real bus/devfn and a new param accel_dev which
>>>> is true for vfio device.
>>>
>>> Just =y for all vfio (passthrough) devices?
>>>
>
>TBH. It's a bit hacky to me in concept. It may be more cleaner to detect
>and block such topology.

OK, then we don't need get_address_space_extend(). Will do in v2.

>
>BTW. @Nic, I suppose nesting vSMMUv3 does not have this concern since
>you will put the passthru devices under a separate vIOMMU which should
>ensure that the emulated devices won't share AS with passthrough device.
>right?
>
>>> ARM tentatively does this for get_address_space using Shameer's
>>> trick to detect if the device is a passthrough VFIO one:
>>>
>>>     PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>>>     bool has_iommufd = !!object_property_find(OBJECT(pdev), "iommufd");
>>>
>>>     if (smmu->nested && ... && has_iommufd) {
>>>         return &sdev->as_sysmem;
>>>     }
>>>
>>> So, I guess "accel_dev" could be just:
>>>     !!object_property_find(OBJECT(pdev), "iommufd")
>>> ?
>>
>> You are right, we don't need param accel_dev. Below should work:
>>
>> object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)
>>
>>>
>>>> Vtd implements this callback and return separate AS for vfio
>>>> device if it's under an pcie-pci bridge and flts=on;
>>>> otherwise it fallback to call .get_address_space(). This way
>>>> emulated devices and passthru devices behind the same pcie-pci
>>>> bridge can have different AS.
>>>
>>> Again, if "vfio-device" tag with "iommufd" property is enough to
>>> identify devices to separate their address spaces, perhaps the
>>> existing get_address_space is enough.
>>
>> We need get_address_space_extend() to pass real BDF.
>> get_address_space pass group's BDF which made pci_find_device return wrong
>device.
>>
>>>
>>>> If above idea is acceptable, then only obstacle is ERRATA_772415,
>>>> maybe we can let VFIO check this errata and bypass RO mapping from
>>>> beginning?
>>>
>>> Yes. There can be some communication between vIOMMU and the VFIO
>>> core.
>>>
>>>> Or we just block this VFIO device running with flts=on if
>>>> ERRATA_772415 and suggesting running with flts=off?
>>>
>>> That sounds like a simpler solution, so long as nobody complains
>>> about this limitation :)
>>
>> I plan to apply this simpler solution except there is objection, because
>> I don't want to bring complexity to VFIO just for an Errata. I remember
>> ERRATA_772415 exists only on old SPR, @Liu, Yi L can correct me if I'm wrong.
>
>hmmm. I'm fine to pass some info to vfio hence let vfio skip RO mappings.
>Is there other info that VFIO needs to get from vIOMMU? Hope start adding
>such mechanism with normal requirement. :)

I can think of ERRATA_772415 and NESTED capability. NESTED used for creating
VFIO default HWPT in stage2 mode.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-16  8:15                 ` Duan, Zhenzhong
@ 2025-06-17  3:14                   ` Nicolin Chen
  2025-06-17 12:37                     ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Nicolin Chen @ 2025-06-17  3:14 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: Liu, Yi L, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P, Yi Sun, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

On Mon, Jun 16, 2025 at 08:15:11AM +0000, Duan, Zhenzhong wrote:
> >IIUIC, the guest kernel cmdline can switch the mode between the
> >stage1 (nesting) and stage2 (legacy/emulated VT-d), right?
> 
> Right. E.g., kexec from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off",
> Then first kernel will run in scalable mode and use stage1(nesting) and
> second kernel will run in legacy mode and use stage2.

In scalable mode, guest kernel has a stage1 (nested) domain and
host kernel has a stage2 (nesting parent) domain. In this case,
the VFIO container IOAS could be the system AS corresponding to
the kernel-managed stage2 domain.

In legacy mode, guest kernel has a stage2 (normal) domain while
host kernel has a stage2 (shadow) domain? In this case, the VFIO
container IOAS should be the iommu AS corresponding to the kernel
guest-level stage2 domain (or should it be shadow)?

The ARM model that Shameer is proposing only allows a nested SMMU
when such a legacy mode is off. This simplifies a lot of things.
But the difficulty of the VT-d model is that it has to rely on a
guest bootcmd during runtime..

Nicolin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-16  7:38                   ` Yi Liu
@ 2025-06-17  3:22                     ` Nicolin Chen
  2025-06-17  6:48                       ` Yi Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Nicolin Chen @ 2025-06-17  3:22 UTC (permalink / raw)
  To: Yi Liu
  Cc: Duan, Zhenzhong, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P, Yi Sun, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

On Mon, Jun 16, 2025 at 03:38:26PM +0800, Yi Liu wrote:
> On 2025/6/16 13:59, Nicolin Chen wrote:
> > On Thu, Jun 12, 2025 at 08:53:40PM +0800, Yi Liu wrote:
> > > > > That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
> > > > > for passthrough devices, right?
> > > > 
> > > > No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual vtd
> > > > supports stage-1 translation, guest still can choose to run in legacy mode(stage2),
> > > > e.g., with kernel cmdline intel_iommu=on,sm_off
> > > > 
> > > > So before guest run, we don't know which kind of page table either stage1 or stage2
> > > > for this VFIO device by guest. So we have to use iommu AS to catch stage2's MAP event
> > > > if guest choose stage2.
> > > 
> > > @Zheznzhong, if guest decides to use legacy mode then vIOMMU should switch
> > > the MRs of the device's AS, hence the IOAS created by VFIO container would
> > > be switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
> > > switched to IOMMU MR. So it should be able to support shadowing the guest
> > > IO page table. Hence, this should not be a problem.
> > > 
> > > @Nicolin, I think your major point is making the VFIO container IOAS as a
> > > GPA IOAS (always return system AS in get_address_space op) and reusing it
> > > when setting nested translation. Is it? I think it should work if:
> > > 1) we can let the vfio memory listener filter out the RO pages per vIOMMU's
> > >     request.
> > 
> > Yes.
> > 
> > > But I don't want the get_address_space op always return system
> > >     AS as the reason mentioned by Zhenzhong above.
> > 
> > So, you mean the VT-d model would need a runtime notification to
> > switch the address space of the VFIO ioas?
> 
> It's not a notification. It's done by switching AS. Detail can be found
> in vtd_switch_address_space().

OK. I got confused about the "switch", thinking that was about
the get_address_space() call.

> > TBH, I am still unclear how many cases the VT-d model would need
> > support here :-/
> >
> > > 2) we can disallow emulated/passthru devices behind the same pcie-pci
> > >     bridge[1]. For emulated devices, AS should switch to iommu MR, while for
> > >     passthru devices, it needs the AS stick with the system MR hence be able
> > >     to keep the VFIO container IOAS as a GPA IOAS. To support this, let AS
> > >     switch to iommu MR and have a separate GPA IOAS is needed. This separate
> > >     GPA IOAS can be shared by all the passthru devices.
> > 
> > Yea, ARM is doing in a similar way.
> > 
> > > So basically, we are ok with your idea. But we should decide if it is
> > > necessary to support the topology in 2). I think this is a general
> > > question. TBH. I don't have much information to judge if it is valuable.
> > > Perhaps, let's hear from more people.
> > 
> > I would be okay if VT-d decides to move on with its own listener,
> > if it turns out to be the relatively better case. But for ARM, I'd
> > like to see we can reuse the VFIO container IOAS.
> 
> I didn't see a problem so far on this part. Have you seen any?

Probably no functional problem with that internal listener. ARM
could work using one like that as well. The only problem is code
duplication. It's not ideal for everybody to have an internal S2
listener while wasting the VFIO one.

But given that VT-d has more complicated use cases like runtime
guest-level configuration that switches between nesting and non-
nesting modes, perhaps having an internal listener is a better
idea?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-17  3:22                     ` Nicolin Chen
@ 2025-06-17  6:48                       ` Yi Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Yi Liu @ 2025-06-17  6:48 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Duan, Zhenzhong, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P, Yi Sun, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

On 2025/6/17 11:22, Nicolin Chen wrote:
> On Mon, Jun 16, 2025 at 03:38:26PM +0800, Yi Liu wrote:
>> On 2025/6/16 13:59, Nicolin Chen wrote:
>>> On Thu, Jun 12, 2025 at 08:53:40PM +0800, Yi Liu wrote:
>>>>>> That being said, IOMMU_NOTIFIER_IOTLB_EVENTS should not be needed
>>>>>> for passthrough devices, right?
>>>>>
>>>>> No, even if x-flts=on is configured in QEMU cmdline, that only mean virtual vtd
>>>>> supports stage-1 translation, guest still can choose to run in legacy mode(stage2),
>>>>> e.g., with kernel cmdline intel_iommu=on,sm_off
>>>>>
>>>>> So before guest run, we don't know which kind of page table either stage1 or stage2
>>>>> for this VFIO device by guest. So we have to use iommu AS to catch stage2's MAP event
>>>>> if guest choose stage2.
>>>>
>>>> @Zheznzhong, if guest decides to use legacy mode then vIOMMU should switch
>>>> the MRs of the device's AS, hence the IOAS created by VFIO container would
>>>> be switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
>>>> switched to IOMMU MR. So it should be able to support shadowing the guest
>>>> IO page table. Hence, this should not be a problem.
>>>>
>>>> @Nicolin, I think your major point is making the VFIO container IOAS as a
>>>> GPA IOAS (always return system AS in get_address_space op) and reusing it
>>>> when setting nested translation. Is it? I think it should work if:
>>>> 1) we can let the vfio memory listener filter out the RO pages per vIOMMU's
>>>>      request.
>>>
>>> Yes.
>>>
>>>> But I don't want the get_address_space op always return system
>>>>      AS as the reason mentioned by Zhenzhong above.
>>>
>>> So, you mean the VT-d model would need a runtime notification to
>>> switch the address space of the VFIO ioas?
>>
>> It's not a notification. It's done by switching AS. Detail can be found
>> in vtd_switch_address_space().
> 
> OK. I got confused about the "switch", thinking that was about
> the get_address_space() call.

yeah, not that call. The all magic is the MR enable/disable. This will
switch to iommu MR hence the vfio_listener_region_add() will see the
MR is iommu MR and register iommu notifier.

>>> TBH, I am still unclear how many cases the VT-d model would need
>>> support here :-/
>>>
>>>> 2) we can disallow emulated/passthru devices behind the same pcie-pci
>>>>      bridge[1]. For emulated devices, AS should switch to iommu MR, while for
>>>>      passthru devices, it needs the AS stick with the system MR hence be able
>>>>      to keep the VFIO container IOAS as a GPA IOAS. To support this, let AS
>>>>      switch to iommu MR and have a separate GPA IOAS is needed. This separate
>>>>      GPA IOAS can be shared by all the passthru devices.
>>>
>>> Yea, ARM is doing in a similar way.
>>>
>>>> So basically, we are ok with your idea. But we should decide if it is
>>>> necessary to support the topology in 2). I think this is a general
>>>> question. TBH. I don't have much information to judge if it is valuable.
>>>> Perhaps, let's hear from more people.
>>>
>>> I would be okay if VT-d decides to move on with its own listener,
>>> if it turns out to be the relatively better case. But for ARM, I'd
>>> like to see we can reuse the VFIO container IOAS.
>>
>> I didn't see a problem so far on this part. Have you seen any?
> 
> Probably no functional problem with that internal listener. ARM
> could work using one like that as well. The only problem is code
> duplication. It's not ideal for everybody to have an internal S2
> listener while wasting the VFIO one.
> 
> But given that VT-d has more complicated use cases like runtime
> guest-level configuration that switches between nesting and non-
> nesting modes, perhaps having an internal listener is a better
> idea?

I noticed there is quite a few duplication now (container/ioas/hwpt). let's
see if anyone wants to put the emulated device and passthru devices under
the same pci bridge. If no, let's avoid duplicating code.

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-16 10:16                         ` Duan, Zhenzhong
@ 2025-06-17  7:04                           ` Yi Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Yi Liu @ 2025-06-17  7:04 UTC (permalink / raw)
  To: Duan, Zhenzhong, Nicolin Chen
  Cc: Peter Xu, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On 2025/6/16 18:16, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>> host
>>
>> On 2025/6/16 16:54, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>>> Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table
>> to
>>>> host
>>>>
>>>> On Mon, Jun 16, 2025 at 03:24:06AM +0000, Duan, Zhenzhong wrote:
>>>>> Hi @Liu, Yi L @Nicolin Chen, for emulated/passthru devices
>>>>> behind the same pcie-pci bridge, I think of an idea, adding
>>>>> a new PCI callback:
>>>>>
>>>>> AddressSpace * (*get_address_space_extend)(PCIBus *bus,
>>>>> void *opaque, int devfn, bool accel_dev);
>>>>>
>>>>> which pass in real bus/devfn and a new param accel_dev which
>>>>> is true for vfio device.
>>>>
>>>> Just =y for all vfio (passthrough) devices?
>>>>
>>
>> TBH. It's a bit hacky to me in concept. It may be more cleaner to detect
>> and block such topology.
> 
> OK, then we don't need get_address_space_extend(). Will do in v2.
> 
>>
>> BTW. @Nic, I suppose nesting vSMMUv3 does not have this concern since
>> you will put the passthru devices under a separate vIOMMU which should
>> ensure that the emulated devices won't share AS with passthrough device.
>> right?
>>
>>>> ARM tentatively does this for get_address_space using Shameer's
>>>> trick to detect if the device is a passthrough VFIO one:
>>>>
>>>>      PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>>>>      bool has_iommufd = !!object_property_find(OBJECT(pdev), "iommufd");
>>>>
>>>>      if (smmu->nested && ... && has_iommufd) {
>>>>          return &sdev->as_sysmem;
>>>>      }
>>>>
>>>> So, I guess "accel_dev" could be just:
>>>>      !!object_property_find(OBJECT(pdev), "iommufd")
>>>> ?
>>>
>>> You are right, we don't need param accel_dev. Below should work:
>>>
>>> object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)
>>>
>>>>
>>>>> Vtd implements this callback and return separate AS for vfio
>>>>> device if it's under an pcie-pci bridge and flts=on;
>>>>> otherwise it fallback to call .get_address_space(). This way
>>>>> emulated devices and passthru devices behind the same pcie-pci
>>>>> bridge can have different AS.
>>>>
>>>> Again, if "vfio-device" tag with "iommufd" property is enough to
>>>> identify devices to separate their address spaces, perhaps the
>>>> existing get_address_space is enough.
>>>
>>> We need get_address_space_extend() to pass real BDF.
>>> get_address_space pass group's BDF which made pci_find_device return wrong
>> device.
>>>
>>>>
>>>>> If above idea is acceptable, then only obstacle is ERRATA_772415,
>>>>> maybe we can let VFIO check this errata and bypass RO mapping from
>>>>> beginning?
>>>>
>>>> Yes. There can be some communication between vIOMMU and the VFIO
>>>> core.
>>>>
>>>>> Or we just block this VFIO device running with flts=on if
>>>>> ERRATA_772415 and suggesting running with flts=off?
>>>>
>>>> That sounds like a simpler solution, so long as nobody complains
>>>> about this limitation :)
>>>
>>> I plan to apply this simpler solution except there is objection, because
>>> I don't want to bring complexity to VFIO just for an Errata. I remember
>>> ERRATA_772415 exists only on old SPR, @Liu, Yi L can correct me if I'm wrong.
>>
>> hmmm. I'm fine to pass some info to vfio hence let vfio skip RO mappings.
>> Is there other info that VFIO needs to get from vIOMMU? Hope start adding
>> such mechanism with normal requirement. :)
> 
> I can think of ERRATA_772415 and NESTED capability. NESTED used for creating
> VFIO default HWPT in stage2 mode.

yeah. NESTED should be a hard requirement. VFIO should allocate hwpt with 
nested_parent flag.

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-17  3:14                   ` Nicolin Chen
@ 2025-06-17 12:37                     ` Jason Gunthorpe
  2025-06-17 13:03                       ` Yi Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Jason Gunthorpe @ 2025-06-17 12:37 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Duan, Zhenzhong, Liu, Yi L, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On Mon, Jun 16, 2025 at 08:14:27PM -0700, Nicolin Chen wrote:
> On Mon, Jun 16, 2025 at 08:15:11AM +0000, Duan, Zhenzhong wrote:
> > >IIUIC, the guest kernel cmdline can switch the mode between the
> > >stage1 (nesting) and stage2 (legacy/emulated VT-d), right?
> > 
> > Right. E.g., kexec from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off",
> > Then first kernel will run in scalable mode and use stage1(nesting) and
> > second kernel will run in legacy mode and use stage2.
> 
> In scalable mode, guest kernel has a stage1 (nested) domain and
> host kernel has a stage2 (nesting parent) domain. In this case,
> the VFIO container IOAS could be the system AS corresponding to
> the kernel-managed stage2 domain.
> 
> In legacy mode, guest kernel has a stage2 (normal) domain while
> host kernel has a stage2 (shadow) domain? In this case, the VFIO
> container IOAS should be the iommu AS corresponding to the kernel
> guest-level stage2 domain (or should it be shadow)?

What you want is to disable HW support for legacy mode in qemu so the
kernel rejects sm_off operation.

The HW spec is really goofy, we get an ecap_slts but it only applies
to a PASID table entry (scalable mode). So the HW has to support
second stage for legacy always but can turn it off for PASID?

IMHO the intention was to allow the VMM to not support shadowing, but
it seems the execution was mangled.

I suggest fixing the Linux driver to refuse to run in sm_on mode if
the HW supports scalable mode and ecap_slts = false. That may not be
100% spec compliant but it seems like a reasonable approach.

> The ARM model that Shameer is proposing only allows a nested SMMU
> when such a legacy mode is off. This simplifies a lot of things.
> But the difficulty of the VT-d model is that it has to rely on a
> guest bootcmd during runtime..

ARM is cleaner because it doesn't have these drivers issues. qemu can
reliably say not to use the S2 and all the existing guest kernels will
obey that.

AMD has the same issues, BTW, arguably even worse as I didn't notice
any way to specify if the v1 page table is supported :\

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-17 12:37                     ` Jason Gunthorpe
@ 2025-06-17 13:03                       ` Yi Liu
  2025-06-17 13:11                         ` Jason Gunthorpe
  0 siblings, 1 reply; 63+ messages in thread
From: Yi Liu @ 2025-06-17 13:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: Duan, Zhenzhong, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On 2025/6/17 20:37, Jason Gunthorpe wrote:
> On Mon, Jun 16, 2025 at 08:14:27PM -0700, Nicolin Chen wrote:
>> On Mon, Jun 16, 2025 at 08:15:11AM +0000, Duan, Zhenzhong wrote:
>>>> IIUIC, the guest kernel cmdline can switch the mode between the
>>>> stage1 (nesting) and stage2 (legacy/emulated VT-d), right?
>>>
>>> Right. E.g., kexec from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off",
>>> Then first kernel will run in scalable mode and use stage1(nesting) and
>>> second kernel will run in legacy mode and use stage2.
>>
>> In scalable mode, guest kernel has a stage1 (nested) domain and
>> host kernel has a stage2 (nesting parent) domain. In this case,
>> the VFIO container IOAS could be the system AS corresponding to
>> the kernel-managed stage2 domain.
>>
>> In legacy mode, guest kernel has a stage2 (normal) domain while
>> host kernel has a stage2 (shadow) domain? In this case, the VFIO
>> container IOAS should be the iommu AS corresponding to the kernel
>> guest-level stage2 domain (or should it be shadow)?
> 
> What you want is to disable HW support for legacy mode in qemu so the
> kernel rejects sm_off operation.

that can be the future. :)

> The HW spec is really goofy, we get an ecap_slts but it only applies
> to a PASID table entry (scalable mode). So the HW has to support
> second stage for legacy always but can turn it off for PASID?

yes. legacy mode (page table following second stage format) is anyhow
supported.

> IMHO the intention was to allow the VMM to not support shadowing, but
> it seems the execution was mangled.
> 
> I suggest fixing the Linux driver to refuse to run in sm_on mode if
> the HW supports scalable mode and ecap_slts = false. That may not be
> 100% spec compliant but it seems like a reasonable approach.

running sm_on with only ecap_flts==true is what we want here. We want
the guest use stage-1 page table hence it can be used by hw under the
nested translation mode. While this page table is only available in sm_on
mode.

If we want to drop the legacy mode usage in virtualization environment, we
might let linux iommu driver refuse running legacy mode while ecap_slts is
false. I suppose HW is going to advertise both ecap_slts and ecap_flts. So
this will just let guest get rid of using legacy mode.

But this is not necessary so far. As the discussion going here, we intend
to reuse the GPA HWPT allocated by VFIO container as well.[1] This is now
aligned with Nic and Shameer.

[1] 
https://lore.kernel.org/qemu-devel/b3d31287-4de5-4e0e-a81b-99f82edd5bcc@intel.com/

>> The ARM model that Shameer is proposing only allows a nested SMMU
>> when such a legacy mode is off. This simplifies a lot of things.
>> But the difficulty of the VT-d model is that it has to rely on a
>> guest bootcmd during runtime..
> 
> ARM is cleaner because it doesn't have these drivers issues. qemu can
> reliably say not to use the S2 and all the existing guest kernels will
> obey that.

out of curious, does SMMU have legacy mode or a given version of SMMU
only supports either legacy mode or newer mode?

> AMD has the same issues, BTW, arguably even worse as I didn't notice
> any way to specify if the v1 page table is supported :\
> 
> Jason

-- 
Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-17 13:03                       ` Yi Liu
@ 2025-06-17 13:11                         ` Jason Gunthorpe
  2025-06-18  2:51                           ` Duan, Zhenzhong
  2025-06-18  3:40                           ` Yi Liu
  0 siblings, 2 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2025-06-17 13:11 UTC (permalink / raw)
  To: Yi Liu
  Cc: Nicolin Chen, Duan, Zhenzhong, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On Tue, Jun 17, 2025 at 09:03:32PM +0800, Yi Liu wrote:
> > I suggest fixing the Linux driver to refuse to run in sm_on mode if
> > the HW supports scalable mode and ecap_slts = false. That may not be
> > 100% spec compliant but it seems like a reasonable approach.
> 
> running sm_on with only ecap_flts==true is what we want here. We want
> the guest use stage-1 page table hence it can be used by hw under the
> nested translation mode. While this page table is only available in sm_on
> mode.
> 
> If we want to drop the legacy mode usage in virtualization environment, we
> might let linux iommu driver refuse running legacy mode while ecap_slts is
> false. I suppose HW is going to advertise both ecap_slts and ecap_flts. So
> this will just let guest get rid of using legacy mode.
> 
> But this is not necessary so far. As the discussion going here, we intend
> to reuse the GPA HWPT allocated by VFIO container as well.[1] This is now
> aligned with Nic and Shameer.

I think it is an issue, nobody really wants to accidently start
supporting and using shadow mode just because the VM is misconfigured.

What is desirable is to make this automatic and ensure we stay in the
nesting configuration only.

> > ARM is cleaner because it doesn't have these drivers issues. qemu can
> > reliably say not to use the S2 and all the existing guest kernels will
> > obey that.
> 
> out of curious, does SMMU have legacy mode or a given version of SMMU
> only supports either legacy mode or newer mode?

The SMMUv3 spec started out with definitions for S1 and S2 as well as
capability bits for them at day 0. So it never had this backward
compatible problem where we want to remove something that was
a mandatory part of the specification.

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

* RE: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-17 13:11                         ` Jason Gunthorpe
@ 2025-06-18  2:51                           ` Duan, Zhenzhong
  2025-06-18  3:40                           ` Yi Liu
  1 sibling, 0 replies; 63+ messages in thread
From: Duan, Zhenzhong @ 2025-06-18  2:51 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Yi L
  Cc: Nicolin Chen, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost



>-----Original Message-----
>From: Jason Gunthorpe <jgg@nvidia.com>
>Subject: Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to
>host
>
>On Tue, Jun 17, 2025 at 09:03:32PM +0800, Yi Liu wrote:
>> > I suggest fixing the Linux driver to refuse to run in sm_on mode if
>> > the HW supports scalable mode and ecap_slts = false. That may not be
>> > 100% spec compliant but it seems like a reasonable approach.
>>
>> running sm_on with only ecap_flts==true is what we want here. We want
>> the guest use stage-1 page table hence it can be used by hw under the
>> nested translation mode. While this page table is only available in sm_on
>> mode.
>>
>> If we want to drop the legacy mode usage in virtualization environment, we
>> might let linux iommu driver refuse running legacy mode while ecap_slts is
>> false. I suppose HW is going to advertise both ecap_slts and ecap_flts. So
>> this will just let guest get rid of using legacy mode.
>>
>> But this is not necessary so far. As the discussion going here, we intend
>> to reuse the GPA HWPT allocated by VFIO container as well.[1] This is now
>> aligned with Nic and Shameer.
>
>I think it is an issue, nobody really wants to accidently start
>supporting and using shadow mode just because the VM is misconfigured.
>
>What is desirable is to make this automatic and ensure we stay in the
>nesting configuration only.

This will break QEMU's back compatibility. Current QEMU supports legacy
mode when ecap_flts==true, but a newer QEMU not?

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-17 13:11                         ` Jason Gunthorpe
  2025-06-18  2:51                           ` Duan, Zhenzhong
@ 2025-06-18  3:40                           ` Yi Liu
  2025-06-18 11:43                             ` Jason Gunthorpe
  1 sibling, 1 reply; 63+ messages in thread
From: Yi Liu @ 2025-06-18  3:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, Duan, Zhenzhong, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On 2025/6/17 21:11, Jason Gunthorpe wrote:
> On Tue, Jun 17, 2025 at 09:03:32PM +0800, Yi Liu wrote:
>>> I suggest fixing the Linux driver to refuse to run in sm_on mode if
>>> the HW supports scalable mode and ecap_slts = false. That may not be
>>> 100% spec compliant but it seems like a reasonable approach.
>>
>> running sm_on with only ecap_flts==true is what we want here. We want
>> the guest use stage-1 page table hence it can be used by hw under the
>> nested translation mode. While this page table is only available in sm_on
>> mode.
>>
>> If we want to drop the legacy mode usage in virtualization environment, we
>> might let linux iommu driver refuse running legacy mode while ecap_slts is
>> false. I suppose HW is going to advertise both ecap_slts and ecap_flts. So
>> this will just let guest get rid of using legacy mode.
>>
>> But this is not necessary so far. As the discussion going here, we intend
>> to reuse the GPA HWPT allocated by VFIO container as well.[1] This is now
>> aligned with Nic and Shameer.
> 
> I think it is an issue, nobody really wants to accidently start
> supporting and using shadow mode just because the VM is misconfigured.

hmmm. intel iommu driver makes sm_on by default since v5.15. So if guest
configs sm_off, that means it wants it. For the kernel <5.15, yes it will
use legacy mode if it has not configured sm_on explicitly. So this seems
not an issue.

Actually, as I explained in the first hunk of [1], there is no issue with
the legacy mode support. :)

[1] 
https://lore.kernel.org/qemu-devel/20250521111452.3316354-1-zhenzhong.duan@intel.com/T/#m4c8fa70742001d4c22b3c297e240a2151d2c617f

> What is desirable is to make this automatic and ensure we stay in the
> nesting configuration only.

yes, once QEMU supports nested translation based vIOMMU, it's better to use
sm_on instead of legacy. I think for the kernels >= 5.15, this automation
has already been achieved since sm is default on.

>>> ARM is cleaner because it doesn't have these drivers issues. qemu can
>>> reliably say not to use the S2 and all the existing guest kernels will
>>> obey that.
>>
>> out of curious, does SMMU have legacy mode or a given version of SMMU
>> only supports either legacy mode or newer mode?
> 
> The SMMUv3 spec started out with definitions for S1 and S2 as well as
> capability bits for them at day 0. So it never had this backward
> compatible problem where we want to remove something that was
> a mandatory part of the specification.

got it. yes, it's all about backward compatible support.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-06-18  3:40                           ` Yi Liu
@ 2025-06-18 11:43                             ` Jason Gunthorpe
  0 siblings, 0 replies; 63+ messages in thread
From: Jason Gunthorpe @ 2025-06-18 11:43 UTC (permalink / raw)
  To: Yi Liu
  Cc: Nicolin Chen, Duan, Zhenzhong, Peter Xu, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, ddutile@redhat.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

On Wed, Jun 18, 2025 at 11:40:38AM +0800, Yi Liu wrote:

> Actually, as I explained in the first hunk of [1], there is no issue with
> the legacy mode support. :)
> 
> [1] https://lore.kernel.org/qemu-devel/20250521111452.3316354-1-zhenzhong.duan@intel.com/T/#m4c8fa70742001d4c22b3c297e240a2151d2c617f

My feeling is that it is undesirable to have the shadowing code in the
VMM at all, as it increases the attack surface/complexity/etc.

There should be a way to fully inhibit legacy mode, and if that means
old kernels don't work that's just how it is.

Jason


^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2025-06-18 11:44 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-21 11:14 [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 01/21] backends/iommufd: Add a helper to invalidate user-managed HWPT Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 02/21] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 03/21] vfio/iommufd: Initialize iommufd specific members in HostIOMMUDeviceIOMMUFD Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 04/21] vfio/iommufd: Implement [at|de]tach_hwpt handlers Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 05/21] vfio/iommufd: Save vendor specific device info Zhenzhong Duan
2025-05-21 21:57   ` Nicolin Chen
2025-05-22  9:21     ` Duan, Zhenzhong
2025-05-22 19:35       ` Nicolin Chen
2025-05-26 12:15   ` Cédric Le Goater
2025-05-27  2:12     ` Duan, Zhenzhong
2025-05-21 11:14 ` [PATCH rfcv3 06/21] iommufd: Implement query of host VTD IOMMU's capability Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 07/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 08/21] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 09/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 10/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 11/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 12/21] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 13/21] intel_iommu: Handle PASID entry adding Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 14/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-05-21 22:49   ` Nicolin Chen
2025-05-22  6:50     ` Duan, Zhenzhong
2025-05-22 19:29       ` Nicolin Chen
2025-05-23  6:26         ` Yi Liu
2025-05-26  3:34         ` Duan, Zhenzhong
2025-05-23  6:22     ` Yi Liu
2025-05-23  6:52       ` Duan, Zhenzhong
2025-05-23 21:12       ` Nicolin Chen
2025-05-26  3:46         ` Duan, Zhenzhong
2025-05-26  7:24         ` Yi Liu
2025-05-26 17:35           ` Nicolin Chen
2025-05-28  7:12             ` Duan, Zhenzhong
2025-06-12 12:53               ` Yi Liu
2025-06-12 14:06                 ` Shameerali Kolothum Thodi via
2025-06-16  6:04                   ` Nicolin Chen
2025-06-16  3:24                 ` Duan, Zhenzhong
2025-06-16  6:34                   ` Nicolin Chen
2025-06-16  8:54                     ` Duan, Zhenzhong
2025-06-16  9:36                       ` Yi Liu
2025-06-16 10:16                         ` Duan, Zhenzhong
2025-06-17  7:04                           ` Yi Liu
2025-06-16  5:59                 ` Nicolin Chen
2025-06-16  7:38                   ` Yi Liu
2025-06-17  3:22                     ` Nicolin Chen
2025-06-17  6:48                       ` Yi Liu
2025-06-16  5:47               ` Nicolin Chen
2025-06-16  8:15                 ` Duan, Zhenzhong
2025-06-17  3:14                   ` Nicolin Chen
2025-06-17 12:37                     ` Jason Gunthorpe
2025-06-17 13:03                       ` Yi Liu
2025-06-17 13:11                         ` Jason Gunthorpe
2025-06-18  2:51                           ` Duan, Zhenzhong
2025-06-18  3:40                           ` Yi Liu
2025-06-18 11:43                             ` Jason Gunthorpe
2025-05-21 11:14 ` [PATCH rfcv3 16/21] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 17/21] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 18/21] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 19/21] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 20/21] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
2025-05-21 11:14 ` [PATCH rfcv3 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
2025-05-26 12:19 ` [PATCH rfcv3 00/21] intel_iommu: Enable stage-1 translation for passthrough device Cédric Le Goater
2025-05-27  2:16   ` Duan, Zhenzhong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).