qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
@ 2025-02-19  8:22 Zhenzhong Duan
  2025-02-19  8:22 ` [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT Zhenzhong Duan
                   ` (21 more replies)
  0 siblings, 22 replies; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Hi,

Per Jason Wang's suggestion, iommufd nesting series[1] is split into
"Enable stage-1 translation for emulated device" series and
"Enable stage-1 translation for passthrough device" series.

This series is 2nd part focusing on passthrough device. We don't do
shadowing of guest page table for passthrough device but pass stage-1
page table to host side to construct a nested domain. There was some
effort to enable this feature in old days, see [2] for details.

The key design is to utilize the dual-stage IOMMU translation
(also known as IOMMU nested translation) capability in host IOMMU.
As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform
the stage-1 address translation. Along with it, modifications to
present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

        .-------------.  .---------------------------.
        |   vIOMMU    |  | Guest I/O page table      |
        |             |  '---------------------------'
        .----------------/
        | PASID Entry |--- PASID cache flush --+
        '-------------'                        |
        |             |                        V
        |             |           I/O page table pointer in GPA
        '-------------'
    Guest
    ------| Shadow |---------------------------|--------
          v        v                           v
    Host
        .-------------.  .------------------------.
        |   pIOMMU    |  |  FS for GIOVA->GPA     |
        |             |  '------------------------'
        .----------------/  |
        | PASID Entry |     V (Nested xlate)
        '----------------\.----------------------------------.
        |             |   | SS for GPA->HPA, unmanaged domain|
        |             |   '----------------------------------'
        '-------------'
Where:
 - FS = First stage page tables
 - SS = Second stage page tables
<Intel VT-d Nested translation>

There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
  subsystem. VFIO calls them to register/unregister HostIOMMUDevice
  instance to vIOMMU at vfio device realize stage.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
  to bind/unbind device to IOMMUFD backed domains, either nested
  domain or not.

See below diagram:

        VFIO Device                                 Intel IOMMU
    .-----------------.                         .-------------------.
    |                 |                         |                   |
    |       .---------|PCIIOMMUOps              |.-------------.    |
    |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
    |       | Device  |------------------------>|| Device list |    |
    |       .---------|(unset_iommu_device)     |.-------------.    |
    |                 |                         |       |           |
    |                 |                         |       V           |
    |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
    |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
    |       | link    |<------------------------|  |   Device    |  |
    |       .---------|            (detach_hwpt)|  .-------------.  |
    |                 |                         |       |           |
    |                 |                         |       ...         |
    .-----------------.                         .-------------------.

Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
whenever possible and create new one on demand, also supports multiple
iommufd objects and ERRATA_772415.

E.g., Stage-2 page table could be shared by different devices if there
is no conflict and devices link to same iommufd object, i.e. devices
under same host IOMMU can share same stage-2 page table. If there is
conflict, i.e. there is one device under non cache coherency mode
which is different from others, it requires a separate stage-2 page
table in non-CC mode.

SPR platform has ERRATA_772415 which requires no readonly mappings
in stage-2 page table. This series supports creating VTDIOASContainer
with no readonly mappings. If there is a rare case that some IOMMUs
on a multiple IOMMU host have ERRATA_772415 and others not, this
design can still survive.

See below example diagram for a full view:

      IntelIOMMUState
             |
             V
    .------------------.    .------------------.    .-------------------.
    | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
    | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
    .------------------.    .------------------.    .-------------------.
             |                       |                              |
             |                       .-->...                        |
             V                                                      V
      .-------------------.    .-------------------.          .---------------.
      |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
      .-------------------.    .-------------------.          .---------------.
          |            |               |                            |
          |            |               |                            |
    .-----------.  .-----------.  .------------.              .------------.
    | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
    | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
    | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
    |           |  |           |  | (iommufd0) |              | (iommufd0) |
    .-----------.  .-----------.  .------------.              .------------.

This series is also a prerequisite work for vSVA, i.e. Sharing
guest application address space with passthrough devices.

To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...

Passthrough device should use iommufd backend to work with stage-1 translation.
i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...

If host doesn't support nested translation, qemu will fail with an unsupported
report.

Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test

PATCH1-8:  Add HWPT-based nesting infrastructure support
PATCH9-10: Some cleanup work
PATCH11:   cap/ecap related compatibility check between vIOMMU and Host IOMMU
PATCH12-19:Implement stage-1 page table for passthrough device
PATCH20:   Enable stage-1 translation for passthrough device

Qemu code can be found at [3]

TODO:
- RAM discard
- dirty tracking on stage-2 page table

[1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
[2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2

Thanks
Zhenzhong

Changelog:
rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
  iommu pasid, this is important for dropping VTDPASIDAddressSpace

Yi Liu (3):
  intel_iommu: Replay pasid binds after context cache invalidation
  intel_iommu: Propagate PASID-based iotlb invalidation to host
  intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed

Zhenzhong Duan (17):
  backends/iommufd: Add helpers for invalidating user-managed HWPT
  vfio/iommufd: Add properties and handlers to
    TYPE_HOST_IOMMU_DEVICE_IOMMUFD
  HostIOMMUDevice: Introduce realize_late callback
  vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
  vfio/iommufd: Implement [at|de]tach_hwpt handlers
  host_iommu_device: Define two new capabilities
    HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
  intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
    vtd_ce_get_pasid_entry
  intel_iommu: Optimize context entry cache utilization
  intel_iommu: Check for compatibility with IOMMUFD backed device when
    x-flts=on
  intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  intel_iommu: Add PASID cache management infrastructure
  intel_iommu: Bind/unbind guest page table to host
  intel_iommu: ERRATA_772415 workaround
  intel_iommu: Bypass replay in stage-1 page table mode
  intel_iommu: Enable host device when x-flts=on in scalable mode

 hw/i386/intel_iommu_internal.h     |   56 +
 include/hw/i386/intel_iommu.h      |   33 +-
 include/system/host_iommu_device.h |   40 +
 include/system/iommufd.h           |   53 +
 backends/iommufd.c                 |   58 +
 hw/i386/intel_iommu.c              | 1660 ++++++++++++++++++++++++----
 hw/vfio/common.c                   |   17 +-
 hw/vfio/iommufd.c                  |   48 +
 backends/trace-events              |    1 +
 hw/i386/trace-events               |   13 +
 10 files changed, 1776 insertions(+), 203 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-20 16:47   ` Eric Auger
  2025-02-24 10:03   ` Shameerali Kolothum Thodi via
  2025-02-19  8:22 ` [PATCH rfcv2 02/20] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD Zhenzhong Duan
                   ` (20 subsequent siblings)
  21 siblings, 2 replies; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/system/iommufd.h |  3 +++
 backends/iommufd.c       | 30 ++++++++++++++++++++++++++++++
 backends/trace-events    |  1 +
 3 files changed, 34 insertions(+)

diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index cbab75bfbf..5d02e9d148 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -61,6 +61,9 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
                                       uint64_t iova, ram_addr_t size,
                                       uint64_t page_size, uint64_t *data,
                                       Error **errp);
+int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t hwpt_id,
+                                     uint32_t data_type, uint32_t entry_len,
+                                     uint32_t *entry_num, void *data_ptr);
 
 #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
 #endif
diff --git a/backends/iommufd.c b/backends/iommufd.c
index d57da44755..fc32aad5cb 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -311,6 +311,36 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
     return true;
 }
 
+int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t hwpt_id,
+                                     uint32_t data_type, uint32_t entry_len,
+                                     uint32_t *entry_num, void *data_ptr)
+{
+    int ret, fd = be->fd;
+    struct iommu_hwpt_invalidate cache = {
+        .size = sizeof(cache),
+        .hwpt_id = hwpt_id,
+        .data_type = data_type,
+        .entry_len = entry_len,
+        .entry_num = *entry_num,
+        .data_uptr = (uintptr_t)data_ptr,
+    };
+
+    ret = ioctl(fd, IOMMU_HWPT_INVALIDATE, &cache);
+
+    trace_iommufd_backend_invalidate_cache(fd, hwpt_id, data_type, entry_len,
+                                           *entry_num, cache.entry_num,
+                                           (uintptr_t)data_ptr, ret);
+    if (ret) {
+        *entry_num = cache.entry_num;
+        error_report("IOMMU_HWPT_INVALIDATE failed: %s", strerror(errno));
+        ret = -errno;
+    } else {
+        g_assert(*entry_num == cache.entry_num);
+    }
+
+    return ret;
+}
+
 static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
 {
     HostIOMMUDeviceCaps *caps = &hiod->caps;
diff --git a/backends/trace-events b/backends/trace-events
index 40811a3162..5a23db6c8a 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -18,3 +18,4 @@ iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_
 iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
 iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%u enable=%d (%d)"
 iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
+iommufd_backend_invalidate_cache(int iommufd, uint32_t hwpt_id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d hwpt_id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 02/20] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
  2025-02-19  8:22 ` [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-20 17:42   ` Eric Auger
  2025-02-19  8:22 ` [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback Zhenzhong Duan
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

New added properties include IOMMUFD handle, devid and hwpt_id.
IOMMUFD handle and devid are used to allocate/free ioas and hwpt.
hwpt_id is used to re-attach IOMMUFD backed device to its default
VFIO sub-system created hwpt, i.e., when vIOMMU is disabled by
guest. These properties are initialized in .realize_late() handler.

New added handlers include [at|de]tach_hwpt. They are used to
attach/detach hwpt. VFIO and VDPA have different way to attach
and detach, so implementation will be in sub-class instead of
HostIOMMUDeviceIOMMUFD.

Add two wrappers host_iommu_device_iommufd_[at|de]tach_hwpt to
wrap the two handlers.

This is a prerequisite patch for following ones.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/system/iommufd.h | 50 ++++++++++++++++++++++++++++++++++++++++
 backends/iommufd.c       | 22 ++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/include/system/iommufd.h b/include/system/iommufd.h
index 5d02e9d148..a871601df5 100644
--- a/include/system/iommufd.h
+++ b/include/system/iommufd.h
@@ -66,4 +66,54 @@ int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t hwpt_id,
                                      uint32_t *entry_num, void *data_ptr);
 
 #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
+OBJECT_DECLARE_TYPE(HostIOMMUDeviceIOMMUFD, HostIOMMUDeviceIOMMUFDClass,
+                    HOST_IOMMU_DEVICE_IOMMUFD)
+
+/* Abstract of host IOMMU device with iommufd backend */
+struct HostIOMMUDeviceIOMMUFD {
+    HostIOMMUDevice parent_obj;
+
+    IOMMUFDBackend *iommufd;
+    uint32_t devid;
+    uint32_t hwpt_id;
+};
+
+struct HostIOMMUDeviceIOMMUFDClass {
+    HostIOMMUDeviceClass parent_class;
+
+    /**
+     * @attach_hwpt: attach host IOMMU device to IOMMUFD hardware page table.
+     * VFIO and VDPA device can have different implementation.
+     *
+     * Mandatory callback.
+     *
+     * @idev: host IOMMU device backed by IOMMUFD backend.
+     *
+     * @hwpt_id: ID of IOMMUFD hardware page table.
+     *
+     * @errp: pass an Error out when attachment fails.
+     *
+     * Returns: true on success, false on failure.
+     */
+    bool (*attach_hwpt)(HostIOMMUDeviceIOMMUFD *idev, uint32_t hwpt_id,
+                        Error **errp);
+    /**
+     * @detach_hwpt: detach host IOMMU device from IOMMUFD hardware page table.
+     * VFIO and VDPA device can have different implementation.
+     *
+     * Mandatory callback.
+     *
+     * @idev: host IOMMU device backed by IOMMUFD backend.
+     *
+     * @errp: pass an Error out when attachment fails.
+     *
+     * Returns: true on success, false on failure.
+     */
+    bool (*detach_hwpt)(HostIOMMUDeviceIOMMUFD *idev, Error **errp);
+};
+
+bool host_iommu_device_iommufd_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           uint32_t hwpt_id, Error **errp);
+bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           Error **errp);
 #endif
diff --git a/backends/iommufd.c b/backends/iommufd.c
index fc32aad5cb..574f330c27 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -341,6 +341,26 @@ int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t hwpt_id,
     return ret;
 }
 
+bool host_iommu_device_iommufd_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           uint32_t hwpt_id, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFDClass *idevc =
+        HOST_IOMMU_DEVICE_IOMMUFD_GET_CLASS(idev);
+
+    g_assert(idevc->attach_hwpt);
+    return idevc->attach_hwpt(idev, hwpt_id, errp);
+}
+
+bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           Error **errp)
+{
+    HostIOMMUDeviceIOMMUFDClass *idevc =
+        HOST_IOMMU_DEVICE_IOMMUFD_GET_CLASS(idev);
+
+    g_assert(idevc->detach_hwpt);
+    return idevc->detach_hwpt(idev, errp);
+}
+
 static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
 {
     HostIOMMUDeviceCaps *caps = &hiod->caps;
@@ -379,6 +399,8 @@ static const TypeInfo types[] = {
     }, {
         .name = TYPE_HOST_IOMMU_DEVICE_IOMMUFD,
         .parent = TYPE_HOST_IOMMU_DEVICE,
+        .instance_size = sizeof(HostIOMMUDeviceIOMMUFD),
+        .class_size = sizeof(HostIOMMUDeviceIOMMUFDClass),
         .class_init = hiod_iommufd_class_init,
         .abstract = true,
     }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
  2025-02-19  8:22 ` [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT Zhenzhong Duan
  2025-02-19  8:22 ` [PATCH rfcv2 02/20] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-20 17:48   ` Eric Auger
  2025-04-07 11:19   ` Cédric Le Goater
  2025-02-19  8:22 ` [PATCH rfcv2 04/20] vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler Zhenzhong Duan
                   ` (18 subsequent siblings)
  21 siblings, 2 replies; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Currently we have realize() callback which is called before attachment.
But there are still some elements e.g., hwpt_id is not ready before
attachment. So we need a realize_late() callback to further initialize
them.

Currently, this callback is only useful for iommufd backend. For legacy
backend nothing needs to be initialized after attachment.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/system/host_iommu_device.h | 17 +++++++++++++++++
 hw/vfio/common.c                   | 17 ++++++++++++++---
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index 809cced4ba..df782598f2 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -66,6 +66,23 @@ struct HostIOMMUDeviceClass {
      * Returns: true on success, false on failure.
      */
     bool (*realize)(HostIOMMUDevice *hiod, void *opaque, Error **errp);
+    /**
+     * @realize_late: initialize host IOMMU device instance after attachment,
+     *                some elements e.g., ioas are ready only after attachment.
+     *                This callback initialize them.
+     *
+     * Optional callback.
+     *
+     * @hiod: pointer to a host IOMMU device instance.
+     *
+     * @opaque: pointer to agent device of this host IOMMU device,
+     *          e.g., VFIO base device or VDPA device.
+     *
+     * @errp: pass an Error out when realize fails.
+     *
+     * Returns: true on success, false on failure.
+     */
+    bool (*realize_late)(HostIOMMUDevice *hiod, void *opaque, Error **errp);
     /**
      * @get_cap: check if a host IOMMU device capability is supported.
      *
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index abbdc56b6d..e198b1e5a2 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1550,6 +1550,7 @@ bool vfio_attach_device(char *name, VFIODevice *vbasedev,
     const VFIOIOMMUClass *ops =
         VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_LEGACY));
     HostIOMMUDevice *hiod = NULL;
+    HostIOMMUDeviceClass *hiod_ops = NULL;
 
     if (vbasedev->iommufd) {
         ops = VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
@@ -1560,16 +1561,26 @@ bool vfio_attach_device(char *name, VFIODevice *vbasedev,
 
     if (!vbasedev->mdev) {
         hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
+        hiod_ops = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
         vbasedev->hiod = hiod;
     }
 
     if (!ops->attach_device(name, vbasedev, as, errp)) {
-        object_unref(hiod);
-        vbasedev->hiod = NULL;
-        return false;
+        goto err_attach;
+    }
+
+    if (hiod_ops && hiod_ops->realize_late &&
+        !hiod_ops->realize_late(hiod, vbasedev, errp)) {
+        ops->detach_device(vbasedev);
+        goto err_attach;
     }
 
     return true;
+
+err_attach:
+    object_unref(hiod);
+    vbasedev->hiod = NULL;
+    return false;
 }
 
 void vfio_detach_device(VFIODevice *vbasedev)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 04/20] vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-20 18:07   ` Eric Auger
  2025-02-19  8:22 ` [PATCH rfcv2 05/20] vfio/iommufd: Implement [at|de]tach_hwpt handlers Zhenzhong Duan
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

There are three iommufd related elements iommufd handle, devid and
hwpt_id. hwpt_id is ready only after VFIO device attachment. Device
id and iommufd handle are ready before attachment, but they are all
iommufd related stuff, initialize them together with hwpt_id.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/iommufd.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index df61edffc0..53639bf88b 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -828,6 +828,19 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
     return true;
 }
 
+static bool hiod_iommufd_vfio_realize_late(HostIOMMUDevice *hiod, void *opaque,
+                                           Error **errp)
+{
+    VFIODevice *vdev = opaque;
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
+
+    idev->iommufd = vdev->iommufd;
+    idev->devid = vdev->devid;
+    idev->hwpt_id = vdev->hwpt->hwpt_id;
+
+    return true;
+}
+
 static GList *
 hiod_iommufd_vfio_get_iova_ranges(HostIOMMUDevice *hiod)
 {
@@ -852,6 +865,7 @@ static void hiod_iommufd_vfio_class_init(ObjectClass *oc, void *data)
     HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_CLASS(oc);
 
     hiodc->realize = hiod_iommufd_vfio_realize;
+    hiodc->realize_late = hiod_iommufd_vfio_realize_late;
     hiodc->get_iova_ranges = hiod_iommufd_vfio_get_iova_ranges;
     hiodc->get_page_size_mask = hiod_iommufd_vfio_get_page_size_mask;
 };
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 05/20] vfio/iommufd: Implement [at|de]tach_hwpt handlers
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (3 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 04/20] vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-20 18:13   ` Eric Auger
  2025-02-19  8:22 ` [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] Zhenzhong Duan
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Implement [at|de]tach_hwpt handlers in VFIO subsystem. vIOMMU
utilizes them to attach to or detach from hwpt on host side.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/iommufd.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 53639bf88b..175c4fe1f4 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -802,6 +802,24 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, void *data)
     vioc->query_dirty_bitmap = iommufd_query_dirty_bitmap;
 };
 
+static bool
+host_iommu_device_iommufd_vfio_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           uint32_t hwpt_id, Error **errp)
+{
+    VFIODevice *vbasedev = HOST_IOMMU_DEVICE(idev)->agent;
+
+    return !iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp);
+}
+
+static bool
+host_iommu_device_iommufd_vfio_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                           Error **errp)
+{
+    VFIODevice *vbasedev = HOST_IOMMU_DEVICE(idev)->agent;
+
+    return iommufd_cdev_detach_ioas_hwpt(vbasedev, errp);
+}
+
 static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
                                       Error **errp)
 {
@@ -863,11 +881,15 @@ hiod_iommufd_vfio_get_page_size_mask(HostIOMMUDevice *hiod)
 static void hiod_iommufd_vfio_class_init(ObjectClass *oc, void *data)
 {
     HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_CLASS(oc);
+    HostIOMMUDeviceIOMMUFDClass *idevc = HOST_IOMMU_DEVICE_IOMMUFD_CLASS(oc);
 
     hiodc->realize = hiod_iommufd_vfio_realize;
     hiodc->realize_late = hiod_iommufd_vfio_realize_late;
     hiodc->get_iova_ranges = hiod_iommufd_vfio_get_iova_ranges;
     hiodc->get_page_size_mask = hiod_iommufd_vfio_get_page_size_mask;
+
+    idevc->attach_hwpt = host_iommu_device_iommufd_vfio_attach_hwpt;
+    idevc->detach_hwpt = host_iommu_device_iommufd_vfio_detach_hwpt;
 };
 
 static const TypeInfo types[] = {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (4 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 05/20] vfio/iommufd: Implement [at|de]tach_hwpt handlers Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-20 18:41   ` Eric Auger
  2025-02-19  8:22 ` [PATCH rfcv2 07/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] Zhenzhong Duan
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/system/host_iommu_device.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index df782598f2..18f8b5e5cf 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -22,10 +22,16 @@
  *
  * @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this represents
  *           the @out_capabilities value returned from IOMMU_GET_HW_INFO ioctl)
+ *
+ * @nesting: nesting page table support.
+ *
+ * @fs1gp: first stage(a.k.a, Stage-1) 1GB huge page support.
  */
 typedef struct HostIOMMUDeviceCaps {
     uint32_t type;
     uint64_t hw_caps;
+    bool nesting;
+    bool fs1gp;
 } HostIOMMUDeviceCaps;
 
 #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
@@ -122,6 +128,8 @@ struct HostIOMMUDeviceClass {
  */
 #define HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE        0
 #define HOST_IOMMU_DEVICE_CAP_AW_BITS           1
+#define HOST_IOMMU_DEVICE_CAP_NESTING           2
+#define HOST_IOMMU_DEVICE_CAP_FS1GP             3
 
 #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
 #endif
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 07/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (5 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-20 19:00   ` Eric Auger
  2025-02-19  8:22 ` [PATCH rfcv2 08/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA Zhenzhong Duan
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] for IOMMUFD
backed host IOMMU device.

Query on these two capabilities is not supported for legacy backend
because there is no plan to support nesting with leacy backend backed
host device.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  1 +
 backends/iommufd.c             |  4 ++++
 hw/vfio/iommufd.c              | 11 +++++++++++
 3 files changed, 16 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e8b211e8b0..2cda744786 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -191,6 +191,7 @@
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_SC                 (1ULL << 7)
 #define VTD_ECAP_MHMV               (15ULL << 20)
+#define VTD_ECAP_NEST               (1ULL << 26)
 #define VTD_ECAP_SRS                (1ULL << 31)
 #define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 574f330c27..0a1a40cbba 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -370,6 +370,10 @@ static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
         return caps->type;
     case HOST_IOMMU_DEVICE_CAP_AW_BITS:
         return vfio_device_get_aw_bits(hiod->agent);
+    case HOST_IOMMU_DEVICE_CAP_NESTING:
+        return caps->nesting;
+    case HOST_IOMMU_DEVICE_CAP_FS1GP:
+        return caps->fs1gp;
     default:
         error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
         return -EINVAL;
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 175c4fe1f4..df6a12d200 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -26,6 +26,7 @@
 #include "qemu/chardev_open.h"
 #include "pci.h"
 #include "exec/ram_addr.h"
+#include "hw/i386/intel_iommu_internal.h"
 
 static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
                             ram_addr_t size, void *vaddr, bool readonly)
@@ -843,6 +844,16 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
     caps->type = type;
     caps->hw_caps = hw_caps;
 
+    switch (type) {
+    case IOMMU_HW_INFO_TYPE_INTEL_VTD:
+        caps->nesting = !!(data.vtd.ecap_reg & VTD_ECAP_NEST);
+        caps->fs1gp = !!(data.vtd.cap_reg & VTD_CAP_FS1GP);
+        break;
+    case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
+    case IOMMU_HW_INFO_TYPE_NONE:
+        break;
+    }
+
     return true;
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 08/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (6 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 07/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-20 18:55   ` Eric Auger
  2025-02-19  8:22 ` [PATCH rfcv2 09/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
                   ` (13 subsequent siblings)
  21 siblings, 1 reply; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA for IOMMUFD
backed host IOMMU device.

Query on this capability is not supported for legacy backend
because there is no plan to support nesting with leacy backend
backed host device.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/system/host_iommu_device.h | 2 ++
 backends/iommufd.c                 | 2 ++
 hw/vfio/iommufd.c                  | 1 +
 3 files changed, 5 insertions(+)

diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index 18f8b5e5cf..250600fc1d 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -32,6 +32,7 @@ typedef struct HostIOMMUDeviceCaps {
     uint64_t hw_caps;
     bool nesting;
     bool fs1gp;
+    uint32_t errata;
 } HostIOMMUDeviceCaps;
 
 #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
@@ -130,6 +131,7 @@ struct HostIOMMUDeviceClass {
 #define HOST_IOMMU_DEVICE_CAP_AW_BITS           1
 #define HOST_IOMMU_DEVICE_CAP_NESTING           2
 #define HOST_IOMMU_DEVICE_CAP_FS1GP             3
+#define HOST_IOMMU_DEVICE_CAP_ERRATA            4
 
 #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
 #endif
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 0a1a40cbba..3c23caef96 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -374,6 +374,8 @@ static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
         return caps->nesting;
     case HOST_IOMMU_DEVICE_CAP_FS1GP:
         return caps->fs1gp;
+    case HOST_IOMMU_DEVICE_CAP_ERRATA:
+        return caps->errata;
     default:
         error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
         return -EINVAL;
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index df6a12d200..58bff030e1 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -848,6 +848,7 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
     case IOMMU_HW_INFO_TYPE_INTEL_VTD:
         caps->nesting = !!(data.vtd.ecap_reg & VTD_ECAP_NEST);
         caps->fs1gp = !!(data.vtd.cap_reg & VTD_CAP_FS1GP);
+        caps->errata = data.vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17;
         break;
     case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
     case IOMMU_HW_INFO_TYPE_NONE:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 09/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (7 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 08/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-21  6:39   ` CLEMENT MATHIEU--DRIF
  2025-02-21 10:11   ` Eric Auger
  2025-02-19  8:22 ` [PATCH rfcv2 10/20] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
                   ` (12 subsequent siblings)
  21 siblings, 2 replies; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

In early days vtd_ce_get_rid2pasid_entry() is used to get pasid entry of
rid2pasid, then extend to any pasid. So a new name vtd_ce_get_pasid_entry
is better to match its functions.

No functional change intended.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 7fde0603bf..df5fb30bc8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
     return 0;
 }
 
-static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
+static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
                                       VTDContextEntry *ce,
                                       VTDPASIDEntry *pe,
                                       uint32_t pasid)
@@ -1025,7 +1025,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return VTD_PE_GET_FL_LEVEL(&pe);
         } else {
@@ -1048,7 +1048,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
     }
 
@@ -1116,7 +1116,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
         } else {
@@ -1522,7 +1522,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
      * has valid rid2pasid setting, which includes valid
      * rid2pasid field and corresponding pasid entry setting
      */
-    return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
+    return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1611,7 +1611,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
     }
 
@@ -1687,7 +1687,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
     int ret;
 
     if (s->root_scalable) {
-        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (ret) {
             /*
              * This error is guest triggerable. We should assumt PT
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 10/20] intel_iommu: Optimize context entry cache utilization
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (8 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 09/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-21 10:00   ` Eric Auger
  2025-02-19  8:22 ` [PATCH rfcv2 11/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

There are many call sites referencing context entry by calling
vtd_as_to_context_entry() which will traverse the DMAR table.

In most cases we can use cached context entry in vtd_as->context_cache_entry
except it's stale. Currently only global and domain context invalidation
stales it.

So introduce a helper function vtd_as_to_context_entry() to fetch from cache
before trying with vtd_dev_to_context_entry().

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index df5fb30bc8..7709f55be5 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1597,6 +1597,22 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     return 0;
 }
 
+static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint8_t bus_num = pci_bus_num(vtd_as->bus);
+    uint8_t devfn = vtd_as->devfn;
+    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
+
+    /* Try to fetch context-entry from cache first */
+    if (cc_entry->context_cache_gen == s->context_cache_gen) {
+        *ce = cc_entry->context_entry;
+        return 0;
+    } else {
+        return vtd_dev_to_context_entry(s, bus_num, devfn, ce);
+    }
+}
+
 static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
                                      void *private)
 {
@@ -1649,9 +1665,7 @@ static int vtd_address_space_sync(VTDAddressSpace *vtd_as)
         return 0;
     }
 
-    ret = vtd_dev_to_context_entry(vtd_as->iommu_state,
-                                   pci_bus_num(vtd_as->bus),
-                                   vtd_as->devfn, &ce);
+    ret = vtd_as_to_context_entry(vtd_as, &ce);
     if (ret) {
         if (ret == -VTD_FR_CONTEXT_ENTRY_P) {
             /*
@@ -1710,8 +1724,7 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
     assert(as);
 
     s = as->iommu_state;
-    if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
-                                 &ce)) {
+    if (vtd_as_to_context_entry(as, &ce)) {
         /*
          * Possibly failed to parse the context entry for some reason
          * (e.g., during init, or any guest configuration errors on
@@ -2443,8 +2456,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
     vtd_iommu_unlock(s);
 
     QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
-        if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
-                                      vtd_as->devfn, &ce) &&
+        if (!vtd_as_to_context_entry(vtd_as, &ce) &&
             domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
             vtd_address_space_sync(vtd_as);
         }
@@ -2466,8 +2478,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
     hwaddr size = (1 << am) * VTD_PAGE_SIZE;
 
     QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
-        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
-                                       vtd_as->devfn, &ce);
+        ret = vtd_as_to_context_entry(vtd_as, &ce);
         if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
             uint32_t rid2pasid = PCI_NO_PASID;
 
@@ -2974,8 +2985,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
     vtd_iommu_unlock(s);
 
     QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
-        if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
-                                      vtd_as->devfn, &ce) &&
+        if (!vtd_as_to_context_entry(vtd_as, &ce) &&
             domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
             uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
 
@@ -4154,7 +4164,7 @@ static void vtd_report_ir_illegal_access(VTDAddressSpace *vtd_as,
     assert(vtd_as->pasid != PCI_NO_PASID);
 
     /* Try out best to fetch FPD, we can't do anything more */
-    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+    if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
         is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
         if (!is_fpd_set && s->root_scalable) {
             vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, vtd_as->pasid);
@@ -4491,7 +4501,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     /* replay is protected by BQL, page walk will re-setup it safely */
     iova_tree_remove(vtd_as->iova_tree, map);
 
-    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+    if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
         trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
                                   "legacy mode",
                                   bus_n, PCI_SLOT(vtd_as->devfn),
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 11/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (9 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 10/20] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-21 12:49   ` Eric Auger
  2025-02-19  8:22 ` [PATCH rfcv2 12/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
is passed to host to construct nested page table. We need to check
compatibility of some critical IOMMU capabilities between vIOMMU and
host IOMMU to ensure guest stage-1 page table could be used by host.

For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
does not, then this IOMMUFD backed device should be failed.

Declare an enum type host_iommu_device_iommu_hw_info_type aliased to
iommu_hw_info_type which come from iommufd header file. This can avoid
build failure on windows which doesn't support iommufd.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/system/host_iommu_device.h | 13 ++++++++++++
 hw/i386/intel_iommu.c              | 34 ++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index 250600fc1d..aa3885d7ee 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -133,5 +133,18 @@ struct HostIOMMUDeviceClass {
 #define HOST_IOMMU_DEVICE_CAP_FS1GP             3
 #define HOST_IOMMU_DEVICE_CAP_ERRATA            4
 
+/**
+ * enum host_iommu_device_iommu_hw_info_type - IOMMU Hardware Info Types
+ * @HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not
+ *                                             report hardware info
+ * @HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ *
+ * This is alias to enum iommu_hw_info_type but for general purpose.
+ */
+enum host_iommu_device_iommu_hw_info_type {
+    HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_NONE,
+    HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD,
+};
+
 #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
 #endif
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 7709f55be5..9de60e607d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -39,6 +39,7 @@
 #include "kvm/kvm_i386.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "system/iommufd.h"
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -4346,6 +4347,39 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         return true;
     }
 
+    /* Remaining checks are all stage-1 translation specific */
+    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
+        return false;
+    }
+
+    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE, errp);
+    if (ret < 0) {
+        return false;
+    }
+    if (ret != HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD) {
+        error_setg(errp, "Incompatible host platform IOMMU type %d", ret);
+        return false;
+    }
+
+    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_NESTING, errp);
+    if (ret < 0) {
+        return false;
+    }
+    if (ret != 1) {
+        error_setg(errp, "Host IOMMU doesn't support nested translation");
+        return false;
+    }
+
+    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_FS1GP, errp);
+    if (ret < 0) {
+        return false;
+    }
+    if (s->fs1gp && ret != 1) {
+        error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
+        return false;
+    }
+
     error_setg(errp, "host device is uncompatible with stage-1 translation");
     return false;
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 12/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (10 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 11/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-21 13:03   ` Eric Auger
  2025-02-19  8:22 ` [PATCH rfcv2 13/20] intel_iommu: Add PASID cache management infrastructure Zhenzhong Duan
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

Introduce a new structure VTDHostIOMMUDevice which replaces
HostIOMMUDevice to be stored in hash table.

It includes a reference to HostIOMMUDevice and IntelIOMMUState,
also includes BDF information which will be used in future
patches.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  7 +++++++
 include/hw/i386/intel_iommu.h  |  2 +-
 hw/i386/intel_iommu.c          | 14 ++++++++++++--
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 2cda744786..18bc22fc72 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -28,6 +28,7 @@
 #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
 #define HW_I386_INTEL_IOMMU_INTERNAL_H
 #include "hw/i386/intel_iommu.h"
+#include "system/host_iommu_device.h"
 
 /*
  * Intel IOMMU register specification
@@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
 /* Bits to decide the offset for each level */
 #define VTD_LEVEL_BITS           9
 
+typedef struct VTDHostIOMMUDevice {
+    IntelIOMMUState *iommu_state;
+    PCIBus *bus;
+    uint8_t devfn;
+    HostIOMMUDevice *hiod;
+} VTDHostIOMMUDevice;
 #endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index e95477e855..50f9b27a45 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -295,7 +295,7 @@ struct IntelIOMMUState {
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
-    GHashTable *vtd_host_iommu_dev;             /* HostIOMMUDevice */
+    GHashTable *vtd_host_iommu_dev;             /* VTDHostIOMMUDevice */
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9de60e607d..fafa199f52 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
 
 static void vtd_hiod_destroy(gpointer v)
 {
-    object_unref(v);
+    VTDHostIOMMUDevice *vtd_hiod = v;
+
+    object_unref(vtd_hiod->hiod);
+    g_free(vtd_hiod);
 }
 
 static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
@@ -4388,6 +4391,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
                                      HostIOMMUDevice *hiod, Error **errp)
 {
     IntelIOMMUState *s = opaque;
+    VTDHostIOMMUDevice *vtd_hiod;
     struct vtd_as_key key = {
         .bus = bus,
         .devfn = devfn,
@@ -4404,6 +4408,12 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
         return false;
     }
 
+    vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
+    vtd_hiod->bus = bus;
+    vtd_hiod->devfn = (uint8_t)devfn;
+    vtd_hiod->iommu_state = s;
+    vtd_hiod->hiod = hiod;
+
     if (!vtd_check_hiod(s, hiod, errp)) {
         vtd_iommu_unlock(s);
         return false;
@@ -4414,7 +4424,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     new_key->devfn = devfn;
 
     object_ref(hiod);
-    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
+    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
 
     vtd_iommu_unlock(s);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 13/20] intel_iommu: Add PASID cache management infrastructure
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (11 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 12/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-21 17:02   ` Eric Auger
  2025-02-19  8:22 ` [PATCH rfcv2 14/20] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Yi Sun, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost, Marcel Apfelbaum

This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
pasid entry and track PASID usage and future PASID tagged DMA address
translation support in vIOMMU.

VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
never freed. For other pasid, VTDAddressSpace instance is created/destroyed
per the guest pasid entry set up/destroy for passthrough devices. While for
emulated devices, VTDAddressSpace instance is created in the PASID tagged DMA
translation and be destroyed per guest PASID cache invalidation. This focuses
on the PASID cache management for passthrough devices as there is no PASID
capable emulated devices yet.

When guest modifies a PASID entry, QEMU will capture the guest pasid selective
pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
invalidation reasons:

    *) a present pasid entry moved to non-present
    *) a present pasid entry to be a present entry
    *) a non-present pasid entry moved to present

vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
and compare it with the PASID cache.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  29 ++
 include/hw/i386/intel_iommu.h  |   6 +
 hw/i386/intel_iommu.c          | 484 ++++++++++++++++++++++++++++++++-
 hw/i386/trace-events           |   4 +
 4 files changed, 513 insertions(+), 10 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 18bc22fc72..632fda2853 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
                                   * request while disabled */
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
+    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
     /* PASID directory entry access failure */
     VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
     /* The Present(P) field of pasid directory entry is 0 */
@@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000f1c0ULL
 #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
 
+#define VTD_INV_DESC_PASIDC_G          (3ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PASIDC_DID(val)   (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0  0xfff000000000f1c0ULL
+
+#define VTD_INV_DESC_PASIDC_DSI        (0ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
+#define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
@@ -548,10 +558,28 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CTX_ENTRY_LEGACY_SIZE     16
 #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
 
+#define VTD_SM_CONTEXT_ENTRY_PDTS(val)      (((val) >> 9) & 0x7)
 #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+typedef enum VTDPCInvType {
+    /* force reset all */
+    VTD_PASID_CACHE_FORCE_RESET = 0,
+    /* pasid cache invalidation rely on guest PASID entry */
+    VTD_PASID_CACHE_GLOBAL_INV,
+    VTD_PASID_CACHE_DOMSI,
+    VTD_PASID_CACHE_PASIDSI,
+} VTDPCInvType;
+
+typedef struct VTDPASIDCacheInfo {
+    VTDPCInvType type;
+    uint16_t domain_id;
+    uint32_t pasid;
+    PCIBus *bus;
+    uint16_t devfn;
+} VTDPASIDCacheInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
@@ -563,6 +591,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
 #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
 #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
 
 /* PASID Granular Translation Type Mask */
 #define VTD_PASID_ENTRY_P              1ULL
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 50f9b27a45..fbc9da903a 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -95,6 +95,11 @@ struct VTDPASIDEntry {
     uint64_t val[8];
 };
 
+typedef struct VTDPASIDCacheEntry {
+    struct VTDPASIDEntry pasid_entry;
+    bool cache_filled;
+} VTDPASIDCacheEntry;
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -107,6 +112,7 @@ struct VTDAddressSpace {
     MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
+    VTDPASIDCacheEntry pasid_cache_entry;
     QLIST_ENTRY(VTDAddressSpace) next;
     /* Superset of notifier flags that this address space has */
     IOMMUNotifierFlag notifier_flags;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fafa199f52..b8f3b85803 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -86,6 +86,8 @@ struct vtd_iotlb_key {
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
+static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+
 static void vtd_panic_require_caching_mode(void)
 {
     error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -390,6 +392,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
     vtd_iommu_lock(s);
     vtd_reset_iotlb_locked(s);
     vtd_reset_context_cache_locked(s);
+    vtd_pasid_cache_reset(s);
     vtd_iommu_unlock(s);
 }
 
@@ -825,6 +828,16 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
     }
 }
 
+static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
+{
+    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
+}
+
+static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
+{
+    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1617,6 +1630,54 @@ static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
     }
 }
 
+/* Translate to iommu pasid if PCI_NO_PASID */
+static int vtd_as_to_iommu_pasid(VTDAddressSpace *vtd_as, uint32_t *pasid)
+{
+    VTDContextEntry ce;
+    int ret;
+
+    ret = vtd_as_to_context_entry(vtd_as, &ce);
+    if (ret) {
+        return ret;
+    }
+
+    if (vtd_as->pasid == PCI_NO_PASID) {
+        *pasid = VTD_CE_GET_RID2PASID(&ce);
+    } else {
+        *pasid = vtd_as->pasid;
+    }
+
+    return 0;
+}
+
+static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
+                                                   gpointer user_data)
+{
+    VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
+    struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
+    uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
+    uint32_t pasid;
+
+    if (vtd_as_to_iommu_pasid(vtd_as, &pasid)) {
+        return false;
+    }
+
+    return (pasid == target->pasid) && (sid == target->sid);
+}
+
+/* Translate iommu pasid to vtd_as */
+static VTDAddressSpace *vtd_as_from_iommu_pasid(IntelIOMMUState *s,
+                                                uint16_t sid, uint32_t pasid)
+{
+    struct vtd_as_raw_key key = {
+        .sid = sid,
+        .pasid = pasid
+    };
+
+    return g_hash_table_find(s->vtd_address_spaces,
+                             vtd_find_as_by_sid_and_iommu_pasid, &key);
+}
+
 static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
                                      void *private)
 {
@@ -3062,6 +3123,412 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
     return true;
 }
 
+static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
+                                            uint32_t pasid, VTDPASIDEntry *pe)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDContextEntry ce;
+    int ret;
+
+    if (!s->root_scalable) {
+        return -VTD_FR_RTADDR_INV_TTM;
+    }
+
+    ret = vtd_as_to_context_entry(vtd_as, &ce);
+    if (ret) {
+        return ret;
+    }
+
+    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+    return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/*
+ * This function fills in the pasid entry in &vtd_as. Caller
+ * of this function should hold iommu_lock.
+ */
+static void vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
+                                 VTDPASIDEntry *pe)
+{
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+
+    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+        /* No need to go further as cached pasid entry is latest */
+        return;
+    }
+
+    pc_entry->pasid_entry = *pe;
+    pc_entry->cache_filled = true;
+    /*
+     * TODO: send pasid bind to host for passthru devices
+     */
+}
+
+/*
+ * This function is used to clear cached pasid entry in vtd_as
+ * instances. Caller of this function should hold iommu_lock.
+ */
+static gboolean vtd_flush_pasid(gpointer key, gpointer value,
+                                gpointer user_data)
+{
+    VTDPASIDCacheInfo *pc_info = user_data;
+    VTDAddressSpace *vtd_as = value;
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    VTDPASIDEntry pe;
+    uint16_t did;
+    uint32_t pasid;
+    int ret;
+
+    /* Replay only fill pasid entry cache for passthrough device */
+    if (!pc_entry->cache_filled) {
+        return false;
+    }
+    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
+
+    if (vtd_as_to_iommu_pasid(vtd_as, &pasid)) {
+        goto remove;
+    }
+
+    switch (pc_info->type) {
+    case VTD_PASID_CACHE_FORCE_RESET:
+        goto remove;
+    case VTD_PASID_CACHE_PASIDSI:
+        if (pc_info->pasid != pasid) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_DOMSI:
+        if (pc_info->domain_id != did) {
+            return false;
+        }
+        /* Fall through */
+    case VTD_PASID_CACHE_GLOBAL_INV:
+        break;
+    default:
+        error_report("invalid pc_info->type");
+        abort();
+    }
+
+    /*
+     * pasid cache invalidation may indicate a present pasid
+     * entry to present pasid entry modification. To cover such
+     * case, vIOMMU emulator needs to fetch latest guest pasid
+     * entry and check cached pasid entry, then update pasid
+     * cache and send pasid bind/unbind to host properly.
+     */
+    ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
+    if (ret) {
+        /*
+         * No valid pasid entry in guest memory. e.g. pasid entry
+         * was modified to be either all-zero or non-present. Either
+         * case means existing pasid cache should be removed.
+         */
+        goto remove;
+    }
+
+    vtd_fill_pe_in_cache(s, vtd_as, &pe);
+    return false;
+
+remove:
+    /*
+     * TODO: send pasid unbind to host for passthru devices
+     */
+    pc_entry->cache_filled = false;
+
+    /*
+     * Don't remove address space of PCI_NO_PASID which is created by PCI
+     * sub-system.
+     */
+    if (vtd_as->pasid == PCI_NO_PASID) {
+        return false;
+    }
+    return true;
+}
+
+/* Caller of this function should hold iommu_lock */
+static void vtd_pasid_cache_reset(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_reset();
+
+    pc_info.type = VTD_PASID_CACHE_FORCE_RESET;
+
+    /*
+     * Reset pasid cache is a big hammer, so use
+     * g_hash_table_foreach_remove which will free
+     * the vtd_as instances. Also, as a big
+     * hammer, use VTD_PASID_CACHE_FORCE_RESET to
+     * ensure all the vtd_as instances are
+     * dropped, meanwhile the change will be passed
+     * to host if HostIOMMUDeviceIOMMUFD is available.
+     */
+    g_hash_table_foreach_remove(s->vtd_address_spaces,
+                                vtd_flush_pasid, &pc_info);
+}
+
+/* Caller of this function should hold iommu_lock. */
+static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+                                        dma_addr_t pt_base,
+                                        int start,
+                                        int end,
+                                        VTDPASIDCacheInfo *info)
+{
+    VTDPASIDEntry pe;
+    int pasid = start;
+    int pasid_next;
+
+    while (pasid < end) {
+        pasid_next = pasid + 1;
+
+        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+            && vtd_pe_present(&pe)) {
+            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
+            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
+            VTDAddressSpace *vtd_as;
+
+            vtd_as = vtd_as_from_iommu_pasid(s, sid, pasid);
+            if (!vtd_as) {
+                vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
+            }
+
+            if ((info->type == VTD_PASID_CACHE_DOMSI ||
+                 info->type == VTD_PASID_CACHE_PASIDSI) &&
+                !(info->domain_id == vtd_pe_get_domain_id(&pe))) {
+                /*
+                 * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
+                 * requires domain ID check. If domain Id check fail,
+                 * go to next pasid.
+                 */
+                pasid = pasid_next;
+                continue;
+            }
+            vtd_fill_pe_in_cache(s, vtd_as, &pe);
+        }
+        pasid = pasid_next;
+    }
+}
+
+/*
+ * Currently, VT-d scalable mode pasid table is a two level table,
+ * this function aims to loop a range of PASIDs in a given pasid
+ * table to identify the pasid config in guest.
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
+                                    dma_addr_t pdt_base,
+                                    int start,
+                                    int end,
+                                    VTDPASIDCacheInfo *info)
+{
+    VTDPASIDDirEntry pdire;
+    int pasid = start;
+    int pasid_next;
+    dma_addr_t pt_base;
+
+    while (pasid < end) {
+        pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
+                      (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
+        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+            && vtd_pdire_present(&pdire)) {
+            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
+        }
+        pasid = pasid_next;
+    }
+}
+
+static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
+                                          int start, int end,
+                                          VTDPASIDCacheInfo *info)
+{
+    VTDContextEntry ce;
+    VTDAddressSpace *vtd_as;
+
+    vtd_as = vtd_find_add_as(s, info->bus, info->devfn, PCI_NO_PASID);
+
+    if (!vtd_as_to_context_entry(vtd_as, &ce)) {
+        uint32_t max_pasid;
+
+        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
+        if (end > max_pasid) {
+            end = max_pasid;
+        }
+        vtd_sm_pasid_table_walk(s,
+                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
+                                start,
+                                end,
+                                info);
+    }
+}
+
+/*
+ * This function replay the guest pasid bindings to hosts by
+ * walking the guest PASID table. This ensures host will have
+ * latest guest pasid bindings. Caller should hold iommu_lock.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+                                            VTDPASIDCacheInfo *pc_info)
+{
+    VTDHostIOMMUDevice *vtd_hiod;
+    int start = 0, end = 1; /* only rid2pasid is supported */
+    VTDPASIDCacheInfo walk_info;
+    GHashTableIter as_it;
+
+    switch (pc_info->type) {
+    case VTD_PASID_CACHE_PASIDSI:
+        start = pc_info->pasid;
+        end = pc_info->pasid + 1;
+        /*
+         * PASID selective invalidation is within domain,
+         * thus fall through.
+         */
+    case VTD_PASID_CACHE_DOMSI:
+    case VTD_PASID_CACHE_GLOBAL_INV:
+        /* loop all assigned devices */
+        break;
+    case VTD_PASID_CACHE_FORCE_RESET:
+        /* For force reset, no need to go further replay */
+        return;
+    default:
+        error_report("invalid pc_info->type for replay");
+        abort();
+    }
+
+    /*
+     * In this replay, only needs to care about the devices which
+     * are backed by host IOMMU. For such devices, their vtd_hiod
+     * instances are in the s->vtd_host_iommu_dev. For devices which
+     * are not backed by host IOMMU, it is not necessary to replay
+     * the bindings since their cache could be re-created in the future
+     * DMA address translation.
+     */
+    walk_info = *pc_info;
+    g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
+    while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
+        /* bus|devfn fields are not identical with pc_info */
+        walk_info.bus = vtd_hiod->bus;
+        walk_info.devfn = vtd_hiod->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+    }
+}
+
+/*
+ * This function syncs the pasid bindings between guest and host.
+ * It includes updating the pasid cache in vIOMMU and updating the
+ * pasid bindings per guest's latest pasid entry presence.
+ */
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info)
+{
+    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
+        return;
+    }
+
+    /*
+     * Regards to a pasid cache invalidation, e.g. a PSI.
+     * it could be either cases of below:
+     * a) a present pasid entry moved to non-present
+     * b) a present pasid entry to be a present entry
+     * c) a non-present pasid entry moved to present
+     *
+     * Different invalidation granularity may affect different device
+     * scope and pasid scope. But for each invalidation granularity,
+     * it needs to do two steps to sync host and guest pasid binding.
+     *
+     * Here is the handling of a PSI:
+     * 1) loop all the existing vtd_as instances to update them
+     *    according to the latest guest pasid entry in pasid table.
+     *    this will make sure affected existing vtd_as instances
+     *    cached the latest pasid entries. Also, during the loop, the
+     *    host should be notified if needed. e.g. pasid unbind or pasid
+     *    update. Should be able to cover case a) and case b).
+     *
+     * 2) loop all devices to cover case c)
+     *    - For devices which are backed by HostIOMMUDeviceIOMMUFD instances,
+     *      we loop them and check if guest pasid entry exists. If yes,
+     *      it is case c), we update the pasid cache and also notify
+     *      host.
+     *    - For devices which are not backed by HostIOMMUDeviceIOMMUFD,
+     *      it is not necessary to create pasid cache at this phase since
+     *      it could be created when vIOMMU does DMA address translation.
+     *      This is not yet implemented since there is no emulated
+     *      pasid-capable devices today. If we have such devices in
+     *      future, the pasid cache shall be created there.
+     * Other granularity follow the same steps, just with different scope
+     *
+     */
+
+    vtd_iommu_lock(s);
+    /* Step 1: loop all the existing vtd_as instances */
+    g_hash_table_foreach_remove(s->vtd_address_spaces,
+                                vtd_flush_pasid, pc_info);
+
+    /*
+     * Step 2: loop all the existing vtd_hiod instances.
+     * Ideally, needs to loop all devices to find if there is any new
+     * PASID binding regards to the PASID cache invalidation request.
+     * But it is enough to loop the devices which are backed by host
+     * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
+     * if new PASID happened on them, their vtd_as instance could
+     * be created during future vIOMMU DMA translation.
+     */
+    vtd_replay_guest_pasid_bindings(s, pc_info);
+    vtd_iommu_unlock(s);
+}
+
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+                                   VTDInvDesc *inv_desc)
+{
+    uint16_t domain_id;
+    uint32_t pasid;
+    VTDPASIDCacheInfo pc_info;
+    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
+                        VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
+
+    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
+                                     __func__, "pasid cache inv")) {
+        return false;
+    }
+
+    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
+    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
+
+    switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
+    case VTD_INV_DESC_PASIDC_DSI:
+        trace_vtd_pasid_cache_dsi(domain_id);
+        pc_info.type = VTD_PASID_CACHE_DOMSI;
+        pc_info.domain_id = domain_id;
+        break;
+
+    case VTD_INV_DESC_PASIDC_PASID_SI:
+        /* PASID selective implies a DID selective */
+        trace_vtd_pasid_cache_psi(domain_id, pasid);
+        pc_info.type = VTD_PASID_CACHE_PASIDSI;
+        pc_info.domain_id = domain_id;
+        pc_info.pasid = pasid;
+        break;
+
+    case VTD_INV_DESC_PASIDC_GLOBAL:
+        trace_vtd_pasid_cache_gsi();
+        pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+        break;
+
+    default:
+        error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
+                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    vtd_pasid_cache_sync(s, &pc_info);
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -3223,6 +3690,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    case VTD_INV_DESC_PC:
+        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_pasid_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
     case VTD_INV_DESC_PIOTLB:
         trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
         if (!vtd_process_piotlb_desc(s, &inv_desc)) {
@@ -3258,16 +3732,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
-    /*
-     * TODO: the entity of below two cases will be implemented in future series.
-     * To make guest (which integrates scalable mode support patch set in
-     * iommu driver) work, just return true is enough so far.
-     */
-    case VTD_INV_DESC_PC:
-        if (s->scalable_mode) {
-            break;
-        }
-    /* fallthrough */
     default:
         error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
                           " (unknown type)", __func__, inv_desc.hi,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 53c02d7ac8..a26b38b52c 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,10 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_reset(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domain slective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 14/20] intel_iommu: Bind/unbind guest page table to host
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (12 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 13/20] intel_iommu: Add PASID cache management infrastructure Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-19  8:22 ` [PATCH rfcv2 15/20] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

This captures the guest PASID table entry modifications and
propagates the changes to host to attach a hwpt with type determined
per guest PGTT configuration.

When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
the hwpt on host side is a nested page table.

The guest page table is configured as stage-1 page table (gIOVA->GPA)
whose translation result would further go through host VT-d stage-2
page table(GPA->HPA) under nested translation mode. This is the key
to support gIOVA over stage-1 page table for Intel VT-d in
virtualization environment.

Stage-2 page table could be shared by different devices if there is
no conflict and devices link to same iommufd object, i.e. devices
under same host IOMMU can share same stage-2 page table. If there
is conflict, i.e. there is one device under non cache coherency
mode which is different from others, it requires a separate
stage-2 page table in non-CC mode.

See below example diagram:

      IntelIOMMUState
             |
             V
    .------------------.    .------------------.
    | VTDIOASContainer |--->| VTDIOASContainer |--->...
    |    (iommufd0)    |    |    (iommufd1)    |
    .------------------.    .------------------.
             |                       |
             |                       .-->...
             V
      .-------------------.    .-------------------.
      |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...
      .-------------------.    .-------------------.
          |            |               |
          |            |               |
    .-----------.  .-----------.  .------------.
    | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |
    | Device(CC)|  | Device(CC)|  | Device     |
    | (iommufd0)|  | (iommufd0)|  | (non-CC)   |
    |           |  |           |  | (iommufd0) |
    .-----------.  .-----------.  .------------.

Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  11 +
 include/hw/i386/intel_iommu.h  |  24 ++
 hw/i386/intel_iommu.c          | 581 +++++++++++++++++++++++++++++++--
 hw/i386/trace-events           |   8 +
 4 files changed, 604 insertions(+), 20 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 632fda2853..23b7e236b0 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -563,6 +563,13 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+typedef enum VTDPASIDOp {
+    VTD_PASID_BIND,
+    VTD_PASID_UPDATE,
+    VTD_PASID_UNBIND,
+    VTD_OP_NUM
+} VTDPASIDOp;
+
 typedef enum VTDPCInvType {
     /* force reset all */
     VTD_PASID_CACHE_FORCE_RESET = 0,
@@ -578,6 +585,7 @@ typedef struct VTDPASIDCacheInfo {
     uint32_t pasid;
     PCIBus *bus;
     uint16_t devfn;
+    bool error_happened;
 } VTDPASIDCacheInfo;
 
 /* PASID Table Related Definitions */
@@ -606,6 +614,9 @@ typedef struct VTDPASIDCacheInfo {
 
 #define VTD_SM_PASID_ENTRY_FLPM          3ULL
 #define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(val)  (!!((val) & 1ULL))
+#define VTD_SM_PASID_ENTRY_WPE_BIT(val)  (!!(((val) >> 4) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
 
 /* First Level Paging Structure */
 /* Masks for First Level Paging Entry */
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index fbc9da903a..594281c1d3 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -100,10 +100,32 @@ typedef struct VTDPASIDCacheEntry {
     bool cache_filled;
 } VTDPASIDCacheEntry;
 
+typedef struct VTDIOASContainer {
+    struct IOMMUFDBackend *iommufd;
+    uint32_t ioas_id;
+    MemoryListener listener;
+    QLIST_HEAD(, VTDS2Hwpt) s2_hwpt_list;
+    QLIST_ENTRY(VTDIOASContainer) next;
+    Error *error;
+} VTDIOASContainer;
+
+typedef struct VTDS2Hwpt {
+    uint32_t users;
+    uint32_t hwpt_id;
+    VTDIOASContainer *container;
+    QLIST_ENTRY(VTDS2Hwpt) next;
+} VTDS2Hwpt;
+
+typedef struct VTDHwpt {
+    uint32_t hwpt_id;
+    VTDS2Hwpt *s2_hwpt;
+} VTDHwpt;
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
     uint32_t pasid;
+    VTDHwpt hwpt;
     AddressSpace as;
     IOMMUMemoryRegion iommu;
     MemoryRegion root;          /* The root container of the device */
@@ -303,6 +325,8 @@ struct IntelIOMMUState {
 
     GHashTable *vtd_host_iommu_dev;             /* VTDHostIOMMUDevice */
 
+    QLIST_HEAD(, VTDIOASContainer) containers;
+
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
     dma_addr_t intr_root;           /* Interrupt remapping table pointer */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b8f3b85803..e36ac44110 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -20,6 +20,7 @@
  */
 
 #include "qemu/osdep.h"
+#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qapi/error.h"
@@ -40,6 +41,9 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "system/iommufd.h"
+#ifdef CONFIG_IOMMUFD
+#include <linux/iommufd.h>
+#endif
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -838,11 +842,40 @@ static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
     return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
 }
 
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+    return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+}
+
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+    return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
+}
+
+static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
+}
+
+/* check if pgtt is first stage translation */
+static inline bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
 }
 
+static inline void pasid_cache_info_set_error(VTDPASIDCacheInfo *pc_info)
+{
+    if (pc_info->error_happened) {
+        return;
+    }
+    pc_info->error_happened = true;
+}
+
 /**
  * Caller of this function should check present bit if wants
  * to use pdir entry for further usage except for fpd bit check.
@@ -1774,7 +1807,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
              */
             return false;
         }
-        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
+        return vtd_pe_pgtt_is_pt(&pe);
     }
 
     return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
@@ -2409,6 +2442,497 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
+#ifdef CONFIG_IOMMUFD
+static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
+{
+    return !memory_region_is_ram(section->mr) ||
+           memory_region_is_protected(section->mr) ||
+           /*
+            * Sizing an enabled 64-bit BAR can cause spurious mappings to
+            * addresses in the upper part of the 64-bit address space.  These
+            * are never accessed by the CPU and beyond the address width of
+            * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
+            */
+           section->offset_within_address_space & (1ULL << 63);
+}
+
+static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
+                                                 MemoryRegionSection *section)
+{
+    VTDIOASContainer *container = container_of(listener,
+                                               VTDIOASContainer, listener);
+    IOMMUFDBackend *iommufd = container->iommufd;
+    uint32_t ioas_id = container->ioas_id;
+    hwaddr iova;
+    Int128 llend, llsize;
+    void *vaddr;
+    Error *err = NULL;
+    int ret;
+
+    if (iommufd_listener_skipped_section(section)) {
+        return;
+    }
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
+    llsize = int128_sub(llend, int128_make64(iova));
+    vaddr = memory_region_get_ram_ptr(section->mr) +
+            section->offset_within_region +
+            (iova - section->offset_within_address_space);
+
+    memory_region_ref(section->mr);
+
+    ret = iommufd_backend_map_dma(iommufd, ioas_id, iova, int128_get64(llsize),
+                                  vaddr, section->readonly);
+    if (!ret) {
+        return;
+    }
+
+    error_setg(&err,
+               "iommufd_listener_region_add_s2domain(%p, 0x%"HWADDR_PRIx", "
+               "0x%"HWADDR_PRIx", %p) = %d (%s)",
+               container, iova, int128_get64(llsize), vaddr, ret,
+               strerror(-ret));
+
+    if (memory_region_is_ram_device(section->mr)) {
+        /* Allow unexpected mappings not to be fatal for RAM devices */
+        error_report_err(err);
+        return;
+    }
+
+    if (!container->error) {
+        error_propagate_prepend(&container->error, err, "Region %s: ",
+                                memory_region_name(section->mr));
+    } else {
+        error_free(err);
+    }
+}
+
+static void iommufd_listener_region_del_s2domain(MemoryListener *listener,
+                                                 MemoryRegionSection *section)
+{
+    VTDIOASContainer *container = container_of(listener,
+                                               VTDIOASContainer, listener);
+    IOMMUFDBackend *iommufd = container->iommufd;
+    uint32_t ioas_id = container->ioas_id;
+    hwaddr iova;
+    Int128 llend, llsize;
+    int ret;
+
+    if (iommufd_listener_skipped_section(section)) {
+        return;
+    }
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
+    llsize = int128_sub(llend, int128_make64(iova));
+
+    ret = iommufd_backend_unmap_dma(iommufd, ioas_id,
+                                    iova, int128_get64(llsize));
+    if (ret) {
+        error_report("iommufd_listener_region_del_s2domain(%p, "
+                     "0x%"HWADDR_PRIx", 0x%"HWADDR_PRIx") = %d (%s)",
+                     container, iova, int128_get64(llsize), ret,
+                     strerror(-ret));
+    }
+
+    memory_region_unref(section->mr);
+}
+
+static const MemoryListener iommufd_s2domain_memory_listener = {
+    .name = "iommufd_s2domain",
+    .priority = 1000,
+    .region_add = iommufd_listener_region_add_s2domain,
+    .region_del = iommufd_listener_region_del_s2domain,
+};
+
+static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
+                                  VTDPASIDEntry *pe)
+{
+    memset(vtd, 0, sizeof(*vtd));
+
+    vtd->flags =  (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ?
+                                        IOMMU_VTD_S1_SRE : 0) |
+                  (VTD_SM_PASID_ENTRY_WPE_BIT(pe->val[2]) ?
+                                        IOMMU_VTD_S1_WPE : 0) |
+                  (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ?
+                                        IOMMU_VTD_S1_EAFE : 0);
+    vtd->addr_width = vtd_pe_get_fl_aw(pe);
+    vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
+}
+
+static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                              VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
+                              VTDPASIDEntry *pe, Error **errp)
+{
+    struct iommu_hwpt_vtd_s1 vtd;
+    uint32_t hwpt_id, s2_hwpt_id = s2_hwpt->hwpt_id;
+
+    vtd_init_s1_hwpt_data(&vtd, pe);
+
+    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+                                    s2_hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
+                                    sizeof(vtd), &vtd, &hwpt_id, errp)) {
+        return -EINVAL;
+    }
+
+    hwpt->hwpt_id = hwpt_id;
+
+    return 0;
+}
+
+static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev, VTDHwpt *hwpt)
+{
+    iommufd_backend_free_id(idev->iommufd, hwpt->hwpt_id);
+}
+
+static VTDS2Hwpt *vtd_ioas_container_get_s2_hwpt(VTDIOASContainer *container,
+                                                 uint32_t hwpt_id)
+{
+    VTDS2Hwpt *s2_hwpt;
+
+    QLIST_FOREACH(s2_hwpt, &container->s2_hwpt_list, next) {
+        if (s2_hwpt->hwpt_id == hwpt_id) {
+            return s2_hwpt;
+        }
+    }
+
+    s2_hwpt = g_malloc0(sizeof(*s2_hwpt));
+
+    s2_hwpt->hwpt_id = hwpt_id;
+    s2_hwpt->container = container;
+    QLIST_INSERT_HEAD(&container->s2_hwpt_list, s2_hwpt, next);
+
+    return s2_hwpt;
+}
+
+static void vtd_ioas_container_put_s2_hwpt(VTDS2Hwpt *s2_hwpt)
+{
+    VTDIOASContainer *container = s2_hwpt->container;
+
+    if (s2_hwpt->users) {
+        return;
+    }
+
+    QLIST_REMOVE(s2_hwpt, next);
+    iommufd_backend_free_id(container->iommufd, s2_hwpt->hwpt_id);
+    g_free(s2_hwpt);
+}
+
+static void vtd_ioas_container_destroy(VTDIOASContainer *container)
+{
+    if (!QLIST_EMPTY(&container->s2_hwpt_list)) {
+        return;
+    }
+
+    QLIST_REMOVE(container, next);
+    memory_listener_unregister(&container->listener);
+    iommufd_backend_free_id(container->iommufd, container->ioas_id);
+    g_free(container);
+}
+
+static int vtd_device_attach_hwpt(VTDHostIOMMUDevice *vtd_hiod,
+                                  uint32_t pasid, VTDPASIDEntry *pe,
+                                  VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
+                                  Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    int ret;
+
+    if (vtd_pe_pgtt_is_flt(pe)) {
+        ret = vtd_create_s1_hwpt(idev, s2_hwpt, hwpt, pe, errp);
+        if (ret) {
+            return ret;
+        }
+    } else {
+        hwpt->hwpt_id = s2_hwpt->hwpt_id;
+    }
+
+    ret = !host_iommu_device_iommufd_attach_hwpt(idev, hwpt->hwpt_id, errp);
+    trace_vtd_device_attach_hwpt(idev->devid, pasid, hwpt->hwpt_id, ret);
+    if (ret) {
+        if (vtd_pe_pgtt_is_flt(pe)) {
+            vtd_destroy_s1_hwpt(idev, hwpt);
+        }
+        hwpt->hwpt_id = 0;
+        error_report("devid %d pasid %d failed to attach hwpt %d",
+                     idev->devid, pasid, hwpt->hwpt_id);
+        return ret;
+    }
+
+    s2_hwpt->users++;
+    hwpt->s2_hwpt = s2_hwpt;
+
+    return 0;
+}
+
+static void vtd_device_detach_hwpt(VTDHostIOMMUDevice *vtd_hiod,
+                                   uint32_t pasid, VTDPASIDEntry *pe,
+                                   VTDHwpt *hwpt, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    int ret;
+
+    if (vtd_hiod->iommu_state->dmar_enabled) {
+        ret = !host_iommu_device_iommufd_detach_hwpt(idev, errp);
+        trace_vtd_device_detach_hwpt(idev->devid, pasid, ret);
+    } else {
+        ret = !host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
+        trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
+                                           ret);
+    }
+
+    if (ret) {
+        error_report("devid %d pasid %d failed to attach hwpt %d",
+                     idev->devid, pasid, hwpt->hwpt_id);
+    }
+
+    if (vtd_pe_pgtt_is_flt(pe)) {
+        vtd_destroy_s1_hwpt(idev, hwpt);
+    }
+
+    hwpt->s2_hwpt->users--;
+    hwpt->s2_hwpt = NULL;
+    hwpt->hwpt_id = 0;
+}
+
+static int vtd_device_attach_container(VTDHostIOMMUDevice *vtd_hiod,
+                                       VTDIOASContainer *container,
+                                       uint32_t pasid, VTDPASIDEntry *pe,
+                                       VTDHwpt *hwpt, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    IOMMUFDBackend *iommufd = idev->iommufd;
+    VTDS2Hwpt *s2_hwpt;
+    uint32_t s2_hwpt_id;
+    Error *err = NULL;
+    int ret;
+
+    /* try to attach to an existing hwpt in this container */
+    QLIST_FOREACH(s2_hwpt, &container->s2_hwpt_list, next) {
+        ret = vtd_device_attach_hwpt(vtd_hiod, pasid, pe, s2_hwpt, hwpt, &err);
+        if (ret) {
+            const char *msg = error_get_pretty(err);
+
+            trace_vtd_device_fail_attach_existing_hwpt(msg);
+            error_free(err);
+            err = NULL;
+        } else {
+            goto found_hwpt;
+        }
+    }
+
+    if (!iommufd_backend_alloc_hwpt(iommufd, idev->devid, container->ioas_id,
+                                    IOMMU_HWPT_ALLOC_NEST_PARENT,
+                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
+                                    &s2_hwpt_id, errp)) {
+        return -EINVAL;
+    }
+
+    s2_hwpt = vtd_ioas_container_get_s2_hwpt(container, s2_hwpt_id);
+
+    /* Attach vtd device to a new allocated hwpt within iommufd */
+    ret = vtd_device_attach_hwpt(vtd_hiod, pasid, pe, s2_hwpt, hwpt, errp);
+    if (ret) {
+        goto err_attach_hwpt;
+    }
+
+found_hwpt:
+    trace_vtd_device_attach_container(iommufd->fd, idev->devid, pasid,
+                                      container->ioas_id, hwpt->hwpt_id);
+    return 0;
+
+err_attach_hwpt:
+    vtd_ioas_container_put_s2_hwpt(s2_hwpt);
+    return ret;
+}
+
+static void vtd_device_detach_container(VTDHostIOMMUDevice *vtd_hiod,
+                                        uint32_t pasid, VTDPASIDEntry *pe,
+                                        VTDHwpt *hwpt, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    IOMMUFDBackend *iommufd = idev->iommufd;
+    VTDS2Hwpt *s2_hwpt = hwpt->s2_hwpt;
+
+    trace_vtd_device_detach_container(iommufd->fd, idev->devid, pasid);
+    vtd_device_detach_hwpt(vtd_hiod, pasid, pe, hwpt, errp);
+    vtd_ioas_container_put_s2_hwpt(s2_hwpt);
+}
+
+static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                     uint32_t pasid, VTDPASIDEntry *pe,
+                                     VTDHwpt *hwpt, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    IOMMUFDBackend *iommufd = idev->iommufd;
+    IntelIOMMUState *s = vtd_hiod->iommu_state;
+    VTDIOASContainer *container;
+    Error *err = NULL;
+    uint32_t ioas_id;
+    int ret;
+
+    /* try to attach to an existing container in this space */
+    QLIST_FOREACH(container, &s->containers, next) {
+        if (container->iommufd != iommufd) {
+            continue;
+        }
+
+        if (vtd_device_attach_container(vtd_hiod, container, pasid, pe, hwpt,
+                                        &err)) {
+            const char *msg = error_get_pretty(err);
+
+            trace_vtd_device_fail_attach_existing_container(msg);
+            error_free(err);
+            err = NULL;
+        } else {
+            return 0;
+        }
+    }
+
+    /* Need to allocate a new dedicated container */
+    ret = iommufd_backend_alloc_ioas(iommufd, &ioas_id, errp);
+    if (ret < 0) {
+        return ret;
+    }
+
+    trace_vtd_device_alloc_ioas(iommufd->fd, ioas_id);
+
+    container = g_malloc0(sizeof(*container));
+    container->iommufd = iommufd;
+    container->ioas_id = ioas_id;
+    QLIST_INIT(&container->s2_hwpt_list);
+
+    if (vtd_device_attach_container(vtd_hiod, container, pasid, pe, hwpt,
+                                    errp)) {
+        goto err_attach_container;
+    }
+
+    container->listener = iommufd_s2domain_memory_listener;
+    memory_listener_register(&container->listener, &address_space_memory);
+
+    if (container->error) {
+        ret = -1;
+        error_propagate_prepend(errp, container->error,
+                                "memory listener initialization failed: ");
+        goto err_listener_register;
+    }
+
+    QLIST_INSERT_HEAD(&s->containers, container, next);
+
+    return 0;
+
+err_listener_register:
+    vtd_device_detach_container(vtd_hiod, pasid, pe, hwpt, errp);
+err_attach_container:
+    iommufd_backend_free_id(iommufd, container->ioas_id);
+    g_free(container);
+    return ret;
+}
+
+static void vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                      uint32_t pasid, VTDPASIDEntry *pe,
+                                      VTDHwpt *hwpt, Error **errp)
+{
+    VTDIOASContainer *container = hwpt->s2_hwpt->container;
+
+    vtd_device_detach_container(vtd_hiod, pasid, pe, hwpt, errp);
+    vtd_ioas_container_destroy(container);
+}
+
+static int vtd_device_attach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
+                                   VTDAddressSpace *vtd_as, VTDPASIDEntry *pe)
+{
+    /*
+     * If pe->gptt != FLT, should be go ahead to do bind as host only
+     * accepts guest FLT under nesting. If pe->pgtt==PT, should setup
+     * the pasid with GPA page table. Otherwise should return failure.
+     */
+    if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
+        return -EINVAL;
+    }
+
+    /* Should fail if the FLPT base is 0 */
+    if (vtd_pe_pgtt_is_flt(pe) && !vtd_pe_get_flpt_base(pe)) {
+        return -EINVAL;
+    }
+
+    return vtd_device_attach_iommufd(vtd_hiod, vtd_as->pasid, pe,
+                                     &vtd_as->hwpt, &error_abort);
+}
+
+static int vtd_device_detach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
+                                   VTDAddressSpace *vtd_as)
+{
+    VTDPASIDEntry *cached_pe = vtd_as->pasid_cache_entry.cache_filled ?
+                       &vtd_as->pasid_cache_entry.pasid_entry : NULL;
+
+    if (!cached_pe ||
+        (!vtd_pe_pgtt_is_flt(cached_pe) && !vtd_pe_pgtt_is_pt(cached_pe))) {
+        return 0;
+    }
+
+    vtd_device_detach_iommufd(vtd_hiod, vtd_as->pasid, cached_pe,
+                              &vtd_as->hwpt, &error_abort);
+
+    return 0;
+}
+
+/**
+ * Caller should hold iommu_lock.
+ */
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
+                                VTDPASIDEntry *pe, VTDPASIDOp op)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDHostIOMMUDevice *vtd_hiod;
+    int devfn = vtd_as->devfn;
+    int ret = -EINVAL;
+    struct vtd_as_key key = {
+        .bus = vtd_as->bus,
+        .devfn = devfn,
+    };
+
+    vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
+    if (!vtd_hiod || !vtd_hiod->hiod) {
+        /* means no need to go further, e.g. for emulated devices */
+        return 0;
+    }
+
+    if (vtd_as->pasid != PCI_NO_PASID) {
+        error_report("Non-rid_pasid %d not supported yet", vtd_as->pasid);
+        return ret;
+    }
+
+    switch (op) {
+    case VTD_PASID_UPDATE:
+    case VTD_PASID_BIND:
+    {
+        ret = vtd_device_attach_pgtbl(vtd_hiod, vtd_as, pe);
+        break;
+    }
+    case VTD_PASID_UNBIND:
+    {
+        ret = vtd_device_detach_pgtbl(vtd_hiod, vtd_as);
+        break;
+    }
+    default:
+        error_report_once("Unknown VTDPASIDOp!!!\n");
+        break;
+    }
+
+    return ret;
+}
+#else
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
+                                VTDPASIDEntry *pe, VTDPASIDOp op)
+{
+    return 0;
+}
+#endif
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -3151,21 +3675,27 @@ static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
  * This function fills in the pasid entry in &vtd_as. Caller
  * of this function should hold iommu_lock.
  */
-static void vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
-                                 VTDPASIDEntry *pe)
+static int vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
+                                VTDPASIDEntry *pe)
 {
     VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    int ret;
 
-    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
-        /* No need to go further as cached pasid entry is latest */
-        return;
+    if (pc_entry->cache_filled) {
+        if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+            /* No need to go further as cached pasid entry is latest */
+            return 0;
+        }
+        ret = vtd_bind_guest_pasid(vtd_as, pe, VTD_PASID_UPDATE);
+    } else {
+        ret = vtd_bind_guest_pasid(vtd_as, pe, VTD_PASID_BIND);
     }
 
-    pc_entry->pasid_entry = *pe;
-    pc_entry->cache_filled = true;
-    /*
-     * TODO: send pasid bind to host for passthru devices
-     */
+    if (!ret) {
+        pc_entry->pasid_entry = *pe;
+        pc_entry->cache_filled = true;
+    }
+    return ret;
 }
 
 /*
@@ -3231,14 +3761,20 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         goto remove;
     }
 
-    vtd_fill_pe_in_cache(s, vtd_as, &pe);
+    if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
+        pasid_cache_info_set_error(pc_info);
+    }
     return false;
 
 remove:
-    /*
-     * TODO: send pasid unbind to host for passthru devices
-     */
-    pc_entry->cache_filled = false;
+    if (pc_entry->cache_filled) {
+        if (vtd_bind_guest_pasid(vtd_as, NULL, VTD_PASID_UNBIND)) {
+            pasid_cache_info_set_error(pc_info);
+            return false;
+        } else {
+            pc_entry->cache_filled = false;
+        }
+    }
 
     /*
      * Don't remove address space of PCI_NO_PASID which is created by PCI
@@ -3253,7 +3789,7 @@ remove:
 /* Caller of this function should hold iommu_lock */
 static void vtd_pasid_cache_reset(IntelIOMMUState *s)
 {
-    VTDPASIDCacheInfo pc_info;
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
 
     trace_vtd_pasid_cache_reset();
 
@@ -3308,7 +3844,9 @@ static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
                 pasid = pasid_next;
                 continue;
             }
-            vtd_fill_pe_in_cache(s, vtd_as, &pe);
+            if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
+                pasid_cache_info_set_error(info);
+            }
         }
         pasid = pasid_next;
     }
@@ -3416,6 +3954,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
         walk_info.devfn = vtd_hiod->devfn;
         vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
     }
+    if (walk_info.error_happened) {
+        pasid_cache_info_set_error(pc_info);
+    }
 }
 
 /*
@@ -3485,9 +4026,9 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
     uint16_t domain_id;
     uint32_t pasid;
-    VTDPASIDCacheInfo pc_info;
     uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
                         VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
 
@@ -3526,7 +4067,7 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
     }
 
     vtd_pasid_cache_sync(s, &pc_info);
-    return true;
+    return !pc_info.error_happened ? true : false;
 }
 
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index a26b38b52c..22559eb787 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -72,6 +72,14 @@ vtd_frr_new(int index, uint64_t hi, uint64_t lo) "index %d high 0x%"PRIx64" low
 vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
 vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
 vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
+vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
+vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_fail_attach_existing_hwpt(const char *msg) " %s"
+vtd_device_attach_container(int fd, uint32_t dev_id, uint32_t pasid, uint32_t ioas_id, uint32_t hwpt_id) "iommufd %d dev_id %d pasid %d ioas_id %d hwpt_id %d"
+vtd_device_detach_container(int fd, uint32_t dev_id, uint32_t pasid) "iommufd %d dev_id %d pasid %d"
+vtd_device_fail_attach_existing_container(const char *msg) " %s"
+vtd_device_alloc_ioas(int fd, uint32_t ioas_id) "iommufd %d ioas_id %d"
 
 # amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 15/20] intel_iommu: ERRATA_772415 workaround
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (13 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 14/20] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-19  8:22 ` [PATCH rfcv2 16/20] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on stage-2 page table could still be written.

Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.

[0] https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update

We utilize the new added IOMMUFD container/ioas/hwpt management framework in
VTD. Add a check to create new VTDIOASContainer to hold RW-only mappings,
then this VTDIOASContainer can be used as backend for device with
ERRATA_772415. See below diagram for details:

      IntelIOMMUState
             |
             V
    .------------------.    .------------------.    .-------------------.
    | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
    | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
    .------------------.    .------------------.    .-------------------.
             |                       |                              |
             |                       .-->...                        |
             V                                                      V
      .-------------------.    .-------------------.          .---------------.
      |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
      .-------------------.    .-------------------.          .---------------.
          |            |               |                            |
          |            |               |                            |
    .-----------.  .-----------.  .------------.              .------------.
    | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
    | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
    | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
    |           |  |           |  | (iommufd0) |              | (iommufd0) |
    .-----------.  .-----------.  .------------.              .------------.

Changed to pass VTDHostIOMMUDevice pointer to vtd_check_hdev() so errata
could be saved.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  1 +
 include/hw/i386/intel_iommu.h  |  1 +
 hw/i386/intel_iommu.c          | 26 +++++++++++++++++++-------
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 23b7e236b0..8558781af8 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -654,5 +654,6 @@ typedef struct VTDHostIOMMUDevice {
     PCIBus *bus;
     uint8_t devfn;
     HostIOMMUDevice *hiod;
+    uint32_t errata;
 } VTDHostIOMMUDevice;
 #endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 594281c1d3..9b156dc32e 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -103,6 +103,7 @@ typedef struct VTDPASIDCacheEntry {
 typedef struct VTDIOASContainer {
     struct IOMMUFDBackend *iommufd;
     uint32_t ioas_id;
+    uint32_t errata;
     MemoryListener listener;
     QLIST_HEAD(, VTDS2Hwpt) s2_hwpt_list;
     QLIST_ENTRY(VTDIOASContainer) next;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e36ac44110..dae1716629 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2443,7 +2443,8 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
 }
 
 #ifdef CONFIG_IOMMUFD
-static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
+static bool iommufd_listener_skipped_section(VTDIOASContainer *container,
+                                             MemoryRegionSection *section)
 {
     return !memory_region_is_ram(section->mr) ||
            memory_region_is_protected(section->mr) ||
@@ -2453,7 +2454,8 @@ static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
             * are never accessed by the CPU and beyond the address width of
             * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
             */
-           section->offset_within_address_space & (1ULL << 63);
+           section->offset_within_address_space & (1ULL << 63) ||
+           (container->errata && section->readonly);
 }
 
 static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
@@ -2469,7 +2471,7 @@ static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
     Error *err = NULL;
     int ret;
 
-    if (iommufd_listener_skipped_section(section)) {
+    if (iommufd_listener_skipped_section(container, section)) {
         return;
     }
     iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
@@ -2520,7 +2522,7 @@ static void iommufd_listener_region_del_s2domain(MemoryListener *listener,
     Int128 llend, llsize;
     int ret;
 
-    if (iommufd_listener_skipped_section(section)) {
+    if (iommufd_listener_skipped_section(container, section)) {
         return;
     }
     iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
@@ -2776,7 +2778,8 @@ static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
 
     /* try to attach to an existing container in this space */
     QLIST_FOREACH(container, &s->containers, next) {
-        if (container->iommufd != iommufd) {
+        if (container->iommufd != iommufd ||
+            container->errata != vtd_hiod->errata) {
             continue;
         }
 
@@ -2803,6 +2806,7 @@ static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
     container = g_malloc0(sizeof(*container));
     container->iommufd = iommufd;
     container->ioas_id = ioas_id;
+    container->errata = vtd_hiod->errata;
     QLIST_INIT(&container->s2_hwpt_list);
 
     if (vtd_device_attach_container(vtd_hiod, container, pasid, pe, hwpt,
@@ -5329,9 +5333,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
     return vtd_dev_as;
 }
 
-static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
+static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
                            Error **errp)
 {
+    HostIOMMUDevice *hiod = vtd_hiod->hiod;
     HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
     int ret;
 
@@ -5388,6 +5393,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         return false;
     }
 
+    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_ERRATA, errp);
+    if (ret < 0) {
+        return false;
+    }
+    vtd_hiod->errata = ret;
+
     error_setg(errp, "host device is uncompatible with stage-1 translation");
     return false;
 }
@@ -5419,7 +5430,8 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     vtd_hiod->iommu_state = s;
     vtd_hiod->hiod = hiod;
 
-    if (!vtd_check_hiod(s, hiod, errp)) {
+    if (!vtd_check_hiod(s, vtd_hiod, errp)) {
+        g_free(vtd_hiod);
         vtd_iommu_unlock(s);
         return false;
     }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 16/20] intel_iommu: Replay pasid binds after context cache invalidation
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (14 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 15/20] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-19  8:22 ` [PATCH rfcv2 17/20] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

From: Yi Liu <yi.l.liu@intel.com>

This replays guest pasid attachments after context cache invalidation.
This is a behavior to ensure safety. Actually, programmer should issue
pasid cache invalidation with proper granularity after issuing a context
cache invalidation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  1 +
 hw/i386/intel_iommu.c          | 51 ++++++++++++++++++++++++++++++++--
 hw/i386/trace-events           |  1 +
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 8558781af8..8f7be7f123 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -575,6 +575,7 @@ typedef enum VTDPCInvType {
     VTD_PASID_CACHE_FORCE_RESET = 0,
     /* pasid cache invalidation rely on guest PASID entry */
     VTD_PASID_CACHE_GLOBAL_INV,
+    VTD_PASID_CACHE_DEVSI,
     VTD_PASID_CACHE_DOMSI,
     VTD_PASID_CACHE_PASIDSI,
 } VTDPCInvType;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index dae1716629..e7376ba6a7 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -91,6 +91,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
 static void vtd_pasid_cache_reset(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  PCIBus *bus, uint16_t devfn);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -2423,6 +2427,8 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
+
     trace_vtd_inv_desc_cc_global();
     /* Protects context cache */
     vtd_iommu_lock(s);
@@ -2440,6 +2446,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
      * VT-d emulation codes.
      */
     vtd_iommu_replay_all(s);
+
+    pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+    vtd_pasid_cache_sync(s, &pc_info);
 }
 
 #ifdef CONFIG_IOMMUFD
@@ -2995,6 +3004,21 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
              * happened.
              */
             vtd_address_space_sync(vtd_as);
+            /*
+             * Per spec, context flush should also followed with PASID
+             * cache and iotlb flush. Regards to a device selective
+             * context cache invalidation:
+             * if (emaulted_device)
+             *    invalidate pasid cache and pasid-based iotlb
+             * else if (assigned_device)
+             *    check if the device has been bound to any pasid
+             *    invoke pasid_unbind regards to each bound pasid
+             * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
+             * caches, while for piotlb in QEMU, we don't have it yet, so
+             * no handling. For assigned device, host iommu driver would
+             * flush piotlb when a pasid unbind is pass down to it.
+             */
+             vtd_pasid_cache_devsi(s, vtd_as->bus, devfn);
         }
     }
 }
@@ -3743,6 +3767,11 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
         /* Fall through */
     case VTD_PASID_CACHE_GLOBAL_INV:
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        if (pc_info->bus != vtd_as->bus || pc_info->devfn != vtd_as->devfn) {
+            return false;
+        }
+        break;
     default:
         error_report("invalid pc_info->type");
         abort();
@@ -3934,6 +3963,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     case VTD_PASID_CACHE_GLOBAL_INV:
         /* loop all assigned devices */
         break;
+    case VTD_PASID_CACHE_DEVSI:
+        walk_info.bus = pc_info->bus;
+        walk_info.devfn = pc_info->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+        return;
     case VTD_PASID_CACHE_FORCE_RESET:
         /* For force reset, no need to go further replay */
         return;
@@ -3968,8 +4002,7 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
  * It includes updating the pasid cache in vIOMMU and updating the
  * pasid bindings per guest's latest pasid entry presence.
  */
-static void vtd_pasid_cache_sync(IntelIOMMUState *s,
-                                 VTDPASIDCacheInfo *pc_info)
+static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
 {
     if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
         return;
@@ -4027,6 +4060,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
     vtd_iommu_unlock(s);
 }
 
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  PCIBus *bus, uint16_t devfn)
+{
+    VTDPASIDCacheInfo pc_info = { .error_happened = false, };
+
+    trace_vtd_pasid_cache_devsi(devfn);
+
+    pc_info.type = VTD_PASID_CACHE_DEVSI;
+    pc_info.bus = bus;
+    pc_info.devfn = devfn;
+
+    vtd_pasid_cache_sync(s, &pc_info);
+}
+
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 22559eb787..e520133f18 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -28,6 +28,7 @@ vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domain slective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 17/20] intel_iommu: Propagate PASID-based iotlb invalidation to host
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (15 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 16/20] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-19  8:22 ` [PATCH rfcv2 18/20] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
	Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

From: Yi Liu <yi.l.liu@intel.com>

This traps the guest PASID-based iotlb invalidation request and propagate it
to host.

Intel VT-d 3.0 supports nested translation in PASID granular. Guest SVA support
could be implemented by configuring nested translation on specific PASID. This
is also known as dual stage DMA translation.

Under such configuration, guest owns the GVA->GPA translation which is
configured as stage-1 page table in host side for a specific pasid, and host
owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
invalidation should be propagated to host since host IOMMU will cache first
level page table related mappings during DMA address translation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |   6 ++
 hw/i386/intel_iommu.c          | 116 ++++++++++++++++++++++++++++++++-
 2 files changed, 120 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 8f7be7f123..630394a8c3 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -589,6 +589,12 @@ typedef struct VTDPASIDCacheInfo {
     bool error_happened;
 } VTDPASIDCacheInfo;
 
+typedef struct VTDPIOTLBInvInfo {
+    uint16_t domain_id;
+    uint32_t pasid;
+    struct iommu_hwpt_vtd_s1_invalidate *inv_data;
+} VTDPIOTLBInvInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e7376ba6a7..8f7fb473f5 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2938,12 +2938,108 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
 
     return ret;
 }
+
+/*
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_invalidate_piotlb(VTDAddressSpace *vtd_as,
+                                  struct iommu_hwpt_vtd_s1_invalidate *cache)
+{
+    VTDHostIOMMUDevice *vtd_hiod;
+    HostIOMMUDeviceIOMMUFD *idev;
+    VTDHwpt *hwpt = &vtd_as->hwpt;
+    int devfn = vtd_as->devfn;
+    struct vtd_as_key key = {
+        .bus = vtd_as->bus,
+        .devfn = devfn,
+    };
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint32_t entry_num = 1; /* Only implement one request for simplicity */
+
+    if (!hwpt) {
+        return;
+    }
+
+    vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
+    if (!vtd_hiod || !vtd_hiod->hiod) {
+        return;
+    }
+    idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+
+    if (iommufd_backend_invalidate_cache(idev->iommufd, hwpt->hwpt_id,
+                                         IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+                                         sizeof(*cache), &entry_num, cache)) {
+        error_report("Cache flush failed, entry_num %d", entry_num);
+    }
+}
+
+/*
+ * This function is a loop function for the s->vtd_address_spaces
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host. Caller of this function
+ * should hold iommu_lock.
+ */
+static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
+                                  gpointer user_data)
+{
+    VTDPIOTLBInvInfo *piotlb_info = user_data;
+    VTDAddressSpace *vtd_as = value;
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    uint32_t pasid;
+    uint16_t did;
+
+    /* Replay only fill pasid entry cache for passthrough device */
+    if (!pc_entry->cache_filled ||
+        !vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
+        return;
+    }
+
+    if (vtd_as_to_iommu_pasid(vtd_as, &pasid)) {
+        return;
+    }
+
+    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
+
+    if (piotlb_info->domain_id == did && piotlb_info->pasid == pasid) {
+        vtd_invalidate_piotlb(vtd_as, piotlb_info->inv_data);
+    }
+}
+
+static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
+                                      uint16_t domain_id, uint32_t pasid,
+                                      hwaddr addr, uint64_t npages, bool ih)
+{
+    struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
+    VTDPIOTLBInvInfo piotlb_info;
+
+    cache_info.addr = addr;
+    cache_info.npages = npages;
+    cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.inv_data = &cache_info;
+
+    /*
+     * Here loops all the vtd_as instances in s->vtd_address_spaces
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_address_spaces,
+                         vtd_flush_pasid_iotlb, &piotlb_info);
+}
 #else
 static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
                                 VTDPASIDEntry *pe, VTDPASIDOp op)
 {
     return 0;
 }
+
+static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
+                                      uint16_t domain_id, uint32_t pasid,
+                                      hwaddr addr, uint64_t npages, bool ih)
+{
+}
 #endif
 
 /* Do a context-cache device-selective invalidation.
@@ -3597,6 +3693,13 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
     info.pasid = pasid;
 
     vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_as instances in s->vtd_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    vtd_flush_pasid_iotlb_all(s, domain_id, pasid, 0, (uint64_t)-1, 0);
+
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
                                 &info);
     vtd_iommu_unlock(s);
@@ -3619,7 +3722,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
-                                       uint32_t pasid, hwaddr addr, uint8_t am)
+                                       uint32_t pasid, hwaddr addr, uint8_t am,
+                                       bool ih)
 {
     VTDIOTLBPageInvInfo info;
 
@@ -3629,6 +3733,13 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     info.mask = ~((1 << am) - 1);
 
     vtd_iommu_lock(s);
+    /*
+     * Here loops all the vtd_as instances in s->vtd_as
+     * to find out the affected devices since piotlb invalidation
+     * should check pasid cache per architecture point of view.
+     */
+    vtd_flush_pasid_iotlb_all(s, domain_id, pasid, addr, 1 << am, ih);
+
     g_hash_table_foreach_remove(s->iotlb,
                                 vtd_hash_remove_by_page_piotlb, &info);
     vtd_iommu_unlock(s);
@@ -3662,7 +3773,8 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
     case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
         am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
         addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
-        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am);
+        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+                                   VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
         break;
 
     default:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 18/20] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (16 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 17/20] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-19  8:22 ` [PATCH rfcv2 19/20] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

From: Yi Liu <yi.l.liu@intel.com>

When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
the pasid bindings on host side become stale and need to be updated.

Introduce a helper function vtd_refresh_pasid_bind() for that purpose.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 8f7fb473f5..225e332132 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -89,6 +89,7 @@ struct vtd_iotlb_key {
 
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_refresh_pasid_bind(IntelIOMMUState *s);
 
 static void vtd_pasid_cache_reset(IntelIOMMUState *s);
 static void vtd_pasid_cache_sync(IntelIOMMUState *s,
@@ -3366,6 +3367,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_refresh_pasid_bind(s);
 }
 
 /* Set Interrupt Remap Table Pointer */
@@ -3400,6 +3402,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_refresh_pasid_bind(s);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -4109,6 +4112,28 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     }
 }
 
+static void vtd_refresh_pasid_bind(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info = { .error_happened = false,
+                                  .type = VTD_PASID_CACHE_GLOBAL_INV };
+
+    /*
+     * Only when dmar is enabled, should pasid bindings replayed,
+     * otherwise no need to replay.
+     */
+    if (!s->dmar_enabled) {
+        return;
+    }
+
+    if (!s->flts || !s->root_scalable) {
+        return;
+    }
+
+    vtd_iommu_lock(s);
+    vtd_replay_guest_pasid_bindings(s, &pc_info);
+    vtd_iommu_unlock(s);
+}
+
 /*
  * This function syncs the pasid bindings between guest and host.
  * It includes updating the pasid cache in vIOMMU and updating the
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 19/20] intel_iommu: Bypass replay in stage-1 page table mode
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (17 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 18/20] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-19  8:22 ` [PATCH rfcv2 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost

VFIO utilizes replay to setup initial shadow iommu mappings.
But when stage-1 page table is configured, it is passed to
host to construct nested page table, there is no replay needed.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 225e332132..e4b83cbe50 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -5743,6 +5743,14 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
     VTDContextEntry ce;
     DMAMap map = { .iova = 0, .size = HWADDR_MAX };
 
+    /*
+     * Replay on stage-1 page table is meaningless as stage-1 page table
+     * is passthroughed to host to construct nested page table
+     */
+    if (s->flts && s->root_scalable) {
+        return;
+    }
+
     /* replay is protected by BQL, page walk will re-setup it safely */
     iova_tree_remove(vtd_as->iova_tree, map);
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH rfcv2 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (18 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 19/20] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
@ 2025-02-19  8:22 ` Zhenzhong Duan
  2025-02-20 19:03 ` [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Eric Auger
  2025-04-05  3:01 ` Donald Dutile
  21 siblings, 0 replies; 68+ messages in thread
From: Zhenzhong Duan @ 2025-02-19  8:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
	Marcel Apfelbaum

Now that all infrastructures of supporting passthrough device running
with stage-1 translation are there, enable it now.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e4b83cbe50..908c28f9be 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -5583,8 +5583,7 @@ static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
     }
     vtd_hiod->errata = ret;
 
-    error_setg(errp, "host device is uncompatible with stage-1 translation");
-    return false;
+    return true;
 }
 
 static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT
  2025-02-19  8:22 ` [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT Zhenzhong Duan
@ 2025-02-20 16:47   ` Eric Auger
  2025-02-28  2:26     ` Duan, Zhenzhong
  2025-02-24 10:03   ` Shameerali Kolothum Thodi via
  1 sibling, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-20 16:47 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

Hi Zhenzhong,


On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
in the title, there is only a single helper here. a small commit msg may
help the reader
> ---
>  include/system/iommufd.h |  3 +++
>  backends/iommufd.c       | 30 ++++++++++++++++++++++++++++++
>  backends/trace-events    |  1 +
>  3 files changed, 34 insertions(+)
>
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index cbab75bfbf..5d02e9d148 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -61,6 +61,9 @@ bool iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
>                                        uint64_t iova, ram_addr_t size,
>                                        uint64_t page_size, uint64_t *data,
>                                        Error **errp);
> +int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t hwpt_id,
> +                                     uint32_t data_type, uint32_t entry_len,
> +                                     uint32_t *entry_num, void *data_ptr);
>  
>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
>  #endif
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index d57da44755..fc32aad5cb 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -311,6 +311,36 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>      return true;
>  }
>  
> +int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t hwpt_id,
> +                                     uint32_t data_type, uint32_t entry_len,
> +                                     uint32_t *entry_num, void *data_ptr)
> +{
> +    int ret, fd = be->fd;
> +    struct iommu_hwpt_invalidate cache = {
> +        .size = sizeof(cache),
> +        .hwpt_id = hwpt_id,
> +        .data_type = data_type,
> +        .entry_len = entry_len,
> +        .entry_num = *entry_num,
> +        .data_uptr = (uintptr_t)data_ptr,
> +    };
> +
> +    ret = ioctl(fd, IOMMU_HWPT_INVALIDATE, &cache);
> +
> +    trace_iommufd_backend_invalidate_cache(fd, hwpt_id, data_type, entry_len,
> +                                           *entry_num, cache.entry_num,
> +                                           (uintptr_t)data_ptr, ret);
> +    if (ret) {
> +        *entry_num = cache.entry_num;
> +        error_report("IOMMU_HWPT_INVALIDATE failed: %s", strerror(errno));
nit: you may report *entry_num also.
Wouldn't it be useful to have an Error *errp passed to the function
> +        ret = -errno;
> +    } else {
> +        g_assert(*entry_num == cache.entry_num);
> +    }
> +
> +    return ret;
> +}
> +
>  static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
>  {
>      HostIOMMUDeviceCaps *caps = &hiod->caps;
> diff --git a/backends/trace-events b/backends/trace-events
> index 40811a3162..5a23db6c8a 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -18,3 +18,4 @@ iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_
>  iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
>  iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret) " iommufd=%d hwpt=%u enable=%d (%d)"
>  iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
> +iommufd_backend_invalidate_cache(int iommufd, uint32_t hwpt_id, uint32_t data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t data_ptr, int ret) " iommufd=%d hwpt_id=%u data_type=%u entry_len=%u entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
Eric



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 02/20] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD
  2025-02-19  8:22 ` [PATCH rfcv2 02/20] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD Zhenzhong Duan
@ 2025-02-20 17:42   ` Eric Auger
  2025-02-28  5:39     ` Duan, Zhenzhong
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-20 17:42 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

Hi Zhenzhong,


On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> New added properties include IOMMUFD handle, devid and hwpt_id.
a property generally has an other meaning in qemu (PROP*).

I would rather say you enhance HostIOMMUDeviceIOMMUFD object with 3 new
members, specific to the iommufd BE + 2 new class functions.


> IOMMUFD handle and devid are used to allocate/free ioas and hwpt.
> hwpt_id is used to re-attach IOMMUFD backed device to its default
> VFIO sub-system created hwpt, i.e., when vIOMMU is disabled by
> guest. These properties are initialized in .realize_late() handler.
realize_late does not exist yet
>
> New added handlers include [at|de]tach_hwpt. They are used to
> attach/detach hwpt. VFIO and VDPA have different way to attach
> and detach, so implementation will be in sub-class instead of
> HostIOMMUDeviceIOMMUFD.
this is tricky to follow ...
>
> Add two wrappers host_iommu_device_iommufd_[at|de]tach_hwpt to
> wrap the two handlers.
>
> This is a prerequisite patch for following ones.
would get rid of that sentence as it does not help much
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/system/iommufd.h | 50 ++++++++++++++++++++++++++++++++++++++++
>  backends/iommufd.c       | 22 ++++++++++++++++++
>  2 files changed, 72 insertions(+)
>
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index 5d02e9d148..a871601df5 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -66,4 +66,54 @@ int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t hwpt_id,
>                                       uint32_t *entry_num, void *data_ptr);
>  
>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
> +OBJECT_DECLARE_TYPE(HostIOMMUDeviceIOMMUFD, HostIOMMUDeviceIOMMUFDClass,
> +                    HOST_IOMMU_DEVICE_IOMMUFD)
> +
> +/* Abstract of host IOMMU device with iommufd backend */
specialization/overload of the host IOMMU device for the iommufd BE?
> +struct HostIOMMUDeviceIOMMUFD {
> +    HostIOMMUDevice parent_obj;
> +
> +    IOMMUFDBackend *iommufd;
> +    uint32_t devid;
> +    uint32_t hwpt_id;
> +};
> +
> +struct HostIOMMUDeviceIOMMUFDClass {
> +    HostIOMMUDeviceClass parent_class;
> +
> +    /**
> +     * @attach_hwpt: attach host IOMMU device to IOMMUFD hardware page table.
> +     * VFIO and VDPA device can have different implementation.
> +     *
> +     * Mandatory callback.
> +     *
> +     * @idev: host IOMMU device backed by IOMMUFD backend.
> +     *
> +     * @hwpt_id: ID of IOMMUFD hardware page table.
> +     *
> +     * @errp: pass an Error out when attachment fails.
> +     *
> +     * Returns: true on success, false on failure.
> +     */
> +    bool (*attach_hwpt)(HostIOMMUDeviceIOMMUFD *idev, uint32_t hwpt_id,
> +                        Error **errp);
> +    /**
> +     * @detach_hwpt: detach host IOMMU device from IOMMUFD hardware page table.
> +     * VFIO and VDPA device can have different implementation.
> +     *
> +     * Mandatory callback.
> +     *
> +     * @idev: host IOMMU device backed by IOMMUFD backend.
> +     *
> +     * @errp: pass an Error out when attachment fails.
> +     *
> +     * Returns: true on success, false on failure.
> +     */
> +    bool (*detach_hwpt)(HostIOMMUDeviceIOMMUFD *idev, Error **errp);
> +};
> +
> +bool host_iommu_device_iommufd_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                                           uint32_t hwpt_id, Error **errp);
> +bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                                           Error **errp);
>  #endif
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index fc32aad5cb..574f330c27 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -341,6 +341,26 @@ int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t hwpt_id,
>      return ret;
>  }
>  
> +bool host_iommu_device_iommufd_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                                           uint32_t hwpt_id, Error **errp)
> +{
> +    HostIOMMUDeviceIOMMUFDClass *idevc =
> +        HOST_IOMMU_DEVICE_IOMMUFD_GET_CLASS(idev);
> +
> +    g_assert(idevc->attach_hwpt);
> +    return idevc->attach_hwpt(idev, hwpt_id, errp);
> +}
> +
> +bool host_iommu_device_iommufd_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                                           Error **errp)
> +{
> +    HostIOMMUDeviceIOMMUFDClass *idevc =
> +        HOST_IOMMU_DEVICE_IOMMUFD_GET_CLASS(idev);
> +
> +    g_assert(idevc->detach_hwpt);
> +    return idevc->detach_hwpt(idev, errp);
> +}
> +
>  static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
>  {
>      HostIOMMUDeviceCaps *caps = &hiod->caps;
> @@ -379,6 +399,8 @@ static const TypeInfo types[] = {
>      }, {
>          .name = TYPE_HOST_IOMMU_DEVICE_IOMMUFD,
>          .parent = TYPE_HOST_IOMMU_DEVICE,
> +        .instance_size = sizeof(HostIOMMUDeviceIOMMUFD),
> +        .class_size = sizeof(HostIOMMUDeviceIOMMUFDClass),
>          .class_init = hiod_iommufd_class_init,
>          .abstract = true,
>      }
Thanks

Eric



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback
  2025-02-19  8:22 ` [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback Zhenzhong Duan
@ 2025-02-20 17:48   ` Eric Auger
  2025-02-28  8:16     ` Duan, Zhenzhong
  2025-04-07 11:19   ` Cédric Le Goater
  1 sibling, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-20 17:48 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng




On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> Currently we have realize() callback which is called before attachment.
> But there are still some elements e.g., hwpt_id is not ready before
> attachment. So we need a realize_late() callback to further initialize
> them.
from the description it is not obvious why the realize() could not have
been called after the attach. Could you remind the reader what is the
reason?

Thanks

Eric
>
> Currently, this callback is only useful for iommufd backend. For legacy
> backend nothing needs to be initialized after attachment.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/system/host_iommu_device.h | 17 +++++++++++++++++
>  hw/vfio/common.c                   | 17 ++++++++++++++---
>  2 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index 809cced4ba..df782598f2 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -66,6 +66,23 @@ struct HostIOMMUDeviceClass {
>       * Returns: true on success, false on failure.
>       */
>      bool (*realize)(HostIOMMUDevice *hiod, void *opaque, Error **errp);
> +    /**
> +     * @realize_late: initialize host IOMMU device instance after attachment,
> +     *                some elements e.g., ioas are ready only after attachment.
> +     *                This callback initialize them.
> +     *
> +     * Optional callback.
> +     *
> +     * @hiod: pointer to a host IOMMU device instance.
> +     *
> +     * @opaque: pointer to agent device of this host IOMMU device,
> +     *          e.g., VFIO base device or VDPA device.
> +     *
> +     * @errp: pass an Error out when realize fails.
> +     *
> +     * Returns: true on success, false on failure.
> +     */
> +    bool (*realize_late)(HostIOMMUDevice *hiod, void *opaque, Error **errp);
>      /**
>       * @get_cap: check if a host IOMMU device capability is supported.
>       *
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index abbdc56b6d..e198b1e5a2 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1550,6 +1550,7 @@ bool vfio_attach_device(char *name, VFIODevice *vbasedev,
>      const VFIOIOMMUClass *ops =
>          VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_LEGACY));
>      HostIOMMUDevice *hiod = NULL;
> +    HostIOMMUDeviceClass *hiod_ops = NULL;
>  
>      if (vbasedev->iommufd) {
>          ops = VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
> @@ -1560,16 +1561,26 @@ bool vfio_attach_device(char *name, VFIODevice *vbasedev,
>  
>      if (!vbasedev->mdev) {
>          hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
> +        hiod_ops = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>          vbasedev->hiod = hiod;
>      }
>  
>      if (!ops->attach_device(name, vbasedev, as, errp)) {
> -        object_unref(hiod);
> -        vbasedev->hiod = NULL;
> -        return false;
> +        goto err_attach;
> +    }
> +
> +    if (hiod_ops && hiod_ops->realize_late &&
> +        !hiod_ops->realize_late(hiod, vbasedev, errp)) {
> +        ops->detach_device(vbasedev);
> +        goto err_attach;
>      }
>  
>      return true;
> +
> +err_attach:
> +    object_unref(hiod);
> +    vbasedev->hiod = NULL;
> +    return false;
>  }
>  
>  void vfio_detach_device(VFIODevice *vbasedev)



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 04/20] vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
  2025-02-19  8:22 ` [PATCH rfcv2 04/20] vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler Zhenzhong Duan
@ 2025-02-20 18:07   ` Eric Auger
  2025-02-28  8:23     ` Duan, Zhenzhong
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-20 18:07 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng




On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> There are three iommufd related elements iommufd handle, devid and

There are three iommufd specific members in HostIOMMUDevice
IOMMUFD that need to be initialized after attach on realize_late() ...

> hwpt_id. hwpt_id is ready only after VFIO device attachment. Device
> id and iommufd handle are ready before attachment, but they are all
> iommufd related stuff, initialize them together with hwpt_id.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/vfio/iommufd.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index df61edffc0..53639bf88b 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -828,6 +828,19 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>      return true;
>  }
>  
> +static bool hiod_iommufd_vfio_realize_late(HostIOMMUDevice *hiod, void *opaque,
> +                                           Error **errp)
> +{
> +    VFIODevice *vdev = opaque;
> +    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
> +
> +    idev->iommufd = vdev->iommufd;
> +    idev->devid = vdev->devid;
> +    idev->hwpt_id = vdev->hwpt->hwpt_id;
> +
> +    return true;
> +}
> +
>  static GList *
>  hiod_iommufd_vfio_get_iova_ranges(HostIOMMUDevice *hiod)
>  {
> @@ -852,6 +865,7 @@ static void hiod_iommufd_vfio_class_init(ObjectClass *oc, void *data)
>      HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_CLASS(oc);
>  
>      hiodc->realize = hiod_iommufd_vfio_realize;
> +    hiodc->realize_late = hiod_iommufd_vfio_realize_late;
>      hiodc->get_iova_ranges = hiod_iommufd_vfio_get_iova_ranges;
>      hiodc->get_page_size_mask = hiod_iommufd_vfio_get_page_size_mask;
>  };



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 05/20] vfio/iommufd: Implement [at|de]tach_hwpt handlers
  2025-02-19  8:22 ` [PATCH rfcv2 05/20] vfio/iommufd: Implement [at|de]tach_hwpt handlers Zhenzhong Duan
@ 2025-02-20 18:13   ` Eric Auger
  2025-02-28  8:24     ` Duan, Zhenzhong
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-20 18:13 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng




On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> Implement [at|de]tach_hwpt handlers in VFIO subsystem. vIOMMU
> utilizes them to attach to or detach from hwpt on host side.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/vfio/iommufd.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 53639bf88b..175c4fe1f4 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -802,6 +802,24 @@ static void vfio_iommu_iommufd_class_init(ObjectClass *klass, void *data)
>      vioc->query_dirty_bitmap = iommufd_query_dirty_bitmap;
>  };
>  
> +static bool
can't we return an integer instead. This looks more standard to me

Eric
> +host_iommu_device_iommufd_vfio_attach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                                           uint32_t hwpt_id, Error **errp)
> +{
> +    VFIODevice *vbasedev = HOST_IOMMU_DEVICE(idev)->agent;
> +
> +    return !iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp);
> +}
> +
> +static bool
> +host_iommu_device_iommufd_vfio_detach_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                                           Error **errp)
> +{
> +    VFIODevice *vbasedev = HOST_IOMMU_DEVICE(idev)->agent;
> +
> +    return iommufd_cdev_detach_ioas_hwpt(vbasedev, errp);
> +}
> +
>  static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>                                        Error **errp)
>  {
> @@ -863,11 +881,15 @@ hiod_iommufd_vfio_get_page_size_mask(HostIOMMUDevice *hiod)
>  static void hiod_iommufd_vfio_class_init(ObjectClass *oc, void *data)
>  {
>      HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_CLASS(oc);
> +    HostIOMMUDeviceIOMMUFDClass *idevc = HOST_IOMMU_DEVICE_IOMMUFD_CLASS(oc);
>  
>      hiodc->realize = hiod_iommufd_vfio_realize;
>      hiodc->realize_late = hiod_iommufd_vfio_realize_late;
>      hiodc->get_iova_ranges = hiod_iommufd_vfio_get_iova_ranges;
>      hiodc->get_page_size_mask = hiod_iommufd_vfio_get_page_size_mask;
> +
> +    idevc->attach_hwpt = host_iommu_device_iommufd_vfio_attach_hwpt;
> +    idevc->detach_hwpt = host_iommu_device_iommufd_vfio_detach_hwpt;
>  };
>  
>  static const TypeInfo types[] = {



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  2025-02-19  8:22 ` [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] Zhenzhong Duan
@ 2025-02-20 18:41   ` Eric Auger
  2025-02-20 18:44     ` Eric Auger
  2025-02-28  8:29     ` Duan, Zhenzhong
  0 siblings, 2 replies; 68+ messages in thread
From: Eric Auger @ 2025-02-20 18:41 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

Hi Zhenzhong,


On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/system/host_iommu_device.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index df782598f2..18f8b5e5cf 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -22,10 +22,16 @@
>   *
>   * @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this represents
>   *           the @out_capabilities value returned from IOMMU_GET_HW_INFO ioctl)
> + *
> + * @nesting: nesting page table support.
> + *
> + * @fs1gp: first stage(a.k.a, Stage-1) 1GB huge page support.
>   */
>  typedef struct HostIOMMUDeviceCaps {
>      uint32_t type;
>      uint64_t hw_caps;
> +    bool nesting;
> +    bool fs1gp;
this looks quite vtd specific, isn't it? Shouldn't we hide this is a
vendor specific cap struct?
>  } HostIOMMUDeviceCaps;
>  
>  #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
> @@ -122,6 +128,8 @@ struct HostIOMMUDeviceClass {
>   */
>  #define HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE        0
>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS           1
> +#define HOST_IOMMU_DEVICE_CAP_NESTING           2
> +#define HOST_IOMMU_DEVICE_CAP_FS1GP             3
>  
>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
>  #endif

Maybe you could introduce the associated implementation of
hiod_iommufd_get_cap in this patch too?

Eric



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  2025-02-20 18:41   ` Eric Auger
@ 2025-02-20 18:44     ` Eric Auger
  2025-02-28  8:29     ` Duan, Zhenzhong
  1 sibling, 0 replies; 68+ messages in thread
From: Eric Auger @ 2025-02-20 18:44 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng



On 2/20/25 7:41 PM, Eric Auger wrote:
> Hi Zhenzhong,
> 
> 
> On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/system/host_iommu_device.h | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
>> index df782598f2..18f8b5e5cf 100644
>> --- a/include/system/host_iommu_device.h
>> +++ b/include/system/host_iommu_device.h
>> @@ -22,10 +22,16 @@
>>   *
>>   * @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this represents
>>   *           the @out_capabilities value returned from IOMMU_GET_HW_INFO ioctl)
>> + *
>> + * @nesting: nesting page table support.
>> + *
>> + * @fs1gp: first stage(a.k.a, Stage-1) 1GB huge page support.
>>   */
>>  typedef struct HostIOMMUDeviceCaps {
>>      uint32_t type;
>>      uint64_t hw_caps;
>> +    bool nesting;
>> +    bool fs1gp;
> this looks quite vtd specific, isn't it? Shouldn't we hide this is a
> vendor specific cap struct?
>>  } HostIOMMUDeviceCaps;
>>  
>>  #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
>> @@ -122,6 +128,8 @@ struct HostIOMMUDeviceClass {
>>   */
>>  #define HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE        0
>>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS           1
>> +#define HOST_IOMMU_DEVICE_CAP_NESTING           2
>> +#define HOST_IOMMU_DEVICE_CAP_FS1GP             3
>>  
>>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
>>  #endif
> 
> Maybe you could introduce the associated implementation of
> hiod_iommufd_get_cap in this patch too?
ignore this last comment :(

Eric
> 
> Eric



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 08/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
  2025-02-19  8:22 ` [PATCH rfcv2 08/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA Zhenzhong Duan
@ 2025-02-20 18:55   ` Eric Auger
  2025-02-28  8:31     ` Duan, Zhenzhong
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-20 18:55 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng




On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA for IOMMUFD
> backed host IOMMU device.
>
> Query on this capability is not supported for legacy backend
> because there is no plan to support nesting with leacy backend
legacy
> backed host device.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/system/host_iommu_device.h | 2 ++
>  backends/iommufd.c                 | 2 ++
>  hw/vfio/iommufd.c                  | 1 +
>  3 files changed, 5 insertions(+)
>
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index 18f8b5e5cf..250600fc1d 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -32,6 +32,7 @@ typedef struct HostIOMMUDeviceCaps {
>      uint64_t hw_caps;
>      bool nesting;
>      bool fs1gp;
> +    uint32_t errata;
to be consistent with the others yu may have introduced this alongside
with the 2 others?
This is also not usable by other IOMMUs.

Eric
>  } HostIOMMUDeviceCaps;
>  
>  #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
> @@ -130,6 +131,7 @@ struct HostIOMMUDeviceClass {
>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS           1
>  #define HOST_IOMMU_DEVICE_CAP_NESTING           2
>  #define HOST_IOMMU_DEVICE_CAP_FS1GP             3
> +#define HOST_IOMMU_DEVICE_CAP_ERRATA            4
>  
>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
>  #endif
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 0a1a40cbba..3c23caef96 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -374,6 +374,8 @@ static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
>          return caps->nesting;
>      case HOST_IOMMU_DEVICE_CAP_FS1GP:
>          return caps->fs1gp;
> +    case HOST_IOMMU_DEVICE_CAP_ERRATA:
> +        return caps->errata;
>      default:
>          error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
>          return -EINVAL;
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index df6a12d200..58bff030e1 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -848,6 +848,7 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>      case IOMMU_HW_INFO_TYPE_INTEL_VTD:
>          caps->nesting = !!(data.vtd.ecap_reg & VTD_ECAP_NEST);
>          caps->fs1gp = !!(data.vtd.cap_reg & VTD_CAP_FS1GP);
> +        caps->errata = data.vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17;
>          break;
>      case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
>      case IOMMU_HW_INFO_TYPE_NONE:



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 07/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  2025-02-19  8:22 ` [PATCH rfcv2 07/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] Zhenzhong Duan
@ 2025-02-20 19:00   ` Eric Auger
  2025-02-28  8:32     ` Duan, Zhenzhong
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-20 19:00 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost




On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] for IOMMUFD
> backed host IOMMU device.
>
> Query on these two capabilities is not supported for legacy backend
> because there is no plan to support nesting with leacy backend backed
> host device.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu_internal.h |  1 +
>  backends/iommufd.c             |  4 ++++
>  hw/vfio/iommufd.c              | 11 +++++++++++
>  3 files changed, 16 insertions(+)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index e8b211e8b0..2cda744786 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -191,6 +191,7 @@
>  #define VTD_ECAP_PT                 (1ULL << 6)
>  #define VTD_ECAP_SC                 (1ULL << 7)
>  #define VTD_ECAP_MHMV               (15ULL << 20)
> +#define VTD_ECAP_NEST               (1ULL << 26)
>  #define VTD_ECAP_SRS                (1ULL << 31)
>  #define VTD_ECAP_PASID              (1ULL << 40)
>  #define VTD_ECAP_SMTS               (1ULL << 43)
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 574f330c27..0a1a40cbba 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -370,6 +370,10 @@ static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
>          return caps->type;
>      case HOST_IOMMU_DEVICE_CAP_AW_BITS:
>          return vfio_device_get_aw_bits(hiod->agent);
> +    case HOST_IOMMU_DEVICE_CAP_NESTING:
> +        return caps->nesting;
> +    case HOST_IOMMU_DEVICE_CAP_FS1GP:
> +        return caps->fs1gp;
this is vtd specific so those caps shouldn't be return for other iommus, no?

Eric
>      default:
>          error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
>          return -EINVAL;
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 175c4fe1f4..df6a12d200 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -26,6 +26,7 @@
>  #include "qemu/chardev_open.h"
>  #include "pci.h"
>  #include "exec/ram_addr.h"
> +#include "hw/i386/intel_iommu_internal.h"
>  
>  static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr iova,
>                              ram_addr_t size, void *vaddr, bool readonly)
> @@ -843,6 +844,16 @@ static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>      caps->type = type;
>      caps->hw_caps = hw_caps;
>  
> +    switch (type) {
> +    case IOMMU_HW_INFO_TYPE_INTEL_VTD:
> +        caps->nesting = !!(data.vtd.ecap_reg & VTD_ECAP_NEST);
> +        caps->fs1gp = !!(data.vtd.cap_reg & VTD_CAP_FS1GP);
> +        break;
> +    case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
> +    case IOMMU_HW_INFO_TYPE_NONE:
> +        break;
> +    }
> +
>      return true;
>  }
>  



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (19 preceding siblings ...)
  2025-02-19  8:22 ` [PATCH rfcv2 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
@ 2025-02-20 19:03 ` Eric Auger
  2025-02-21  6:08   ` Duan, Zhenzhong
  2025-04-05  3:01 ` Donald Dutile
  21 siblings, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-20 19:03 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng


Hi Zhenzhong

On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> Hi,
>
> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
> "Enable stage-1 translation for emulated device" series and
> "Enable stage-1 translation for passthrough device" series.
>
> This series is 2nd part focusing on passthrough device. We don't do
> shadowing of guest page table for passthrough device but pass stage-1
> page table to host side to construct a nested domain. There was some
> effort to enable this feature in old days, see [2] for details.
>
> The key design is to utilize the dual-stage IOMMU translation
> (also known as IOMMU nested translation) capability in host IOMMU.
> As the below diagram shows, guest I/O page table pointer in GPA
> (guest physical address) is passed to host and be used to perform
s/be/is
> the stage-1 address translation. Along with it, modifications to
> present mappings in the guest I/O page table should be followed
> with an IOTLB invalidation.
>
>         .-------------.  .---------------------------.
>         |   vIOMMU    |  | Guest I/O page table      |
>         |             |  '---------------------------'
>         .----------------/
>         | PASID Entry |--- PASID cache flush --+
>         '-------------'                        |
>         |             |                        V
>         |             |           I/O page table pointer in GPA
>         '-------------'
>     Guest
>     ------| Shadow |---------------------------|--------
>           v        v                           v
>     Host
>         .-------------.  .------------------------.
>         |   pIOMMU    |  |  FS for GIOVA->GPA     |
>         |             |  '------------------------'
>         .----------------/  |
>         | PASID Entry |     V (Nested xlate)
>         '----------------\.----------------------------------.
>         |             |   | SS for GPA->HPA, unmanaged domain|
>         |             |   '----------------------------------'
>         '-------------'
> Where:
>  - FS = First stage page tables
>  - SS = Second stage page tables
> <Intel VT-d Nested translation>
>
> There are some interactions between VFIO and vIOMMU
> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>   subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>   instance to vIOMMU at vfio device realize stage.
> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>   to bind/unbind device to IOMMUFD backed domains, either nested
>   domain or not.
>
> See below diagram:
>
>         VFIO Device                                 Intel IOMMU
>     .-----------------.                         .-------------------.
>     |                 |                         |                   |
>     |       .---------|PCIIOMMUOps              |.-------------.    |
>     |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
>     |       | Device  |------------------------>|| Device list |    |
>     |       .---------|(unset_iommu_device)     |.-------------.    |
>     |                 |                         |       |           |
>     |                 |                         |       V           |
>     |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>     |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
>     |       | link    |<------------------------|  |   Device    |  |
>     |       .---------|            (detach_hwpt)|  .-------------.  |
>     |                 |                         |       |           |
>     |                 |                         |       ...         |
>     .-----------------.                         .-------------------.
>
> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
> whenever possible and create new one on demand, also supports multiple
> iommufd objects and ERRATA_772415.
>
> E.g., Stage-2 page table could be shared by different devices if there
> is no conflict and devices link to same iommufd object, i.e. devices
> under same host IOMMU can share same stage-2 page table. If there is
> conflict, i.e. there is one device under non cache coherency mode
> which is different from others, it requires a separate stage-2 page
> table in non-CC mode.
>
> SPR platform has ERRATA_772415 which requires no readonly mappings
> in stage-2 page table. This series supports creating VTDIOASContainer
> with no readonly mappings. If there is a rare case that some IOMMUs
> on a multiple IOMMU host have ERRATA_772415 and others not, this
> design can still survive.
>
> See below example diagram for a full view:
>
>       IntelIOMMUState
>              |
>              V
>     .------------------.    .------------------.    .-------------------.
>     | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
>     | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
>     .------------------.    .------------------.    .-------------------.
>              |                       |                              |
>              |                       .-->...                        |
>              V                                                      V
>       .-------------------.    .-------------------.          .---------------.
>       |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
>       .-------------------.    .-------------------.          .---------------.
>           |            |               |                            |
>           |            |               |                            |
>     .-----------.  .-----------.  .------------.              .------------.
>     | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>     | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>     | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>     |           |  |           |  | (iommufd0) |              | (iommufd0) |
>     .-----------.  .-----------.  .------------.              .------------.
>
> This series is also a prerequisite work for vSVA, i.e. Sharing
> guest application address space with passthrough devices.
>
> To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
>
> Passthrough device should use iommufd backend to work with stage-1 translation.
> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>
> If host doesn't support nested translation, qemu will fail with an unsupported
> report.

you're not mentioning lack of error reporting from HW S1 faults to
guests. Are there other deps missing?

Eric
>
> Test done:
> - VFIO devices hotplug/unplug
> - different VFIO devices linked to different iommufds
> - vhost net device ping test
>
> PATCH1-8:  Add HWPT-based nesting infrastructure support
> PATCH9-10: Some cleanup work
> PATCH11:   cap/ecap related compatibility check between vIOMMU and Host IOMMU
> PATCH12-19:Implement stage-1 page table for passthrough device
> PATCH20:   Enable stage-1 translation for passthrough device
>
> Qemu code can be found at [3]
>
> TODO:
> - RAM discard
> - dirty tracking on stage-2 page table
>
> [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
> [2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
> [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2
>
> Thanks
> Zhenzhong
>
> Changelog:
> rfcv2:
> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
> - add two cleanup patches(patch9-10)
> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
>   iommu pasid, this is important for dropping VTDPASIDAddressSpace
>
> Yi Liu (3):
>   intel_iommu: Replay pasid binds after context cache invalidation
>   intel_iommu: Propagate PASID-based iotlb invalidation to host
>   intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
>
> Zhenzhong Duan (17):
>   backends/iommufd: Add helpers for invalidating user-managed HWPT
>   vfio/iommufd: Add properties and handlers to
>     TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>   HostIOMMUDevice: Introduce realize_late callback
>   vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
>   vfio/iommufd: Implement [at|de]tach_hwpt handlers
>   host_iommu_device: Define two new capabilities
>     HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>   iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>   iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
>   intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>     vtd_ce_get_pasid_entry
>   intel_iommu: Optimize context entry cache utilization
>   intel_iommu: Check for compatibility with IOMMUFD backed device when
>     x-flts=on
>   intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>   intel_iommu: Add PASID cache management infrastructure
>   intel_iommu: Bind/unbind guest page table to host
>   intel_iommu: ERRATA_772415 workaround
>   intel_iommu: Bypass replay in stage-1 page table mode
>   intel_iommu: Enable host device when x-flts=on in scalable mode
>
>  hw/i386/intel_iommu_internal.h     |   56 +
>  include/hw/i386/intel_iommu.h      |   33 +-
>  include/system/host_iommu_device.h |   40 +
>  include/system/iommufd.h           |   53 +
>  backends/iommufd.c                 |   58 +
>  hw/i386/intel_iommu.c              | 1660 ++++++++++++++++++++++++----
>  hw/vfio/common.c                   |   17 +-
>  hw/vfio/iommufd.c                  |   48 +
>  backends/trace-events              |    1 +
>  hw/i386/trace-events               |   13 +
>  10 files changed, 1776 insertions(+), 203 deletions(-)
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
  2025-02-20 19:03 ` [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Eric Auger
@ 2025-02-21  6:08   ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-21  6:08 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for
>passthrough device
>
>
>Hi Zhenzhong
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> Hi,
>>
>> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
>> "Enable stage-1 translation for emulated device" series and
>> "Enable stage-1 translation for passthrough device" series.
>>
>> This series is 2nd part focusing on passthrough device. We don't do
>> shadowing of guest page table for passthrough device but pass stage-1
>> page table to host side to construct a nested domain. There was some
>> effort to enable this feature in old days, see [2] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation
>> (also known as IOMMU nested translation) capability in host IOMMU.
>> As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform
>s/be/is
>> the stage-1 address translation. Along with it, modifications to
>> present mappings in the guest I/O page table should be followed
>> with an IOTLB invalidation.
>>
>>         .-------------.  .---------------------------.
>>         |   vIOMMU    |  | Guest I/O page table      |
>>         |             |  '---------------------------'
>>         .----------------/
>>         | PASID Entry |--- PASID cache flush --+
>>         '-------------'                        |
>>         |             |                        V
>>         |             |           I/O page table pointer in GPA
>>         '-------------'
>>     Guest
>>     ------| Shadow |---------------------------|--------
>>           v        v                           v
>>     Host
>>         .-------------.  .------------------------.
>>         |   pIOMMU    |  |  FS for GIOVA->GPA     |
>>         |             |  '------------------------'
>>         .----------------/  |
>>         | PASID Entry |     V (Nested xlate)
>>         '----------------\.----------------------------------.
>>         |             |   | SS for GPA->HPA, unmanaged domain|
>>         |             |   '----------------------------------'
>>         '-------------'
>> Where:
>>  - FS = First stage page tables
>>  - SS = Second stage page tables
>> <Intel VT-d Nested translation>
>>
>> There are some interactions between VFIO and vIOMMU
>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>   subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>   instance to vIOMMU at vfio device realize stage.
>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>   to bind/unbind device to IOMMUFD backed domains, either nested
>>   domain or not.
>>
>> See below diagram:
>>
>>         VFIO Device                                 Intel IOMMU
>>     .-----------------.                         .-------------------.
>>     |                 |                         |                   |
>>     |       .---------|PCIIOMMUOps              |.-------------.    |
>>     |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
>>     |       | Device  |------------------------>|| Device list |    |
>>     |       .---------|(unset_iommu_device)     |.-------------.    |
>>     |                 |                         |       |           |
>>     |                 |                         |       V           |
>>     |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>>     |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
>>     |       | link    |<------------------------|  |   Device    |  |
>>     |       .---------|            (detach_hwpt)|  .-------------.  |
>>     |                 |                         |       |           |
>>     |                 |                         |       ...         |
>>     .-----------------.                         .-------------------.
>>
>> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
>> whenever possible and create new one on demand, also supports multiple
>> iommufd objects and ERRATA_772415.
>>
>> E.g., Stage-2 page table could be shared by different devices if there
>> is no conflict and devices link to same iommufd object, i.e. devices
>> under same host IOMMU can share same stage-2 page table. If there is
>> conflict, i.e. there is one device under non cache coherency mode
>> which is different from others, it requires a separate stage-2 page
>> table in non-CC mode.
>>
>> SPR platform has ERRATA_772415 which requires no readonly mappings
>> in stage-2 page table. This series supports creating VTDIOASContainer
>> with no readonly mappings. If there is a rare case that some IOMMUs
>> on a multiple IOMMU host have ERRATA_772415 and others not, this
>> design can still survive.
>>
>> See below example diagram for a full view:
>>
>>       IntelIOMMUState
>>              |
>>              V
>>     .------------------.    .------------------.    .-------------------.
>>     | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |--
>>...
>>     | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
>>     .------------------.    .------------------.    .-------------------.
>>              |                       |                              |
>>              |                       .-->...                        |
>>              V                                                      V
>>       .-------------------.    .-------------------.          .---------------.
>>       |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-
>->...
>>       .-------------------.    .-------------------.          .---------------.
>>           |            |               |                            |
>>           |            |               |                            |
>>     .-----------.  .-----------.  .------------.              .------------.
>>     | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>>     | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>>     | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>>     |           |  |           |  | (iommufd0) |              | (iommufd0) |
>>     .-----------.  .-----------.  .------------.              .------------.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing
>> guest application address space with passthrough devices.
>>
>> To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
>> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>
>> Passthrough device should use iommufd backend to work with stage-1
>translation.
>> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>
>> If host doesn't support nested translation, qemu will fail with an unsupported
>> report.
>
>you're not mentioning lack of error reporting from HW S1 faults to
>guests. Are there other deps missing?

Good question, this will be in future series. Plan is:

1) vtd nesting
2) pasid support
3) PRQ support(this includes S1 faults passing)

So to play with this series, we have to presume guest kernel always construct
correct S1 page table for passthrough device, for emulated devices, the
emulation code already provided S1 fault injection.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 09/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
  2025-02-19  8:22 ` [PATCH rfcv2 09/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-02-21  6:39   ` CLEMENT MATHIEU--DRIF
  2025-02-21 10:11   ` Eric Auger
  1 sibling, 0 replies; 68+ messages in thread
From: CLEMENT MATHIEU--DRIF @ 2025-02-21  6:39 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com,
	Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
	Eduardo Habkost

Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>

On 19/02/2025 09:22, Zhenzhong Duan wrote:
> Caution: External email. Do not open attachments or click links, unless this email comes from a known sender and you know the content is safe.
> 
> 
> In early days vtd_ce_get_rid2pasid_entry() is used to get pasid entry of
> rid2pasid, then extend to any pasid. So a new name vtd_ce_get_pasid_entry
> is better to match its functions.
> 
> No functional change intended.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/i386/intel_iommu.c | 14 +++++++-------
>   1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 7fde0603bf..df5fb30bc8 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
>       return 0;
>   }
> 
> -static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
> +static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
>                                         VTDContextEntry *ce,
>                                         VTDPASIDEntry *pe,
>                                         uint32_t pasid)
> @@ -1025,7 +1025,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
>       VTDPASIDEntry pe;
> 
>       if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> +        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>           if (s->flts) {
>               return VTD_PE_GET_FL_LEVEL(&pe);
>           } else {
> @@ -1048,7 +1048,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
>       VTDPASIDEntry pe;
> 
>       if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> +        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>           return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
>       }
> 
> @@ -1116,7 +1116,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
>       VTDPASIDEntry pe;
> 
>       if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> +        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>           if (s->flts) {
>               return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
>           } else {
> @@ -1522,7 +1522,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
>        * has valid rid2pasid setting, which includes valid
>        * rid2pasid field and corresponding pasid entry setting
>        */
> -    return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
> +    return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
>   }
> 
>   /* Map a device to its corresponding domain (context-entry) */
> @@ -1611,7 +1611,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
>       VTDPASIDEntry pe;
> 
>       if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> +        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>           return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
>       }
> 
> @@ -1687,7 +1687,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>       int ret;
> 
>       if (s->root_scalable) {
> -        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> +        ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>           if (ret) {
>               /*
>                * This error is guest triggerable. We should assumt PT
> --
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 10/20] intel_iommu: Optimize context entry cache utilization
  2025-02-19  8:22 ` [PATCH rfcv2 10/20] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
@ 2025-02-21 10:00   ` Eric Auger
  2025-02-28  8:34     ` Duan, Zhenzhong
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-21 10:00 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum

Hi Zhenzhong,

On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> There are many call sites referencing context entry by calling
> vtd_as_to_context_entry() which will traverse the DMAR table.
didn't you mean vtd_dev_to_context_entry? instead
>
> In most cases we can use cached context entry in vtd_as->context_cache_entry
> except it's stale. Currently only global and domain context invalidation
> stales it.
s/states/stale

Eric
>
> So introduce a helper function vtd_as_to_context_entry() to fetch from cache
> before trying with vtd_dev_to_context_entry().
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c | 36 +++++++++++++++++++++++-------------
>  1 file changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index df5fb30bc8..7709f55be5 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1597,6 +1597,22 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>      return 0;
>  }
>  
> +static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
> +{
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    uint8_t bus_num = pci_bus_num(vtd_as->bus);
> +    uint8_t devfn = vtd_as->devfn;
> +    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
> +
> +    /* Try to fetch context-entry from cache first */
> +    if (cc_entry->context_cache_gen == s->context_cache_gen) {
> +        *ce = cc_entry->context_entry;
> +        return 0;
> +    } else {
> +        return vtd_dev_to_context_entry(s, bus_num, devfn, ce);
> +    }
> +}
> +
>  static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
>                                       void *private)
>  {
> @@ -1649,9 +1665,7 @@ static int vtd_address_space_sync(VTDAddressSpace *vtd_as)
>          return 0;
>      }
>  
> -    ret = vtd_dev_to_context_entry(vtd_as->iommu_state,
> -                                   pci_bus_num(vtd_as->bus),
> -                                   vtd_as->devfn, &ce);
> +    ret = vtd_as_to_context_entry(vtd_as, &ce);
>      if (ret) {
>          if (ret == -VTD_FR_CONTEXT_ENTRY_P) {
>              /*
> @@ -1710,8 +1724,7 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>      assert(as);
>  
>      s = as->iommu_state;
> -    if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
> -                                 &ce)) {
> +    if (vtd_as_to_context_entry(as, &ce)) {
>          /*
>           * Possibly failed to parse the context entry for some reason
>           * (e.g., during init, or any guest configuration errors on
> @@ -2443,8 +2456,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>      vtd_iommu_unlock(s);
>  
>      QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
> -        if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> -                                      vtd_as->devfn, &ce) &&
> +        if (!vtd_as_to_context_entry(vtd_as, &ce) &&
>              domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>              vtd_address_space_sync(vtd_as);
>          }
> @@ -2466,8 +2478,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>      hwaddr size = (1 << am) * VTD_PAGE_SIZE;
>  
>      QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
> -        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> -                                       vtd_as->devfn, &ce);
> +        ret = vtd_as_to_context_entry(vtd_as, &ce);
>          if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>              uint32_t rid2pasid = PCI_NO_PASID;
>  
> @@ -2974,8 +2985,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>      vtd_iommu_unlock(s);
>  
>      QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
> -        if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> -                                      vtd_as->devfn, &ce) &&
> +        if (!vtd_as_to_context_entry(vtd_as, &ce) &&
>              domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>              uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
>  
> @@ -4154,7 +4164,7 @@ static void vtd_report_ir_illegal_access(VTDAddressSpace *vtd_as,
>      assert(vtd_as->pasid != PCI_NO_PASID);
>  
>      /* Try out best to fetch FPD, we can't do anything more */
> -    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> +    if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
>          is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
>          if (!is_fpd_set && s->root_scalable) {
>              vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, vtd_as->pasid);
> @@ -4491,7 +4501,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
>      /* replay is protected by BQL, page walk will re-setup it safely */
>      iova_tree_remove(vtd_as->iova_tree, map);
>  
> -    if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> +    if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
>          trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
>                                    "legacy mode",
>                                    bus_n, PCI_SLOT(vtd_as->devfn),



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 09/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
  2025-02-19  8:22 ` [PATCH rfcv2 09/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
  2025-02-21  6:39   ` CLEMENT MATHIEU--DRIF
@ 2025-02-21 10:11   ` Eric Auger
  2025-02-28  8:47     ` Duan, Zhenzhong
  1 sibling, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-21 10:11 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

Hi Zhenzhong,


On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> In early days vtd_ce_get_rid2pasid_entry() is used to get pasid entry of
is/was
> rid2pasid, then extend to any pasid. So a new name vtd_ce_get_pasid_entry
then it was extended to get any pasid entry?
> is better to match its functions.
to match what it actually does?

I do not know the vtd spec very well so I searched for rid2pasid and I
did not find any reference. I think I understand what is the pasid entry
from the pasid table though so the renaming does make sense to me.

Eric
>
> No functional change intended.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 7fde0603bf..df5fb30bc8 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
>      return 0;
>  }
>  
> -static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
> +static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
>                                        VTDContextEntry *ce,
>                                        VTDPASIDEntry *pe,
>                                        uint32_t pasid)
> @@ -1025,7 +1025,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
>      VTDPASIDEntry pe;
>  
>      if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> +        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>          if (s->flts) {
>              return VTD_PE_GET_FL_LEVEL(&pe);
>          } else {
> @@ -1048,7 +1048,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
>      VTDPASIDEntry pe;
>  
>      if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> +        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>          return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
>      }
>  
> @@ -1116,7 +1116,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
>      VTDPASIDEntry pe;
>  
>      if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> +        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>          if (s->flts) {
>              return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
>          } else {
> @@ -1522,7 +1522,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
>       * has valid rid2pasid setting, which includes valid
>       * rid2pasid field and corresponding pasid entry setting
>       */
> -    return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
> +    return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
>  }
>  
>  /* Map a device to its corresponding domain (context-entry) */
> @@ -1611,7 +1611,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
>      VTDPASIDEntry pe;
>  
>      if (s->root_scalable) {
> -        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> +        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>          return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
>      }
>  
> @@ -1687,7 +1687,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>      int ret;
>  
>      if (s->root_scalable) {
> -        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> +        ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>          if (ret) {
>              /*
>               * This error is guest triggerable. We should assumt PT



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 11/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-02-19  8:22 ` [PATCH rfcv2 11/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-02-21 12:49   ` Eric Auger
  2025-02-21 14:18     ` Eric Auger
  2025-02-28  8:57     ` Duan, Zhenzhong
  0 siblings, 2 replies; 68+ messages in thread
From: Eric Auger @ 2025-02-21 12:49 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum

Hi Zhenzhong,


On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
> is passed to host to construct nested page table. We need to check
> compatibility of some critical IOMMU capabilities between vIOMMU and
> host IOMMU to ensure guest stage-1 page table could be used by host.
>
> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
> does not, then this IOMMUFD backed device should be failed.
is this 1GB huge page mapping a requiring for SIOV?
>
> Declare an enum type host_iommu_device_iommu_hw_info_type aliased to
> iommu_hw_info_type which come from iommufd header file. This can avoid
s/come/comes
> build failure on windows which doesn't support iommufd.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/system/host_iommu_device.h | 13 ++++++++++++
>  hw/i386/intel_iommu.c              | 34 ++++++++++++++++++++++++++++++
>  2 files changed, 47 insertions(+)
>
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index 250600fc1d..aa3885d7ee 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -133,5 +133,18 @@ struct HostIOMMUDeviceClass {
>  #define HOST_IOMMU_DEVICE_CAP_FS1GP             3
>  #define HOST_IOMMU_DEVICE_CAP_ERRATA            4
>  
> +/**
> + * enum host_iommu_device_iommu_hw_info_type - IOMMU Hardware Info Types
> + * @HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not
> + *                                             report hardware info
> + * @HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
> + *
> + * This is alias to enum iommu_hw_info_type but for general purpose.
> + */
> +enum host_iommu_device_iommu_hw_info_type {
> +    HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_NONE,
> +    HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD,
> +};
> +
>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
>  #endif
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 7709f55be5..9de60e607d 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -39,6 +39,7 @@
>  #include "kvm/kvm_i386.h"
>  #include "migration/vmstate.h"
>  #include "trace.h"
> +#include "system/iommufd.h"
>  
>  /* context entry operations */
>  #define VTD_CE_GET_RID2PASID(ce) \
> @@ -4346,6 +4347,39 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>          return true;
>      }
>  
> +    /* Remaining checks are all stage-1 translation specific */
> +    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> +        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
> +        return false;
> +    }
> +
> +    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE, errp);
> +    if (ret < 0) {
> +        return false;
Can't you simply rely on the check below?
> +    }
> +    if (ret != HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD) {
> +        error_setg(errp, "Incompatible host platform IOMMU type %d", ret);
> +        return false;
> +    }
> +
> +    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_NESTING, errp);
> +    if (ret < 0) {
> +        return false;
> +    }
same heere
> +    if (ret != 1) {
> +        error_setg(errp, "Host IOMMU doesn't support nested translation");
> +        return false;
> +    }
> +
> +    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_FS1GP, errp);
> +    if (ret < 0) {
> +        return false;
> +    }
> +    if (s->fs1gp && ret != 1) {
looking in the vtd spec I don't find FS1GP. Is it the same as FL1GP?
Maybe I am not looking the correct spec though. Why do you need to check
both ret and fs1gp
Even why do you need a member to store the cap? Looks FL1GP can only
take 0 or 1 value?
> +        error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
> +        return false;
> +    }
> +
>      error_setg(errp, "host device is uncompatible with stage-1 translation");
>      return false;
>  }
Eric



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 12/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-02-19  8:22 ` [PATCH rfcv2 12/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-02-21 13:03   ` Eric Auger
  2025-02-28  8:58     ` Duan, Zhenzhong
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-21 13:03 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost




On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> Introduce a new structure VTDHostIOMMUDevice which replaces
> HostIOMMUDevice to be stored in hash table.
>
> It includes a reference to HostIOMMUDevice and IntelIOMMUState,
> also includes BDF information which will be used in future
> patches.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu_internal.h |  7 +++++++
>  include/hw/i386/intel_iommu.h  |  2 +-
>  hw/i386/intel_iommu.c          | 14 ++++++++++++--
>  3 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 2cda744786..18bc22fc72 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -28,6 +28,7 @@
>  #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
>  #define HW_I386_INTEL_IOMMU_INTERNAL_H
>  #include "hw/i386/intel_iommu.h"
> +#include "system/host_iommu_device.h"
>  
>  /*
>   * Intel IOMMU register specification
> @@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
>  /* Bits to decide the offset for each level */
>  #define VTD_LEVEL_BITS           9
>  
> +typedef struct VTDHostIOMMUDevice {
> +    IntelIOMMUState *iommu_state;
> +    PCIBus *bus;
> +    uint8_t devfn;
Just to make sure the parent

HostIOMMUDevice has aliased_bus and aliased_devfn. Can you explain why do you need both aliased and non aliased info?

Thanks

Eric

> +    HostIOMMUDevice *hiod;
> +} VTDHostIOMMUDevice;
>  #endif
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index e95477e855..50f9b27a45 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -295,7 +295,7 @@ struct IntelIOMMUState {
>      /* list of registered notifiers */
>      QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>  
> -    GHashTable *vtd_host_iommu_dev;             /* HostIOMMUDevice */
> +    GHashTable *vtd_host_iommu_dev;             /* VTDHostIOMMUDevice */
>  
>      /* interrupt remapping */
>      bool intr_enabled;              /* Whether guest enabled IR */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 9de60e607d..fafa199f52 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
>  
>  static void vtd_hiod_destroy(gpointer v)
>  {
> -    object_unref(v);
> +    VTDHostIOMMUDevice *vtd_hiod = v;
> +
> +    object_unref(vtd_hiod->hiod);
> +    g_free(vtd_hiod);
>  }
>  
>  static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
> @@ -4388,6 +4391,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>                                       HostIOMMUDevice *hiod, Error **errp)
>  {
>      IntelIOMMUState *s = opaque;
> +    VTDHostIOMMUDevice *vtd_hiod;
>      struct vtd_as_key key = {
>          .bus = bus,
>          .devfn = devfn,
> @@ -4404,6 +4408,12 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>          return false;
>      }
>  
> +    vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
> +    vtd_hiod->bus = bus;
> +    vtd_hiod->devfn = (uint8_t)devfn;
> +    vtd_hiod->iommu_state = s;
> +    vtd_hiod->hiod = hiod;
> +
>      if (!vtd_check_hiod(s, hiod, errp)) {
>          vtd_iommu_unlock(s);
>          return false;
> @@ -4414,7 +4424,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>      new_key->devfn = devfn;
>  
>      object_ref(hiod);
> -    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
> +    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
>  
>      vtd_iommu_unlock(s);
>  



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 11/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-02-21 12:49   ` Eric Auger
@ 2025-02-21 14:18     ` Eric Auger
  2025-02-28  8:57     ` Duan, Zhenzhong
  1 sibling, 0 replies; 68+ messages in thread
From: Eric Auger @ 2025-02-21 14:18 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum



On 2/21/25 1:49 PM, Eric Auger wrote:
> Hi Zhenzhong,
> 
> 
> On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
>> is passed to host to construct nested page table. We need to check
>> compatibility of some critical IOMMU capabilities between vIOMMU and
>> host IOMMU to ensure guest stage-1 page table could be used by host.
>>
>> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
>> does not, then this IOMMUFD backed device should be failed.
> is this 1GB huge page mapping a requiring for SIOV?
>>
>> Declare an enum type host_iommu_device_iommu_hw_info_type aliased to
>> iommu_hw_info_type which come from iommufd header file. This can avoid
> s/come/comes
>> build failure on windows which doesn't support iommufd.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/system/host_iommu_device.h | 13 ++++++++++++
>>  hw/i386/intel_iommu.c              | 34 ++++++++++++++++++++++++++++++
>>  2 files changed, 47 insertions(+)
>>
>> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
>> index 250600fc1d..aa3885d7ee 100644
>> --- a/include/system/host_iommu_device.h
>> +++ b/include/system/host_iommu_device.h
>> @@ -133,5 +133,18 @@ struct HostIOMMUDeviceClass {
>>  #define HOST_IOMMU_DEVICE_CAP_FS1GP             3
>>  #define HOST_IOMMU_DEVICE_CAP_ERRATA            4
>>  
>> +/**
>> + * enum host_iommu_device_iommu_hw_info_type - IOMMU Hardware Info Types
>> + * @HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not
>> + *                                             report hardware info
>> + * @HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
>> + *
>> + * This is alias to enum iommu_hw_info_type but for general purpose.
>> + */
>> +enum host_iommu_device_iommu_hw_info_type {
>> +    HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_NONE,
>> +    HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD,
>> +};
>> +
>>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
>>  #endif
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 7709f55be5..9de60e607d 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -39,6 +39,7 @@
>>  #include "kvm/kvm_i386.h"
>>  #include "migration/vmstate.h"
>>  #include "trace.h"
>> +#include "system/iommufd.h"
>>  
>>  /* context entry operations */
>>  #define VTD_CE_GET_RID2PASID(ce) \
>> @@ -4346,6 +4347,39 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>>          return true;
>>      }
>>  
>> +    /* Remaining checks are all stage-1 translation specific */
>> +    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
>> +        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
>> +        return false;
>> +    }
>> +
>> +    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE, errp);
>> +    if (ret < 0) {
>> +        return false;
> Can't you simply rely on the check below?
>> +    }
>> +    if (ret != HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>> +        error_setg(errp, "Incompatible host platform IOMMU type %d", ret);
>> +        return false;
>> +    }
>> +
>> +    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_NESTING, errp);
>> +    if (ret < 0) {
>> +        return false;
>> +    }
> same heere
>> +    if (ret != 1) {
>> +        error_setg(errp, "Host IOMMU doesn't support nested translation");
>> +        return false;
>> +    }
>> +
>> +    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_FS1GP, errp);
>> +    if (ret < 0) {
>> +        return false;
>> +    }
>> +    if (s->fs1gp && ret != 1) {
> looking in the vtd spec I don't find FS1GP. Is it the same as FL1GP?
I am now looking at spec rev from june 22 and it seems it has been
renamed. So please ignore this comment

Eric
> Maybe I am not looking the correct spec though. Why do you need to check
> both ret and fs1gp
> Even why do you need a member to store the cap? Looks FL1GP can only
> take 0 or 1 value?
>> +        error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
>> +        return false;
>> +    }
>> +
>>      error_setg(errp, "host device is uncompatible with stage-1 translation");
>>      return false;
>>  }
> Eric



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 13/20] intel_iommu: Add PASID cache management infrastructure
  2025-02-19  8:22 ` [PATCH rfcv2 13/20] intel_iommu: Add PASID cache management infrastructure Zhenzhong Duan
@ 2025-02-21 17:02   ` Eric Auger
  2025-02-28  9:35     ` Duan, Zhenzhong
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-02-21 17:02 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng, Yi Sun, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum




On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
> pasid entry and track PASID usage and future PASID tagged DMA address
> translation support in vIOMMU.
>
> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
> never freed. For other pasid, VTDAddressSpace instance is created/destroyed
> per the guest pasid entry set up/destroy for passthrough devices. While for
> emulated devices, VTDAddressSpace instance is created in the PASID tagged DMA
> translation and be destroyed per guest PASID cache invalidation. This focuses
> on the PASID cache management for passthrough devices as there is no PASID
> capable emulated devices yet.
>
> When guest modifies a PASID entry, QEMU will capture the guest pasid selective
> pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
> invalidation reasons:
>
>     *) a present pasid entry moved to non-present
>     *) a present pasid entry to be a present entry
>     *) a non-present pasid entry moved to present
>
> vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
> and compare it with the PASID cache.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu_internal.h |  29 ++
>  include/hw/i386/intel_iommu.h  |   6 +
>  hw/i386/intel_iommu.c          | 484 ++++++++++++++++++++++++++++++++-
Don't you have ways to split this patch. It has a huge change set and
this is really heavy to digest at once (at least for me).
>  hw/i386/trace-events           |   4 +
>  4 files changed, 513 insertions(+), 10 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 18bc22fc72..632fda2853 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
>                                    * request while disabled */
>      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>  
> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>      /* PASID directory entry access failure */
>      VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>      /* The Present(P) field of pasid directory entry is 0 */
> @@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
>  #define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000f1c0ULL
>  #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>  
> +#define VTD_INV_DESC_PASIDC_G          (3ULL << 4)
> +#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
> +#define VTD_INV_DESC_PASIDC_DID(val)   (((val) >> 16) & VTD_DOMAIN_ID_MASK)
> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0  0xfff000000000f1c0ULL
> +
> +#define VTD_INV_DESC_PASIDC_DSI        (0ULL << 4)
> +#define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
> +#define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
> +
>  /* Information about page-selective IOTLB invalidate */
>  struct VTDIOTLBPageInvInfo {
>      uint16_t domain_id;
> @@ -548,10 +558,28 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>  
> +#define VTD_SM_CONTEXT_ENTRY_PDTS(val)      (((val) >> 9) & 0x7)
>  #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>  
> +typedef enum VTDPCInvType {
> +    /* force reset all */
> +    VTD_PASID_CACHE_FORCE_RESET = 0,
> +    /* pasid cache invalidation rely on guest PASID entry */
> +    VTD_PASID_CACHE_GLOBAL_INV,
> +    VTD_PASID_CACHE_DOMSI,
> +    VTD_PASID_CACHE_PASIDSI,
> +} VTDPCInvType;
> +
> +typedef struct VTDPASIDCacheInfo {
> +    VTDPCInvType type;
> +    uint16_t domain_id;
> +    uint32_t pasid;
> +    PCIBus *bus;
> +    uint16_t devfn;
> +} VTDPASIDCacheInfo;
> +
>  /* PASID Table Related Definitions */
>  #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>  #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> @@ -563,6 +591,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
>  #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
>  #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
> +#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
>  
>  /* PASID Granular Translation Type Mask */
>  #define VTD_PASID_ENTRY_P              1ULL
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 50f9b27a45..fbc9da903a 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>      uint64_t val[8];
>  };
>  
> +typedef struct VTDPASIDCacheEntry {
> +    struct VTDPASIDEntry pasid_entry;
> +    bool cache_filled;
> +} VTDPASIDCacheEntry;
> +
>  struct VTDAddressSpace {
>      PCIBus *bus;
>      uint8_t devfn;
> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>      MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
>      IntelIOMMUState *iommu_state;
>      VTDContextCacheEntry context_cache_entry;
> +    VTDPASIDCacheEntry pasid_cache_entry;
>      QLIST_ENTRY(VTDAddressSpace) next;
>      /* Superset of notifier flags that this address space has */
>      IOMMUNotifierFlag notifier_flags;
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index fafa199f52..b8f3b85803 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -86,6 +86,8 @@ struct vtd_iotlb_key {
>  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
> +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
use _locked suffix to be consistent with the others and emphases the
lock is held?
> +
>  static void vtd_panic_require_caching_mode(void)
>  {
>      error_report("We need to set caching-mode=on for intel-iommu to enable "
> @@ -390,6 +392,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
>      vtd_iommu_lock(s);
>      vtd_reset_iotlb_locked(s);
>      vtd_reset_context_cache_locked(s);
> +    vtd_pasid_cache_reset(s);
>      vtd_iommu_unlock(s);
>  }
>  
> @@ -825,6 +828,16 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>      }
>  }
>  
> +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
> +{
> +    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
> +}
> +
> +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)

nit: vtd_pe_get_did as the filed is named DID?

> +{
> +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> +}
> +
>  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>  {
>      return pdire->val & 1;
> @@ -1617,6 +1630,54 @@ static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
>      }
>  }
>  
> +/* Translate to iommu pasid if PCI_NO_PASID */
I don't really get the comment above. Ay best, shouldn't it be put in

(vtd_as->pasid == PCI_NO_PASID) block?
What you call "iommu pasid" is the value set in RID_PASID, right? Is "iommu pasid" a conventional terminology?

> +static int vtd_as_to_iommu_pasid(VTDAddressSpace *vtd_as, uint32_t *pasid)
> +{
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    ret = vtd_as_to_context_entry(vtd_as, &ce);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    if (vtd_as->pasid == PCI_NO_PASID) {
> +        *pasid = VTD_CE_GET_RID2PASID(&ce);
This is called RID_PASID in the spec. I think it would be easier for the
reader if could have a direct match with named fields so that we can
easily seach the spec.
> +    } else {
> +        *pasid = vtd_as->pasid;
> +    }
> +
> +    return 0;
> +}
> +
> +static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
> +                                                   gpointer user_data)
why iommu_pasid and not directly pasid?
> +{
> +    VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
> +    struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
> +    uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
> +    uint32_t pasid;
> +
> +    if (vtd_as_to_iommu_pasid(vtd_as, &pasid)) {
> +        return false;
> +    }
> +
> +    return (pasid == target->pasid) && (sid == target->sid);
> +}
> +
> +/* Translate iommu pasid to vtd_as */
> +static VTDAddressSpace *vtd_as_from_iommu_pasid(IntelIOMMUState *s,
> +                                                uint16_t sid, uint32_t pasid)
> +{
> +    struct vtd_as_raw_key key = {
> +        .sid = sid,
> +        .pasid = pasid
> +    };
> +
> +    return g_hash_table_find(s->vtd_address_spaces,
> +                             vtd_find_as_by_sid_and_iommu_pasid, &key);
> +}
> +
>  static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
>                                       void *private)
>  {
> @@ -3062,6 +3123,412 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
>      return true;
>  }
>  
> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
> +                                            uint32_t pasid, VTDPASIDEntry *pe)
does "pe" means pasid entry? It is not obvious for a dummy reader like
me. May be worth a comment at least once.
> +{
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    if (!s->root_scalable) {
> +        return -VTD_FR_RTADDR_INV_TTM;
> +    }
> +
> +    ret = vtd_as_to_context_entry(vtd_as, &ce);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
> +}
> +
> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
> +{
> +    return !memcmp(p1, p2, sizeof(*p1));
> +}
> +
> +/*
> + * This function fills in the pasid entry in &vtd_as. Caller
> + * of this function should hold iommu_lock.
> + */
> +static void vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
> +                                 VTDPASIDEntry *pe)
seems some other functions used the _locked suffix
> +{
> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> +
> +    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
> +        /* No need to go further as cached pasid entry is latest */
> +        return;
> +    }
> +
> +    pc_entry->pasid_entry = *pe;
> +    pc_entry->cache_filled = true;
> +    /*
> +     * TODO: send pasid bind to host for passthru devices
> +     */
what does it mean?
> +}
> +
> +/*
> + * This function is used to clear cached pasid entry in vtd_as
> + * instances. Caller of this function should hold iommu_lock.
> + */
> +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> +                                gpointer user_data)
> +{
> +    VTDPASIDCacheInfo *pc_info = user_data;
> +    VTDAddressSpace *vtd_as = value;
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> +    VTDPASIDEntry pe;
> +    uint16_t did;
> +    uint32_t pasid;
> +    int ret;
> +
> +    /* Replay only fill pasid entry cache for passthrough device */
filled

Eric

> +    if (!pc_entry->cache_filled) {
> +        return false;
> +    }
> +    did = vtd_pe_get_domain_id(&pc_entry->pasid_entry);
> +
> +    if (vtd_as_to_iommu_pasid(vtd_as, &pasid)) {
> +        goto remove;
> +    }
> +
> +    switch (pc_info->type) {
> +    case VTD_PASID_CACHE_FORCE_RESET:
> +        goto remove;
> +    case VTD_PASID_CACHE_PASIDSI:
> +        if (pc_info->pasid != pasid) {
> +            return false;
> +        }
> +        /* Fall through */
> +    case VTD_PASID_CACHE_DOMSI:
> +        if (pc_info->domain_id != did) {
> +            return false;
> +        }
> +        /* Fall through */
> +    case VTD_PASID_CACHE_GLOBAL_INV:
> +        break;
> +    default:
> +        error_report("invalid pc_info->type");
> +        abort();
> +    }
> +
> +    /*
> +     * pasid cache invalidation may indicate a present pasid
> +     * entry to present pasid entry modification. To cover such
> +     * case, vIOMMU emulator needs to fetch latest guest pasid
> +     * entry and check cached pasid entry, then update pasid
> +     * cache and send pasid bind/unbind to host properly.
> +     */
> +    ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
> +    if (ret) {
> +        /*
> +         * No valid pasid entry in guest memory. e.g. pasid entry
> +         * was modified to be either all-zero or non-present. Either
> +         * case means existing pasid cache should be removed.
> +         */
> +        goto remove;
> +    }
> +
> +    vtd_fill_pe_in_cache(s, vtd_as, &pe);
> +    return false;
> +
> +remove:
> +    /*
> +     * TODO: send pasid unbind to host for passthru devices
> +     */
> +    pc_entry->cache_filled = false;
> +
> +    /*
> +     * Don't remove address space of PCI_NO_PASID which is created by PCI
> +     * sub-system.
> +     */
> +    if (vtd_as->pasid == PCI_NO_PASID) {
> +        return false;
> +    }
> +    return true;
> +}
> +
> +/* Caller of this function should hold iommu_lock */
> +static void vtd_pasid_cache_reset(IntelIOMMUState *s)
> +{
> +    VTDPASIDCacheInfo pc_info;
> +
> +    trace_vtd_pasid_cache_reset();
> +
> +    pc_info.type = VTD_PASID_CACHE_FORCE_RESET;
> +
> +    /*
> +     * Reset pasid cache is a big hammer, so use
> +     * g_hash_table_foreach_remove which will free
> +     * the vtd_as instances. Also, as a big
> +     * hammer, use VTD_PASID_CACHE_FORCE_RESET to
> +     * ensure all the vtd_as instances are
> +     * dropped, meanwhile the change will be passed
> +     * to host if HostIOMMUDeviceIOMMUFD is available.
> +     */
> +    g_hash_table_foreach_remove(s->vtd_address_spaces,
> +                                vtd_flush_pasid, &pc_info);
> +}
> +
> +/* Caller of this function should hold iommu_lock. */
> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
> +                                        dma_addr_t pt_base,
> +                                        int start,
> +                                        int end,
> +                                        VTDPASIDCacheInfo *info)
> +{
> +    VTDPASIDEntry pe;
> +    int pasid = start;
> +    int pasid_next;
> +
> +    while (pasid < end) {
> +        pasid_next = pasid + 1;
> +
> +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
> +            && vtd_pe_present(&pe)) {
> +            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
> +            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
> +            VTDAddressSpace *vtd_as;
> +
> +            vtd_as = vtd_as_from_iommu_pasid(s, sid, pasid);
> +            if (!vtd_as) {
> +                vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
> +            }
> +
> +            if ((info->type == VTD_PASID_CACHE_DOMSI ||
> +                 info->type == VTD_PASID_CACHE_PASIDSI) &&
> +                !(info->domain_id == vtd_pe_get_domain_id(&pe))) {
> +                /*
> +                 * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
> +                 * requires domain ID check. If domain Id check fail,
> +                 * go to next pasid.
> +                 */
> +                pasid = pasid_next;
> +                continue;
> +            }
> +            vtd_fill_pe_in_cache(s, vtd_as, &pe);
> +        }
> +        pasid = pasid_next;
> +    }
> +}
> +
> +/*
> + * Currently, VT-d scalable mode pasid table is a two level table,
> + * this function aims to loop a range of PASIDs in a given pasid
> + * table to identify the pasid config in guest.
> + * Caller of this function should hold iommu_lock.
> + */
> +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
> +                                    dma_addr_t pdt_base,
> +                                    int start,
> +                                    int end,
> +                                    VTDPASIDCacheInfo *info)
> +{
> +    VTDPASIDDirEntry pdire;
> +    int pasid = start;
> +    int pasid_next;
> +    dma_addr_t pt_base;
> +
> +    while (pasid < end) {
> +        pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
> +                      (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
> +        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
> +            && vtd_pdire_present(&pdire)) {
> +            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
> +            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
> +        }
> +        pasid = pasid_next;
> +    }
> +}
> +
> +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
> +                                          int start, int end,
> +                                          VTDPASIDCacheInfo *info)
> +{
> +    VTDContextEntry ce;
> +    VTDAddressSpace *vtd_as;
> +
> +    vtd_as = vtd_find_add_as(s, info->bus, info->devfn, PCI_NO_PASID);
> +
> +    if (!vtd_as_to_context_entry(vtd_as, &ce)) {
> +        uint32_t max_pasid;
> +
> +        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
> +        if (end > max_pasid) {
> +            end = max_pasid;
> +        }
> +        vtd_sm_pasid_table_walk(s,
> +                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
> +                                start,
> +                                end,
> +                                info);
> +    }
> +}
> +
> +/*
> + * This function replay the guest pasid bindings to hosts by
> + * walking the guest PASID table. This ensures host will have
> + * latest guest pasid bindings. Caller should hold iommu_lock.
> + */
> +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> +                                            VTDPASIDCacheInfo *pc_info)
> +{
> +    VTDHostIOMMUDevice *vtd_hiod;
> +    int start = 0, end = 1; /* only rid2pasid is supported */
> +    VTDPASIDCacheInfo walk_info;
> +    GHashTableIter as_it;
> +
> +    switch (pc_info->type) {
> +    case VTD_PASID_CACHE_PASIDSI:
> +        start = pc_info->pasid;
> +        end = pc_info->pasid + 1;
> +        /*
> +         * PASID selective invalidation is within domain,
> +         * thus fall through.
> +         */
> +    case VTD_PASID_CACHE_DOMSI:
> +    case VTD_PASID_CACHE_GLOBAL_INV:
> +        /* loop all assigned devices */
> +        break;
> +    case VTD_PASID_CACHE_FORCE_RESET:
> +        /* For force reset, no need to go further replay */
> +        return;
> +    default:
> +        error_report("invalid pc_info->type for replay");
> +        abort();
> +    }
> +
> +    /*
> +     * In this replay, only needs to care about the devices which
> +     * are backed by host IOMMU. For such devices, their vtd_hiod
> +     * instances are in the s->vtd_host_iommu_dev. For devices which
> +     * are not backed by host IOMMU, it is not necessary to replay
> +     * the bindings since their cache could be re-created in the future
> +     * DMA address translation.
> +     */
> +    walk_info = *pc_info;
> +    g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
> +    while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
> +        /* bus|devfn fields are not identical with pc_info */
> +        walk_info.bus = vtd_hiod->bus;
> +        walk_info.devfn = vtd_hiod->devfn;
> +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> +    }
> +}
> +
> +/*
> + * This function syncs the pasid bindings between guest and host.
> + * It includes updating the pasid cache in vIOMMU and updating the
> + * pasid bindings per guest's latest pasid entry presence.
> + */
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> +                                 VTDPASIDCacheInfo *pc_info)
> +{
> +    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
> +        return;
> +    }
> +
> +    /*
> +     * Regards to a pasid cache invalidation, e.g. a PSI.
> +     * it could be either cases of below:
> +     * a) a present pasid entry moved to non-present
> +     * b) a present pasid entry to be a present entry
> +     * c) a non-present pasid entry moved to present
> +     *
> +     * Different invalidation granularity may affect different device
> +     * scope and pasid scope. But for each invalidation granularity,
> +     * it needs to do two steps to sync host and guest pasid binding.
> +     *
> +     * Here is the handling of a PSI:
> +     * 1) loop all the existing vtd_as instances to update them
> +     *    according to the latest guest pasid entry in pasid table.
> +     *    this will make sure affected existing vtd_as instances
> +     *    cached the latest pasid entries. Also, during the loop, the
> +     *    host should be notified if needed. e.g. pasid unbind or pasid
> +     *    update. Should be able to cover case a) and case b).
> +     *
> +     * 2) loop all devices to cover case c)
> +     *    - For devices which are backed by HostIOMMUDeviceIOMMUFD instances,
> +     *      we loop them and check if guest pasid entry exists. If yes,
> +     *      it is case c), we update the pasid cache and also notify
> +     *      host.
> +     *    - For devices which are not backed by HostIOMMUDeviceIOMMUFD,
> +     *      it is not necessary to create pasid cache at this phase since
> +     *      it could be created when vIOMMU does DMA address translation.
> +     *      This is not yet implemented since there is no emulated
> +     *      pasid-capable devices today. If we have such devices in
> +     *      future, the pasid cache shall be created there.
> +     * Other granularity follow the same steps, just with different scope
> +     *
> +     */
> +
> +    vtd_iommu_lock(s);
> +    /* Step 1: loop all the existing vtd_as instances */
> +    g_hash_table_foreach_remove(s->vtd_address_spaces,
> +                                vtd_flush_pasid, pc_info);
> +
> +    /*
> +     * Step 2: loop all the existing vtd_hiod instances.
> +     * Ideally, needs to loop all devices to find if there is any new
> +     * PASID binding regards to the PASID cache invalidation request.
> +     * But it is enough to loop the devices which are backed by host
> +     * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
> +     * if new PASID happened on them, their vtd_as instance could
> +     * be created during future vIOMMU DMA translation.
> +     */
> +    vtd_replay_guest_pasid_bindings(s, pc_info);
> +    vtd_iommu_unlock(s);
> +}
> +
> +static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> +                                   VTDInvDesc *inv_desc)
> +{
> +    uint16_t domain_id;
> +    uint32_t pasid;
> +    VTDPASIDCacheInfo pc_info;
> +    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
> +                        VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
> +
> +    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
> +                                     __func__, "pasid cache inv")) {
> +        return false;
> +    }
> +
> +    domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
> +    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
> +
> +    switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
> +    case VTD_INV_DESC_PASIDC_DSI:
> +        trace_vtd_pasid_cache_dsi(domain_id);
> +        pc_info.type = VTD_PASID_CACHE_DOMSI;
> +        pc_info.domain_id = domain_id;
> +        break;
> +
> +    case VTD_INV_DESC_PASIDC_PASID_SI:
> +        /* PASID selective implies a DID selective */
> +        trace_vtd_pasid_cache_psi(domain_id, pasid);
> +        pc_info.type = VTD_PASID_CACHE_PASIDSI;
> +        pc_info.domain_id = domain_id;
> +        pc_info.pasid = pasid;
> +        break;
> +
> +    case VTD_INV_DESC_PASIDC_GLOBAL:
> +        trace_vtd_pasid_cache_gsi();
> +        pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
> +        break;
> +
> +    default:
> +        error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
> +                  " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
> +        return false;
> +    }
> +
> +    vtd_pasid_cache_sync(s, &pc_info);
> +    return true;
> +}
> +
>  static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
>                                       VTDInvDesc *inv_desc)
>  {
> @@ -3223,6 +3690,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>          }
>          break;
>  
> +    case VTD_INV_DESC_PC:
> +        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
> +        if (!vtd_process_pasid_desc(s, &inv_desc)) {
> +            return false;
> +        }
> +        break;
> +
>      case VTD_INV_DESC_PIOTLB:
>          trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
>          if (!vtd_process_piotlb_desc(s, &inv_desc)) {
> @@ -3258,16 +3732,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>          }
>          break;
>  
> -    /*
> -     * TODO: the entity of below two cases will be implemented in future series.
> -     * To make guest (which integrates scalable mode support patch set in
> -     * iommu driver) work, just return true is enough so far.
> -     */
> -    case VTD_INV_DESC_PC:
> -        if (s->scalable_mode) {
> -            break;
> -        }
> -    /* fallthrough */
>      default:
>          error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
>                            " (unknown type)", __func__, inv_desc.hi,
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 53c02d7ac8..a26b38b52c 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -24,6 +24,10 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
>  vtd_inv_qi_tail(uint16_t head) "write tail %d"
>  vtd_inv_qi_fetch(void) ""
>  vtd_context_cache_reset(void) ""
> +vtd_pasid_cache_gsi(void) ""
> +vtd_pasid_cache_reset(void) ""
> +vtd_pasid_cache_dsi(uint16_t domain) "Domain slective PC invalidation domain 0x%"PRIx16
> +vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID slective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
>  vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
>  vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
>  vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT
  2025-02-19  8:22 ` [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT Zhenzhong Duan
  2025-02-20 16:47   ` Eric Auger
@ 2025-02-24 10:03   ` Shameerali Kolothum Thodi via
  2025-02-28  9:36     ` Duan, Zhenzhong
  1 sibling, 1 reply; 68+ messages in thread
From: Shameerali Kolothum Thodi via @ 2025-02-24 10:03 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, kevin.tian@intel.com,
	yi.l.liu@intel.com, chao.p.peng@intel.com

Hi Zhenzhong,

> -----Original Message-----
> From: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Sent: Wednesday, February 19, 2025 8:22 AM
> To: qemu-devel@nongnu.org
> Cc: alex.williamson@redhat.com; clg@redhat.com; eric.auger@redhat.com;
> mst@redhat.com; jasowang@redhat.com; peterx@redhat.com;
> jgg@nvidia.com; nicolinc@nvidia.com; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; joao.m.martins@oracle.com;
> clement.mathieu--drif@eviden.com; kevin.tian@intel.com;
> yi.l.liu@intel.com; chao.p.peng@intel.com; Zhenzhong Duan
> <zhenzhong.duan@intel.com>
> Subject: [PATCH rfcv2 01/20] backends/iommufd: Add helpers for
> invalidating user-managed HWPT
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/system/iommufd.h |  3 +++
>  backends/iommufd.c       | 30 ++++++++++++++++++++++++++++++
>  backends/trace-events    |  1 +
>  3 files changed, 34 insertions(+)
> 
> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
> index cbab75bfbf..5d02e9d148 100644
> --- a/include/system/iommufd.h
> +++ b/include/system/iommufd.h
> @@ -61,6 +61,9 @@ bool
> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t
> hwpt_id,
>                                        uint64_t iova, ram_addr_t size,
>                                        uint64_t page_size, uint64_t *data,
>                                        Error **errp);
> +int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t
> hwpt_id,
> +                                     uint32_t data_type, uint32_t entry_len,
> +                                     uint32_t *entry_num, void *data_ptr);
> 
>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
> TYPE_HOST_IOMMU_DEVICE "-iommufd"
>  #endif
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index d57da44755..fc32aad5cb 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -311,6 +311,36 @@ bool
> iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>      return true;
>  }
> 
> +int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t
> hwpt_id,

Nit: As per struct iommu_hwpt_invalidate documentation this can be an ID of
Nested HWPT or vIOMMU.  May be better to rename this just to id.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT
  2025-02-20 16:47   ` Eric Auger
@ 2025-02-28  2:26     ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  2:26 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating
>user-managed HWPT
>
>Hi Zhenzhong,
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>in the title, there is only a single helper here. a small commit msg may
>help the reader

Sure, will do.

>> ---
>>  include/system/iommufd.h |  3 +++
>>  backends/iommufd.c       | 30 ++++++++++++++++++++++++++++++
>>  backends/trace-events    |  1 +
>>  3 files changed, 34 insertions(+)
>>
>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>> index cbab75bfbf..5d02e9d148 100644
>> --- a/include/system/iommufd.h
>> +++ b/include/system/iommufd.h
>> @@ -61,6 +61,9 @@ bool
>iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t hwpt_id,
>>                                        uint64_t iova, ram_addr_t size,
>>                                        uint64_t page_size, uint64_t *data,
>>                                        Error **errp);
>> +int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t
>hwpt_id,
>> +                                     uint32_t data_type, uint32_t entry_len,
>> +                                     uint32_t *entry_num, void *data_ptr);
>>
>>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>  #endif
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index d57da44755..fc32aad5cb 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -311,6 +311,36 @@ bool
>iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>>      return true;
>>  }
>>
>> +int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t
>hwpt_id,
>> +                                     uint32_t data_type, uint32_t entry_len,
>> +                                     uint32_t *entry_num, void *data_ptr)
>> +{
>> +    int ret, fd = be->fd;
>> +    struct iommu_hwpt_invalidate cache = {
>> +        .size = sizeof(cache),
>> +        .hwpt_id = hwpt_id,
>> +        .data_type = data_type,
>> +        .entry_len = entry_len,
>> +        .entry_num = *entry_num,
>> +        .data_uptr = (uintptr_t)data_ptr,
>> +    };
>> +
>> +    ret = ioctl(fd, IOMMU_HWPT_INVALIDATE, &cache);
>> +
>> +    trace_iommufd_backend_invalidate_cache(fd, hwpt_id, data_type,
>entry_len,
>> +                                           *entry_num, cache.entry_num,
>> +                                           (uintptr_t)data_ptr, ret);
>> +    if (ret) {
>> +        *entry_num = cache.entry_num;
>> +        error_report("IOMMU_HWPT_INVALIDATE failed: %s", strerror(errno));
>nit: you may report *entry_num also.
>Wouldn't it be useful to have an Error *errp passed to the function

Will do.

Thanks
Zhenzhong

>> +        ret = -errno;
>> +    } else {
>> +        g_assert(*entry_num == cache.entry_num);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>  static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error
>**errp)
>>  {
>>      HostIOMMUDeviceCaps *caps = &hiod->caps;
>> diff --git a/backends/trace-events b/backends/trace-events
>> index 40811a3162..5a23db6c8a 100644
>> --- a/backends/trace-events
>> +++ b/backends/trace-events
>> @@ -18,3 +18,4 @@ iommufd_backend_alloc_hwpt(int iommufd, uint32_t
>dev_id, uint32_t pt_id, uint32_
>>  iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d
>id=%d (%d)"
>>  iommufd_backend_set_dirty(int iommufd, uint32_t hwpt_id, bool start, int ret)
>" iommufd=%d hwpt=%u enable=%d (%d)"
>>  iommufd_backend_get_dirty_bitmap(int iommufd, uint32_t hwpt_id, uint64_t
>iova, uint64_t size, uint64_t page_size, int ret) " iommufd=%d hwpt=%u
>iova=0x%"PRIx64" size=0x%"PRIx64" page_size=0x%"PRIx64" (%d)"
>> +iommufd_backend_invalidate_cache(int iommufd, uint32_t hwpt_id, uint32_t
>data_type, uint32_t entry_len, uint32_t entry_num, uint32_t done_num, uint64_t
>data_ptr, int ret) " iommufd=%d hwpt_id=%u data_type=%u entry_len=%u
>entry_num=%u done_num=%u data_ptr=0x%"PRIx64" (%d)"
>Eric



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 02/20] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD
  2025-02-20 17:42   ` Eric Auger
@ 2025-02-28  5:39     ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  5:39 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 02/20] vfio/iommufd: Add properties and handlers to
>TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>
>Hi Zhenzhong,
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> New added properties include IOMMUFD handle, devid and hwpt_id.
>a property generally has an other meaning in qemu (PROP*).
>
>I would rather say you enhance HostIOMMUDeviceIOMMUFD object with 3 new
>members, specific to the iommufd BE + 2 new class functions.

Will do.

>
>
>> IOMMUFD handle and devid are used to allocate/free ioas and hwpt.
>> hwpt_id is used to re-attach IOMMUFD backed device to its default
>> VFIO sub-system created hwpt, i.e., when vIOMMU is disabled by
>> guest. These properties are initialized in .realize_late() handler.
>realize_late does not exist yet
>>
>> New added handlers include [at|de]tach_hwpt. They are used to
>> attach/detach hwpt. VFIO and VDPA have different way to attach
>> and detach, so implementation will be in sub-class instead of
>> HostIOMMUDeviceIOMMUFD.
>this is tricky to follow ...

I mean implementing [at|de]tach_hwpt in e.g., HostIOMMUDeviceIOMMUFDVFIO.

>>
>> Add two wrappers host_iommu_device_iommufd_[at|de]tach_hwpt to
>> wrap the two handlers.
>>
>> This is a prerequisite patch for following ones.
>would get rid of that sentence as it does not help much

Sure.

Thanks
Zhenzhong



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback
  2025-02-20 17:48   ` Eric Auger
@ 2025-02-28  8:16     ` Duan, Zhenzhong
  2025-03-06 15:53       ` Eric Auger
  0 siblings, 1 reply; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  8:16 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late
>callback
>
>
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> Currently we have realize() callback which is called before attachment.
>> But there are still some elements e.g., hwpt_id is not ready before
>> attachment. So we need a realize_late() callback to further initialize
>> them.
>from the description it is not obvious why the realize() could not have
>been called after the attach. Could you remind the reader what is the
>reason?

Sure, will rephrase as below:

" HostIOMMUDevice provides some elements to vIOMMU, but there are some which
are ready after attachment, e.g., hwpt_id.

Before create and attach to a new hwpt with IOMMU dirty tracking capability,
we have to call realize() to get if hardware IOMMU supports dirty tracking
capability.

So moving realize() after attach() will not work here, we need a new callback
realize_late() to further initialize those elements.

Currently, this callback is only useful for iommufd backend. For legacy
backend nothing needs to be initialized after attachment. "

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 04/20] vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
  2025-02-20 18:07   ` Eric Auger
@ 2025-02-28  8:23     ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  8:23 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 04/20] vfio/iommufd: Implement
>HostIOMMUDeviceClass::realize_late() handler
>
>
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> There are three iommufd related elements iommufd handle, devid and
>
>There are three iommufd specific members in HostIOMMUDevice
>IOMMUFD that need to be initialized after attach on realize_late() ...

Will do.

Thanks
Zhenzhong

>
>> hwpt_id. hwpt_id is ready only after VFIO device attachment. Device
>> id and iommufd handle are ready before attachment, but they are all
>> iommufd related stuff, initialize them together with hwpt_id.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/vfio/iommufd.c | 14 ++++++++++++++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index df61edffc0..53639bf88b 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -828,6 +828,19 @@ static bool
>hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>>      return true;
>>  }
>>
>> +static bool hiod_iommufd_vfio_realize_late(HostIOMMUDevice *hiod, void
>*opaque,
>> +                                           Error **errp)
>> +{
>> +    VFIODevice *vdev = opaque;
>> +    HostIOMMUDeviceIOMMUFD *idev =
>HOST_IOMMU_DEVICE_IOMMUFD(hiod);
>> +
>> +    idev->iommufd = vdev->iommufd;
>> +    idev->devid = vdev->devid;
>> +    idev->hwpt_id = vdev->hwpt->hwpt_id;
>> +
>> +    return true;
>> +}
>> +
>>  static GList *
>>  hiod_iommufd_vfio_get_iova_ranges(HostIOMMUDevice *hiod)
>>  {
>> @@ -852,6 +865,7 @@ static void hiod_iommufd_vfio_class_init(ObjectClass
>*oc, void *data)
>>      HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_CLASS(oc);
>>
>>      hiodc->realize = hiod_iommufd_vfio_realize;
>> +    hiodc->realize_late = hiod_iommufd_vfio_realize_late;
>>      hiodc->get_iova_ranges = hiod_iommufd_vfio_get_iova_ranges;
>>      hiodc->get_page_size_mask = hiod_iommufd_vfio_get_page_size_mask;
>>  };


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 05/20] vfio/iommufd: Implement [at|de]tach_hwpt handlers
  2025-02-20 18:13   ` Eric Auger
@ 2025-02-28  8:24     ` Duan, Zhenzhong
  2025-03-06 15:56       ` Eric Auger
  0 siblings, 1 reply; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  8:24 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 05/20] vfio/iommufd: Implement [at|de]tach_hwpt
>handlers
>
>
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> Implement [at|de]tach_hwpt handlers in VFIO subsystem. vIOMMU
>> utilizes them to attach to or detach from hwpt on host side.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/vfio/iommufd.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 53639bf88b..175c4fe1f4 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -802,6 +802,24 @@ static void
>vfio_iommu_iommufd_class_init(ObjectClass *klass, void *data)
>>      vioc->query_dirty_bitmap = iommufd_query_dirty_bitmap;
>>  };
>>
>> +static bool
>can't we return an integer instead. This looks more standard to me

I can do that, but I remember VFIO honors bool return value
whenever possible. We had ever cleanup patches to make all functions
return bool when possible. Do we really want to return int for only these
two functions?

Thanks
Zhenzhong

>
>Eric
>>
>+host_iommu_device_iommufd_vfio_attach_hwpt(HostIOMMUDeviceIOMMUFD
>*idev,
>> +                                           uint32_t hwpt_id, Error **errp)
>> +{
>> +    VFIODevice *vbasedev = HOST_IOMMU_DEVICE(idev)->agent;
>> +
>> +    return !iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp);
>> +}
>> +
>> +static bool
>>
>+host_iommu_device_iommufd_vfio_detach_hwpt(HostIOMMUDeviceIOMMUF
>D *idev,
>> +                                           Error **errp)
>> +{
>> +    VFIODevice *vbasedev = HOST_IOMMU_DEVICE(idev)->agent;
>> +
>> +    return iommufd_cdev_detach_ioas_hwpt(vbasedev, errp);
>> +}
>> +
>>  static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void
>*opaque,
>>                                        Error **errp)
>>  {
>> @@ -863,11 +881,15 @@
>hiod_iommufd_vfio_get_page_size_mask(HostIOMMUDevice *hiod)
>>  static void hiod_iommufd_vfio_class_init(ObjectClass *oc, void *data)
>>  {
>>      HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_CLASS(oc);
>> +    HostIOMMUDeviceIOMMUFDClass *idevc =
>HOST_IOMMU_DEVICE_IOMMUFD_CLASS(oc);
>>
>>      hiodc->realize = hiod_iommufd_vfio_realize;
>>      hiodc->realize_late = hiod_iommufd_vfio_realize_late;
>>      hiodc->get_iova_ranges = hiod_iommufd_vfio_get_iova_ranges;
>>      hiodc->get_page_size_mask = hiod_iommufd_vfio_get_page_size_mask;
>> +
>> +    idevc->attach_hwpt = host_iommu_device_iommufd_vfio_attach_hwpt;
>> +    idevc->detach_hwpt = host_iommu_device_iommufd_vfio_detach_hwpt;
>>  };
>>
>>  static const TypeInfo types[] = {



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  2025-02-20 18:41   ` Eric Auger
  2025-02-20 18:44     ` Eric Auger
@ 2025-02-28  8:29     ` Duan, Zhenzhong
  2025-03-06 15:59       ` Eric Auger
  1 sibling, 1 reply; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  8:29 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 06/20] host_iommu_device: Define two new
>capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>
>Hi Zhenzhong,
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/system/host_iommu_device.h | 8 ++++++++
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/include/system/host_iommu_device.h
>b/include/system/host_iommu_device.h
>> index df782598f2..18f8b5e5cf 100644
>> --- a/include/system/host_iommu_device.h
>> +++ b/include/system/host_iommu_device.h
>> @@ -22,10 +22,16 @@
>>   *
>>   * @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this
>represents
>>   *           the @out_capabilities value returned from IOMMU_GET_HW_INFO
>ioctl)
>> + *
>> + * @nesting: nesting page table support.
>> + *
>> + * @fs1gp: first stage(a.k.a, Stage-1) 1GB huge page support.
>>   */
>>  typedef struct HostIOMMUDeviceCaps {
>>      uint32_t type;
>>      uint64_t hw_caps;
>> +    bool nesting;
>> +    bool fs1gp;
>this looks quite vtd specific, isn't it? Shouldn't we hide this is a
>vendor specific cap struct?

Yes? I guess ARM hw could also provide nesting support at least?

There are some reasons I perfer a flatten struct even if some
Elements may be vendor specific.
1. If a vendor doesn't support an capability for other vendor,
corresponding element should be zero by default.
2. An element vendor specific may become generic in future
and we don't need to update the structure when that happens.
3. vIOMMU calls get_cap() to query if a capability is supported,
so a vIOMMU never query a vendor specific capability it doesn't
recognize. Even if that happens, zero is returned hinting no support.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 08/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
  2025-02-20 18:55   ` Eric Auger
@ 2025-02-28  8:31     ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  8:31 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 08/20] iommufd: Implement query of
>HOST_IOMMU_DEVICE_CAP_ERRATA
>
>
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA for IOMMUFD
>> backed host IOMMU device.
>>
>> Query on this capability is not supported for legacy backend
>> because there is no plan to support nesting with leacy backend
>legacy
>> backed host device.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/system/host_iommu_device.h | 2 ++
>>  backends/iommufd.c                 | 2 ++
>>  hw/vfio/iommufd.c                  | 1 +
>>  3 files changed, 5 insertions(+)
>>
>> diff --git a/include/system/host_iommu_device.h
>b/include/system/host_iommu_device.h
>> index 18f8b5e5cf..250600fc1d 100644
>> --- a/include/system/host_iommu_device.h
>> +++ b/include/system/host_iommu_device.h
>> @@ -32,6 +32,7 @@ typedef struct HostIOMMUDeviceCaps {
>>      uint64_t hw_caps;
>>      bool nesting;
>>      bool fs1gp;
>> +    uint32_t errata;
>to be consistent with the others yu may have introduced this alongside
>with the 2 others?
>This is also not usable by other IOMMUs.

Yes, this is vendor specific element. Will merge after confirm nesting and fs1gp
are vendor specific too.

Thanks
Zhenzhong

>
>Eric
>>  } HostIOMMUDeviceCaps;
>>
>>  #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
>> @@ -130,6 +131,7 @@ struct HostIOMMUDeviceClass {
>>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS           1
>>  #define HOST_IOMMU_DEVICE_CAP_NESTING           2
>>  #define HOST_IOMMU_DEVICE_CAP_FS1GP             3
>> +#define HOST_IOMMU_DEVICE_CAP_ERRATA            4
>>
>>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
>>  #endif
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index 0a1a40cbba..3c23caef96 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -374,6 +374,8 @@ static int hiod_iommufd_get_cap(HostIOMMUDevice
>*hiod, int cap, Error **errp)
>>          return caps->nesting;
>>      case HOST_IOMMU_DEVICE_CAP_FS1GP:
>>          return caps->fs1gp;
>> +    case HOST_IOMMU_DEVICE_CAP_ERRATA:
>> +        return caps->errata;
>>      default:
>>          error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
>>          return -EINVAL;
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index df6a12d200..58bff030e1 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -848,6 +848,7 @@ static bool
>hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>>      case IOMMU_HW_INFO_TYPE_INTEL_VTD:
>>          caps->nesting = !!(data.vtd.ecap_reg & VTD_ECAP_NEST);
>>          caps->fs1gp = !!(data.vtd.cap_reg & VTD_CAP_FS1GP);
>> +        caps->errata = data.vtd.flags &
>IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17;
>>          break;
>>      case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
>>      case IOMMU_HW_INFO_TYPE_NONE:



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 07/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  2025-02-20 19:00   ` Eric Auger
@ 2025-02-28  8:32     ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  8:32 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 07/20] iommufd: Implement query of
>HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>
>
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] for
>IOMMUFD
>> backed host IOMMU device.
>>
>> Query on these two capabilities is not supported for legacy backend
>> because there is no plan to support nesting with leacy backend backed
>> host device.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu_internal.h |  1 +
>>  backends/iommufd.c             |  4 ++++
>>  hw/vfio/iommufd.c              | 11 +++++++++++
>>  3 files changed, 16 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> index e8b211e8b0..2cda744786 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -191,6 +191,7 @@
>>  #define VTD_ECAP_PT                 (1ULL << 6)
>>  #define VTD_ECAP_SC                 (1ULL << 7)
>>  #define VTD_ECAP_MHMV               (15ULL << 20)
>> +#define VTD_ECAP_NEST               (1ULL << 26)
>>  #define VTD_ECAP_SRS                (1ULL << 31)
>>  #define VTD_ECAP_PASID              (1ULL << 40)
>>  #define VTD_ECAP_SMTS               (1ULL << 43)
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index 574f330c27..0a1a40cbba 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -370,6 +370,10 @@ static int hiod_iommufd_get_cap(HostIOMMUDevice
>*hiod, int cap, Error **errp)
>>          return caps->type;
>>      case HOST_IOMMU_DEVICE_CAP_AW_BITS:
>>          return vfio_device_get_aw_bits(hiod->agent);
>> +    case HOST_IOMMU_DEVICE_CAP_NESTING:
>> +        return caps->nesting;
>> +    case HOST_IOMMU_DEVICE_CAP_FS1GP:
>> +        return caps->fs1gp;
>this is vtd specific so those caps shouldn't be return for other iommus, no?

vIOMMU should not query a CAP it doesn't recognize, even if that happen,
zero is returned hinting that CAP isn't supported for this vIOMMU.

Thanks
Zhenzhong

>
>Eric
>>      default:
>>          error_setg(errp, "%s: unsupported capability %x", hiod->name, cap);
>>          return -EINVAL;
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 175c4fe1f4..df6a12d200 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -26,6 +26,7 @@
>>  #include "qemu/chardev_open.h"
>>  #include "pci.h"
>>  #include "exec/ram_addr.h"
>> +#include "hw/i386/intel_iommu_internal.h"
>>
>>  static int iommufd_cdev_map(const VFIOContainerBase *bcontainer, hwaddr
>iova,
>>                              ram_addr_t size, void *vaddr, bool readonly)
>> @@ -843,6 +844,16 @@ static bool
>hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void *opaque,
>>      caps->type = type;
>>      caps->hw_caps = hw_caps;
>>
>> +    switch (type) {
>> +    case IOMMU_HW_INFO_TYPE_INTEL_VTD:
>> +        caps->nesting = !!(data.vtd.ecap_reg & VTD_ECAP_NEST);
>> +        caps->fs1gp = !!(data.vtd.cap_reg & VTD_CAP_FS1GP);
>> +        break;
>> +    case IOMMU_HW_INFO_TYPE_ARM_SMMUV3:
>> +    case IOMMU_HW_INFO_TYPE_NONE:
>> +        break;
>> +    }
>> +
>>      return true;
>>  }
>>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 10/20] intel_iommu: Optimize context entry cache utilization
  2025-02-21 10:00   ` Eric Auger
@ 2025-02-28  8:34     ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  8:34 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 10/20] intel_iommu: Optimize context entry cache
>utilization
>
>Hi Zhenzhong,
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> There are many call sites referencing context entry by calling
>> vtd_as_to_context_entry() which will traverse the DMAR table.
>didn't you mean vtd_dev_to_context_entry? Instead

Good catch, will fix.

>>
>> In most cases we can use cached context entry in vtd_as->context_cache_entry
>> except it's stale. Currently only global and domain context invalidation
>> stales it.
>s/states/stale

Will fix.

Thanks
Zhenzhong



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 09/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
  2025-02-21 10:11   ` Eric Auger
@ 2025-02-28  8:47     ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  8:47 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 09/20] intel_iommu: Rename
>vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
>
>Hi Zhenzhong,
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> In early days vtd_ce_get_rid2pasid_entry() is used to get pasid entry of
>is/was

Will do.

>> rid2pasid, then extend to any pasid. So a new name vtd_ce_get_pasid_entry
>then it was extended to get any pasid entry?

Will do.

>> is better to match its functions.
>to match what it actually does?

Yes, will do. 
>
>I do not know the vtd spec very well so I searched for rid2pasid and I
>did not find any reference. I think I understand what is the pasid entry
>from the pasid table though so the renaming does make sense to me.

In spec it's named RID_PASID, copied some desc:

"Requests-without-PASID processed through this scalable-mode
context entry are treated as Requests-with-PASID with PASID value
specified in this field. ExecuteRequested field is treated as 0 for
such requests."

Thanks
Zhenzhong



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 11/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-02-21 12:49   ` Eric Auger
  2025-02-21 14:18     ` Eric Auger
@ 2025-02-28  8:57     ` Duan, Zhenzhong
  1 sibling, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  8:57 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 11/20] intel_iommu: Check for compatibility with
>IOMMUFD backed device when x-flts=on
>
>Hi Zhenzhong,
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
>> is passed to host to construct nested page table. We need to check
>> compatibility of some critical IOMMU capabilities between vIOMMU and
>> host IOMMU to ensure guest stage-1 page table could be used by host.
>>
>> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
>> does not, then this IOMMUFD backed device should be failed.
>is this 1GB huge page mapping a requiring for SIOV?

No, but if guest has configured that support, but host doesn't support it, VFIO
device should fail the plug.

>>
>> Declare an enum type host_iommu_device_iommu_hw_info_type aliased to
>> iommu_hw_info_type which come from iommufd header file. This can avoid
>s/come/comes

Will do.

>> build failure on windows which doesn't support iommufd.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/system/host_iommu_device.h | 13 ++++++++++++
>>  hw/i386/intel_iommu.c              | 34 ++++++++++++++++++++++++++++++
>>  2 files changed, 47 insertions(+)
>>
>> diff --git a/include/system/host_iommu_device.h
>b/include/system/host_iommu_device.h
>> index 250600fc1d..aa3885d7ee 100644
>> --- a/include/system/host_iommu_device.h
>> +++ b/include/system/host_iommu_device.h
>> @@ -133,5 +133,18 @@ struct HostIOMMUDeviceClass {
>>  #define HOST_IOMMU_DEVICE_CAP_FS1GP             3
>>  #define HOST_IOMMU_DEVICE_CAP_ERRATA            4
>>
>> +/**
>> + * enum host_iommu_device_iommu_hw_info_type - IOMMU Hardware Info
>Types
>> + * @HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_NONE: Used by the
>drivers that do not
>> + *                                             report hardware info
>> + * @HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d
>iommu info type
>> + *
>> + * This is alias to enum iommu_hw_info_type but for general purpose.
>> + */
>> +enum host_iommu_device_iommu_hw_info_type {
>> +    HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_NONE,
>> +    HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD,
>> +};
>> +
>>  #define HOST_IOMMU_DEVICE_CAP_AW_BITS_MAX       64
>>  #endif
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 7709f55be5..9de60e607d 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -39,6 +39,7 @@
>>  #include "kvm/kvm_i386.h"
>>  #include "migration/vmstate.h"
>>  #include "trace.h"
>> +#include "system/iommufd.h"
>>
>>  /* context entry operations */
>>  #define VTD_CE_GET_RID2PASID(ce) \
>> @@ -4346,6 +4347,39 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>HostIOMMUDevice *hiod,
>>          return true;
>>      }
>>
>> +    /* Remaining checks are all stage-1 translation specific */
>> +    if (!object_dynamic_cast(OBJECT(hiod),
>TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
>> +        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
>> +        return false;
>> +    }
>> +
>> +    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_IOMMU_TYPE,
>errp);
>> +    if (ret < 0) {
>> +        return false;
>Can't you simply rely on the check below?

I think not, below code will overwrite errp.

>> +    }
>> +    if (ret != HOST_IOMMU_DEVICE_IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>> +        error_setg(errp, "Incompatible host platform IOMMU type %d", ret);
>> +        return false;
>> +    }
>> +
>> +    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_NESTING, errp);
>> +    if (ret < 0) {
>> +        return false;
>> +    }
>same heere
>> +    if (ret != 1) {
>> +        error_setg(errp, "Host IOMMU doesn't support nested translation");
>> +        return false;
>> +    }
>> +
>> +    ret = hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_FS1GP, errp);
>> +    if (ret < 0) {
>> +        return false;
>> +    }
>> +    if (s->fs1gp && ret != 1) {
>looking in the vtd spec I don't find FS1GP. Is it the same as FL1GP?
Yes.

>Maybe I am not looking the correct spec though. Why do you need to check
>both ret and fs1gp

Ret < 0 means error happen, e.g., vIOMMU checks an unrecognized cap.
0 or 1 means no error and unsupported vs. supported for FS1GP.

>Even why do you need a member to store the cap? Looks FL1GP can only
>take 0 or 1 value?

You means s->fs1gp? That's user configuration for vIOMMU.
We need to check user's config of FS1GP with host's FS1GP to ensure compatibility.

Yes, Fs1GP takes only 0 or 1, aw_bits can have other values.

Thanks
Zhenzhong

>> +        error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
>> +        return false;
>> +    }
>> +
>>      error_setg(errp, "host device is uncompatible with stage-1 translation");
>>      return false;
>>  }
>Eric



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 12/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-02-21 13:03   ` Eric Auger
@ 2025-02-28  8:58     ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  8:58 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P, Marcel Apfelbaum,
	Paolo Bonzini, Richard Henderson, Eduardo Habkost



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 12/20] intel_iommu: Introduce a new structure
>VTDHostIOMMUDevice
>
>
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> Introduce a new structure VTDHostIOMMUDevice which replaces
>> HostIOMMUDevice to be stored in hash table.
>>
>> It includes a reference to HostIOMMUDevice and IntelIOMMUState,
>> also includes BDF information which will be used in future
>> patches.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu_internal.h |  7 +++++++
>>  include/hw/i386/intel_iommu.h  |  2 +-
>>  hw/i386/intel_iommu.c          | 14 ++++++++++++--
>>  3 files changed, 20 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> index 2cda744786..18bc22fc72 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -28,6 +28,7 @@
>>  #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
>>  #define HW_I386_INTEL_IOMMU_INTERNAL_H
>>  #include "hw/i386/intel_iommu.h"
>> +#include "system/host_iommu_device.h"
>>
>>  /*
>>   * Intel IOMMU register specification
>> @@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
>>  /* Bits to decide the offset for each level */
>>  #define VTD_LEVEL_BITS           9
>>
>> +typedef struct VTDHostIOMMUDevice {
>> +    IntelIOMMUState *iommu_state;
>> +    PCIBus *bus;
>> +    uint8_t devfn;
>Just to make sure the parent
>
>HostIOMMUDevice has aliased_bus and aliased_devfn. Can you explain why do
>you need both aliased and non aliased info?

Virtual vtd only need non aliased bdf, it uses non aliased bdf to index
HostIOMMUDevice to do attachment/detachment.

I remember virtio-iommu need aliased bdf to rebuild reserved regions
for aliased IOMMUDevice.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 13/20] intel_iommu: Add PASID cache management infrastructure
  2025-02-21 17:02   ` Eric Auger
@ 2025-02-28  9:35     ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  9:35 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P, Yi Sun, Paolo Bonzini,
	Richard Henderson, Eduardo Habkost, Marcel Apfelbaum



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH rfcv2 13/20] intel_iommu: Add PASID cache management
>infrastructure
>
>
>
>
>On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
>> pasid entry and track PASID usage and future PASID tagged DMA address
>> translation support in vIOMMU.
>>
>> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
>> never freed. For other pasid, VTDAddressSpace instance is created/destroyed
>> per the guest pasid entry set up/destroy for passthrough devices. While for
>> emulated devices, VTDAddressSpace instance is created in the PASID tagged
>DMA
>> translation and be destroyed per guest PASID cache invalidation. This focuses
>> on the PASID cache management for passthrough devices as there is no PASID
>> capable emulated devices yet.
>>
>> When guest modifies a PASID entry, QEMU will capture the guest pasid
>selective
>> pasid cache invalidation, allocate or remove a VTDAddressSpace instance per
>the
>> invalidation reasons:
>>
>>     *) a present pasid entry moved to non-present
>>     *) a present pasid entry to be a present entry
>>     *) a non-present pasid entry moved to present
>>
>> vIOMMU emulator could figure out the reason by fetching latest guest pasid
>entry
>> and compare it with the PASID cache.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu_internal.h |  29 ++
>>  include/hw/i386/intel_iommu.h  |   6 +
>>  hw/i386/intel_iommu.c          | 484 ++++++++++++++++++++++++++++++++-
>Don't you have ways to split this patch. It has a huge change set and
>this is really heavy to digest at once (at least for me).

Sure, will try.

>>  hw/i386/trace-events           |   4 +
>>  4 files changed, 513 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> index 18bc22fc72..632fda2853 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
>>                                    * request while disabled */
>>      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>>
>> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>>      /* PASID directory entry access failure */
>>      VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>>      /* The Present(P) field of pasid directory entry is 0 */
>> @@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
>>  #define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000f1c0ULL
>>  #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>>
>> +#define VTD_INV_DESC_PASIDC_G          (3ULL << 4)
>> +#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
>> +#define VTD_INV_DESC_PASIDC_DID(val)   (((val) >> 16) &
>VTD_DOMAIN_ID_MASK)
>> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0  0xfff000000000f1c0ULL
>> +
>> +#define VTD_INV_DESC_PASIDC_DSI        (0ULL << 4)
>> +#define VTD_INV_DESC_PASIDC_PASID_SI   (1ULL << 4)
>> +#define VTD_INV_DESC_PASIDC_GLOBAL     (3ULL << 4)
>> +
>>  /* Information about page-selective IOTLB invalidate */
>>  struct VTDIOTLBPageInvInfo {
>>      uint16_t domain_id;
>> @@ -548,10 +558,28 @@ typedef struct VTDRootEntry VTDRootEntry;
>>  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>>  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>>
>> +#define VTD_SM_CONTEXT_ENTRY_PDTS(val)      (((val) >> 9) & 0x7)
>>  #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL |
>~VTD_HAW_MASK(aw))
>>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>>
>> +typedef enum VTDPCInvType {
>> +    /* force reset all */
>> +    VTD_PASID_CACHE_FORCE_RESET = 0,
>> +    /* pasid cache invalidation rely on guest PASID entry */
>> +    VTD_PASID_CACHE_GLOBAL_INV,
>> +    VTD_PASID_CACHE_DOMSI,
>> +    VTD_PASID_CACHE_PASIDSI,
>> +} VTDPCInvType;
>> +
>> +typedef struct VTDPASIDCacheInfo {
>> +    VTDPCInvType type;
>> +    uint16_t domain_id;
>> +    uint32_t pasid;
>> +    PCIBus *bus;
>> +    uint16_t devfn;
>> +} VTDPASIDCacheInfo;
>> +
>>  /* PASID Table Related Definitions */
>>  #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>>  #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
>> @@ -563,6 +591,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>>  #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
>>  #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) &
>VTD_PASID_TABLE_BITS_MASK)
>>  #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable
>*/
>> +#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
>>
>>  /* PASID Granular Translation Type Mask */
>>  #define VTD_PASID_ENTRY_P              1ULL
>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>> index 50f9b27a45..fbc9da903a 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>>      uint64_t val[8];
>>  };
>>
>> +typedef struct VTDPASIDCacheEntry {
>> +    struct VTDPASIDEntry pasid_entry;
>> +    bool cache_filled;
>> +} VTDPASIDCacheEntry;
>> +
>>  struct VTDAddressSpace {
>>      PCIBus *bus;
>>      uint8_t devfn;
>> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>>      MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
>>      IntelIOMMUState *iommu_state;
>>      VTDContextCacheEntry context_cache_entry;
>> +    VTDPASIDCacheEntry pasid_cache_entry;
>>      QLIST_ENTRY(VTDAddressSpace) next;
>>      /* Superset of notifier flags that this address space has */
>>      IOMMUNotifierFlag notifier_flags;
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index fafa199f52..b8f3b85803 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -86,6 +86,8 @@ struct vtd_iotlb_key {
>>  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier
>*n);
>>
>> +static void vtd_pasid_cache_reset(IntelIOMMUState *s);
>use _locked suffix to be consistent with the others and emphases the
>lock is held?

Will do.

>> +
>>  static void vtd_panic_require_caching_mode(void)
>>  {
>>      error_report("We need to set caching-mode=on for intel-iommu to enable "
>> @@ -390,6 +392,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
>>      vtd_iommu_lock(s);
>>      vtd_reset_iotlb_locked(s);
>>      vtd_reset_context_cache_locked(s);
>> +    vtd_pasid_cache_reset(s);
>>      vtd_iommu_unlock(s);
>>  }
>>
>> @@ -825,6 +828,16 @@ static inline bool
>vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>>      }
>>  }
>>
>> +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
>> +{
>> +    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
>> +}
>> +
>> +static inline uint16_t vtd_pe_get_domain_id(VTDPASIDEntry *pe)
>
>nit: vtd_pe_get_did as the filed is named DID?

Will do.

>
>> +{
>> +    return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
>> +}
>> +
>>  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>>  {
>>      return pdire->val & 1;
>> @@ -1617,6 +1630,54 @@ static int
>vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
>>      }
>>  }
>>
>> +/* Translate to iommu pasid if PCI_NO_PASID */
>I don't really get the comment above. Ay best, shouldn't it be put in

Will do, it's pci's pasid value vs. guest's pasid value which I call iommu pasid.

>
>(vtd_as->pasid == PCI_NO_PASID) block?
>What you call "iommu pasid" is the value set in RID_PASID, right? Is "iommu
>pasid" a conventional terminology?

PCI subsystem uses PCI_NO_PASID(-1) to represent a Requests-without-PASID,
but guest doesn't recognize PCI_NO_PASID(-1), so vIOMMU has to translate
PCI_NO_PASID(-1) to an normal pasid value(pasid >= 0), I call it iommu pasid
to distinguish with PCI's pasid, it's not conventional terminology.

>
>> +static int vtd_as_to_iommu_pasid(VTDAddressSpace *vtd_as, uint32_t *pasid)
>> +{
>> +    VTDContextEntry ce;
>> +    int ret;
>> +
>> +    ret = vtd_as_to_context_entry(vtd_as, &ce);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    if (vtd_as->pasid == PCI_NO_PASID) {
>> +        *pasid = VTD_CE_GET_RID2PASID(&ce);
>This is called RID_PASID in the spec. I think it would be easier for the
>reader if could have a direct match with named fields so that we can
>easily seach the spec.

I agree with you about the match, but rid2pasid is wildly used in intel_iommu.c
for a long history. I'd like to here more voice before renaming.

>> +    } else {
>> +        *pasid = vtd_as->pasid;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer
>value,
>> +                                                   gpointer user_data)
>why iommu_pasid and not directly pasid?

Because vtd_as->pasid is PCI's pasid, it needs to be transformed into iommu pasid.

>> +{
>> +    VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
>> +    struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
>> +    uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
>> +    uint32_t pasid;
>> +
>> +    if (vtd_as_to_iommu_pasid(vtd_as, &pasid)) {
>> +        return false;
>> +    }
>> +
>> +    return (pasid == target->pasid) && (sid == target->sid);
>> +}
>> +
>> +/* Translate iommu pasid to vtd_as */
>> +static VTDAddressSpace *vtd_as_from_iommu_pasid(IntelIOMMUState *s,
>> +                                                uint16_t sid, uint32_t pasid)
>> +{
>> +    struct vtd_as_raw_key key = {
>> +        .sid = sid,
>> +        .pasid = pasid
>> +    };
>> +
>> +    return g_hash_table_find(s->vtd_address_spaces,
>> +                             vtd_find_as_by_sid_and_iommu_pasid, &key);
>> +}
>> +
>>  static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
>>                                       void *private)
>>  {
>> @@ -3062,6 +3123,412 @@ static bool
>vtd_process_piotlb_desc(IntelIOMMUState *s,
>>      return true;
>>  }
>>
>> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
>> +                                            uint32_t pasid, VTDPASIDEntry *pe)
>does "pe" means pasid entry? It is not obvious for a dummy reader like
>me. May be worth a comment at least once.

May VTDPASIDEntry have given you enough hint?

>> +{
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    VTDContextEntry ce;
>> +    int ret;
>> +
>> +    if (!s->root_scalable) {
>> +        return -VTD_FR_RTADDR_INV_TTM;
>> +    }
>> +
>> +    ret = vtd_as_to_context_entry(vtd_as, &ce);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
>> +}
>> +
>> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
>> +{
>> +    return !memcmp(p1, p2, sizeof(*p1));
>> +}
>> +
>> +/*
>> + * This function fills in the pasid entry in &vtd_as. Caller
>> + * of this function should hold iommu_lock.
>> + */
>> +static void vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace
>*vtd_as,
>> +                                 VTDPASIDEntry *pe)
>seems some other functions used the _locked suffix

Yes, vtd_pasid_cache_reset_locked() is, so no need to add suffix here, right?

>> +{
>> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> +
>> +    if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
>> +        /* No need to go further as cached pasid entry is latest */
>> +        return;
>> +    }
>> +
>> +    pc_entry->pasid_entry = *pe;
>> +    pc_entry->cache_filled = true;
>> +    /*
>> +     * TODO: send pasid bind to host for passthru devices
>> +     */
>what does it mean?

Here means a new entry is found, we need to send pasid<->hwpt binding
to host.

>> +}
>> +
>> +/*
>> + * This function is used to clear cached pasid entry in vtd_as
>> + * instances. Caller of this function should hold iommu_lock.
>> + */
>> +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
>> +                                gpointer user_data)
>> +{
>> +    VTDPASIDCacheInfo *pc_info = user_data;
>> +    VTDAddressSpace *vtd_as = value;
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> +    VTDPASIDEntry pe;
>> +    uint16_t did;
>> +    uint32_t pasid;
>> +    int ret;
>> +
>> +    /* Replay only fill pasid entry cache for passthrough device */
>filled

Will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT
  2025-02-24 10:03   ` Shameerali Kolothum Thodi via
@ 2025-02-28  9:36     ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-02-28  9:36 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hi Shameer,

>-----Original Message-----
>From: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>Subject: RE: [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating
>user-managed HWPT
>
>Hi Zhenzhong,
>
>> -----Original Message-----
>> From: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Sent: Wednesday, February 19, 2025 8:22 AM
>> To: qemu-devel@nongnu.org
>> Cc: alex.williamson@redhat.com; clg@redhat.com; eric.auger@redhat.com;
>> mst@redhat.com; jasowang@redhat.com; peterx@redhat.com;
>> jgg@nvidia.com; nicolinc@nvidia.com; Shameerali Kolothum Thodi
>> <shameerali.kolothum.thodi@huawei.com>; joao.m.martins@oracle.com;
>> clement.mathieu--drif@eviden.com; kevin.tian@intel.com;
>> yi.l.liu@intel.com; chao.p.peng@intel.com; Zhenzhong Duan
>> <zhenzhong.duan@intel.com>
>> Subject: [PATCH rfcv2 01/20] backends/iommufd: Add helpers for
>> invalidating user-managed HWPT
>>
>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/system/iommufd.h |  3 +++
>>  backends/iommufd.c       | 30 ++++++++++++++++++++++++++++++
>>  backends/trace-events    |  1 +
>>  3 files changed, 34 insertions(+)
>>
>> diff --git a/include/system/iommufd.h b/include/system/iommufd.h
>> index cbab75bfbf..5d02e9d148 100644
>> --- a/include/system/iommufd.h
>> +++ b/include/system/iommufd.h
>> @@ -61,6 +61,9 @@ bool
>> iommufd_backend_get_dirty_bitmap(IOMMUFDBackend *be, uint32_t
>> hwpt_id,
>>                                        uint64_t iova, ram_addr_t size,
>>                                        uint64_t page_size, uint64_t *data,
>>                                        Error **errp);
>> +int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t
>> hwpt_id,
>> +                                     uint32_t data_type, uint32_t entry_len,
>> +                                     uint32_t *entry_num, void *data_ptr);
>>
>>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>> TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>  #endif
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index d57da44755..fc32aad5cb 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -311,6 +311,36 @@ bool
>> iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>>      return true;
>>  }
>>
>> +int iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t
>> hwpt_id,
>
>Nit: As per struct iommu_hwpt_invalidate documentation this can be an ID of
>Nested HWPT or vIOMMU.  May be better to rename this just to id.

Sure, will do.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback
  2025-02-28  8:16     ` Duan, Zhenzhong
@ 2025-03-06 15:53       ` Eric Auger
  0 siblings, 0 replies; 68+ messages in thread
From: Eric Auger @ 2025-03-06 15:53 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P


Hi Zhenzhong,

On 2/28/25 9:16 AM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late
>> callback
>>
>>
>>
>>
>> On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>>> Currently we have realize() callback which is called before attachment.
>>> But there are still some elements e.g., hwpt_id is not ready before
>>> attachment. So we need a realize_late() callback to further initialize
>>> them.
> >from the description it is not obvious why the realize() could not have
>> been called after the attach. Could you remind the reader what is the
>> reason?
> Sure, will rephrase as below:
>
> " HostIOMMUDevice provides some elements to vIOMMU, but there are some which
> are ready after attachment, e.g., hwpt_id.
>
> Before create and attach to a new hwpt with IOMMU dirty tracking capability,
> we have to call realize() to get if hardware IOMMU supports dirty tracking
> capability.
>
> So moving realize() after attach() will not work here, we need a new callback
> realize_late() to further initialize those elements.
>
> Currently, this callback is only useful for iommufd backend. For legacy
> backend nothing needs to be initialized after attachment. "

OK this helps me

Thanks

Eric
>
> Thanks
> Zhenzhong
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 05/20] vfio/iommufd: Implement [at|de]tach_hwpt handlers
  2025-02-28  8:24     ` Duan, Zhenzhong
@ 2025-03-06 15:56       ` Eric Auger
  0 siblings, 0 replies; 68+ messages in thread
From: Eric Auger @ 2025-03-06 15:56 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P




On 2/28/25 9:24 AM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH rfcv2 05/20] vfio/iommufd: Implement [at|de]tach_hwpt
>> handlers
>>
>>
>>
>>
>> On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>>> Implement [at|de]tach_hwpt handlers in VFIO subsystem. vIOMMU
>>> utilizes them to attach to or detach from hwpt on host side.
>>>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>  hw/vfio/iommufd.c | 22 ++++++++++++++++++++++
>>>  1 file changed, 22 insertions(+)
>>>
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 53639bf88b..175c4fe1f4 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -802,6 +802,24 @@ static void
>> vfio_iommu_iommufd_class_init(ObjectClass *klass, void *data)
>>>      vioc->query_dirty_bitmap = iommufd_query_dirty_bitmap;
>>>  };
>>>
>>> +static bool
>> can't we return an integer instead. This looks more standard to me
> I can do that, but I remember VFIO honors bool return value
> whenever possible. We had ever cleanup patches to make all functions
> return bool when possible. Do we really want to return int for only these
> two functions?
I now remember those patches from Cédric. As I mentionned realier I have
not found in the errp doc that this was a requirement but nevertheless
ignore this comment then ;-)

Eric
>
> Thanks
> Zhenzhong
>
>> Eric
>> +host_iommu_device_iommufd_vfio_attach_hwpt(HostIOMMUDeviceIOMMUFD
>> *idev,
>>> +                                           uint32_t hwpt_id, Error **errp)
>>> +{
>>> +    VFIODevice *vbasedev = HOST_IOMMU_DEVICE(idev)->agent;
>>> +
>>> +    return !iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt_id, errp);
>>> +}
>>> +
>>> +static bool
>>>
>> +host_iommu_device_iommufd_vfio_detach_hwpt(HostIOMMUDeviceIOMMUF
>> D *idev,
>>> +                                           Error **errp)
>>> +{
>>> +    VFIODevice *vbasedev = HOST_IOMMU_DEVICE(idev)->agent;
>>> +
>>> +    return iommufd_cdev_detach_ioas_hwpt(vbasedev, errp);
>>> +}
>>> +
>>>  static bool hiod_iommufd_vfio_realize(HostIOMMUDevice *hiod, void
>> *opaque,
>>>                                        Error **errp)
>>>  {
>>> @@ -863,11 +881,15 @@
>> hiod_iommufd_vfio_get_page_size_mask(HostIOMMUDevice *hiod)
>>>  static void hiod_iommufd_vfio_class_init(ObjectClass *oc, void *data)
>>>  {
>>>      HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_CLASS(oc);
>>> +    HostIOMMUDeviceIOMMUFDClass *idevc =
>> HOST_IOMMU_DEVICE_IOMMUFD_CLASS(oc);
>>>      hiodc->realize = hiod_iommufd_vfio_realize;
>>>      hiodc->realize_late = hiod_iommufd_vfio_realize_late;
>>>      hiodc->get_iova_ranges = hiod_iommufd_vfio_get_iova_ranges;
>>>      hiodc->get_page_size_mask = hiod_iommufd_vfio_get_page_size_mask;
>>> +
>>> +    idevc->attach_hwpt = host_iommu_device_iommufd_vfio_attach_hwpt;
>>> +    idevc->detach_hwpt = host_iommu_device_iommufd_vfio_detach_hwpt;
>>>  };
>>>
>>>  static const TypeInfo types[] = {



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  2025-02-28  8:29     ` Duan, Zhenzhong
@ 2025-03-06 15:59       ` Eric Auger
  2025-03-06 19:45         ` Nicolin Chen
  0 siblings, 1 reply; 68+ messages in thread
From: Eric Auger @ 2025-03-06 15:59 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P




On 2/28/25 9:29 AM, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH rfcv2 06/20] host_iommu_device: Define two new
>> capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>>
>> Hi Zhenzhong,
>>
>>
>> On 2/19/25 9:22 AM, Zhenzhong Duan wrote:
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>  include/system/host_iommu_device.h | 8 ++++++++
>>>  1 file changed, 8 insertions(+)
>>>
>>> diff --git a/include/system/host_iommu_device.h
>> b/include/system/host_iommu_device.h
>>> index df782598f2..18f8b5e5cf 100644
>>> --- a/include/system/host_iommu_device.h
>>> +++ b/include/system/host_iommu_device.h
>>> @@ -22,10 +22,16 @@
>>>   *
>>>   * @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this
>> represents
>>>   *           the @out_capabilities value returned from IOMMU_GET_HW_INFO
>> ioctl)
>>> + *
>>> + * @nesting: nesting page table support.
>>> + *
>>> + * @fs1gp: first stage(a.k.a, Stage-1) 1GB huge page support.
>>>   */
>>>  typedef struct HostIOMMUDeviceCaps {
>>>      uint32_t type;
>>>      uint64_t hw_caps;
>>> +    bool nesting;
>>> +    bool fs1gp;
>> this looks quite vtd specific, isn't it? Shouldn't we hide this is a
>> vendor specific cap struct?
> Yes? I guess ARM hw could also provide nesting support at least
> There are some reasons I perfer a flatten struct even if some
> Elements may be vendor specific.
> 1. If a vendor doesn't support an capability for other vendor,
> corresponding element should be zero by default.
> 2. An element vendor specific may become generic in future
> and we don't need to update the structure when that happens.
> 3. vIOMMU calls get_cap() to query if a capability is supported,
> so a vIOMMU never query a vendor specific capability it doesn't
> recognize. Even if that happens, zero is returned hinting no support.
I will let others comment but in general this is frown upon and unions
are prefered at least.

Eric
>
> Thanks
> Zhenzhong
>



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  2025-03-06 15:59       ` Eric Auger
@ 2025-03-06 19:45         ` Nicolin Chen
  2025-03-10  3:48           ` Duan, Zhenzhong
  0 siblings, 1 reply; 68+ messages in thread
From: Nicolin Chen @ 2025-03-06 19:45 UTC (permalink / raw)
  To: Eric Auger
  Cc: Duan, Zhenzhong, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

On Thu, Mar 06, 2025 at 04:59:39PM +0100, Eric Auger wrote:
> >>> +++ b/include/system/host_iommu_device.h
> >>> @@ -22,10 +22,16 @@
> >>>   *
> >>>   * @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this
> >> represents
> >>>   *           the @out_capabilities value returned from IOMMU_GET_HW_INFO
> >> ioctl)
> >>> + *
> >>> + * @nesting: nesting page table support.
> >>> + *
> >>> + * @fs1gp: first stage(a.k.a, Stage-1) 1GB huge page support.
> >>>   */
> >>>  typedef struct HostIOMMUDeviceCaps {
> >>>      uint32_t type;
> >>>      uint64_t hw_caps;
> >>> +    bool nesting;
> >>> +    bool fs1gp;
> >> this looks quite vtd specific, isn't it? Shouldn't we hide this is a
> >> vendor specific cap struct?
> > Yes? I guess ARM hw could also provide nesting support at least
> > There are some reasons I perfer a flatten struct even if some
> > Elements may be vendor specific.
> > 1. If a vendor doesn't support an capability for other vendor,
> > corresponding element should be zero by default.
> > 2. An element vendor specific may become generic in future
> > and we don't need to update the structure when that happens.
> > 3. vIOMMU calls get_cap() to query if a capability is supported,
> > so a vIOMMU never query a vendor specific capability it doesn't
> > recognize. Even if that happens, zero is returned hinting no support.
> I will let others comment but in general this is frown upon and unions
> are prefered at least.

Yea, it feels odd to me that we stuff vendor specific thing in
the public structure.

It's okay if we want to store in HostIOMMUDeviceCaps the vendor
specific data pointer (opaque), just for convenience.

I think we can have another PCIIOMMUOps op for vendor code to
run iommufd_backend_get_device_info() that returns the hw_caps
for the core code to read.

Or perhaps the vendor code can just return a HWPT directly? If
IOMMU_HW_CAP_DIRTY_TRACKING is set in the hw_caps, the vendor
code can allocate a HWPT for that. And if vendor code detects
the "nesting" cap in vendor struct, then return a nest_parent
HWPT. And returning NULL can let core code allocate a default
HWPT (or just attach the device to IOAS for auto domain/hwpt).

I am also hoping that this can handle a shared S2 nest_parent
HWPT case. Could the core container structure or so store the
HWPT?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
  2025-03-06 19:45         ` Nicolin Chen
@ 2025-03-10  3:48           ` Duan, Zhenzhong
  0 siblings, 0 replies; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-03-10  3:48 UTC (permalink / raw)
  To: Nicolin Chen, Eric Auger
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P

Hi Eric, Nicolin,

>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH rfcv2 06/20] host_iommu_device: Define two new
>capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>
>On Thu, Mar 06, 2025 at 04:59:39PM +0100, Eric Auger wrote:
>> >>> +++ b/include/system/host_iommu_device.h
>> >>> @@ -22,10 +22,16 @@
>> >>>   *
>> >>>   * @hw_caps: host platform IOMMU capabilities (e.g. on IOMMUFD this
>> >> represents
>> >>>   *           the @out_capabilities value returned from
>IOMMU_GET_HW_INFO
>> >> ioctl)
>> >>> + *
>> >>> + * @nesting: nesting page table support.
>> >>> + *
>> >>> + * @fs1gp: first stage(a.k.a, Stage-1) 1GB huge page support.
>> >>>   */
>> >>>  typedef struct HostIOMMUDeviceCaps {
>> >>>      uint32_t type;
>> >>>      uint64_t hw_caps;
>> >>> +    bool nesting;
>> >>> +    bool fs1gp;
>> >> this looks quite vtd specific, isn't it? Shouldn't we hide this is a
>> >> vendor specific cap struct?
>> > Yes? I guess ARM hw could also provide nesting support at least
>> > There are some reasons I perfer a flatten struct even if some
>> > Elements may be vendor specific.
>> > 1. If a vendor doesn't support an capability for other vendor,
>> > corresponding element should be zero by default.
>> > 2. An element vendor specific may become generic in future
>> > and we don't need to update the structure when that happens.
>> > 3. vIOMMU calls get_cap() to query if a capability is supported,
>> > so a vIOMMU never query a vendor specific capability it doesn't
>> > recognize. Even if that happens, zero is returned hinting no support.
>> I will let others comment but in general this is frown upon and unions
>> are prefered at least.
>
>Yea, it feels odd to me that we stuff vendor specific thing in
>the public structure.
>
>It's okay if we want to store in HostIOMMUDeviceCaps the vendor
>specific data pointer (opaque), just for convenience.
>
>I think we can have another PCIIOMMUOps op for vendor code to
>run iommufd_backend_get_device_info() that returns the hw_caps
>for the core code to read.
>
>Or perhaps the vendor code can just return a HWPT directly? If
>IOMMU_HW_CAP_DIRTY_TRACKING is set in the hw_caps, the vendor
>code can allocate a HWPT for that. And if vendor code detects
>the "nesting" cap in vendor struct, then return a nest_parent
>HWPT. And returning NULL can let core code allocate a default
>HWPT (or just attach the device to IOAS for auto domain/hwpt).
>
>I am also hoping that this can handle a shared S2 nest_parent
>HWPT case. Could the core container structure or so store the
>HWPT?

Thanks for your suggestions. It is becoming clear for me.
I'll update the code and send a new version after I finish a higher
priority work in my company.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
  2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (20 preceding siblings ...)
  2025-02-20 19:03 ` [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Eric Auger
@ 2025-04-05  3:01 ` Donald Dutile
  2025-05-19  8:37   ` Duan, Zhenzhong
  21 siblings, 1 reply; 68+ messages in thread
From: Donald Dutile @ 2025-04-05  3:01 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

Zhenzhong,

Hi!
Eric asked me to review this series.
Since it's rather late since you posted will summarize review feedback below/bottom.

- Don

On 2/19/25 3:22 AM, Zhenzhong Duan wrote:
> Hi,
> 
> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
> "Enable stage-1 translation for emulated device" series and
> "Enable stage-1 translation for passthrough device" series.
> 
> This series is 2nd part focusing on passthrough device. We don't do
> shadowing of guest page table for passthrough device but pass stage-1
> page table to host side to construct a nested domain. There was some
> effort to enable this feature in old days, see [2] for details.
> 
> The key design is to utilize the dual-stage IOMMU translation
> (also known as IOMMU nested translation) capability in host IOMMU.
> As the below diagram shows, guest I/O page table pointer in GPA
> (guest physical address) is passed to host and be used to perform
> the stage-1 address translation. Along with it, modifications to
> present mappings in the guest I/O page table should be followed
> with an IOTLB invalidation.
> 
>          .-------------.  .---------------------------.
>          |   vIOMMU    |  | Guest I/O page table      |
>          |             |  '---------------------------'
>          .----------------/
>          | PASID Entry |--- PASID cache flush --+
>          '-------------'                        |
>          |             |                        V
>          |             |           I/O page table pointer in GPA
>          '-------------'
>      Guest
>      ------| Shadow |---------------------------|--------
>            v        v                           v
>      Host
>          .-------------.  .------------------------.
>          |   pIOMMU    |  |  FS for GIOVA->GPA     |
>          |             |  '------------------------'
>          .----------------/  |
>          | PASID Entry |     V (Nested xlate)
>          '----------------\.----------------------------------.
>          |             |   | SS for GPA->HPA, unmanaged domain|
>          |             |   '----------------------------------'
>          '-------------'
> Where:
>   - FS = First stage page tables
>   - SS = Second stage page tables
> <Intel VT-d Nested translation>
> 
I'd prefer the use of 's1' for stage1/First stage, and 's2' for stage2/second stage.
We don't need different terms for the same technology in the iommu/iommufd space(s).

> There are some interactions between VFIO and vIOMMU
> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>    subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>    instance to vIOMMU at vfio device realize stage.
> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>    to bind/unbind device to IOMMUFD backed domains, either nested
>    domain or not.
> 
> See below diagram:
> 
>          VFIO Device                                 Intel IOMMU
>      .-----------------.                         .-------------------.
>      |                 |                         |                   |
>      |       .---------|PCIIOMMUOps              |.-------------.    |
>      |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
>      |       | Device  |------------------------>|| Device list |    |
>      |       .---------|(unset_iommu_device)     |.-------------.    |
>      |                 |                         |       |           |
>      |                 |                         |       V           |
>      |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>      |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
>      |       | link    |<------------------------|  |   Device    |  |
>      |       .---------|            (detach_hwpt)|  .-------------.  |
>      |                 |                         |       |           |
>      |                 |                         |       ...         |
>      .-----------------.                         .-------------------.
> 
> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
> whenever possible and create new one on demand, also supports multiple
> iommufd objects and ERRATA_772415.
> 
> E.g., Stage-2 page table could be shared by different devices if there
> is no conflict and devices link to same iommufd object, i.e. devices
> under same host IOMMU can share same stage-2 page table. If there is
and 'devices under the same guest'.
Different guests cant be sharing the same stage-2 page tables.

> conflict, i.e. there is one device under non cache coherency mode
> which is different from others, it requires a separate stage-2 page
> table in non-CC mode.
> 
> SPR platform has ERRATA_772415 which requires no readonly mappings
> in stage-2 page table. This series supports creating VTDIOASContainer
> with no readonly mappings. If there is a rare case that some IOMMUs
> on a multiple IOMMU host have ERRATA_772415 and others not, this
> design can still survive.
> 
> See below example diagram for a full view:
> 
>        IntelIOMMUState
>               |
>               V
>      .------------------.    .------------------.    .-------------------.
>      | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
>      | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
>      .------------------.    .------------------.    .-------------------.
>               |                       |                              |
>               |                       .-->...                        |
>               V                                                      V
>        .-------------------.    .-------------------.          .---------------.
>        |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
>        .-------------------.    .-------------------.          .---------------.
>            |            |               |                            |
>            |            |               |                            |
>      .-----------.  .-----------.  .------------.              .------------.
>      | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>      | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>      | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>      |           |  |           |  | (iommufd0) |              | (iommufd0) |
>      .-----------.  .-----------.  .------------.              .------------.
> 
> This series is also a prerequisite work for vSVA, i.e. Sharing
> guest application address space with passthrough devices.
> 
> To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
> 
> Passthrough device should use iommufd backend to work with stage-1 translation.
> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
> 
> If host doesn't support nested translation, qemu will fail with an unsupported
> report.
> 
> Test done:
> - VFIO devices hotplug/unplug
> - different VFIO devices linked to different iommufds
> - vhost net device ping test
> 
> PATCH1-8:  Add HWPT-based nesting infrastructure support
> PATCH9-10: Some cleanup work
> PATCH11:   cap/ecap related compatibility check between vIOMMU and Host IOMMU
> PATCH12-19:Implement stage-1 page table for passthrough device
> PATCH20:   Enable stage-1 translation for passthrough device
> 
> Qemu code can be found at [3]
> 
> TODO:
> - RAM discard
> - dirty tracking on stage-2 page table
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
> [2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
> [3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2
> 
> Thanks
> Zhenzhong
> 
> Changelog:
> rfcv2:
> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
> - add two cleanup patches(patch9-10)
> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
>    iommu pasid, this is important for dropping VTDPASIDAddressSpace
> 
> Yi Liu (3):
>    intel_iommu: Replay pasid binds after context cache invalidation
>    intel_iommu: Propagate PASID-based iotlb invalidation to host
>    intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
> 
> Zhenzhong Duan (17):
>    backends/iommufd: Add helpers for invalidating user-managed HWPT
>    vfio/iommufd: Add properties and handlers to
>      TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>    HostIOMMUDevice: Introduce realize_late callback
>    vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
>    vfio/iommufd: Implement [at|de]tach_hwpt handlers
>    host_iommu_device: Define two new capabilities
>      HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>    iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>    iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
>    intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>      vtd_ce_get_pasid_entry
>    intel_iommu: Optimize context entry cache utilization
>    intel_iommu: Check for compatibility with IOMMUFD backed device when
>      x-flts=on
>    intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>    intel_iommu: Add PASID cache management infrastructure
>    intel_iommu: Bind/unbind guest page table to host
>    intel_iommu: ERRATA_772415 workaround
>    intel_iommu: Bypass replay in stage-1 page table mode
>    intel_iommu: Enable host device when x-flts=on in scalable mode
> 
>   hw/i386/intel_iommu_internal.h     |   56 +
>   include/hw/i386/intel_iommu.h      |   33 +-
>   include/system/host_iommu_device.h |   40 +
>   include/system/iommufd.h           |   53 +
>   backends/iommufd.c                 |   58 +
>   hw/i386/intel_iommu.c              | 1660 ++++++++++++++++++++++++----
>   hw/vfio/common.c                   |   17 +-
>   hw/vfio/iommufd.c                  |   48 +
>   backends/trace-events              |    1 +
>   hw/i386/trace-events               |   13 +
>   10 files changed, 1776 insertions(+), 203 deletions(-)
> 
Relative to the patches:
Patch 1: As Eric eluded to, a proper description for the patch should be written, and the title should change 'helpers' to 'helper'

Patch 2:
(1) Introduce 'realize_late()' interface, but leave the reader wondering 'ah, why? what?' ... after reading farther down the series, I learn more about realize_late(), but more on that later...
(2) For my education, can you provide ptrs to VFIO & VPDA code paths that demonstrate the need for different [at|de]tach_<>_hwpt()

Patch 3: Why can't the realize() be moved to after attach?  isn't realize() suppose to indicate 'all is setup and object can now be used' -- apologies for what could be a dumb question, as that's my understanding of realize().  If the argument is such that there needs to be two steps, how does the first realize() that put the object into a used state <somehow> wait until realize_late()?

Patch 4: Shouldn't the current/existing realize callback just be overwritten with the later one, when this is needed?

Patch 5: no issues.

Patch 6: ewww -- fs1gp ... we use underlines all over the place for multi-word elements; so how about 's1_1g_pg'
           -- how many places is that really used that multiple underlines is an issue?

Patch 7: intel-iommu-specific callbacks in the common vfio & iommufd-backend code; nack. This won't compile w/intel-iommu included with iommufd... I think backend, intel-iommu hw-caps should provide the generic 'caps' boolean-type values/states; ... and maybe they should be extracted via vfio? .... like
      case HOST_IOMMU_DEVICE_CAP_AW_BITS:
          return vfio_device_get_aw_bits(hiod->agent);

Patch 8: Again, VTD-specific code in IOMMUFD is a nack; again, maybe via vfio, or a direct call into an iommu-device-cap api.

Patch 9: no issues.

Patch 10: "except it's stale"  likely "except when it's entry is stale" ?
           Did you ever put some tracing in to capture avg hits in cache? ... if so, add as a comment.
           Otherwise, looks good.

Patch 11: Apologies, I don't know what 'flts' stands for, and why it is relative to 2-stage mapping, or SIOV.  Could you add verbage to explain the use of it, as the rest of this patch doesn't make any sense to me without the background.
The patch introduces hw-info-type (none or intel), and then goes on to add a large set of checks; seems like the caps & this checking should go together (split for each cap; put all caps together & the check...).

Patch 12: Why isn't HostIOMMUDevice extended to have another iommu-specif element, opaque in HostIOMMUDevice, but set to specific IOMMU in use?   e.g. void *hostiommustate;

Patch 13: Isn't PASID just an extension/addition of BDF id? and doesn't each PASID have its own address space?
So, why isn't it handle via a uniqe AS cache like 'any other device'?  Maybe I'm thinking too SMMU-StreamID, which can be varying length, depending on subsystem support.  I see what appears to be sid+pasid calls to do the AS lookups; hmm, so maybe this is the generalized BDF+pasid AS lookup?  if so, maybe a better description stating this transition to a wider stream-id would set the code context better.
As for the rest of the (400 intel-iommu) code, I'm not that in-depth in intel-iommu to determine if its all correct or not.

Patch 14: Define PGTT; the second paragraph seem self-contradicting -- it says it uses a 2-stage page table in each case, but it implies it should be different.  At 580 lines of code changes, you win! ;-)

Patch 15: Read-only and Read/write areas have different IOMMUFDs?  is that an intel-iommu requriement?
           At least this intel-iommu-errata code is only in hw/i386/<> modules.

Patch 16: Looks reasonable.  What does the 'SI' mean after "CACHE_DEV", "CACHE_DOM" & "CACHE_PASID" ? -- stream-invalidation?

Patch 17: Looks reasonable.

Patch 18: Looks good.

Patch 19: No comment; believe AlexW should weigh in on this change.

Patch 20: Looks good.

  

Patch 17:

Patch 14

Patch 14:
           



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback
  2025-02-19  8:22 ` [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback Zhenzhong Duan
  2025-02-20 17:48   ` Eric Auger
@ 2025-04-07 11:19   ` Cédric Le Goater
  2025-04-08  8:00     ` Cédric Le Goater
  1 sibling, 1 reply; 68+ messages in thread
From: Cédric Le Goater @ 2025-04-07 11:19 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 2/19/25 09:22, Zhenzhong Duan wrote:
> Currently we have realize() callback which is called before attachment.
> But there are still some elements e.g., hwpt_id is not ready before
> attachment. So we need a realize_late() callback to further initialize
> them.

The relation between objects HostIOMMUDevice and VFIOIOMMU is starting
to look too complex for me.

I think it makes sense to realize HostIOMMUDevice after the device
is attached. Can't we move :

         hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
         vbasedev->hiod = hiod;

under ->attach_device() and also the call :

     if (!vfio_device_hiod_realize(vbasedev, errp)) {

later in the ->attach_device() patch ?

hiod_legacy_vfio_realize() doesn't do much. We might need to rework
hiod_iommufd_vfio_realize() which queries the iommufd hw caps, later
used by intel-iommu.

Anyway, it is good time to cleanup our interfaces before adding more.

Thanks,

C.



     
> Currently, this callback is only useful for iommufd backend. For legacy
> backend nothing needs to be initialized after attachment.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   include/system/host_iommu_device.h | 17 +++++++++++++++++
>   hw/vfio/common.c                   | 17 ++++++++++++++---
>   2 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index 809cced4ba..df782598f2 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -66,6 +66,23 @@ struct HostIOMMUDeviceClass {
>        * Returns: true on success, false on failure.
>        */
>       bool (*realize)(HostIOMMUDevice *hiod, void *opaque, Error **errp);
> +    /**
> +     * @realize_late: initialize host IOMMU device instance after attachment,
> +     *                some elements e.g., ioas are ready only after attachment.
> +     *                This callback initialize them.
> +     *
> +     * Optional callback.
> +     *
> +     * @hiod: pointer to a host IOMMU device instance.
> +     *
> +     * @opaque: pointer to agent device of this host IOMMU device,
> +     *          e.g., VFIO base device or VDPA device.
> +     *
> +     * @errp: pass an Error out when realize fails.
> +     *
> +     * Returns: true on success, false on failure.
> +     */
> +    bool (*realize_late)(HostIOMMUDevice *hiod, void *opaque, Error **errp);
>       /**
>        * @get_cap: check if a host IOMMU device capability is supported.
>        *
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index abbdc56b6d..e198b1e5a2 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -1550,6 +1550,7 @@ bool vfio_attach_device(char *name, VFIODevice *vbasedev,
>       const VFIOIOMMUClass *ops =
>           VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_LEGACY));
>       HostIOMMUDevice *hiod = NULL;
> +    HostIOMMUDeviceClass *hiod_ops = NULL;
>   
>       if (vbasedev->iommufd) {
>           ops = VFIO_IOMMU_CLASS(object_class_by_name(TYPE_VFIO_IOMMU_IOMMUFD));
> @@ -1560,16 +1561,26 @@ bool vfio_attach_device(char *name, VFIODevice *vbasedev,
>   
>       if (!vbasedev->mdev) {
>           hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
> +        hiod_ops = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>           vbasedev->hiod = hiod;
>       }
>   
>       if (!ops->attach_device(name, vbasedev, as, errp)) {
> -        object_unref(hiod);
> -        vbasedev->hiod = NULL;
> -        return false;
> +        goto err_attach;
> +    }
> +
> +    if (hiod_ops && hiod_ops->realize_late &&
> +        !hiod_ops->realize_late(hiod, vbasedev, errp)) {
> +        ops->detach_device(vbasedev);
> +        goto err_attach;
>       }
>   
>       return true;
> +
> +err_attach:
> +    object_unref(hiod);
> +    vbasedev->hiod = NULL;
> +    return false;
>   }
>   
>   void vfio_detach_device(VFIODevice *vbasedev)



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback
  2025-04-07 11:19   ` Cédric Le Goater
@ 2025-04-08  8:00     ` Cédric Le Goater
  2025-04-09  8:27       ` Duan, Zhenzhong
  0 siblings, 1 reply; 68+ messages in thread
From: Cédric Le Goater @ 2025-04-08  8:00 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, jgg, nicolinc,
	shameerali.kolothum.thodi, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 4/7/25 13:19, Cédric Le Goater wrote:
> On 2/19/25 09:22, Zhenzhong Duan wrote:
>> Currently we have realize() callback which is called before attachment.
>> But there are still some elements e.g., hwpt_id is not ready before
>> attachment. So we need a realize_late() callback to further initialize
>> them.
> 
> The relation between objects HostIOMMUDevice and VFIOIOMMU is starting
> to look too complex for me.
> 
> I think it makes sense to realize HostIOMMUDevice after the device
> is attached. Can't we move :
> 
>          hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
>          vbasedev->hiod = hiod;
> 
> under ->attach_device() and also the call :
> 
>      if (!vfio_device_hiod_realize(vbasedev, errp)) {
> 
> later in the ->attach_device() patch ?
> 
> hiod_legacy_vfio_realize() doesn't do much. We might need to rework
> hiod_iommufd_vfio_realize() which queries the iommufd hw caps, later
> used by intel-iommu.

The only dependency I see on the IOMMUFD HostIOMMUDevice when attaching
the device to the container is in iommufd_cdev_autodomains_get(). The
flags for IOMMU_HWPT_ALLOC depends on the HW capability of the IOMMFD
backend and we rely on hiod_iommufd_vfio_realize() to have done the
query on the iommufd kernel device before.

Since this is not a hot path, I don't think it is a problem to add
a redundant call to iommufd_backend_get_device_info() in
iommufd_cdev_autodomains_get() and avoid the IOMMUFD HostIOMMUDevice
dependency. With that we can move the HostIOMMUDevice creation and
realize sequence at the end of the device attach sequence.

I think this makes the code cleaner when it comes to using the
vbasedev->hiod pointer too.

> Anyway, it is good time to cleanup our interfaces before adding more.

On that topic, I think

    iommufd_cdev_attach_ioas_hwpt
    iommufd_cdev_detach_ioas_hwpt

belong to IOMMUFD backend.


Thanks,

C.

  



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback
  2025-04-08  8:00     ` Cédric Le Goater
@ 2025-04-09  8:27       ` Duan, Zhenzhong
  2025-04-09  9:58         ` Cédric Le Goater
  0 siblings, 1 reply; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-04-09  8:27 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late
>callback
>
>On 4/7/25 13:19, Cédric Le Goater wrote:
>> On 2/19/25 09:22, Zhenzhong Duan wrote:
>>> Currently we have realize() callback which is called before attachment.
>>> But there are still some elements e.g., hwpt_id is not ready before
>>> attachment. So we need a realize_late() callback to further initialize
>>> them.
>>
>> The relation between objects HostIOMMUDevice and VFIOIOMMU is starting
>> to look too complex for me.

Agree.

>>
>> I think it makes sense to realize HostIOMMUDevice after the device
>> is attached. Can't we move :
>>
>>          hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
>>          vbasedev->hiod = hiod;
>>
>> under ->attach_device() and also the call :
>>
>>      if (!vfio_device_hiod_realize(vbasedev, errp)) {
>>
>> later in the ->attach_device() patch ?
>>
>> hiod_legacy_vfio_realize() doesn't do much. We might need to rework
>> hiod_iommufd_vfio_realize() which queries the iommufd hw caps, later
>> used by intel-iommu.
>
>The only dependency I see on the IOMMUFD HostIOMMUDevice when attaching
>the device to the container is in iommufd_cdev_autodomains_get(). The
>flags for IOMMU_HWPT_ALLOC depends on the HW capability of the IOMMFD
>backend and we rely on hiod_iommufd_vfio_realize() to have done the
>query on the iommufd kernel device before.
>
>Since this is not a hot path, I don't think it is a problem to add
>a redundant call to iommufd_backend_get_device_info() in
>iommufd_cdev_autodomains_get() and avoid the IOMMUFD HostIOMMUDevice
>dependency. With that we can move the HostIOMMUDevice creation and
>realize sequence at the end of the device attach sequence.

Yes.

>
>I think this makes the code cleaner when it comes to using the
>vbasedev->hiod pointer too.
>
>> Anyway, it is good time to cleanup our interfaces before adding more.

OK, let me think about this further and write some patches to move .realize() after .attach_device().
will be based on vfio-next.

>
>On that topic, I think
>
>    iommufd_cdev_attach_ioas_hwpt
>    iommufd_cdev_detach_ioas_hwpt
>
>belong to IOMMUFD backend.

They are operation on VFIODevice, backends/iommufd.c are for operation on IOMMUFDBackend,
Do we need to move iommufd_cdev_attach/detach_ioas_hwpt to backends/iommufd.c which is VFIODevice agnostic?

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback
  2025-04-09  8:27       ` Duan, Zhenzhong
@ 2025-04-09  9:58         ` Cédric Le Goater
  0 siblings, 0 replies; 68+ messages in thread
From: Cédric Le Goater @ 2025-04-09  9:58 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P

On 4/9/25 10:27, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late
>> callback
>>
>> On 4/7/25 13:19, Cédric Le Goater wrote:
>>> On 2/19/25 09:22, Zhenzhong Duan wrote:
>>>> Currently we have realize() callback which is called before attachment.
>>>> But there are still some elements e.g., hwpt_id is not ready before
>>>> attachment. So we need a realize_late() callback to further initialize
>>>> them.
>>>
>>> The relation between objects HostIOMMUDevice and VFIOIOMMU is starting
>>> to look too complex for me.
> 
> Agree.
> 
>>>
>>> I think it makes sense to realize HostIOMMUDevice after the device
>>> is attached. Can't we move :
>>>
>>>           hiod = HOST_IOMMU_DEVICE(object_new(ops->hiod_typename));
>>>           vbasedev->hiod = hiod;
>>>
>>> under ->attach_device() and also the call :
>>>
>>>       if (!vfio_device_hiod_realize(vbasedev, errp)) {
>>>
>>> later in the ->attach_device() patch ?
>>>
>>> hiod_legacy_vfio_realize() doesn't do much. We might need to rework
>>> hiod_iommufd_vfio_realize() which queries the iommufd hw caps, later
>>> used by intel-iommu.
>>
>> The only dependency I see on the IOMMUFD HostIOMMUDevice when attaching
>> the device to the container is in iommufd_cdev_autodomains_get(). The
>> flags for IOMMU_HWPT_ALLOC depends on the HW capability of the IOMMFD
>> backend and we rely on hiod_iommufd_vfio_realize() to have done the
>> query on the iommufd kernel device before.
>>
>> Since this is not a hot path, I don't think it is a problem to add
>> a redundant call to iommufd_backend_get_device_info() in
>> iommufd_cdev_autodomains_get() and avoid the IOMMUFD HostIOMMUDevice
>> dependency. With that we can move the HostIOMMUDevice creation and
>> realize sequence at the end of the device attach sequence.
> 
> Yes.
> 
>>
>> I think this makes the code cleaner when it comes to using the
>> vbasedev->hiod pointer too.
>>
>>> Anyway, it is good time to cleanup our interfaces before adding more.
> 
> OK, let me think about this further and write some patches to move .realize() after .attach_device().
> will be based on vfio-next.

I just updated the vfio-next branch with what should be in the next PR
for QEMU 10.1.

> 
>>
>> On that topic, I think
>>
>>     iommufd_cdev_attach_ioas_hwpt
>>     iommufd_cdev_detach_ioas_hwpt
>>
>> belong to IOMMUFD backend.
> 
> They are operation on VFIODevice, backends/iommufd.c are for operation on IOMMUFDBackend,
> Do we need to move iommufd_cdev_attach/detach_ioas_hwpt to backends/iommufd.c which is VFIODevice agnostic?

My mistake. I was confused with

   int iommufd = vbasedev->iommufd->fd

and thought we could simply replace 'VFIODevice *' parameter with a
'IOMMUFDBackend *be' parameter but this is not the case.

Thanks,

C.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
  2025-04-05  3:01 ` Donald Dutile
@ 2025-05-19  8:37   ` Duan, Zhenzhong
  2025-05-19 15:39     ` Donald Dutile
  0 siblings, 1 reply; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-05-19  8:37 UTC (permalink / raw)
  To: Donald Dutile, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hi Donald,

>-----Original Message-----
>From: Donald Dutile <ddutile@redhat.com>
>Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for
>passthrough device
>
>Zhenzhong,
>
>Hi!
>Eric asked me to review this series.
>Since it's rather late since you posted will summarize review feedback
>below/bottom.
>
>- Don
>
>On 2/19/25 3:22 AM, Zhenzhong Duan wrote:
>> Hi,
>>
>> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
>> "Enable stage-1 translation for emulated device" series and
>> "Enable stage-1 translation for passthrough device" series.
>>
>> This series is 2nd part focusing on passthrough device. We don't do
>> shadowing of guest page table for passthrough device but pass stage-1
>> page table to host side to construct a nested domain. There was some
>> effort to enable this feature in old days, see [2] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation
>> (also known as IOMMU nested translation) capability in host IOMMU.
>> As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform
>> the stage-1 address translation. Along with it, modifications to
>> present mappings in the guest I/O page table should be followed
>> with an IOTLB invalidation.
>>
>>          .-------------.  .---------------------------.
>>          |   vIOMMU    |  | Guest I/O page table      |
>>          |             |  '---------------------------'
>>          .----------------/
>>          | PASID Entry |--- PASID cache flush --+
>>          '-------------'                        |
>>          |             |                        V
>>          |             |           I/O page table pointer in GPA
>>          '-------------'
>>      Guest
>>      ------| Shadow |---------------------------|--------
>>            v        v                           v
>>      Host
>>          .-------------.  .------------------------.
>>          |   pIOMMU    |  |  FS for GIOVA->GPA     |
>>          |             |  '------------------------'
>>          .----------------/  |
>>          | PASID Entry |     V (Nested xlate)
>>          '----------------\.----------------------------------.
>>          |             |   | SS for GPA->HPA, unmanaged domain|
>>          |             |   '----------------------------------'
>>          '-------------'
>> Where:
>>   - FS = First stage page tables
>>   - SS = Second stage page tables
>> <Intel VT-d Nested translation>
>>
>I'd prefer the use of 's1' for stage1/First stage, and 's2' for stage2/second stage.
>We don't need different terms for the same technology in the iommu/iommufd
>space(s).

OK, then I'd like to use stage1 and stage2 everywhere which is more verbose.

>
>> There are some interactions between VFIO and vIOMMU
>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>    subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>    instance to vIOMMU at vfio device realize stage.
>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>    to bind/unbind device to IOMMUFD backed domains, either nested
>>    domain or not.
>>
>> See below diagram:
>>
>>          VFIO Device                                 Intel IOMMU
>>      .-----------------.                         .-------------------.
>>      |                 |                         |                   |
>>      |       .---------|PCIIOMMUOps              |.-------------.    |
>>      |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
>>      |       | Device  |------------------------>|| Device list |    |
>>      |       .---------|(unset_iommu_device)     |.-------------.    |
>>      |                 |                         |       |           |
>>      |                 |                         |       V           |
>>      |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>>      |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
>>      |       | link    |<------------------------|  |   Device    |  |
>>      |       .---------|            (detach_hwpt)|  .-------------.  |
>>      |                 |                         |       |           |
>>      |                 |                         |       ...         |
>>      .-----------------.                         .-------------------.
>>
>> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
>> whenever possible and create new one on demand, also supports multiple
>> iommufd objects and ERRATA_772415.
>>
>> E.g., Stage-2 page table could be shared by different devices if there
>> is no conflict and devices link to same iommufd object, i.e. devices
>> under same host IOMMU can share same stage-2 page table. If there is
>and 'devices under the same guest'.
>Different guests cant be sharing the same stage-2 page tables.

Yes, will update.

>
>> conflict, i.e. there is one device under non cache coherency mode
>> which is different from others, it requires a separate stage-2 page
>> table in non-CC mode.
>>
>> SPR platform has ERRATA_772415 which requires no readonly mappings
>> in stage-2 page table. This series supports creating VTDIOASContainer
>> with no readonly mappings. If there is a rare case that some IOMMUs
>> on a multiple IOMMU host have ERRATA_772415 and others not, this
>> design can still survive.
>>
>> See below example diagram for a full view:
>>
>>        IntelIOMMUState
>>               |
>>               V
>>      .------------------.    .------------------.    .-------------------.
>>      | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |--
>>...
>>      | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
>>      .------------------.    .------------------.    .-------------------.
>>               |                       |                              |
>>               |                       .-->...                        |
>>               V                                                      V
>>        .-------------------.    .-------------------.          .---------------.
>>        |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-
>->...
>>        .-------------------.    .-------------------.          .---------------.
>>            |            |               |                            |
>>            |            |               |                            |
>>      .-----------.  .-----------.  .------------.              .------------.
>>      | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>>      | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>>      | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>>      |           |  |           |  | (iommufd0) |              | (iommufd0) |
>>      .-----------.  .-----------.  .------------.              .------------.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing
>> guest application address space with passthrough devices.
>>
>> To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
>> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>
>> Passthrough device should use iommufd backend to work with stage-1
>translation.
>> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>
>> If host doesn't support nested translation, qemu will fail with an unsupported
>> report.
>>
>> Test done:
>> - VFIO devices hotplug/unplug
>> - different VFIO devices linked to different iommufds
>> - vhost net device ping test
>>
>> PATCH1-8:  Add HWPT-based nesting infrastructure support
>> PATCH9-10: Some cleanup work
>> PATCH11:   cap/ecap related compatibility check between vIOMMU and Host
>IOMMU
>> PATCH12-19:Implement stage-1 page table for passthrough device
>> PATCH20:   Enable stage-1 translation for passthrough device
>>
>> Qemu code can be found at [3]
>>
>> TODO:
>> - RAM discard
>> - dirty tracking on stage-2 page table
>>
>> [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
>> [2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-
>1-yi.l.liu@intel.com/
>> [3]
>https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2
>>
>> Thanks
>> Zhenzhong
>>
>> Changelog:
>> rfcv2:
>> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
>> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
>> - add two cleanup patches(patch9-10)
>> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of
>iommufd/devid/ioas_id
>> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
>>    iommu pasid, this is important for dropping VTDPASIDAddressSpace
>>
>> Yi Liu (3):
>>    intel_iommu: Replay pasid binds after context cache invalidation
>>    intel_iommu: Propagate PASID-based iotlb invalidation to host
>>    intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
>>
>> Zhenzhong Duan (17):
>>    backends/iommufd: Add helpers for invalidating user-managed HWPT
>>    vfio/iommufd: Add properties and handlers to
>>      TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>>    HostIOMMUDevice: Introduce realize_late callback
>>    vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
>>    vfio/iommufd: Implement [at|de]tach_hwpt handlers
>>    host_iommu_device: Define two new capabilities
>>      HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>>    iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>>    iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
>>    intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>>      vtd_ce_get_pasid_entry
>>    intel_iommu: Optimize context entry cache utilization
>>    intel_iommu: Check for compatibility with IOMMUFD backed device when
>>      x-flts=on
>>    intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>>    intel_iommu: Add PASID cache management infrastructure
>>    intel_iommu: Bind/unbind guest page table to host
>>    intel_iommu: ERRATA_772415 workaround
>>    intel_iommu: Bypass replay in stage-1 page table mode
>>    intel_iommu: Enable host device when x-flts=on in scalable mode
>>
>>   hw/i386/intel_iommu_internal.h     |   56 +
>>   include/hw/i386/intel_iommu.h      |   33 +-
>>   include/system/host_iommu_device.h |   40 +
>>   include/system/iommufd.h           |   53 +
>>   backends/iommufd.c                 |   58 +
>>   hw/i386/intel_iommu.c              | 1660 ++++++++++++++++++++++++----
>>   hw/vfio/common.c                   |   17 +-
>>   hw/vfio/iommufd.c                  |   48 +
>>   backends/trace-events              |    1 +
>>   hw/i386/trace-events               |   13 +
>>   10 files changed, 1776 insertions(+), 203 deletions(-)
>>
>Relative to the patches:
>Patch 1: As Eric eluded to, a proper description for the patch should be written,
>and the title should change 'helpers' to 'helper'

This is addressed in [1]

[1] https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_nesting_rfcv3.wip/

>
>Patch 2:
>(1) Introduce 'realize_late()' interface, but leave the reader wondering 'ah, why?
>what?' ... after reading farther down the series, I learn more about realize_late(),
>but more on that later...

realize_late() has been removed in [1]

>(2) For my education, can you provide ptrs to VFIO & VPDA code paths that
>demonstrate the need for different [at|de]tach_<>_hwpt()

Google help me find this link https://lkml.iu.edu/2309.2/08079.html
specially https://gitlab.com/lulu6/gitlabqemutmp/-/commit/354870ff6bd9bac80c9fc5c7f944331cb24b0331

>
>Patch 3: Why can't the realize() be moved to after attach?  isn't realize() suppose
>to indicate 'all is setup and object can now be used' -- apologies for what could
>be a dumb question, as that's my understanding of realize().  If the argument is
>such that there needs to be two steps, how does the first realize() that put the
>object into a used state <somehow> wait until realize_late()?
>
>Patch 4: Shouldn't the current/existing realize callback just be overwritten with
>the later one, when this is needed?

realize_late() has been removed in [1]

>
>Patch 5: no issues.
>
>Patch 6: ewww -- fs1gp ... we use underlines all over the place for multi-word
>elements; so how about 's1_1g_pg'
>           -- how many places is that really used that multiple underlines is an issue?

This is vtd specific, so to follow VTD spec's naming so that we can easily find its definition by searching fs1gp.

>
>Patch 7: intel-iommu-specific callbacks in the common vfio & iommufd-backend
>code; nack. This won't compile w/intel-iommu included with iommufd... I think
>backend, intel-iommu hw-caps should provide the generic 'caps' boolean-type
>values/states; ... and maybe they should be extracted via vfio? .... like
>      case HOST_IOMMU_DEVICE_CAP_AW_BITS:
>          return vfio_device_get_aw_bits(hiod->agent);
>
>Patch 8: Again, VTD-specific code in IOMMUFD is a nack; again, maybe via vfio,
>or a direct call into an iommu-device-cap api.

Patch 7&8 addressed in [1]

>
>Patch 9: no issues.
>
>Patch 10: "except it's stale"  likely "except when it's entry is stale" ?

addressed in [1]

>           Did you ever put some tracing in to capture avg hits in cache? ... if so, add
>as a comment.
>           Otherwise, looks good.
>
>Patch 11: Apologies, I don't know what 'flts' stands for, and why it is relative to 2-
>stage mapping, or SIOV.  Could you add verbage to explain the use of it, as the
>rest of this patch doesn't make any sense to me without the background.
>The patch introduces hw-info-type (none or intel), and then goes on to add a
>large set of checks; seems like the caps & this checking should go together (split
>for each cap; put all caps together & the check...).

OK, will do. There are some explanations in cover-letter.
For history reason, old vtd spec define stage-1 as first level then switch to first stage.

>
>Patch 12: Why isn't HostIOMMUDevice extended to have another iommu-specif
>element, opaque in HostIOMMUDevice, but set to specific IOMMU in use?   e.g.
>void *hostiommustate;

Yes, that's possible, but we want to make a generic interface between VFIO/VDPA and vIOMMU.

>
>Patch 13: Isn't PASID just an extension/addition of BDF id? and doesn't each
>PASID have its own address space?

Yes, it is.

>So, why isn't it handle via a uniqe AS cache like 'any other device'?  Maybe I'm
>thinking too SMMU-StreamID, which can be varying length, depending on
>subsystem support.  I see what appears to be sid+pasid calls to do the AS lookups;
>hmm, so maybe this is the generalized BDF+pasid AS lookup?  if so, maybe a
>better description stating this transition to a wider stream-id would set the code
>context better.

Not quite get..

>As for the rest of the (400 intel-iommu) code, I'm not that in-depth in intel-iommu
>to determine if its all correct or not.
>
>Patch 14: Define PGTT; the second paragraph seem self-contradicting -- it says it
>uses a 2-stage page table in each case, but it implies it should be different.  At 580
>lines of code changes, you win! ;-)

The host side's using nested or only stage-2 page table depends on PGTT's setting in guest.

>
>Patch 15: Read-only and Read/write areas have different IOMMUFDs?  is that an
>intel-iommu requriement?
>           At least this intel-iommu-errata code is only in hw/i386/<> modules.

No, if ERRATA_772415, read-only areas should not be mapped, so we allocate a new VTDIOASContainer to hold only read/write areas mapping.
We can use same IOMMUFDs for different VTDIOASContainer.

>
>Patch 16: Looks reasonable.  What does the 'SI' mean after "CACHE_DEV",
>"CACHE_DOM" & "CACHE_PASID" ? -- stream-invalidation?

VTD_PASID_CACHE_DEVSI stands for 'pasid cache device selective invalidation',
VTD_PASID_CACHE_DOMSI means 'pasid cache domain selective invalidation'.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
  2025-05-19  8:37   ` Duan, Zhenzhong
@ 2025-05-19 15:39     ` Donald Dutile
  2025-05-20  9:13       ` Duan, Zhenzhong
  0 siblings, 1 reply; 68+ messages in thread
From: Donald Dutile @ 2025-05-19 15:39 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hey Zhenzhong,
Thanks for feedback. replies below.
- Don

On 5/19/25 4:37 AM, Duan, Zhenzhong wrote:
> Hi Donald,
> 
>> -----Original Message-----
>> From: Donald Dutile <ddutile@redhat.com>
>> Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for
>> passthrough device
>>
>> Zhenzhong,
>>
>> Hi!
>> Eric asked me to review this series.
>> Since it's rather late since you posted will summarize review feedback
>> below/bottom.
>>
>> - Don
>>
>> On 2/19/25 3:22 AM, Zhenzhong Duan wrote:
>>> Hi,
>>>
>>> Per Jason Wang's suggestion, iommufd nesting series[1] is split into
>>> "Enable stage-1 translation for emulated device" series and
>>> "Enable stage-1 translation for passthrough device" series.
>>>
>>> This series is 2nd part focusing on passthrough device. We don't do
>>> shadowing of guest page table for passthrough device but pass stage-1
>>> page table to host side to construct a nested domain. There was some
>>> effort to enable this feature in old days, see [2] for details.
>>>
>>> The key design is to utilize the dual-stage IOMMU translation
>>> (also known as IOMMU nested translation) capability in host IOMMU.
>>> As the below diagram shows, guest I/O page table pointer in GPA
>>> (guest physical address) is passed to host and be used to perform
>>> the stage-1 address translation. Along with it, modifications to
>>> present mappings in the guest I/O page table should be followed
>>> with an IOTLB invalidation.
>>>
>>>           .-------------.  .---------------------------.
>>>           |   vIOMMU    |  | Guest I/O page table      |
>>>           |             |  '---------------------------'
>>>           .----------------/
>>>           | PASID Entry |--- PASID cache flush --+
>>>           '-------------'                        |
>>>           |             |                        V
>>>           |             |           I/O page table pointer in GPA
>>>           '-------------'
>>>       Guest
>>>       ------| Shadow |---------------------------|--------
>>>             v        v                           v
>>>       Host
>>>           .-------------.  .------------------------.
>>>           |   pIOMMU    |  |  FS for GIOVA->GPA     |
>>>           |             |  '------------------------'
>>>           .----------------/  |
>>>           | PASID Entry |     V (Nested xlate)
>>>           '----------------\.----------------------------------.
>>>           |             |   | SS for GPA->HPA, unmanaged domain|
>>>           |             |   '----------------------------------'
>>>           '-------------'
>>> Where:
>>>    - FS = First stage page tables
>>>    - SS = Second stage page tables
>>> <Intel VT-d Nested translation>
>>>
>> I'd prefer the use of 's1' for stage1/First stage, and 's2' for stage2/second stage.
>> We don't need different terms for the same technology in the iommu/iommufd
>> space(s).
> 
> OK, then I'd like to use stage1 and stage2 everywhere which is more verbose.
> 
your choice; in other kernel & qemu code I've seen, 's1' and 's2' are used quite frequently,
that's why I recommended it -- call it plagerizing! ;-)

>>
>>> There are some interactions between VFIO and vIOMMU
>>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>>     subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>>     instance to vIOMMU at vfio device realize stage.
>>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>>     to bind/unbind device to IOMMUFD backed domains, either nested
>>>     domain or not.
>>>
>>> See below diagram:
>>>
>>>           VFIO Device                                 Intel IOMMU
>>>       .-----------------.                         .-------------------.
>>>       |                 |                         |                   |
>>>       |       .---------|PCIIOMMUOps              |.-------------.    |
>>>       |       | IOMMUFD |(set_iommu_device)       || Host IOMMU  |    |
>>>       |       | Device  |------------------------>|| Device list |    |
>>>       |       .---------|(unset_iommu_device)     |.-------------.    |
>>>       |                 |                         |       |           |
>>>       |                 |                         |       V           |
>>>       |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>>>       |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
>>>       |       | link    |<------------------------|  |   Device    |  |
>>>       |       .---------|            (detach_hwpt)|  .-------------.  |
>>>       |                 |                         |       |           |
>>>       |                 |                         |       ...         |
>>>       .-----------------.                         .-------------------.
>>>
>>> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
>>> whenever possible and create new one on demand, also supports multiple
>>> iommufd objects and ERRATA_772415.
>>>
>>> E.g., Stage-2 page table could be shared by different devices if there
>>> is no conflict and devices link to same iommufd object, i.e. devices
>>> under same host IOMMU can share same stage-2 page table. If there is
>> and 'devices under the same guest'.
>> Different guests cant be sharing the same stage-2 page tables.
> 
> Yes, will update.
> 
Thanks.

>>
>>> conflict, i.e. there is one device under non cache coherency mode
>>> which is different from others, it requires a separate stage-2 page
>>> table in non-CC mode.
>>>
>>> SPR platform has ERRATA_772415 which requires no readonly mappings
>>> in stage-2 page table. This series supports creating VTDIOASContainer
>>> with no readonly mappings. If there is a rare case that some IOMMUs
>>> on a multiple IOMMU host have ERRATA_772415 and others not, this
>>> design can still survive.
>>>
>>> See below example diagram for a full view:
>>>
>>>         IntelIOMMUState
>>>                |
>>>                V
>>>       .------------------.    .------------------.    .-------------------.
>>>       | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |--
>>> ...
>>>       | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
>>>       .------------------.    .------------------.    .-------------------.
>>>                |                       |                              |
>>>                |                       .-->...                        |
>>>                V                                                      V
>>>         .-------------------.    .-------------------.          .---------------.
>>>         |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-
>> ->...
>>>         .-------------------.    .-------------------.          .---------------.
>>>             |            |               |                            |
>>>             |            |               |                            |
>>>       .-----------.  .-----------.  .------------.              .------------.
>>>       | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>>>       | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>>>       | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>>>       |           |  |           |  | (iommufd0) |              | (iommufd0) |
>>>       .-----------.  .-----------.  .------------.              .------------.
>>>
>>> This series is also a prerequisite work for vSVA, i.e. Sharing
>>> guest application address space with passthrough devices.
>>>
>>> To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
>>> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>>
>>> Passthrough device should use iommufd backend to work with stage-1
>> translation.
>>> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>>
>>> If host doesn't support nested translation, qemu will fail with an unsupported
>>> report.
>>>
>>> Test done:
>>> - VFIO devices hotplug/unplug
>>> - different VFIO devices linked to different iommufds
>>> - vhost net device ping test
>>>
>>> PATCH1-8:  Add HWPT-based nesting infrastructure support
>>> PATCH9-10: Some cleanup work
>>> PATCH11:   cap/ecap related compatibility check between vIOMMU and Host
>> IOMMU
>>> PATCH12-19:Implement stage-1 page table for passthrough device
>>> PATCH20:   Enable stage-1 translation for passthrough device
>>>
>>> Qemu code can be found at [3]
>>>
>>> TODO:
>>> - RAM discard
>>> - dirty tracking on stage-2 page table
>>>
>>> [1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
>>> [2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-
>> 1-yi.l.liu@intel.com/
>>> [3]
>> https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv2
>>>
>>> Thanks
>>> Zhenzhong
>>>
>>> Changelog:
>>> rfcv2:
>>> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
>>> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
>>> - add two cleanup patches(patch9-10)
>>> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of
>> iommufd/devid/ioas_id
>>> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
>>>     iommu pasid, this is important for dropping VTDPASIDAddressSpace
>>>
>>> Yi Liu (3):
>>>     intel_iommu: Replay pasid binds after context cache invalidation
>>>     intel_iommu: Propagate PASID-based iotlb invalidation to host
>>>     intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
>>>
>>> Zhenzhong Duan (17):
>>>     backends/iommufd: Add helpers for invalidating user-managed HWPT
>>>     vfio/iommufd: Add properties and handlers to
>>>       TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>>>     HostIOMMUDevice: Introduce realize_late callback
>>>     vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
>>>     vfio/iommufd: Implement [at|de]tach_hwpt handlers
>>>     host_iommu_device: Define two new capabilities
>>>       HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>>>     iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP]
>>>     iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA
>>>     intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>>>       vtd_ce_get_pasid_entry
>>>     intel_iommu: Optimize context entry cache utilization
>>>     intel_iommu: Check for compatibility with IOMMUFD backed device when
>>>       x-flts=on
>>>     intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>>>     intel_iommu: Add PASID cache management infrastructure
>>>     intel_iommu: Bind/unbind guest page table to host
>>>     intel_iommu: ERRATA_772415 workaround
>>>     intel_iommu: Bypass replay in stage-1 page table mode
>>>     intel_iommu: Enable host device when x-flts=on in scalable mode
>>>
>>>    hw/i386/intel_iommu_internal.h     |   56 +
>>>    include/hw/i386/intel_iommu.h      |   33 +-
>>>    include/system/host_iommu_device.h |   40 +
>>>    include/system/iommufd.h           |   53 +
>>>    backends/iommufd.c                 |   58 +
>>>    hw/i386/intel_iommu.c              | 1660 ++++++++++++++++++++++++----
>>>    hw/vfio/common.c                   |   17 +-
>>>    hw/vfio/iommufd.c                  |   48 +
>>>    backends/trace-events              |    1 +
>>>    hw/i386/trace-events               |   13 +
>>>    10 files changed, 1776 insertions(+), 203 deletions(-)
>>>
>> Relative to the patches:
>> Patch 1: As Eric eluded to, a proper description for the patch should be written,
>> and the title should change 'helpers' to 'helper'
> 
> This is addressed in [1]
> 
> [1] https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_nesting_rfcv3.wip/
> 
ok.
>>
>> Patch 2:
>> (1) Introduce 'realize_late()' interface, but leave the reader wondering 'ah, why?
>> what?' ... after reading farther down the series, I learn more about realize_late(),
>> but more on that later...
> 
> realize_late() has been removed in [1]
> 
Yes, I saw. :)

>> (2) For my education, can you provide ptrs to VFIO & VPDA code paths that
>> demonstrate the need for different [at|de]tach_<>_hwpt()
> 
> Google help me find this link https://lkml.iu.edu/2309.2/08079.html
> specially https://gitlab.com/lulu6/gitlabqemutmp/-/commit/354870ff6bd9bac80c9fc5c7f944331cb24b0331
> 
thanks for vdpa ptrs; I have my diff work cut out for me, but heh, I asked for it! :)

>>
>> Patch 3: Why can't the realize() be moved to after attach?  isn't realize() suppose
>> to indicate 'all is setup and object can now be used' -- apologies for what could
>> be a dumb question, as that's my understanding of realize().  If the argument is
>> such that there needs to be two steps, how does the first realize() that put the
>> object into a used state <somehow> wait until realize_late()?
>>
>> Patch 4: Shouldn't the current/existing realize callback just be overwritten with
>> the later one, when this is needed?
> 
> realize_late() has been removed in [1]
> 
+1

>>
>> Patch 5: no issues.
>>
>> Patch 6: ewww -- fs1gp ... we use underlines all over the place for multi-word
>> elements; so how about 's1_1g_pg'
>>            -- how many places is that really used that multiple underlines is an issue?
> 
> This is vtd specific, so to follow VTD spec's naming so that we can easily find its definition by searching fs1gp.
> 
>>
ok.  You had mentioned earlier that you used VTD nomenclature, so I should have concluded the same here. /my bad.

>> Patch 7: intel-iommu-specific callbacks in the common vfio & iommufd-backend
>> code; nack. This won't compile w/intel-iommu included with iommufd... I think
>> backend, intel-iommu hw-caps should provide the generic 'caps' boolean-type
>> values/states; ... and maybe they should be extracted via vfio? .... like
>>       case HOST_IOMMU_DEVICE_CAP_AW_BITS:
>>           return vfio_device_get_aw_bits(hiod->agent);
>>
>> Patch 8: Again, VTD-specific code in IOMMUFD is a nack; again, maybe via vfio,
>> or a direct call into an iommu-device-cap api.
> 
> Patch 7&8 addressed in [1]
> 
+1
>>
>> Patch 9: no issues.
>>
>> Patch 10: "except it's stale"  likely "except when it's entry is stale" ?
> 
> addressed in [1]
> 
+1
>>            Did you ever put some tracing in to capture avg hits in cache? ... if so, add
>> as a comment.
>>            Otherwise, looks good.
>>
>> Patch 11: Apologies, I don't know what 'flts' stands for, and why it is relative to 2-
>> stage mapping, or SIOV.  Could you add verbage to explain the use of it, as the
>> rest of this patch doesn't make any sense to me without the background.
>> The patch introduces hw-info-type (none or intel), and then goes on to add a
>> large set of checks; seems like the caps & this checking should go together (split
>> for each cap; put all caps together & the check...).
> 
> OK, will do. There are some explanations in cover-letter.
> For history reason, old vtd spec define stage-1 as first level then switch to first stage.
> 
So 'flts' is 'first level then switch' .

>>
>> Patch 12: Why isn't HostIOMMUDevice extended to have another iommu-specif
>> element, opaque in HostIOMMUDevice, but set to specific IOMMU in use?   e.g.
>> void *hostiommustate;
> 
> Yes, that's possible, but we want to make a generic interface between VFIO/VDPA and vIOMMU.
> 
ok. I don't understand how VFIO & VPDA complicate that add.

>>
>> Patch 13: Isn't PASID just an extension/addition of BDF id? and doesn't each
>> PASID have its own address space?
> 
> Yes, it is.
> 
>> So, why isn't it handle via a uniqe AS cache like 'any other device'?  Maybe I'm
>> thinking too SMMU-StreamID, which can be varying length, depending on
>> subsystem support.  I see what appears to be sid+pasid calls to do the AS lookups;
>> hmm, so maybe this is the generalized BDF+pasid AS lookup?  if so, maybe a
>> better description stating this transition to a wider stream-id would set the code
>> context better.
> 
> Not quite get..
> 
I'm looking for a better description that states the AS cache lookup is broadened from bdf
to bdf+pasid.

>> As for the rest of the (400 intel-iommu) code, I'm not that in-depth in intel-iommu
>> to determine if its all correct or not.
>>
>> Patch 14: Define PGTT; the second paragraph seem self-contradicting -- it says it
>> uses a 2-stage page table in each case, but it implies it should be different.  At 580
>> lines of code changes, you win! ;-)
> 
> The host side's using nested or only stage-2 page table depends on PGTT's setting in guest.
> 
Thanks for clarification.

>>
>> Patch 15: Read-only and Read/write areas have different IOMMUFDs?  is that an
>> intel-iommu requriement?
>>            At least this intel-iommu-errata code is only in hw/i386/<> modules.
> 
> No, if ERRATA_772415, read-only areas should not be mapped, so we allocate a new VTDIOASContainer to hold only read/write areas mapping.
> We can use same IOMMUFDs for different VTDIOASContainer.
> 
ah yes; I got hung-up on different mappings, and didn't back up to AS-container split & same IOMMUFD.

>>
>> Patch 16: Looks reasonable.  What does the 'SI' mean after "CACHE_DEV",
>> "CACHE_DOM" & "CACHE_PASID" ? -- stream-invalidation?
> 
> VTD_PASID_CACHE_DEVSI stands for 'pasid cache device selective invalidation',
> VTD_PASID_CACHE_DOMSI means 'pasid cache domain selective invalidation'.
> 
That explanation helps. :)  maybe put a short blurb in the commit log, or code,
so one doesn't have to be a ninja-VTD spec consumer to comprehend those (important) diffs.

> Thanks
> Zhenzhong
> 
Again, thanks for the reply.
Looking fwd to the rfcv3 (on list) or move to v1-POST.
- Don



^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
  2025-05-19 15:39     ` Donald Dutile
@ 2025-05-20  9:13       ` Duan, Zhenzhong
  2025-05-20 10:47         ` Donald Dutile
  0 siblings, 1 reply; 68+ messages in thread
From: Duan, Zhenzhong @ 2025-05-20  9:13 UTC (permalink / raw)
  To: Donald Dutile, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Donald Dutile <ddutile@redhat.com>
>Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for
>passthrough device
>
>Hey Zhenzhong,
>Thanks for feedback. replies below.
>- Don
>
>On 5/19/25 4:37 AM, Duan, Zhenzhong wrote:
>> Hi Donald,
>>
>>> -----Original Message-----
>>> From: Donald Dutile <ddutile@redhat.com>
>>> Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for
>>> passthrough device
>>>
>>> Zhenzhong,
>>>
>>> Hi!
>>> Eric asked me to review this series.
>>> Since it's rather late since you posted will summarize review feedback
>>> below/bottom.
>>>
>>> - Don
>>>
>>> On 2/19/25 3:22 AM, Zhenzhong Duan wrote:
...

>>>            Did you ever put some tracing in to capture avg hits in cache? ... if so,
>add
>>> as a comment.
>>>            Otherwise, looks good.
>>>
>>> Patch 11: Apologies, I don't know what 'flts' stands for, and why it is relative
>to 2-
>>> stage mapping, or SIOV.  Could you add verbage to explain the use of it, as the
>>> rest of this patch doesn't make any sense to me without the background.
>>> The patch introduces hw-info-type (none or intel), and then goes on to add a
>>> large set of checks; seems like the caps & this checking should go together
>(split
>>> for each cap; put all caps together & the check...).
>>
>> OK, will do. There are some explanations in cover-letter.
>> For history reason, old vtd spec define stage-1 as first level then switch to first
>stage.
>>
>So 'flts' is 'first level then switch' .

Sorry for confusion, it stands for 'first level translation support'.

>
>>>
>>> Patch 12: Why isn't HostIOMMUDevice extended to have another iommu-
>specif
>>> element, opaque in HostIOMMUDevice, but set to specific IOMMU in use?
>e.g.
>>> void *hostiommustate;
>>
>> Yes, that's possible, but we want to make a generic interface between
>VFIO/VDPA and vIOMMU.
>>
>ok. I don't understand how VFIO & VPDA complicate that add.

IIUC, the hostiommustate provided by VFIO and VDPA may be different format.
By using a general interface like .get_cap(), we hide the resolving under VFIO and
VDPA backend. This is like the KVM extension checking between QEMU and KVM.

FYI, there was some discuss on the interface before,
see https://lists.gnu.org/archive/html/qemu-devel/2024-04/msg02658.html

>
>>>
>>> Patch 13: Isn't PASID just an extension/addition of BDF id? and doesn't each
>>> PASID have its own address space?
>>
>> Yes, it is.
>>
>>> So, why isn't it handle via a uniqe AS cache like 'any other device'?  Maybe I'm
>>> thinking too SMMU-StreamID, which can be varying length, depending on
>>> subsystem support.  I see what appears to be sid+pasid calls to do the AS
>lookups;
>>> hmm, so maybe this is the generalized BDF+pasid AS lookup?  if so, maybe a
>>> better description stating this transition to a wider stream-id would set the
>code
>>> context better.
>>
>> Not quite get..
>>
>I'm looking for a better description that states the AS cache lookup is broadened
>from bdf
>to bdf+pasid.

Guess you mean vtd_as_from_iommu_pasid(), it's a variant of vtd_find_add_as().
We support AS cache lookup by bdf+pasid for a long time, see vtd_find_add_as().

>
>>> As for the rest of the (400 intel-iommu) code, I'm not that in-depth in intel-
>iommu
>>> to determine if its all correct or not.
>>>
>>> Patch 14: Define PGTT; the second paragraph seem self-contradicting -- it says
>it
>>> uses a 2-stage page table in each case, but it implies it should be different.  At
>580
>>> lines of code changes, you win! ;-)
>>
>> The host side's using nested or only stage-2 page table depends on PGTT's
>setting in guest.
>>
>Thanks for clarification.
>
>>>
>>> Patch 15: Read-only and Read/write areas have different IOMMUFDs?  is that
>an
>>> intel-iommu requriement?
>>>            At least this intel-iommu-errata code is only in hw/i386/<> modules.
>>
>> No, if ERRATA_772415, read-only areas should not be mapped, so we allocate a
>new VTDIOASContainer to hold only read/write areas mapping.
>> We can use same IOMMUFDs for different VTDIOASContainer.
>>
>ah yes; I got hung-up on different mappings, and didn't back up to AS-container
>split & same IOMMUFD.
>
>>>
>>> Patch 16: Looks reasonable.  What does the 'SI' mean after "CACHE_DEV",
>>> "CACHE_DOM" & "CACHE_PASID" ? -- stream-invalidation?
>>
>> VTD_PASID_CACHE_DEVSI stands for 'pasid cache device selective invalidation',
>> VTD_PASID_CACHE_DOMSI means 'pasid cache domain selective invalidation'.
>>
>That explanation helps. :)  maybe put a short blurb in the commit log, or code,
>so one doesn't have to be a ninja-VTD spec consumer to comprehend those
>(important) diffs.

Good idea, will do.

>
>> Thanks
>> Zhenzhong
>>
>Again, thanks for the reply.
>Looking fwd to the rfcv3 (on list) or move to v1-POST.

Thanks for your comments!

BRs,
Zhenzhong


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device
  2025-05-20  9:13       ` Duan, Zhenzhong
@ 2025-05-20 10:47         ` Donald Dutile
  0 siblings, 0 replies; 68+ messages in thread
From: Donald Dutile @ 2025-05-20 10:47 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P



On 5/20/25 5:13 AM, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Donald Dutile <ddutile@redhat.com>
>> Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for
>> passthrough device
>>
>> Hey Zhenzhong,
>> Thanks for feedback. replies below.
>> - Don
>>
>> On 5/19/25 4:37 AM, Duan, Zhenzhong wrote:
>>> Hi Donald,
>>>
>>>> -----Original Message-----
>>>> From: Donald Dutile <ddutile@redhat.com>
>>>> Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for
>>>> passthrough device
>>>>
>>>> Zhenzhong,
>>>>
>>>> Hi!
>>>> Eric asked me to review this series.
>>>> Since it's rather late since you posted will summarize review feedback
>>>> below/bottom.
>>>>
>>>> - Don
>>>>
>>>> On 2/19/25 3:22 AM, Zhenzhong Duan wrote:
> ...
> 
>>>>             Did you ever put some tracing in to capture avg hits in cache? ... if so,
>> add
>>>> as a comment.
>>>>             Otherwise, looks good.
>>>>
>>>> Patch 11: Apologies, I don't know what 'flts' stands for, and why it is relative
>> to 2-
>>>> stage mapping, or SIOV.  Could you add verbage to explain the use of it, as the
>>>> rest of this patch doesn't make any sense to me without the background.
>>>> The patch introduces hw-info-type (none or intel), and then goes on to add a
>>>> large set of checks; seems like the caps & this checking should go together
>> (split
>>>> for each cap; put all caps together & the check...).
>>>
>>> OK, will do. There are some explanations in cover-letter.
>>> For history reason, old vtd spec define stage-1 as first level then switch to first
>> stage.
>>>
>> So 'flts' is 'first level then switch' .
> 
> Sorry for confusion, it stands for 'first level translation support'.
> 
Thanks.

>>
>>>>
>>>> Patch 12: Why isn't HostIOMMUDevice extended to have another iommu-
>> specif
>>>> element, opaque in HostIOMMUDevice, but set to specific IOMMU in use?
>> e.g.
>>>> void *hostiommustate;
>>>
>>> Yes, that's possible, but we want to make a generic interface between
>> VFIO/VDPA and vIOMMU.
>>>
>> ok. I don't understand how VFIO & VPDA complicate that add.
> 
> IIUC, the hostiommustate provided by VFIO and VDPA may be different format.
> By using a general interface like .get_cap(), we hide the resolving under VFIO and
> VDPA backend. This is like the KVM extension checking between QEMU and KVM.
> 
> FYI, there was some discuss on the interface before,
> see https://lists.gnu.org/archive/html/qemu-devel/2024-04/msg02658.html
> 
Good analogy, thanks. I'll reach out to Cedric on the above discussion as well.

>>
>>>>
>>>> Patch 13: Isn't PASID just an extension/addition of BDF id? and doesn't each
>>>> PASID have its own address space?
>>>
>>> Yes, it is.
>>>
>>>> So, why isn't it handle via a uniqe AS cache like 'any other device'?  Maybe I'm
>>>> thinking too SMMU-StreamID, which can be varying length, depending on
>>>> subsystem support.  I see what appears to be sid+pasid calls to do the AS
>> lookups;
>>>> hmm, so maybe this is the generalized BDF+pasid AS lookup?  if so, maybe a
>>>> better description stating this transition to a wider stream-id would set the
>> code
>>>> context better.
>>>
>>> Not quite get..
>>>
>> I'm looking for a better description that states the AS cache lookup is broadened
>>from bdf
>> to bdf+pasid.
> 
> Guess you mean vtd_as_from_iommu_pasid(), it's a variant of vtd_find_add_as().
> We support AS cache lookup by bdf+pasid for a long time, see vtd_find_add_as().
> 
Thanks for clarification.

>>
>>>> As for the rest of the (400 intel-iommu) code, I'm not that in-depth in intel-
>> iommu
>>>> to determine if its all correct or not.
>>>>
>>>> Patch 14: Define PGTT; the second paragraph seem self-contradicting -- it says
>> it
>>>> uses a 2-stage page table in each case, but it implies it should be different.  At
>> 580
>>>> lines of code changes, you win! ;-)
>>>
>>> The host side's using nested or only stage-2 page table depends on PGTT's
>> setting in guest.
>>>
>> Thanks for clarification.
>>
>>>>
>>>> Patch 15: Read-only and Read/write areas have different IOMMUFDs?  is that
>> an
>>>> intel-iommu requriement?
>>>>             At least this intel-iommu-errata code is only in hw/i386/<> modules.
>>>
>>> No, if ERRATA_772415, read-only areas should not be mapped, so we allocate a
>> new VTDIOASContainer to hold only read/write areas mapping.
>>> We can use same IOMMUFDs for different VTDIOASContainer.
>>>
>> ah yes; I got hung-up on different mappings, and didn't back up to AS-container
>> split & same IOMMUFD.
>>
>>>>
>>>> Patch 16: Looks reasonable.  What does the 'SI' mean after "CACHE_DEV",
>>>> "CACHE_DOM" & "CACHE_PASID" ? -- stream-invalidation?
>>>
>>> VTD_PASID_CACHE_DEVSI stands for 'pasid cache device selective invalidation',
>>> VTD_PASID_CACHE_DOMSI means 'pasid cache domain selective invalidation'.
>>>
>> That explanation helps. :)  maybe put a short blurb in the commit log, or code,
>> so one doesn't have to be a ninja-VTD spec consumer to comprehend those
>> (important) diffs.
> 
> Good idea, will do.
> 
>>
>>> Thanks
>>> Zhenzhong
>>>
>> Again, thanks for the reply.
>> Looking fwd to the rfcv3 (on list) or move to v1-POST.
> 
> Thanks for your comments!
> 
> BRs,
> Zhenzhong
> 
Thanks for the added clarifications.
- Don



^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2025-05-20 10:48 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-19  8:22 [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-02-19  8:22 ` [PATCH rfcv2 01/20] backends/iommufd: Add helpers for invalidating user-managed HWPT Zhenzhong Duan
2025-02-20 16:47   ` Eric Auger
2025-02-28  2:26     ` Duan, Zhenzhong
2025-02-24 10:03   ` Shameerali Kolothum Thodi via
2025-02-28  9:36     ` Duan, Zhenzhong
2025-02-19  8:22 ` [PATCH rfcv2 02/20] vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD Zhenzhong Duan
2025-02-20 17:42   ` Eric Auger
2025-02-28  5:39     ` Duan, Zhenzhong
2025-02-19  8:22 ` [PATCH rfcv2 03/20] HostIOMMUDevice: Introduce realize_late callback Zhenzhong Duan
2025-02-20 17:48   ` Eric Auger
2025-02-28  8:16     ` Duan, Zhenzhong
2025-03-06 15:53       ` Eric Auger
2025-04-07 11:19   ` Cédric Le Goater
2025-04-08  8:00     ` Cédric Le Goater
2025-04-09  8:27       ` Duan, Zhenzhong
2025-04-09  9:58         ` Cédric Le Goater
2025-02-19  8:22 ` [PATCH rfcv2 04/20] vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler Zhenzhong Duan
2025-02-20 18:07   ` Eric Auger
2025-02-28  8:23     ` Duan, Zhenzhong
2025-02-19  8:22 ` [PATCH rfcv2 05/20] vfio/iommufd: Implement [at|de]tach_hwpt handlers Zhenzhong Duan
2025-02-20 18:13   ` Eric Auger
2025-02-28  8:24     ` Duan, Zhenzhong
2025-03-06 15:56       ` Eric Auger
2025-02-19  8:22 ` [PATCH rfcv2 06/20] host_iommu_device: Define two new capabilities HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] Zhenzhong Duan
2025-02-20 18:41   ` Eric Auger
2025-02-20 18:44     ` Eric Auger
2025-02-28  8:29     ` Duan, Zhenzhong
2025-03-06 15:59       ` Eric Auger
2025-03-06 19:45         ` Nicolin Chen
2025-03-10  3:48           ` Duan, Zhenzhong
2025-02-19  8:22 ` [PATCH rfcv2 07/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_[NESTING|FS1GP] Zhenzhong Duan
2025-02-20 19:00   ` Eric Auger
2025-02-28  8:32     ` Duan, Zhenzhong
2025-02-19  8:22 ` [PATCH rfcv2 08/20] iommufd: Implement query of HOST_IOMMU_DEVICE_CAP_ERRATA Zhenzhong Duan
2025-02-20 18:55   ` Eric Auger
2025-02-28  8:31     ` Duan, Zhenzhong
2025-02-19  8:22 ` [PATCH rfcv2 09/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-02-21  6:39   ` CLEMENT MATHIEU--DRIF
2025-02-21 10:11   ` Eric Auger
2025-02-28  8:47     ` Duan, Zhenzhong
2025-02-19  8:22 ` [PATCH rfcv2 10/20] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
2025-02-21 10:00   ` Eric Auger
2025-02-28  8:34     ` Duan, Zhenzhong
2025-02-19  8:22 ` [PATCH rfcv2 11/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-02-21 12:49   ` Eric Auger
2025-02-21 14:18     ` Eric Auger
2025-02-28  8:57     ` Duan, Zhenzhong
2025-02-19  8:22 ` [PATCH rfcv2 12/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-02-21 13:03   ` Eric Auger
2025-02-28  8:58     ` Duan, Zhenzhong
2025-02-19  8:22 ` [PATCH rfcv2 13/20] intel_iommu: Add PASID cache management infrastructure Zhenzhong Duan
2025-02-21 17:02   ` Eric Auger
2025-02-28  9:35     ` Duan, Zhenzhong
2025-02-19  8:22 ` [PATCH rfcv2 14/20] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-02-19  8:22 ` [PATCH rfcv2 15/20] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
2025-02-19  8:22 ` [PATCH rfcv2 16/20] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
2025-02-19  8:22 ` [PATCH rfcv2 17/20] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-02-19  8:22 ` [PATCH rfcv2 18/20] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
2025-02-19  8:22 ` [PATCH rfcv2 19/20] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
2025-02-19  8:22 ` [PATCH rfcv2 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
2025-02-20 19:03 ` [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for passthrough device Eric Auger
2025-02-21  6:08   ` Duan, Zhenzhong
2025-04-05  3:01 ` Donald Dutile
2025-05-19  8:37   ` Duan, Zhenzhong
2025-05-19 15:39     ` Donald Dutile
2025-05-20  9:13       ` Duan, Zhenzhong
2025-05-20 10:47         ` Donald Dutile

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).