qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device
@ 2025-10-24  8:43 Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 01/23] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
                   ` (22 more replies)
  0 siblings, 23 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Hi,

For passthrough device with intel_iommu.x-flts=on, we don't do shadowing of
guest page table but pass first stage page table to host side to construct a
nested HWPT. There was some effort to enable this feature in old days, see
[1] for details.

The key design is to utilize the dual-stage IOMMU translation (also known as
IOMMU nested translation) capability in host IOMMU. As the below diagram shows,
guest I/O page table pointer in GPA (guest physical address) is passed to host
and be used to perform the first stage address translation. Along with it,
modifications to present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

        .-------------.  .---------------------------.
        |   vIOMMU    |  | Guest I/O page table      |
        |             |  '---------------------------'
        .----------------/
        | PASID Entry |--- PASID cache flush --+
        '-------------'                        |
        |             |                        V
        |             |           I/O page table pointer in GPA
        '-------------'
    Guest
    ------| Shadow |---------------------------|--------
          v        v                           v
    Host
        .-------------.  .-----------------------------.
        |   pIOMMU    |  | First stage for GIOVA->GPA  |
        |             |  '-----------------------------'
        .----------------/  |
        | PASID Entry |     V (Nested xlate)
        '----------------\.--------------------------------------------.
        |             |   | Second stage for GPA->HPA, unmanaged domain|
        |             |   '--------------------------------------------'
        '-------------'
<Intel VT-d Nested translation>

This series reuse VFIO device's default HWPT as nesting parent instead of
creating new one. This way avoids duplicate code of a new memory listener,
all existing feature from VFIO listener can be shared, e.g., ram discard,
dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
under a PCI bridge with emulated device, because emulated device wants
IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off" on platform
with ERRATA_772415_SPR17, because VFIO device's default HWPT is created
with NEST_PARENT flag, kernel inhibit RO mappings when switch to shadow
mode.

This series is also a prerequisite work for vSVA, i.e. Sharing guest
application address space with passthrough devices.

There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
  subsystem. VFIO calls them to register/unregister HostIOMMUDevice
  instance to vIOMMU at vfio device realize stage.
* vIOMMU registers PCIIOMMUOps get_viommu_flags to PCI subsystem.
  VFIO calls it to get vIOMMU exposed flags.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
  to bind/unbind device to IOMMUFD backed domains, either nested
  domain or not.

See below diagram:

        VFIO Device                                 Intel IOMMU
    .-----------------.                         .-------------------.
    |                 |                         |                   |
    |       .---------|PCIIOMMUOps              |.-------------.    |
    |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU  |    |
    |       | Device  |------------------------>|| Device list |    |
    |       .---------|(get_viommu_flags)       |.-------------.    |
    |                 |                         |       |           |
    |                 |                         |       V           |
    |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
    |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
    |       | link    |<------------------------|  |   Device    |  |
    |       .---------|            (detach_hwpt)|  .-------------.  |
    |                 |                         |       |           |
    |                 |                         |       ...         |
    .-----------------.                         .-------------------.

Below is an example to enable first stage translation for passthrough device:

    -M q35,...
    -device intel-iommu,x-scalable-mode=on,x-flts=on...
    -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...

Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test
- migration with QAT passthrough

PATCH01-09: Some preparing work
PATCH10-11: Compatibility check between vIOMMU and Host IOMMU
PATCH12-16: Implement first stage page table for passthrough device
PATCH17-19: Workaround for ERRATA_772415_SPR17
PATCH20-21: Add migration support and optimization
PATCH22:    Enable first stage translation for passthrough device
PATCH23:    Add doc

Qemu code can be found at [2], it's based on
vfio-next + domain_switch_series[3] + migration_relax_series[4].

Fault event injection to guest isn't supported in this series, we presume guest
kernel always construct correct first stage page table for passthrough device.
For emulated devices, the emulation code already provided first stage fault
injection.

TODO:
- Fault event injection to guest when HW first stage page table faults

[1] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v7
[3] https://lore.kernel.org/qemu-devel/20251017093602.525338-1-zhenzhong.duan@intel.com/
[4] https://lore.kernel.org/qemu-devel/20251024020922.13053-1-zhenzhong.duan@intel.com/

Thanks
Zhenzhong

Changelog:
v7:
- s/host_iommu_extract_vendor_caps/host_iommu_extract_quirks (Nicolin)
- s/RID_PASID/PASID_0 (Eric)
- drop rid2pasid check in vtd_do_iommu_translate (Eric)
- refine DID check in vtd_pasid_cache_sync_locked (Liuyi)
- refine commit log (Nicolin, Eric, Liuyi)
- Fix doc build (Cedric)
- add migration support

v6:
- delete RPS capability related supporting code (Eric, Yi)
- use terminology 'first/second stage' to replace 'first/second level" (Eric, Yi)
- use get_viommu_flags() instead of get_viommu_caps() (Nicolin)
- drop non-RID_PASID related code and simplify pasid invalidation handling (Eric, Yi)
- drop the patch that handle pasid replay when context invalidation (Eric)
- move vendor specific cap check from VFIO core to backend/iommufd.c (Nicolin)

v5:
- refine commit log of patch2 (Cedric, Nicolin)
- introduce helper vfio_pci_from_vfio_device() (Cedric)
- introduce helper vfio_device_viommu_get_nested() (Cedric)
- pass 'bool bypass_ro' argument to vfio_listener_valid_section() instead of 'VFIOContainerBase *' (Cedric)
- fix a potential build error reported by Jim Shu

v4:
- s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin, Donald, Shameer)
- clarify get_viommu_cap() return pure emulated caps and explain reason in commit log (Eric)
- retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric)
- refine doc comment and commit log in patch10-11 (Eric)

v3:
- define enum type for VIOMMU_CAP_* (Eric)
- drop inline flag in the patch which uses the helper (Eric)
- use extract64 in new introduced MACRO (Eric)
- polish comments and fix typo error (Eric)
- split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
- optimize bind/unbind error path processing

v2:
- introduce get_viommu_cap() to get STAGE1 flag to create nesting parent HWPT (Liuyi)
- reuse VFIO's default HWPT as parent HWPT of nested translation (Nicolin, Liuyi)
- abandon support of VFIO device under pcie-to-pci bridge to simplify design (Liuyi)
- bypass RO mapping in VFIO's default HWPT if ERRATA_772415_SPR17 (Liuyi)
- drop vtd_dev_to_context_entry optimization (Liuyi)

v1:
- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
- rebase to master

rfcv3:
- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer)
- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
- simplify return value check of get_cap() (Eric)
- drop realize_late (Cedric, Eric)
- split patch13:intel_iommu: Add PASID cache management infrastructure (Eric)
- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
- refine comments (Eric, Donald)

rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
  iommu pasid, this is important for dropping VTDPASIDAddressSpace


Yi Liu (3):
  intel_iommu: Propagate PASID-based iotlb invalidation to host
  intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
    changed
  intel_iommu: Replay pasid bindings after context cache invalidation

Zhenzhong Duan (20):
  intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
    vtd_ce_get_pasid_entry
  intel_iommu: Delete RPS capability related supporting code
  intel_iommu: Update terminology to match VTD spec
  hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  hw/pci: Introduce pci_device_get_viommu_flags()
  intel_iommu: Implement get_viommu_flags() callback
  intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  vfio/iommufd: Force creating nesting parent HWPT
  intel_iommu: Stick to system MR for IOMMUFD backed host device when
    x-flts=on
  intel_iommu: Check for compatibility with IOMMUFD backed device when
    x-flts=on
  intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  intel_iommu: Add some macros and inline functions
  intel_iommu: Bind/unbind guest page table to host
  iommufd: Introduce a helper function to extract vendor capabilities
  vfio: Add a new element bypass_ro in VFIOContainer
  Workaround for ERRATA_772415_SPR17
  vfio: Bypass readonly region for dirty tracking
  intel_iommu: Add migration support with x-flts=on
  intel_iommu: Enable host device when x-flts=on in scalable mode
  docs/devel: Add IOMMUFD nesting documentation

 MAINTAINERS                        |   1 +
 docs/devel/vfio-iommufd.rst        |  25 +
 hw/i386/intel_iommu_internal.h     | 107 ++--
 include/hw/i386/intel_iommu.h      |   5 +-
 include/hw/iommu.h                 |  30 ++
 include/hw/pci/pci.h               |  24 +
 include/hw/vfio/vfio-container.h   |   1 +
 include/hw/vfio/vfio-device.h      |   2 +
 include/system/host_iommu_device.h |  15 +
 backends/iommufd.c                 |  13 +
 hw/i386/intel_iommu.c              | 785 +++++++++++++++++++++--------
 hw/pci/pci.c                       |  23 +-
 hw/vfio/device.c                   |  12 +
 hw/vfio/iommufd.c                  |  19 +-
 hw/vfio/listener.c                 |  28 +-
 tests/qtest/intel-iommu-test.c     |   4 +-
 hw/i386/trace-events               |   4 +
 17 files changed, 831 insertions(+), 267 deletions(-)
 create mode 100644 include/hw/iommu.h

-- 
2.47.1



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v7 01/23] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 02/23] intel_iommu: Delete RPS capability related supporting code Zhenzhong Duan
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
of rid2pasid, then it was extended to get any pasid entry. So a new name
vtd_ce_get_pasid_entry is better to match what it actually does.

No functional change intended.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/i386/intel_iommu.c | 22 ++++++++++------------
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b00fdecaf8..70746e3080 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -956,10 +956,8 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
     return 0;
 }
 
-static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
-                                      VTDContextEntry *ce,
-                                      VTDPASIDEntry *pe,
-                                      uint32_t pasid)
+static int vtd_ce_get_pasid_entry(IntelIOMMUState *s, VTDContextEntry *ce,
+                                  VTDPASIDEntry *pe, uint32_t pasid)
 {
     dma_addr_t pasid_dir_base;
     int ret = 0;
@@ -1037,7 +1035,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return VTD_PE_GET_FL_LEVEL(&pe);
         } else {
@@ -1060,7 +1058,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
     }
 
@@ -1128,7 +1126,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
         } else {
@@ -1534,7 +1532,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
      * has valid rid2pasid setting, which includes valid
      * rid2pasid field and corresponding pasid entry setting
      */
-    return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
+    return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1623,7 +1621,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
     }
 
@@ -1699,7 +1697,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
     int ret;
 
     if (s->root_scalable) {
-        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (ret) {
             /*
              * This error is guest triggerable. We should assumt PT
@@ -3085,7 +3083,7 @@ static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
         return ret;
     }
 
-    return vtd_ce_get_rid2pasid_entry(s, &ce, pe, vtd_as->pasid);
+    return vtd_ce_get_pasid_entry(s, &ce, pe, vtd_as->pasid);
 }
 
 static int vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
@@ -5204,7 +5202,7 @@ static int vtd_pri_perform_implicit_invalidation(VTDAddressSpace *vtd_as,
     if (ret) {
         return -EINVAL;
     }
-    ret = vtd_ce_get_rid2pasid_entry(s, &ce, &pe, vtd_as->pasid);
+    ret = vtd_ce_get_pasid_entry(s, &ce, &pe, vtd_as->pasid);
     if (ret) {
         return -EINVAL;
     }
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 02/23] intel_iommu: Delete RPS capability related supporting code
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 01/23] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-31  7:50   ` Eric Auger
  2025-10-24  8:43 ` [PATCH v7 03/23] intel_iommu: Update terminology to match VTD spec Zhenzhong Duan
                   ` (20 subsequent siblings)
  22 siblings, 1 reply; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

RID-PASID Support(RPS) is not set in vIOMMU ECAP register, the supporting
code is there but never takes effect.

Meanwhile, according to VTD spec section 3.4.3:
"Implementations not supporting RID_PASID capability (ECAP_REG.RPS is 0b),
use a PASID value of 0 to perform address translation for requests without
PASID."

We should delete the supporting code which fetches RID_PASID field from
scalable context entry and use 0 as RID_PASID directly, because RID_PASID
field is ignored if no RPS support according to spec.

This simplifies the code and doesn't bring any penalty.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  1 -
 hw/i386/intel_iommu.c          | 82 +++++++++++-----------------------
 2 files changed, 27 insertions(+), 56 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 75bafdf0cd..bf8fb2aa80 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -609,7 +609,6 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CTX_ENTRY_LEGACY_SIZE     16
 #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
 
-#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 #define VTD_SM_CONTEXT_ENTRY_PRE            0x10ULL
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 70746e3080..06065d16b6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -42,8 +42,7 @@
 #include "trace.h"
 
 /* context entry operations */
-#define VTD_CE_GET_RID2PASID(ce) \
-    ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
+#define PASID_0    0
 #define VTD_CE_GET_PASID_DIR_TABLE(ce) \
     ((ce)->val[0] & VTD_PASID_DIR_BASE_ADDR_MASK)
 #define VTD_CE_GET_PRE(ce) \
@@ -963,7 +962,7 @@ static int vtd_ce_get_pasid_entry(IntelIOMMUState *s, VTDContextEntry *ce,
     int ret = 0;
 
     if (pasid == PCI_NO_PASID) {
-        pasid = VTD_CE_GET_RID2PASID(ce);
+        pasid = PASID_0;
     }
     pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
     ret = vtd_get_pe_from_pasid_table(s, pasid_dir_base, pasid, pe);
@@ -982,7 +981,7 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (pasid == PCI_NO_PASID) {
-        pasid = VTD_CE_GET_RID2PASID(ce);
+        pasid = PASID_0;
     }
     pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
 
@@ -1522,17 +1521,15 @@ static inline int vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
     return 0;
 }
 
-static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
-                                  VTDContextEntry *ce)
+static int vtd_ce_pasid_0_check(IntelIOMMUState *s, VTDContextEntry *ce)
 {
     VTDPASIDEntry pe;
 
     /*
      * Make sure in Scalable Mode, a present context entry
-     * has valid rid2pasid setting, which includes valid
-     * rid2pasid field and corresponding pasid entry setting
+     * has valid pasid entry setting at PASID_0.
      */
-    return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
+    return vtd_ce_get_pasid_entry(s, ce, &pe, PASID_0);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1593,12 +1590,11 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         }
     } else {
         /*
-         * Check if the programming of context-entry.rid2pasid
-         * and corresponding pasid setting is valid, and thus
-         * avoids to check pasid entry fetching result in future
-         * helper function calling.
+         * Check if the programming of pasid setting of PASID_0
+         * is valid, and thus avoids to check pasid entry fetching
+         * result in future helper function calling.
          */
-        ret_fr = vtd_ce_rid2pasid_check(s, ce);
+        ret_fr = vtd_ce_pasid_0_check(s, ce);
         if (ret_fr) {
             return ret_fr;
         }
@@ -2110,7 +2106,6 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     bool reads = true;
     bool writes = true;
     uint8_t access_flags, pgtt;
-    bool rid2pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
     VTDIOTLBEntry *iotlb_entry;
     uint64_t xlat, size;
 
@@ -2122,21 +2117,23 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
 
     vtd_iommu_lock(s);
 
-    cc_entry = &vtd_as->context_cache_entry;
+    if (pasid == PCI_NO_PASID && s->root_scalable) {
+        pasid = PASID_0;
+    }
 
-    /* Try to fetch pte from IOTLB, we don't need RID2PASID logic */
-    if (!rid2pasid) {
-        iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
-        if (iotlb_entry) {
-            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
-                                     iotlb_entry->domain_id);
-            pte = iotlb_entry->pte;
-            access_flags = iotlb_entry->access_flags;
-            page_mask = iotlb_entry->mask;
-            goto out;
-        }
+    /* Try to fetch pte from IOTLB */
+    iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
+    if (iotlb_entry) {
+        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
+                                 iotlb_entry->domain_id);
+        pte = iotlb_entry->pte;
+        access_flags = iotlb_entry->access_flags;
+        page_mask = iotlb_entry->mask;
+        goto out;
     }
 
+    cc_entry = &vtd_as->context_cache_entry;
+
     /* Try to fetch context-entry from cache first */
     if (cc_entry->context_cache_gen == s->context_cache_gen) {
         trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.hi,
@@ -2173,10 +2170,6 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         cc_entry->context_cache_gen = s->context_cache_gen;
     }
 
-    if (rid2pasid) {
-        pasid = VTD_CE_GET_RID2PASID(&ce);
-    }
-
     /*
      * We don't need to translate for pass-through context entries.
      * Also, let's ignore IOTLB caching as well for PT devices.
@@ -2202,19 +2195,6 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         return true;
     }
 
-    /* Try to fetch pte from IOTLB for RID2PASID slow path */
-    if (rid2pasid) {
-        iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
-        if (iotlb_entry) {
-            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
-                                     iotlb_entry->domain_id);
-            pte = iotlb_entry->pte;
-            access_flags = iotlb_entry->access_flags;
-            page_mask = iotlb_entry->mask;
-            goto out;
-        }
-    }
-
     if (s->flts && s->root_scalable) {
         ret_fr = vtd_iova_to_flpte(s, &ce, addr, is_write, &pte, &level,
                                    &reads, &writes, s->aw_bits, pasid);
@@ -2477,20 +2457,14 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
         ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                        vtd_as->devfn, &ce);
         if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
-            uint32_t rid2pasid = PCI_NO_PASID;
-
-            if (s->root_scalable) {
-                rid2pasid = VTD_CE_GET_RID2PASID(&ce);
-            }
-
             /*
              * In legacy mode, vtd_as->pasid == pasid is always true.
              * In scalable mode, for vtd address space backing a PCI
              * device without pasid, needs to compare pasid with
-             * rid2pasid of this device.
+             * PASID_0 of this device.
              */
             if (!(vtd_as->pasid == pasid ||
-                  (vtd_as->pasid == PCI_NO_PASID && pasid == rid2pasid))) {
+                  (vtd_as->pasid == PCI_NO_PASID && pasid == PASID_0))) {
                 continue;
             }
 
@@ -2995,9 +2969,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                       vtd_as->devfn, &ce) &&
             domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
-            uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
-
-            if ((vtd_as->pasid != PCI_NO_PASID || pasid != rid2pasid) &&
+            if ((vtd_as->pasid != PCI_NO_PASID || pasid != PASID_0) &&
                 vtd_as->pasid != pasid) {
                 continue;
             }
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 03/23] intel_iommu: Update terminology to match VTD spec
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 01/23] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 02/23] intel_iommu: Delete RPS capability related supporting code Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 04/23] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini

VTD spec revision 3.4 released in December 2021 renamed "First-level" to
"First-stage" and "Second-level" to "Second-stage".

Do the same in intel_iommu code to match spec, change all existing
"fl/sl/FL/SL/first level/second level/stage-1/stage-2" terminology to
"fs/ss/FS/SS/first stage/second stage".

Opportunistically fix a error print of "flts=on" with "x-flts=on".

No functional changes intended.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Suggested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu_internal.h |  85 ++++++------
 include/hw/i386/intel_iommu.h  |   2 +-
 hw/i386/intel_iommu.c          | 247 +++++++++++++++++----------------
 tests/qtest/intel-iommu-test.c |   4 +-
 4 files changed, 170 insertions(+), 168 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index bf8fb2aa80..ba0f1f5096 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -196,8 +196,8 @@
 #define VTD_ECAP_PSS                (7ULL << 35) /* limit: MemTxAttrs::pid */
 #define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
-#define VTD_ECAP_SLTS               (1ULL << 46)
-#define VTD_ECAP_FLTS               (1ULL << 47)
+#define VTD_ECAP_SSTS               (1ULL << 46)
+#define VTD_ECAP_FSTS               (1ULL << 47)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
@@ -211,7 +211,7 @@
 #define VTD_MAMV                    18ULL
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
 #define VTD_CAP_PSI                 (1ULL << 39)
-#define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
+#define VTD_CAP_SSLPS               ((1ULL << 34) | (1ULL << 35))
 #define VTD_CAP_DRAIN_WRITE         (1ULL << 54)
 #define VTD_CAP_DRAIN_READ          (1ULL << 55)
 #define VTD_CAP_FS1GP               (1ULL << 56)
@@ -284,7 +284,7 @@ typedef enum VTDFaultReason {
     VTD_FR_ADDR_BEYOND_MGAW,    /* Input-address above (2^x-1) */
     VTD_FR_WRITE,               /* No write permission */
     VTD_FR_READ,                /* No read permission */
-    /* Fail to access a second-level paging entry (not SL_PML4E) */
+    /* Fail to access a second-stage paging entry (not SS_PML4E) */
     VTD_FR_PAGING_ENTRY_INV,
     VTD_FR_ROOT_TABLE_INV,      /* Fail to access a root-entry */
     VTD_FR_CONTEXT_TABLE_INV,   /* Fail to access a context-entry */
@@ -292,7 +292,8 @@ typedef enum VTDFaultReason {
     VTD_FR_ROOT_ENTRY_RSVD,
     /* Non-zero reserved field in a present context-entry */
     VTD_FR_CONTEXT_ENTRY_RSVD,
-    /* Non-zero reserved field in a second-level paging entry with at lease one
+    /*
+     * Non-zero reserved field in a second-stage paging entry with at lease one
      * Read(R) and Write(W) or Execute(E) field is Set.
      */
     VTD_FR_PAGING_ENTRY_RSVD,
@@ -329,7 +330,7 @@ typedef enum VTDFaultReason {
     VTD_FR_PASID_ENTRY_P = 0x59,
     VTD_FR_PASID_TABLE_ENTRY_INV = 0x5b,  /*Invalid PASID table entry */
 
-    /* Fail to access a first-level paging entry (not FS_PML4E) */
+    /* Fail to access a first-stage paging entry (not FS_PML4E) */
     VTD_FR_FS_PAGING_ENTRY_INV = 0x70,
     VTD_FR_FS_PAGING_ENTRY_P = 0x71,
     /* Non-zero reserved field in present first-stage paging entry */
@@ -473,23 +474,23 @@ typedef union VTDPRDesc VTDPRDesc;
 
 #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, stale_tm) \
         stale_tm ? \
-        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
-        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM | VTD_SS_TM)) : \
+        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 #define VTD_SPTE_PAGE_L2_RSVD_MASK(aw) \
-        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 #define VTD_SPTE_PAGE_L3_RSVD_MASK(aw) \
-        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
-        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 
 #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw, stale_tm) \
         stale_tm ? \
-        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
-        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM | VTD_SS_TM)) : \
+        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 #define VTD_SPTE_LPAGE_L3_RSVD_MASK(aw, stale_tm) \
         stale_tm ? \
-        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
-        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM | VTD_SS_TM)) : \
+        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 
 /* Rsvd field masks for fpte */
 #define VTD_FS_UPPER_IGNORED 0xfff0000000000000ULL
@@ -596,8 +597,8 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CONTEXT_TT_MULTI_LEVEL  0
 #define VTD_CONTEXT_TT_DEV_IOTLB    (1ULL << 2)
 #define VTD_CONTEXT_TT_PASS_THROUGH (2ULL << 2)
-/* Second Level Page Translation Pointer*/
-#define VTD_CONTEXT_ENTRY_SLPTPTR   (~0xfffULL)
+/* Second Stage Page Translation Pointer*/
+#define VTD_CONTEXT_ENTRY_SSPTPTR   (~0xfffULL)
 #define VTD_CONTEXT_ENTRY_RSVD_LO(aw) (0xff0ULL | ~VTD_HAW_MASK(aw))
 /* hi */
 #define VTD_CONTEXT_ENTRY_AW        7ULL /* Adjusted guest-address-width */
@@ -634,37 +635,37 @@ typedef struct VTDPASIDCacheInfo {
 /* PASID Granular Translation Type Mask */
 #define VTD_PASID_ENTRY_P              1ULL
 #define VTD_SM_PASID_ENTRY_PGTT        (7ULL << 6)
-#define VTD_SM_PASID_ENTRY_FLT         (1ULL << 6)
-#define VTD_SM_PASID_ENTRY_SLT         (2ULL << 6)
+#define VTD_SM_PASID_ENTRY_FST         (1ULL << 6)
+#define VTD_SM_PASID_ENTRY_SST         (2ULL << 6)
 #define VTD_SM_PASID_ENTRY_NESTED      (3ULL << 6)
 #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
 
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
 
-#define VTD_SM_PASID_ENTRY_FLPM          3ULL
-#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
-
-/* First Level Paging Structure */
-/* Masks for First Level Paging Entry */
-#define VTD_FL_P                    1ULL
-#define VTD_FL_RW                   (1ULL << 1)
-#define VTD_FL_US                   (1ULL << 2)
-#define VTD_FL_A                    (1ULL << 5)
-#define VTD_FL_D                    (1ULL << 6)
-
-/* Second Level Page Translation Pointer*/
-#define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
-
-/* Second Level Paging Structure */
-/* Masks for Second Level Paging Entry */
-#define VTD_SL_RW_MASK              3ULL
-#define VTD_SL_R                    1ULL
-#define VTD_SL_W                    (1ULL << 1)
-#define VTD_SL_IGN_COM              0xbff0000000000000ULL
-#define VTD_SL_TM                   (1ULL << 62)
-
-/* Common for both First Level and Second Level */
+#define VTD_SM_PASID_ENTRY_FSPM          3ULL
+#define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
+
+/* First Stage Paging Structure */
+/* Masks for First Stage Paging Entry */
+#define VTD_FS_P                    1ULL
+#define VTD_FS_RW                   (1ULL << 1)
+#define VTD_FS_US                   (1ULL << 2)
+#define VTD_FS_A                    (1ULL << 5)
+#define VTD_FS_D                    (1ULL << 6)
+
+/* Second Stage Page Translation Pointer*/
+#define VTD_SM_PASID_ENTRY_SSPTPTR     (~0xfffULL)
+
+/* Second Stage Paging Structure */
+/* Masks for Second Stage Paging Entry */
+#define VTD_SS_RW_MASK              3ULL
+#define VTD_SS_R                    1ULL
+#define VTD_SS_W                    (1ULL << 1)
+#define VTD_SS_IGN_COM              0xbff0000000000000ULL
+#define VTD_SS_TM                   (1ULL << 62)
+
+/* Common for both First Stage and Second Stage */
 #define VTD_PML4_LEVEL           4
 #define VTD_PDP_LEVEL            3
 #define VTD_PD_LEVEL             2
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 6e68734b3c..a84d6965a4 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -271,7 +271,7 @@ struct IntelIOMMUState {
 
     bool caching_mode;              /* RO - is cap CM enabled? */
     bool scalable_mode;             /* RO - is Scalable Mode supported? */
-    bool flts;                      /* RO - is stage-1 translation supported? */
+    bool fsts;                      /* RO - is first stage translation supported? */
     bool snoop_control;             /* RO - is SNP filed supported? */
 
     dma_addr_t root;                /* Current root table pointer */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 06065d16b6..d6a4e21972 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -50,9 +50,9 @@
 
 /* pe operations */
 #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
-#define VTD_PE_GET_FL_LEVEL(pe) \
-    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM))
-#define VTD_PE_GET_SL_LEVEL(pe) \
+#define VTD_PE_GET_FS_LEVEL(pe) \
+    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FSPM))
+#define VTD_PE_GET_SS_LEVEL(pe) \
     (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
 
 /*
@@ -330,7 +330,7 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
      * nested (PGTT=011b) mapping associated with specified domain-id are
      * invalidated. Nested isn't supported yet, so only need to check 001b.
      */
-    if (entry->pgtt == VTD_SM_PASID_ENTRY_FLT) {
+    if (entry->pgtt == VTD_SM_PASID_ENTRY_FST) {
         return true;
     }
 
@@ -351,7 +351,7 @@ static gboolean vtd_hash_remove_by_page_piotlb(gpointer key, gpointer value,
      * or pass-through (PGTT=100b) mappings. Nested isn't supported yet,
      * so only need to check first-stage (PGTT=001b) mappings.
      */
-    if (entry->pgtt != VTD_SM_PASID_ENTRY_FLT) {
+    if (entry->pgtt != VTD_SM_PASID_ENTRY_FST) {
         return false;
     }
 
@@ -759,9 +759,9 @@ static int vtd_get_context_entry_from_root(IntelIOMMUState *s,
     return 0;
 }
 
-static inline dma_addr_t vtd_ce_get_slpt_base(VTDContextEntry *ce)
+static inline dma_addr_t vtd_ce_get_sspt_base(VTDContextEntry *ce)
 {
-    return ce->lo & VTD_CONTEXT_ENTRY_SLPTPTR;
+    return ce->lo & VTD_CONTEXT_ENTRY_SSPTPTR;
 }
 
 static inline uint64_t vtd_get_pte_addr(uint64_t pte, uint8_t aw)
@@ -802,13 +802,13 @@ static inline uint32_t vtd_iova_level_offset(uint64_t iova, uint32_t level)
 }
 
 /* Check Capability Register to see if the @level of page-table is supported */
-static inline bool vtd_is_sl_level_supported(IntelIOMMUState *s, uint32_t level)
+static inline bool vtd_is_ss_level_supported(IntelIOMMUState *s, uint32_t level)
 {
     return VTD_CAP_SAGAW_MASK & s->cap &
            (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
 }
 
-static inline bool vtd_is_fl_level_supported(IntelIOMMUState *s, uint32_t level)
+static inline bool vtd_is_fs_level_supported(IntelIOMMUState *s, uint32_t level)
 {
     return level == VTD_PML4_LEVEL;
 }
@@ -817,10 +817,10 @@ static inline bool vtd_is_fl_level_supported(IntelIOMMUState *s, uint32_t level)
 static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
 {
     switch (VTD_PE_GET_TYPE(pe)) {
-    case VTD_SM_PASID_ENTRY_FLT:
-        return !!(s->ecap & VTD_ECAP_FLTS);
-    case VTD_SM_PASID_ENTRY_SLT:
-        return !!(s->ecap & VTD_ECAP_SLTS);
+    case VTD_SM_PASID_ENTRY_FST:
+        return !!(s->ecap & VTD_ECAP_FSTS);
+    case VTD_SM_PASID_ENTRY_SST:
+        return !!(s->ecap & VTD_ECAP_SSTS);
     case VTD_SM_PASID_ENTRY_NESTED:
         /* Not support NESTED page table type yet */
         return false;
@@ -892,13 +892,13 @@ static int vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
     }
 
     pgtt = VTD_PE_GET_TYPE(pe);
-    if (pgtt == VTD_SM_PASID_ENTRY_SLT &&
-        !vtd_is_sl_level_supported(s, VTD_PE_GET_SL_LEVEL(pe))) {
+    if (pgtt == VTD_SM_PASID_ENTRY_SST &&
+        !vtd_is_ss_level_supported(s, VTD_PE_GET_SS_LEVEL(pe))) {
             return -VTD_FR_PASID_TABLE_ENTRY_INV;
     }
 
-    if (pgtt == VTD_SM_PASID_ENTRY_FLT &&
-        !vtd_is_fl_level_supported(s, VTD_PE_GET_FL_LEVEL(pe))) {
+    if (pgtt == VTD_SM_PASID_ENTRY_FST &&
+        !vtd_is_fs_level_supported(s, VTD_PE_GET_FS_LEVEL(pe))) {
             return -VTD_FR_PASID_TABLE_ENTRY_INV;
     }
 
@@ -1019,7 +1019,8 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
     return 0;
 }
 
-/* Get the page-table level that hardware should use for the second-level
+/*
+ * Get the page-table level that hardware should use for the second-stage
  * page-table walk from the Address Width field of context-entry.
  */
 static inline uint32_t vtd_ce_get_level(VTDContextEntry *ce)
@@ -1035,10 +1036,10 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
 
     if (s->root_scalable) {
         vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
-        if (s->flts) {
-            return VTD_PE_GET_FL_LEVEL(&pe);
+        if (s->fsts) {
+            return VTD_PE_GET_FS_LEVEL(&pe);
         } else {
-            return VTD_PE_GET_SL_LEVEL(&pe);
+            return VTD_PE_GET_SS_LEVEL(&pe);
         }
     }
 
@@ -1107,7 +1108,7 @@ static inline uint64_t vtd_iova_limit(IntelIOMMUState *s,
 }
 
 /* Return true if IOVA passes range check, otherwise false. */
-static inline bool vtd_iova_sl_range_check(IntelIOMMUState *s,
+static inline bool vtd_iova_ss_range_check(IntelIOMMUState *s,
                                            uint64_t iova, VTDContextEntry *ce,
                                            uint8_t aw, uint32_t pasid)
 {
@@ -1126,14 +1127,14 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
 
     if (s->root_scalable) {
         vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
-        if (s->flts) {
-            return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+        if (s->fsts) {
+            return pe.val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
         } else {
-            return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
+            return pe.val[0] & VTD_SM_PASID_ENTRY_SSPTPTR;
         }
     }
 
-    return vtd_ce_get_slpt_base(ce);
+    return vtd_ce_get_sspt_base(ce);
 }
 
 /*
@@ -1148,13 +1149,13 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
 static uint64_t vtd_spte_rsvd[VTD_SPTE_RSVD_LEN];
 static uint64_t vtd_spte_rsvd_large[VTD_SPTE_RSVD_LEN];
 
-static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
+static bool vtd_sspte_nonzero_rsvd(uint64_t sspte, uint32_t level)
 {
     uint64_t rsvd_mask;
 
     /*
      * We should have caught a guest-mis-programmed level earlier,
-     * via vtd_is_sl_level_supported.
+     * via vtd_is_ss_level_supported.
      */
     assert(level < VTD_SPTE_RSVD_LEN);
     /*
@@ -1164,46 +1165,47 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
     assert(level);
 
     if ((level == VTD_PD_LEVEL || level == VTD_PDP_LEVEL) &&
-        (slpte & VTD_PT_PAGE_SIZE_MASK)) {
+        (sspte & VTD_PT_PAGE_SIZE_MASK)) {
         /* large page */
         rsvd_mask = vtd_spte_rsvd_large[level];
     } else {
         rsvd_mask = vtd_spte_rsvd[level];
     }
 
-    return slpte & rsvd_mask;
+    return sspte & rsvd_mask;
 }
 
-/* Given the @iova, get relevant @slptep. @slpte_level will be the last level
+/*
+ * Given the @iova, get relevant @ssptep. @sspte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
-static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
+static int vtd_iova_to_sspte(IntelIOMMUState *s, VTDContextEntry *ce,
                              uint64_t iova, bool is_write,
-                             uint64_t *slptep, uint32_t *slpte_level,
+                             uint64_t *ssptep, uint32_t *sspte_level,
                              bool *reads, bool *writes, uint8_t aw_bits,
                              uint32_t pasid)
 {
     dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
     uint32_t level = vtd_get_iova_level(s, ce, pasid);
     uint32_t offset;
-    uint64_t slpte;
+    uint64_t sspte;
     uint64_t access_right_check;
 
-    if (!vtd_iova_sl_range_check(s, iova, ce, aw_bits, pasid)) {
+    if (!vtd_iova_ss_range_check(s, iova, ce, aw_bits, pasid)) {
         error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ","
                           "pasid=0x%" PRIx32 ")", __func__, iova, pasid);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
     /* FIXME: what is the Atomics request here? */
-    access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
+    access_right_check = is_write ? VTD_SS_W : VTD_SS_R;
 
     while (true) {
         offset = vtd_iova_level_offset(iova, level);
-        slpte = vtd_get_pte(addr, offset);
+        sspte = vtd_get_pte(addr, offset);
 
-        if (slpte == (uint64_t)-1) {
-            error_report_once("%s: detected read error on DMAR slpte "
+        if (sspte == (uint64_t)-1) {
+            error_report_once("%s: detected read error on DMAR sspte "
                               "(iova=0x%" PRIx64 ", pasid=0x%" PRIx32 ")",
                               __func__, iova, pasid);
             if (level == vtd_get_iova_level(s, ce, pasid)) {
@@ -1213,30 +1215,30 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
                 return -VTD_FR_PAGING_ENTRY_INV;
             }
         }
-        *reads = (*reads) && (slpte & VTD_SL_R);
-        *writes = (*writes) && (slpte & VTD_SL_W);
-        if (!(slpte & access_right_check)) {
-            error_report_once("%s: detected slpte permission error "
+        *reads = (*reads) && (sspte & VTD_SS_R);
+        *writes = (*writes) && (sspte & VTD_SS_W);
+        if (!(sspte & access_right_check)) {
+            error_report_once("%s: detected sspte permission error "
                               "(iova=0x%" PRIx64 ", level=0x%" PRIx32 ", "
-                              "slpte=0x%" PRIx64 ", write=%d, pasid=0x%"
+                              "sspte=0x%" PRIx64 ", write=%d, pasid=0x%"
                               PRIx32 ")", __func__, iova, level,
-                              slpte, is_write, pasid);
+                              sspte, is_write, pasid);
             return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
         }
-        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+        if (vtd_sspte_nonzero_rsvd(sspte, level)) {
             error_report_once("%s: detected splte reserve non-zero "
                               "iova=0x%" PRIx64 ", level=0x%" PRIx32
-                              "slpte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
-                              __func__, iova, level, slpte, pasid);
+                              "sspte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
+                              __func__, iova, level, sspte, pasid);
             return -VTD_FR_PAGING_ENTRY_RSVD;
         }
 
-        if (vtd_is_last_pte(slpte, level)) {
-            *slptep = slpte;
-            *slpte_level = level;
+        if (vtd_is_last_pte(sspte, level)) {
+            *ssptep = sspte;
+            *sspte_level = level;
             break;
         }
-        addr = vtd_get_pte_addr(slpte, aw_bits);
+        addr = vtd_get_pte_addr(sspte, aw_bits);
         level--;
     }
 
@@ -1362,7 +1364,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
 {
     bool read_cur, write_cur, entry_valid;
     uint32_t offset;
-    uint64_t slpte;
+    uint64_t sspte;
     uint64_t subpage_size, subpage_mask;
     IOMMUTLBEvent event;
     uint64_t iova = start;
@@ -1378,21 +1380,21 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
         iova_next = (iova & subpage_mask) + subpage_size;
 
         offset = vtd_iova_level_offset(iova, level);
-        slpte = vtd_get_pte(addr, offset);
+        sspte = vtd_get_pte(addr, offset);
 
-        if (slpte == (uint64_t)-1) {
+        if (sspte == (uint64_t)-1) {
             trace_vtd_page_walk_skip_read(iova, iova_next);
             goto next;
         }
 
-        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+        if (vtd_sspte_nonzero_rsvd(sspte, level)) {
             trace_vtd_page_walk_skip_reserve(iova, iova_next);
             goto next;
         }
 
         /* Permissions are stacked with parents' */
-        read_cur = read && (slpte & VTD_SL_R);
-        write_cur = write && (slpte & VTD_SL_W);
+        read_cur = read && (sspte & VTD_SS_R);
+        write_cur = write && (sspte & VTD_SS_W);
 
         /*
          * As long as we have either read/write permission, this is a
@@ -1401,12 +1403,12 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
          */
         entry_valid = read_cur | write_cur;
 
-        if (!vtd_is_last_pte(slpte, level) && entry_valid) {
+        if (!vtd_is_last_pte(sspte, level) && entry_valid) {
             /*
              * This is a valid PDE (or even bigger than PDE).  We need
              * to walk one further level.
              */
-            ret = vtd_page_walk_level(vtd_get_pte_addr(slpte, info->aw),
+            ret = vtd_page_walk_level(vtd_get_pte_addr(sspte, info->aw),
                                       iova, MIN(iova_next, end), level - 1,
                                       read_cur, write_cur, info);
         } else {
@@ -1423,7 +1425,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
             event.entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
             event.entry.addr_mask = ~subpage_mask;
             /* NOTE: this is only meaningful if entry_valid == true */
-            event.entry.translated_addr = vtd_get_pte_addr(slpte, info->aw);
+            event.entry.translated_addr = vtd_get_pte_addr(sspte, info->aw);
             event.type = event.entry.perm ? IOMMU_NOTIFIER_MAP :
                                             IOMMU_NOTIFIER_UNMAP;
             ret = vtd_page_walk_one(&event, info);
@@ -1457,11 +1459,11 @@ static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
     dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
     uint32_t level = vtd_get_iova_level(s, ce, pasid);
 
-    if (!vtd_iova_sl_range_check(s, start, ce, info->aw, pasid)) {
+    if (!vtd_iova_ss_range_check(s, start, ce, info->aw, pasid)) {
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
-    if (!vtd_iova_sl_range_check(s, end, ce, info->aw, pasid)) {
+    if (!vtd_iova_ss_range_check(s, end, ce, info->aw, pasid)) {
         /* Fix end so that it reaches the maximum */
         end = vtd_iova_limit(s, ce, info->aw, pasid);
     }
@@ -1574,7 +1576,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
 
     /* Check if the programming of context-entry is valid */
     if (!s->root_scalable &&
-        !vtd_is_sl_level_supported(s, vtd_ce_get_level(ce))) {
+        !vtd_is_ss_level_supported(s, vtd_ce_get_level(ce))) {
         error_report_once("%s: invalid context entry: hi=%"PRIx64
                           ", lo=%"PRIx64" (level %d not supported)",
                           __func__, ce->hi, ce->lo,
@@ -1681,10 +1683,9 @@ static int vtd_address_space_sync(VTDAddressSpace *vtd_as)
 }
 
 /*
- * Check if specific device is configured to bypass address
- * translation for DMA requests. In Scalable Mode, bypass
- * 1st-level translation or 2nd-level translation, it depends
- * on PGTT setting.
+ * Check if specific device is configured to bypass address translation
+ * for DMA requests. In Scalable Mode, bypass first stage translation
+ * or second stage translation, it depends on PGTT setting.
  */
 static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
                                uint32_t pasid)
@@ -1922,13 +1923,13 @@ out:
 static uint64_t vtd_fpte_rsvd[VTD_FPTE_RSVD_LEN];
 static uint64_t vtd_fpte_rsvd_large[VTD_FPTE_RSVD_LEN];
 
-static bool vtd_flpte_nonzero_rsvd(uint64_t flpte, uint32_t level)
+static bool vtd_fspte_nonzero_rsvd(uint64_t fspte, uint32_t level)
 {
     uint64_t rsvd_mask;
 
     /*
      * We should have caught a guest-mis-programmed level earlier,
-     * via vtd_is_fl_level_supported.
+     * via vtd_is_fs_level_supported.
      */
     assert(level < VTD_FPTE_RSVD_LEN);
     /*
@@ -1938,23 +1939,23 @@ static bool vtd_flpte_nonzero_rsvd(uint64_t flpte, uint32_t level)
     assert(level);
 
     if ((level == VTD_PD_LEVEL || level == VTD_PDP_LEVEL) &&
-        (flpte & VTD_PT_PAGE_SIZE_MASK)) {
+        (fspte & VTD_PT_PAGE_SIZE_MASK)) {
         /* large page */
         rsvd_mask = vtd_fpte_rsvd_large[level];
     } else {
         rsvd_mask = vtd_fpte_rsvd[level];
     }
 
-    return flpte & rsvd_mask;
+    return fspte & rsvd_mask;
 }
 
-static inline bool vtd_flpte_present(uint64_t flpte)
+static inline bool vtd_fspte_present(uint64_t fspte)
 {
-    return !!(flpte & VTD_FL_P);
+    return !!(fspte & VTD_FS_P);
 }
 
 /* Return true if IOVA is canonical, otherwise false. */
-static bool vtd_iova_fl_check_canonical(IntelIOMMUState *s, uint64_t iova,
+static bool vtd_iova_fs_check_canonical(IntelIOMMUState *s, uint64_t iova,
                                         VTDContextEntry *ce, uint32_t pasid)
 {
     uint64_t iova_limit = vtd_iova_limit(s, ce, s->aw_bits, pasid);
@@ -1984,32 +1985,32 @@ static MemTxResult vtd_set_flag_in_pte(dma_addr_t base_addr, uint32_t index,
 }
 
 /*
- * Given the @iova, get relevant @flptep. @flpte_level will be the last level
+ * Given the @iova, get relevant @fsptep. @fspte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
-static int vtd_iova_to_flpte(IntelIOMMUState *s, VTDContextEntry *ce,
+static int vtd_iova_to_fspte(IntelIOMMUState *s, VTDContextEntry *ce,
                              uint64_t iova, bool is_write,
-                             uint64_t *flptep, uint32_t *flpte_level,
+                             uint64_t *fsptep, uint32_t *fspte_level,
                              bool *reads, bool *writes, uint8_t aw_bits,
                              uint32_t pasid)
 {
     dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
     uint32_t offset;
-    uint64_t flpte, flag_ad = VTD_FL_A;
-    *flpte_level = vtd_get_iova_level(s, ce, pasid);
+    uint64_t fspte, flag_ad = VTD_FS_A;
+    *fspte_level = vtd_get_iova_level(s, ce, pasid);
 
-    if (!vtd_iova_fl_check_canonical(s, iova, ce, pasid)) {
+    if (!vtd_iova_fs_check_canonical(s, iova, ce, pasid)) {
         error_report_once("%s: detected non canonical IOVA (iova=0x%" PRIx64 ","
                           "pasid=0x%" PRIx32 ")", __func__, iova, pasid);
         return -VTD_FR_FS_NON_CANONICAL;
     }
 
     while (true) {
-        offset = vtd_iova_level_offset(iova, *flpte_level);
-        flpte = vtd_get_pte(addr, offset);
+        offset = vtd_iova_level_offset(iova, *fspte_level);
+        fspte = vtd_get_pte(addr, offset);
 
-        if (flpte == (uint64_t)-1) {
-            if (*flpte_level == vtd_get_iova_level(s, ce, pasid)) {
+        if (fspte == (uint64_t)-1) {
+            if (*fspte_level == vtd_get_iova_level(s, ce, pasid)) {
                 /* Invalid programming of pasid-entry */
                 return -VTD_FR_PASID_ENTRY_FSPTPTR_INV;
             } else {
@@ -2017,47 +2018,47 @@ static int vtd_iova_to_flpte(IntelIOMMUState *s, VTDContextEntry *ce,
             }
         }
 
-        if (!vtd_flpte_present(flpte)) {
+        if (!vtd_fspte_present(fspte)) {
             *reads = false;
             *writes = false;
             return -VTD_FR_FS_PAGING_ENTRY_P;
         }
 
         /* No emulated device supports supervisor privilege request yet */
-        if (!(flpte & VTD_FL_US)) {
+        if (!(fspte & VTD_FS_US)) {
             *reads = false;
             *writes = false;
             return -VTD_FR_FS_PAGING_ENTRY_US;
         }
 
         *reads = true;
-        *writes = (*writes) && (flpte & VTD_FL_RW);
-        if (is_write && !(flpte & VTD_FL_RW)) {
+        *writes = (*writes) && (fspte & VTD_FS_RW);
+        if (is_write && !(fspte & VTD_FS_RW)) {
             return -VTD_FR_SM_WRITE;
         }
-        if (vtd_flpte_nonzero_rsvd(flpte, *flpte_level)) {
-            error_report_once("%s: detected flpte reserved non-zero "
+        if (vtd_fspte_nonzero_rsvd(fspte, *fspte_level)) {
+            error_report_once("%s: detected fspte reserved non-zero "
                               "iova=0x%" PRIx64 ", level=0x%" PRIx32
-                              "flpte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
-                              __func__, iova, *flpte_level, flpte, pasid);
+                              "fspte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
+                              __func__, iova, *fspte_level, fspte, pasid);
             return -VTD_FR_FS_PAGING_ENTRY_RSVD;
         }
 
-        if (vtd_is_last_pte(flpte, *flpte_level) && is_write) {
-            flag_ad |= VTD_FL_D;
+        if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
+            flag_ad |= VTD_FS_D;
         }
 
-        if (vtd_set_flag_in_pte(addr, offset, flpte, flag_ad) != MEMTX_OK) {
+        if (vtd_set_flag_in_pte(addr, offset, fspte, flag_ad) != MEMTX_OK) {
             return -VTD_FR_FS_BIT_UPDATE_FAILED;
         }
 
-        if (vtd_is_last_pte(flpte, *flpte_level)) {
-            *flptep = flpte;
+        if (vtd_is_last_pte(fspte, *fspte_level)) {
+            *fsptep = fspte;
             return 0;
         }
 
-        addr = vtd_get_pte_addr(flpte, aw_bits);
-        (*flpte_level)--;
+        addr = vtd_get_pte_addr(fspte, aw_bits);
+        (*fspte_level)--;
     }
 }
 
@@ -2195,14 +2196,14 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         return true;
     }
 
-    if (s->flts && s->root_scalable) {
-        ret_fr = vtd_iova_to_flpte(s, &ce, addr, is_write, &pte, &level,
+    if (s->fsts && s->root_scalable) {
+        ret_fr = vtd_iova_to_fspte(s, &ce, addr, is_write, &pte, &level,
                                    &reads, &writes, s->aw_bits, pasid);
-        pgtt = VTD_SM_PASID_ENTRY_FLT;
+        pgtt = VTD_SM_PASID_ENTRY_FST;
     } else {
-        ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &pte, &level,
+        ret_fr = vtd_iova_to_sspte(s, &ce, addr, is_write, &pte, &level,
                                    &reads, &writes, s->aw_bits, pasid);
-        pgtt = VTD_SM_PASID_ENTRY_SLT;
+        pgtt = VTD_SM_PASID_ENTRY_SST;
     }
     if (!ret_fr) {
         xlat = vtd_get_pte_addr(pte, s->aw_bits);
@@ -2470,13 +2471,13 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
 
             if (vtd_as_has_map_notifier(vtd_as)) {
                 /*
-                 * When stage-1 translation is off, as long as we have MAP
+                 * When first stage translation is off, as long as we have MAP
                  * notifications registered in any of our IOMMU notifiers,
                  * we need to sync the shadow page table. Otherwise VFIO
                  * device attaches to nested page table instead of shadow
                  * page table, so no need to sync.
                  */
-                if (!s->flts || !s->root_scalable) {
+                if (!s->fsts || !s->root_scalable) {
                     vtd_sync_shadow_page_table_range(vtd_as, &ce, addr, size);
                 }
             } else {
@@ -2974,7 +2975,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
                 continue;
             }
 
-            if (!s->flts || !vtd_as_has_map_notifier(vtd_as)) {
+            if (!s->fsts || !vtd_as_has_map_notifier(vtd_as)) {
                 vtd_address_space_sync(vtd_as);
             }
         }
@@ -4069,7 +4070,7 @@ static const Property vtd_properties[] = {
                       VTD_HOST_ADDRESS_WIDTH),
     DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
     DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
-    DEFINE_PROP_BOOL("x-flts", IntelIOMMUState, flts, FALSE),
+    DEFINE_PROP_BOOL("x-flts", IntelIOMMUState, fsts, FALSE),
     DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control, false),
     DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
     DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
@@ -4594,12 +4595,13 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         return false;
     }
 
-    if (!s->flts) {
-        /* All checks requested by VTD stage-2 translation pass */
+    if (!s->fsts) {
+        /* All checks requested by VTD second stage translation pass */
         return true;
     }
 
-    error_setg(errp, "host device is uncompatible with stage-1 translation");
+    error_setg(errp,
+               "host device is uncompatible with first stage translation");
     return false;
 }
 
@@ -4832,7 +4834,7 @@ static void vtd_cap_init(IntelIOMMUState *s)
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
 
     s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
-             VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
+             VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SSLPS |
              VTD_CAP_ESRTPS | VTD_CAP_MGAW(s->aw_bits);
     if (s->dma_drain) {
         s->cap |= VTD_CAP_DRAIN;
@@ -4868,13 +4870,13 @@ static void vtd_cap_init(IntelIOMMUState *s)
     }
 
     /* TODO: read cap/ecap from host to decide which cap to be exposed. */
-    if (s->flts) {
-        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_FLTS;
+    if (s->fsts) {
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_FSTS;
         if (s->fs1gp) {
             s->cap |= VTD_CAP_FS1GP;
         }
     } else if (s->scalable_mode) {
-        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SSTS;
     }
 
     if (s->snoop_control) {
@@ -5182,7 +5184,7 @@ static int vtd_pri_perform_implicit_invalidation(VTDAddressSpace *vtd_as,
     domain_id = VTD_SM_PASID_ENTRY_DID(pe.val[1]);
     ret = 0;
     switch (pgtt) {
-    case VTD_SM_PASID_ENTRY_FLT:
+    case VTD_SM_PASID_ENTRY_FST:
         vtd_piotlb_page_invalidate(s, domain_id, vtd_as->pasid, addr, 0);
         break;
     /* Room for other pgtt values */
@@ -5384,12 +5386,12 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
         }
     }
 
-    if (!s->scalable_mode && s->flts) {
+    if (!s->scalable_mode && s->fsts) {
         error_setg(errp, "x-flts is only available in scalable mode");
         return false;
     }
 
-    if (!s->flts && s->aw_bits != VTD_HOST_AW_39BIT &&
+    if (!s->fsts && s->aw_bits != VTD_HOST_AW_39BIT &&
         s->aw_bits != VTD_HOST_AW_48BIT) {
         error_setg(errp, "%s: supported values for aw-bits are: %d, %d",
                    s->scalable_mode ? "Scalable mode(flts=off)" : "Legacy mode",
@@ -5397,10 +5399,9 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
         return false;
     }
 
-    if (s->flts && s->aw_bits != VTD_HOST_AW_48BIT) {
-        error_setg(errp,
-                   "Scalable mode(flts=on): supported value for aw-bits is: %d",
-                   VTD_HOST_AW_48BIT);
+    if (s->fsts && s->aw_bits != VTD_HOST_AW_48BIT) {
+        error_setg(errp, "Scalable mode(x-flts=on): supported value for "
+                   "aw-bits is: %d", VTD_HOST_AW_48BIT);
         return false;
     }
 
diff --git a/tests/qtest/intel-iommu-test.c b/tests/qtest/intel-iommu-test.c
index c521b3796e..e5cc6acaf0 100644
--- a/tests/qtest/intel-iommu-test.c
+++ b/tests/qtest/intel-iommu-test.c
@@ -13,9 +13,9 @@
 #include "hw/i386/intel_iommu_internal.h"
 
 #define CAP_STAGE_1_FIXED1    (VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | \
-                              VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS)
+                              VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SSLPS)
 #define ECAP_STAGE_1_FIXED1   (VTD_ECAP_QI |  VTD_ECAP_IR | VTD_ECAP_IRO | \
-                              VTD_ECAP_MHMV | VTD_ECAP_SMTS | VTD_ECAP_FLTS)
+                              VTD_ECAP_MHMV | VTD_ECAP_SMTS | VTD_ECAP_FSTS)
 
 static inline uint64_t vtd_reg_readq(QTestState *s, uint64_t offset)
 {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 04/23] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 03/23] intel_iommu: Update terminology to match VTD spec Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 05/23] hw/pci: Introduce pci_device_get_viommu_flags() Zhenzhong Duan
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Returns true if PCI device is aliased or false otherwise. This will be
used in following patch to determine if a PCI device is under a PCI
bridge.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 include/hw/pci/pci.h |  2 ++
 hw/pci/pci.c         | 12 ++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6bccb25ac2..bde9dca8e2 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -637,6 +637,8 @@ typedef struct PCIIOMMUOps {
                             bool is_write);
 } PCIIOMMUOps;
 
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+                                    PCIBus **aliased_bus, int *aliased_devfn);
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index acc03fd470..d0e81651aa 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2858,20 +2858,21 @@ static void pci_device_class_base_init(ObjectClass *klass, const void *data)
  * For call sites which don't need aliased BDF, passing NULL to
  * aliased_[bus|devfn] is allowed.
  *
+ * Returns true if PCI device RID is aliased or false otherwise.
+ *
  * @piommu_bus: return root #PCIBus backed by an IOMMU for the PCI device.
  *
  * @aliased_bus: return aliased #PCIBus of the PCI device, optional.
  *
  * @aliased_devfn: return aliased devfn of the PCI device, optional.
  */
-static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
-                                           PCIBus **piommu_bus,
-                                           PCIBus **aliased_bus,
-                                           int *aliased_devfn)
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+                                    PCIBus **aliased_bus, int *aliased_devfn)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
     int devfn = dev->devfn;
+    bool aliased = false;
 
     while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
         PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
@@ -2908,6 +2909,7 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
                 devfn = parent->devfn;
                 bus = parent_bus;
             }
+            aliased = true;
         }
 
         /*
@@ -2942,6 +2944,8 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
     if (aliased_devfn) {
         *aliased_devfn = devfn;
     }
+
+    return aliased;
 }
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 05/23] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (3 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 04/23] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24 17:18   ` Cédric Le Goater
  2025-10-24  8:43 ` [PATCH v7 06/23] intel_iommu: Implement get_viommu_flags() callback Zhenzhong Duan
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Introduce a new PCIIOMMUOps optional callback, get_viommu_flags() which
allows to retrieve flags exposed by a vIOMMU. The first planned vIOMMU
device flag is VIOMMU_FLAG_WANT_NESTING_PARENT that advertises the
support of HW nested stage translation scheme and wants other sub-system
like VFIO's cooperation to create nesting parent HWPT.

pci_device_get_viommu_flags() is a wrapper that can be called on a PCI
device potentially protected by a vIOMMU.

get_viommu_flags() is designed to return 64bit bitmap of purely vIOMMU
flags which are only determined by user's configuration, no host
capabilities involved. Reasons are:

1. host may has heterogeneous IOMMUs, each with different capabilities
2. this is migration friendly, return value is consistent between source
   and target.

Note that this op will be invoked at the attach_device() stage, at which
point host IOMMU capabilities are not yet forwarded to the vIOMMU through
the set_iommu_device() callback that will be after the attach_device().

See below sequence:

  vfio_device_attach():
      iommufd_cdev_attach():
          pci_device_get_viommu_flags() for HW nesting cap
          create a nesting parent HWPT
          attach device to the HWPT
          vfio_device_hiod_create_and_realize() creating hiod
  ...
  pci_device_set_iommu_device(hiod)

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 MAINTAINERS          |  1 +
 include/hw/iommu.h   | 25 +++++++++++++++++++++++++
 include/hw/pci/pci.h | 22 ++++++++++++++++++++++
 hw/pci/pci.c         | 11 +++++++++++
 4 files changed, 59 insertions(+)
 create mode 100644 include/hw/iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 36eef27b41..d94fbcbdfb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2338,6 +2338,7 @@ F: include/system/iommufd.h
 F: backends/host_iommu_device.c
 F: include/system/host_iommu_device.h
 F: include/qemu/chardev_open.h
+F: include/hw/iommu.h
 F: util/chardev_open.c
 F: docs/devel/vfio-iommufd.rst
 
diff --git a/include/hw/iommu.h b/include/hw/iommu.h
new file mode 100644
index 0000000000..9b8bb94fc2
--- /dev/null
+++ b/include/hw/iommu.h
@@ -0,0 +1,25 @@
+/*
+ * General vIOMMU flags
+ *
+ * Copyright (C) 2025 Intel Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_IOMMU_H
+#define HW_IOMMU_H
+
+#include "qemu/bitops.h"
+
+/*
+ * Theoretical vIOMMU flags. Only determined by the vIOMMU device properties and
+ * independent on the actual host IOMMU capabilities they may depend on. Each
+ * flag can be an expectation or request to other sub-system or just a pure
+ * vIOMMU capability. vIOMMU can choose which flags to expose.
+ */
+enum viommu_flags {
+    /* vIOMMU needs nesting parent HWPT to create nested HWPT */
+    VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
+};
+
+#endif /* HW_IOMMU_H */
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index bde9dca8e2..cf99b5bb68 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -462,6 +462,18 @@ typedef struct PCIIOMMUOps {
      * @devfn: device and function number of the PCI device.
      */
     void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
+    /**
+     * @get_viommu_flags: get vIOMMU flags
+     *
+     * Optional callback, if not implemented, then vIOMMU doesn't support
+     * exposing flags to other sub-system, e.g., VFIO.
+     *
+     * @opaque: the data passed to pci_setup_iommu().
+     *
+     * Returns: bitmap with each representing a vIOMMU flag defined in
+     * enum viommu_flags.
+     */
+    uint64_t (*get_viommu_flags)(void *opaque);
     /**
      * @get_iotlb_info: get properties required to initialize a device IOTLB.
      *
@@ -644,6 +656,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
 void pci_device_unset_iommu_device(PCIDevice *dev);
 
+/**
+ * pci_device_get_viommu_flags: get vIOMMU flags.
+ *
+ * Returns: bitmap with each representing a vIOMMU flag defined in
+ * enum viommu_flags. Or 0 if vIOMMU doesn't report any.
+ *
+ * @dev: PCI device pointer.
+ */
+uint64_t pci_device_get_viommu_flags(PCIDevice *dev);
+
 /**
  * pci_iommu_get_iotlb_info: get properties required to initialize a
  * device IOTLB.
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index d0e81651aa..c9932c87e3 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -3010,6 +3010,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
     }
 }
 
+uint64_t pci_device_get_viommu_flags(PCIDevice *dev)
+{
+    PCIBus *iommu_bus;
+
+    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
+    if (iommu_bus && iommu_bus->iommu_ops->get_viommu_flags) {
+        return iommu_bus->iommu_ops->get_viommu_flags(iommu_bus->iommu_opaque);
+    }
+    return 0;
+}
+
 int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
                          bool exec_req, hwaddr addr, bool lpig,
                          uint16_t prgi, bool is_read, bool is_write)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 06/23] intel_iommu: Implement get_viommu_flags() callback
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (4 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 05/23] hw/pci: Introduce pci_device_get_viommu_flags() Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 07/23] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Implement get_viommu_flags() callback and expose a request for nesting
parent HWPT for now.

VFIO uses it to create nesting parent HWPT which is further used to create
nested HWPT in vIOMMU. All these will be implemented in following patches.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d6a4e21972..db5be065bf 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -24,6 +24,7 @@
 #include "qemu/main-loop.h"
 #include "qapi/error.h"
 #include "hw/sysbus.h"
+#include "hw/iommu.h"
 #include "intel_iommu_internal.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/pci_bus.h"
@@ -4704,6 +4705,16 @@ static void vtd_address_space_unmap_in_migration(VTDAddressSpace *as,
     }
 }
 
+static uint64_t vtd_get_viommu_flags(void *opaque)
+{
+    IntelIOMMUState *s = opaque;
+    uint64_t flags;
+
+    flags = s->fsts ? VIOMMU_FLAG_WANT_NESTING_PARENT : 0;
+
+    return flags;
+}
+
 /* Unmap the whole range in the notifier's scope. */
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 {
@@ -5362,6 +5373,7 @@ static PCIIOMMUOps vtd_iommu_ops = {
     .pri_register_notifier = vtd_pri_register_notifier,
     .pri_unregister_notifier = vtd_pri_unregister_notifier,
     .pri_request_page = vtd_pri_request_page,
+    .get_viommu_flags = vtd_get_viommu_flags,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 07/23] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (5 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 06/23] intel_iommu: Implement get_viommu_flags() callback Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 08/23] vfio/iommufd: Force creating nesting parent HWPT Zhenzhong Duan
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Introduce a new structure VTDHostIOMMUDevice which replaces
HostIOMMUDevice to be stored in hash table.

It includes a reference to HostIOMMUDevice and IntelIOMMUState,
also includes BDF information which will be used in future
patches.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu_internal.h |  7 +++++++
 include/hw/i386/intel_iommu.h  |  2 +-
 hw/i386/intel_iommu.c          | 15 +++++++++++++--
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index ba0f1f5096..09edba81e2 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -28,6 +28,7 @@
 #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
 #define HW_I386_INTEL_IOMMU_INTERNAL_H
 #include "hw/i386/intel_iommu.h"
+#include "system/host_iommu_device.h"
 
 /*
  * Intel IOMMU register specification
@@ -677,4 +678,10 @@ typedef struct VTDPASIDCacheInfo {
 /* Bits to decide the offset for each level */
 #define VTD_LEVEL_BITS           9
 
+typedef struct VTDHostIOMMUDevice {
+    IntelIOMMUState *iommu_state;
+    PCIBus *bus;
+    uint8_t devfn;
+    HostIOMMUDevice *hiod;
+} VTDHostIOMMUDevice;
 #endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index a84d6965a4..3758ac239c 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -302,7 +302,7 @@ struct IntelIOMMUState {
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
-    GHashTable *vtd_host_iommu_dev;             /* HostIOMMUDevice */
+    GHashTable *vtd_host_iommu_dev;             /* VTDHostIOMMUDevice */
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index db5be065bf..4c83578c54 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -291,7 +291,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
 
 static void vtd_hiod_destroy(gpointer v)
 {
-    object_unref(v);
+    VTDHostIOMMUDevice *vtd_hiod = v;
+
+    object_unref(vtd_hiod->hiod);
+    g_free(vtd_hiod);
 }
 
 static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
@@ -4610,6 +4613,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
                                      HostIOMMUDevice *hiod, Error **errp)
 {
     IntelIOMMUState *s = opaque;
+    VTDHostIOMMUDevice *vtd_hiod;
     struct vtd_as_key key = {
         .bus = bus,
         .devfn = devfn,
@@ -4632,7 +4636,14 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
         return false;
     }
 
+    vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
+    vtd_hiod->bus = bus;
+    vtd_hiod->devfn = (uint8_t)devfn;
+    vtd_hiod->iommu_state = s;
+    vtd_hiod->hiod = hiod;
+
     if (!vtd_check_hiod(s, hiod, errp)) {
+        g_free(vtd_hiod);
         vtd_iommu_unlock(s);
         return false;
     }
@@ -4642,7 +4653,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     new_key->devfn = devfn;
 
     object_ref(hiod);
-    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
+    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
 
     vtd_iommu_unlock(s);
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 08/23] vfio/iommufd: Force creating nesting parent HWPT
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (6 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 07/23] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24 16:23   ` Cédric Le Goater
  2025-10-24  8:43 ` [PATCH v7 09/23] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-flts=on Zhenzhong Duan
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Call pci_device_get_viommu_flags() to get if vIOMMU supports
VIOMMU_FLAG_WANT_NESTING_PARENT.

If yes, create a nesting parent HWPT and add it to the container's hwpt_list,
letting this parent HWPT cover the entire second stage mappings (GPA=>HPA).

This allows a VFIO passthrough device to directly attach to this default HWPT
and then to use the system address space and its listener.

Introduce a vfio_device_get_viommu_flags_want_nesting() helper to facilitate
this implementation.

It is safe to do so because a vIOMMU will be able to fail in set_iommu_device()
call, if something else related to the VFIO device or vIOMMU isn't compatible.

Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 include/hw/vfio/vfio-device.h |  2 ++
 hw/vfio/device.c              | 12 ++++++++++++
 hw/vfio/iommufd.c             |  9 +++++++++
 3 files changed, 23 insertions(+)

diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index a0b8fc2eb6..48d00c7bc4 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -267,6 +267,8 @@ void vfio_device_prepare(VFIODevice *vbasedev, VFIOContainer *bcontainer,
 
 void vfio_device_unprepare(VFIODevice *vbasedev);
 
+bool vfio_device_get_viommu_flags_want_nesting(VFIODevice *vbasedev);
+
 int vfio_device_get_region_info(VFIODevice *vbasedev, int index,
                                 struct vfio_region_info **info);
 int vfio_device_get_region_info_type(VFIODevice *vbasedev, uint32_t type,
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 5ed3103e72..be94947623 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -23,6 +23,7 @@
 
 #include "hw/vfio/vfio-device.h"
 #include "hw/vfio/pci.h"
+#include "hw/iommu.h"
 #include "hw/hw.h"
 #include "trace.h"
 #include "qapi/error.h"
@@ -521,6 +522,17 @@ void vfio_device_unprepare(VFIODevice *vbasedev)
     vbasedev->bcontainer = NULL;
 }
 
+bool vfio_device_get_viommu_flags_want_nesting(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
+
+    if (vdev) {
+        return !!(pci_device_get_viommu_flags(&vdev->parent_obj) &
+                  VIOMMU_FLAG_WANT_NESTING_PARENT);
+    }
+    return false;
+}
+
 /*
  * Traditional ioctl() based io
  */
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 8de765c769..f9d0926274 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -404,6 +404,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
         flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
     }
 
+    /*
+     * If vIOMMU requests VFIO's cooperation to create nesting parent HWPT,
+     * force to create it so that it could be reused by vIOMMU to create
+     * nested HWPT.
+     */
+    if (vfio_device_get_viommu_flags_want_nesting(vbasedev)) {
+        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
+    }
+
     if (cpr_is_incoming()) {
         hwpt_id = vbasedev->cpr.hwpt_id;
         goto skip_alloc;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 09/23] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-flts=on
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (7 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 08/23] vfio/iommufd: Force creating nesting parent HWPT Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-31  8:09   ` Eric Auger
  2025-10-24  8:43 ` [PATCH v7 10/23] intel_iommu: Check for compatibility with IOMMUFD backed " Zhenzhong Duan
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

When guest enables scalable mode and setup first stage page table, we don't
want to use IOMMU MR but rather continue using the system MR for IOMMUFD
backed host device.

Then default HWPT in VFIO contains GPA->HPA mappings which could be reused
as nesting parent HWPT to construct nested HWPT in vIOMMU.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c | 36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 4c83578c54..ce4c54165e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -41,6 +41,7 @@
 #include "migration/misc.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "system/iommufd.h"
 
 /* context entry operations */
 #define PASID_0    0
@@ -1713,6 +1714,24 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
 
 }
 
+static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(VTDAddressSpace *as)
+{
+    IntelIOMMUState *s = as->iommu_state;
+    struct vtd_as_key key = {
+        .bus = as->bus,
+        .devfn = as->devfn,
+    };
+    VTDHostIOMMUDevice *vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev,
+                                                       &key);
+
+    if (vtd_hiod && vtd_hiod->hiod &&
+        object_dynamic_cast(OBJECT(vtd_hiod->hiod),
+                            TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+        return vtd_hiod;
+    }
+    return NULL;
+}
+
 static bool vtd_as_pt_enabled(VTDAddressSpace *as)
 {
     IntelIOMMUState *s;
@@ -1738,12 +1757,25 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
 /* Return whether the device is using IOMMU translation. */
 static bool vtd_switch_address_space(VTDAddressSpace *as)
 {
+    IntelIOMMUState *s;
     bool use_iommu, pt;
 
     assert(as);
 
-    use_iommu = as->iommu_state->dmar_enabled && !vtd_as_pt_enabled(as);
-    pt = as->iommu_state->dmar_enabled && vtd_as_pt_enabled(as);
+    s = as->iommu_state;
+    use_iommu = s->dmar_enabled && !vtd_as_pt_enabled(as);
+    pt = s->dmar_enabled && vtd_as_pt_enabled(as);
+
+    /*
+     * When guest enables scalable mode and sets up first stage page table,
+     * we stick to system MR for IOMMUFD backed host device. Then its
+     * default hwpt contains GPA->HPA mappings which is used directly if
+     * PGTT=PT and used as nesting parent if PGTT=FST. Otherwise fall back
+     * to original processing.
+     */
+    if (s->root_scalable && s->fsts && vtd_find_hiod_iommufd(as)) {
+        use_iommu = false;
+    }
 
     trace_vtd_switch_address_space(pci_bus_num(as->bus),
                                    VTD_PCI_SLOT(as->devfn),
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 10/23] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (8 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 09/23] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-flts=on Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24 17:29   ` Cédric Le Goater
  2025-10-24  8:43 ` [PATCH v7 11/23] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

When vIOMMU is configured x-flts=on in scalable mode, first stage page table
is passed to host to construct nested page table for passthrough devices.

We need to check compatibility of some critical IOMMU capabilities between
vIOMMU and host IOMMU to ensure guest first stage page table could be used by
host.

For instance, vIOMMU supports first stage 1GB large page mapping, but host does
not, then this IOMMUFD backed device should fail.

Even of the checks pass, for now we willingly reject the association because
all the bits are not there yet, it will be relaxed in the end of this series.

Note vIOMMU has exposed IOMMU_HWPT_ALLOC_NEST_PARENT flag to force VFIO core to
create nesting parent HWPT, if host doesn't support nested translation, the
creation will fail. So no need to check nested capability here.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 hw/i386/intel_iommu.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ce4c54165e..7d908cdb58 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4636,8 +4636,31 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         return true;
     }
 
+#ifdef CONFIG_IOMMUFD
+    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
+    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+
+    /* Remaining checks are all first stage translation specific */
+    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
+        return false;
+    }
+
+    if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
+        error_setg(errp, "Incompatible host platform IOMMU type %d",
+                   caps->type);
+        return false;
+    }
+
+    if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
+        error_setg(errp,
+                   "First stage 1GB large page is unsupported by host IOMMU");
+        return false;
+    }
+#endif
+
     error_setg(errp,
-               "host device is uncompatible with first stage translation");
+               "host IOMMU is incompatible with guest first stage translation");
     return false;
 }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 11/23] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (9 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 10/23] intel_iommu: Check for compatibility with IOMMUFD backed " Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 12/23] intel_iommu: Add some macros and inline functions Zhenzhong Duan
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Currently we don't support nested translation for passthrough device with
emulated device under same PCI bridge, because they require different address
space when x-flts=on.

In theory, we do support if devices under same PCI bridge are all passthrough
devices. But emulated device can be hotplugged under same bridge. To simplify,
just forbid passthrough device under PCI bridge no matter if there is, or will
be emulated devices under same bridge. This is acceptable because PCIE bridge
is more popular than PCI bridge now.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 7d908cdb58..56abbb991d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4610,9 +4610,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
     return vtd_dev_as;
 }
 
-static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
+static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
                            Error **errp)
 {
+    HostIOMMUDevice *hiod = vtd_hiod->hiod;
     HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
     int ret;
 
@@ -4639,6 +4640,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
 #ifdef CONFIG_IOMMUFD
     struct HostIOMMUDeviceCaps *caps = &hiod->caps;
     struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+    PCIBus *bus = vtd_hiod->bus;
+    PCIDevice *pdev = bus->devices[vtd_hiod->devfn];
 
     /* Remaining checks are all first stage translation specific */
     if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
@@ -4657,6 +4660,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
                    "First stage 1GB large page is unsupported by host IOMMU");
         return false;
     }
+
+    if (pci_device_get_iommu_bus_devfn(pdev, &bus, NULL, NULL)) {
+        error_setg(errp, "Host device under PCI bridge is unsupported "
+                   "when x-flts=on");
+        return false;
+    }
 #endif
 
     error_setg(errp,
@@ -4697,7 +4706,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     vtd_hiod->iommu_state = s;
     vtd_hiod->hiod = hiod;
 
-    if (!vtd_check_hiod(s, hiod, errp)) {
+    if (!vtd_check_hiod(s, vtd_hiod, errp)) {
         g_free(vtd_hiod);
         vtd_iommu_unlock(s);
         return false;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 12/23] intel_iommu: Add some macros and inline functions
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (10 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 11/23] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24 16:39   ` Cédric Le Goater
  2025-10-24  8:43 ` [PATCH v7 13/23] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Add some macros and inline functions that will be used by following
patch.

This patch also make a cleanup to change macro VTD_SM_PASID_ENTRY_DID
and VTD_SM_PASID_ENTRY_FSPM to use extract64() just like what smmu does,
because they are either used in following patches or used indirectly by
new introduced inline functions. But we doesn't aim to change the huge
amount of bit mask style macro definitions in this patch, that should be
in a separate patch.

Suggested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu_internal.h |  8 +++++--
 hw/i386/intel_iommu.c          | 38 +++++++++++++++++++++++++++-------
 2 files changed, 37 insertions(+), 9 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 09edba81e2..df80af839d 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -642,10 +642,14 @@ typedef struct VTDPASIDCacheInfo {
 #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
 
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
-#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
+#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
 
-#define VTD_SM_PASID_ENTRY_FSPM          3ULL
 #define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(x)    extract64((x)->val[2], 0, 1)
+/* 00: 4-level paging, 01: 5-level paging, 10-11: Reserved */
+#define VTD_SM_PASID_ENTRY_FSPM(x)       extract64((x)->val[2], 2, 2)
+#define VTD_SM_PASID_ENTRY_WPE_BIT(x)    extract64((x)->val[2], 4, 1)
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(x)   extract64((x)->val[2], 7, 1)
 
 /* First Stage Paging Structure */
 /* Masks for First Stage Paging Entry */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 56abbb991d..871e6aad19 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -52,8 +52,7 @@
 
 /* pe operations */
 #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
-#define VTD_PE_GET_FS_LEVEL(pe) \
-    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FSPM))
+#define VTD_PE_GET_FS_LEVEL(pe) (VTD_SM_PASID_ENTRY_FSPM(pe) + 4)
 #define VTD_PE_GET_SS_LEVEL(pe) \
     (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
 
@@ -837,6 +836,31 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
     }
 }
 
+static inline dma_addr_t vtd_pe_get_fspt_base(VTDPASIDEntry *pe)
+{
+    return pe->val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
+}
+
+/*
+ * First stage IOVA address width: 48 bits for 4-level paging(FSPM=00)
+ *                                 57 bits for 5-level paging(FSPM=01)
+ */
+static inline uint32_t vtd_pe_get_fs_aw(VTDPASIDEntry *pe)
+{
+    return 48 + VTD_SM_PASID_ENTRY_FSPM(pe) * 9;
+}
+
+static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
+}
+
+/* check if pgtt is first stage translation */
+static inline bool vtd_pe_pgtt_is_fst(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FST);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1625,7 +1649,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
 
     if (s->root_scalable) {
         vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
-        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
+        return VTD_SM_PASID_ENTRY_DID(&pe);
     }
 
     return VTD_CONTEXT_ENTRY_DID(ce->hi);
@@ -1707,7 +1731,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
              */
             return false;
         }
-        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
+        return vtd_pe_pgtt_is_pt(&pe);
     }
 
     return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
@@ -3146,9 +3170,9 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
         /* Fall through */
     case VTD_INV_DESC_PASIDC_G_DSI:
         if (pc_entry->valid) {
-            did = VTD_SM_PASID_ENTRY_DID(pc_entry->pasid_entry.val[1]);
+            did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
         } else {
-            did = VTD_SM_PASID_ENTRY_DID(pe.val[1]);
+            did = VTD_SM_PASID_ENTRY_DID(&pe);
         }
         if (pc_info->did != did) {
             return;
@@ -5267,7 +5291,7 @@ static int vtd_pri_perform_implicit_invalidation(VTDAddressSpace *vtd_as,
         return -EINVAL;
     }
     pgtt = VTD_PE_GET_TYPE(&pe);
-    domain_id = VTD_SM_PASID_ENTRY_DID(pe.val[1]);
+    domain_id = VTD_SM_PASID_ENTRY_DID(&pe);
     ret = 0;
     switch (pgtt) {
     case VTD_SM_PASID_ENTRY_FST:
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 13/23] intel_iommu: Bind/unbind guest page table to host
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (11 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 12/23] intel_iommu: Add some macros and inline functions Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24 17:01   ` Cédric Le Goater
  2025-10-24 17:33   ` Cédric Le Goater
  2025-10-24  8:43 ` [PATCH v7 14/23] intel_iommu: Propagate PASID-based iotlb invalidation " Zhenzhong Duan
                   ` (9 subsequent siblings)
  22 siblings, 2 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Yi Sun

This captures the guest PASID table entry modifications and propagates
the changes to host to attach a hwpt with type determined per guest IOMMU
PGTT configuration.

When PGTT=PT, attach PASID_0 to a second stage HWPT(GPA->HPA).
When PGTT=FST, attach PASID_0 to nested HWPT with nesting parent HWPT
coming from VFIO.

Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/i386/intel_iommu.h |   1 +
 hw/i386/intel_iommu.c         | 150 +++++++++++++++++++++++++++++++++-
 hw/i386/trace-events          |   3 +
 3 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 3758ac239c..b5f8a9fc29 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -104,6 +104,7 @@ struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
     uint32_t pasid;
+    uint32_t fs_hwpt;
     AddressSpace as;
     IOMMUMemoryRegion iommu;
     MemoryRegion root;          /* The root container of the device */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 871e6aad19..3789a36147 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -20,6 +20,7 @@
  */
 
 #include "qemu/osdep.h"
+#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qapi/error.h"
@@ -42,6 +43,9 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "system/iommufd.h"
+#ifdef CONFIG_IOMMUFD
+#include <linux/iommufd.h>
+#endif
 
 /* context entry operations */
 #define PASID_0    0
@@ -87,6 +91,7 @@ struct vtd_iotlb_key {
 
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp);
 
 static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
 {
@@ -98,7 +103,11 @@ static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
     g_hash_table_iter_init(&as_it, s->vtd_address_spaces);
     while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_as)) {
         VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
-        pc_entry->valid = false;
+        if (pc_entry->valid) {
+            pc_entry->valid = false;
+            /* It's fatal to get failure during reset */
+            vtd_bind_guest_pasid(vtd_as, &error_fatal);
+        }
     }
 }
 
@@ -2380,6 +2389,128 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
+#ifdef CONFIG_IOMMUFD
+static int vtd_create_fs_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                              VTDPASIDEntry *pe, uint32_t *fs_hwpt,
+                              Error **errp)
+{
+    struct iommu_hwpt_vtd_s1 vtd = {};
+
+    vtd.flags = (VTD_SM_PASID_ENTRY_SRE_BIT(pe) ? IOMMU_VTD_S1_SRE : 0) |
+                (VTD_SM_PASID_ENTRY_WPE_BIT(pe) ? IOMMU_VTD_S1_WPE : 0) |
+                (VTD_SM_PASID_ENTRY_EAFE_BIT(pe) ? IOMMU_VTD_S1_EAFE : 0);
+    vtd.addr_width = vtd_pe_get_fs_aw(pe);
+    vtd.pgtbl_addr = (uint64_t)vtd_pe_get_fspt_base(pe);
+
+    return !iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+                                       idev->hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
+                                       sizeof(vtd), &vtd, fs_hwpt, errp);
+}
+
+static void vtd_destroy_old_fs_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                    VTDAddressSpace *vtd_as)
+{
+    if (!vtd_as->fs_hwpt) {
+        return;
+    }
+    iommufd_backend_free_id(idev->iommufd, vtd_as->fs_hwpt);
+    vtd_as->fs_hwpt = 0;
+}
+
+static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                     VTDAddressSpace *vtd_as, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    VTDPASIDEntry *pe = &vtd_as->pasid_cache_entry.pasid_entry;
+    uint32_t hwpt_id;
+    bool ret;
+
+    /*
+     * We can get here only if flts=on, the supported PGTT is FST and PT.
+     * Catch invalid PGTT when processing invalidation request to avoid
+     * attaching to wrong hwpt.
+     */
+    if (!vtd_pe_pgtt_is_fst(pe) && !vtd_pe_pgtt_is_pt(pe)) {
+        error_setg(errp, "Invalid PGTT type");
+        return -EINVAL;
+    }
+
+    if (vtd_pe_pgtt_is_pt(pe)) {
+        hwpt_id = idev->hwpt_id;
+    } else if (vtd_create_fs_hwpt(idev, pe, &hwpt_id, errp)) {
+        return -EINVAL;
+    }
+
+    ret = host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp);
+    trace_vtd_device_attach_hwpt(idev->devid, vtd_as->pasid, hwpt_id, !ret);
+    if (ret) {
+        /* Destroy old fs_hwpt if it's a replacement */
+        vtd_destroy_old_fs_hwpt(idev, vtd_as);
+        if (vtd_pe_pgtt_is_fst(pe)) {
+            vtd_as->fs_hwpt = hwpt_id;
+        }
+    } else if (vtd_pe_pgtt_is_fst(pe)) {
+        iommufd_backend_free_id(idev->iommufd, hwpt_id);
+    }
+
+    return !ret;
+}
+
+static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                     VTDAddressSpace *vtd_as, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint32_t pasid = vtd_as->pasid;
+    bool ret;
+
+    if (s->dmar_enabled && s->root_scalable) {
+        ret = host_iommu_device_iommufd_detach_hwpt(idev, errp);
+        trace_vtd_device_detach_hwpt(idev->devid, pasid, !ret);
+    } else {
+        /*
+         * If DMAR remapping is disabled or guest switches to legacy mode,
+         * we fallback to the default HWPT which contains shadow page table.
+         * So guest DMA could still work.
+         */
+        ret = host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
+        trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
+                                           !ret);
+    }
+
+    if (ret) {
+        vtd_destroy_old_fs_hwpt(idev, vtd_as);
+    }
+
+    return !ret;
+}
+
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
+{
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(vtd_as);
+    int ret;
+
+    /* Ignore emulated device or legacy VFIO backed device */
+    if (!vtd_hiod) {
+        return 0;
+    }
+
+    if (pc_entry->valid) {
+        ret = vtd_device_attach_iommufd(vtd_hiod, vtd_as, errp);
+    } else {
+        ret = vtd_device_detach_iommufd(vtd_hiod, vtd_as, errp);
+    }
+
+    return ret;
+}
+#else
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
+{
+    return 0;
+}
+#endif
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -3134,6 +3265,8 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
     VTDPASIDEntry pe;
     IOMMUNotifier *n;
     uint16_t did;
+    const char *err_prefix;
+    Error *local_err = NULL;
 
     if (vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
         if (!pc_entry->valid) {
@@ -3154,7 +3287,9 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
             vtd_address_space_unmap(vtd_as, n);
         }
         vtd_switch_address_space(vtd_as);
-        return;
+
+        err_prefix = "Detaching from HWPT failed: ";
+        goto do_bind_unbind;
     }
 
     /*
@@ -3182,12 +3317,21 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
     if (!pc_entry->valid) {
         pc_entry->pasid_entry = pe;
         pc_entry->valid = true;
-    } else if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
+        err_prefix = "Attaching to HWPT failed: ";
+    } else if (vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
+        err_prefix = "Replacing HWPT attachment failed: ";
+    } else {
         return;
     }
 
     vtd_switch_address_space(vtd_as);
     vtd_address_space_sync(vtd_as);
+
+do_bind_unbind:
+    /* TODO: Fault event injection into guest, report error to QEMU for now */
+    if (vtd_bind_guest_pasid(vtd_as, &local_err)) {
+        error_reportf_err(local_err, "%s", err_prefix);
+    }
 }
 
 static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index b704f4f90c..5a3ee1cf64 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -73,6 +73,9 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
 vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
 vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
 vtd_reset_exit(void) ""
+vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
+vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
 
 # amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 14/23] intel_iommu: Propagate PASID-based iotlb invalidation to host
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (12 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 13/23] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 15/23] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
	Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

This traps the guest PASID-based iotlb invalidation request and propagate it
to host.

Intel VT-d 3.0 supports nested translation in PASID granularity. Guest SVA
support could be implemented by configuring nested translation on specific
pasid. This is also known as dual stage DMA translation.

Under such configuration, guest owns the GVA->GPA translation which is
configured as first stage page table on host side for a specific pasid, and
host owns GPA->HPA translation. As guest owns first stage translation table,
piotlb invalidation should be propagated to host since host IOMMU will cache
first level page table related mappings during DMA address translation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  6 +++
 hw/i386/intel_iommu.c          | 87 ++++++++++++++++++++++++++++++++--
 2 files changed, 90 insertions(+), 3 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index df80af839d..97b48544d2 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -621,6 +621,12 @@ typedef struct VTDPASIDCacheInfo {
     uint32_t pasid;
 } VTDPASIDCacheInfo;
 
+typedef struct VTDPIOTLBInvInfo {
+    uint16_t domain_id;
+    uint32_t pasid;
+    struct iommu_hwpt_vtd_s1_invalidate *inv_data;
+} VTDPIOTLBInvInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 3789a36147..ef6477de53 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2504,11 +2504,88 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
 
     return ret;
 }
+
+/*
+ * This function is a loop function for the s->vtd_address_spaces
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host.
+ */
+static void vtd_flush_host_piotlb_locked(gpointer key, gpointer value,
+                                         gpointer user_data)
+{
+    VTDPIOTLBInvInfo *piotlb_info = user_data;
+    VTDAddressSpace *vtd_as = value;
+    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(vtd_as);
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    uint16_t did;
+
+    if (!vtd_hiod) {
+        return;
+    }
+
+    assert(vtd_as->pasid == PCI_NO_PASID);
+
+    /* Nothing to do if there is no first stage HWPT attached */
+    if (!pc_entry->valid ||
+        !vtd_pe_pgtt_is_fst(&pc_entry->pasid_entry)) {
+        return;
+    }
+
+    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
+
+    if (piotlb_info->domain_id == did && piotlb_info->pasid == PASID_0) {
+        HostIOMMUDeviceIOMMUFD *idev =
+            HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+        uint32_t entry_num = 1; /* Only implement one request for simplicity */
+        Error *local_err = NULL;
+        struct iommu_hwpt_vtd_s1_invalidate *cache = piotlb_info->inv_data;
+
+        if (!iommufd_backend_invalidate_cache(idev->iommufd, vtd_as->fs_hwpt,
+                                              IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+                                              sizeof(*cache), &entry_num, cache,
+                                              &local_err)) {
+            /* Something wrong in kernel, but trying to continue */
+            error_report_err(local_err);
+        }
+    }
+}
+
+static void
+vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
+                                 uint16_t domain_id, uint32_t pasid,
+                                 hwaddr addr, uint64_t npages, bool ih)
+{
+    struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
+    VTDPIOTLBInvInfo piotlb_info;
+
+    cache_info.addr = addr;
+    cache_info.npages = npages;
+    cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.inv_data = &cache_info;
+
+    /*
+     * Go through each vtd_as instance in s->vtd_address_spaces, find out
+     * the affected host device which need host piotlb invalidation. Piotlb
+     * invalidation should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_address_spaces,
+                         vtd_flush_host_piotlb_locked, &piotlb_info);
+}
 #else
 static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
 {
     return 0;
 }
+
+static void
+vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
+                                 uint16_t domain_id, uint32_t pasid,
+                                 hwaddr addr, uint64_t npages, bool ih)
+{
+}
 #endif
 
 /* Do a context-cache device-selective invalidation.
@@ -3155,6 +3232,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
     vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
                                 &info);
+    vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, 0, (uint64_t)-1, 0);
     vtd_iommu_unlock(s);
 
     QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
@@ -3174,7 +3252,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
-                                       uint32_t pasid, hwaddr addr, uint8_t am)
+                                       uint32_t pasid, hwaddr addr, uint8_t am,
+                                       bool ih)
 {
     VTDIOTLBPageInvInfo info;
 
@@ -3186,6 +3265,7 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb,
                                 vtd_hash_remove_by_page_piotlb, &info);
+    vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, addr, 1 << am, ih);
     vtd_iommu_unlock(s);
 
     vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am, pasid);
@@ -3217,7 +3297,8 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
     case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
         am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
         addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
-        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am);
+        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+                                   VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
         break;
 
     default:
@@ -5439,7 +5520,7 @@ static int vtd_pri_perform_implicit_invalidation(VTDAddressSpace *vtd_as,
     ret = 0;
     switch (pgtt) {
     case VTD_SM_PASID_ENTRY_FST:
-        vtd_piotlb_page_invalidate(s, domain_id, vtd_as->pasid, addr, 0);
+        vtd_piotlb_page_invalidate(s, domain_id, vtd_as->pasid, addr, 0, 0);
         break;
     /* Room for other pgtt values */
     default:
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 15/23] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (13 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 14/23] intel_iommu: Propagate PASID-based iotlb invalidation " Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 16/23] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
all pasid bindings on host side become stale and need to be updated.

Introduce a helper function vtd_replay_pasid_bindings_all() to go through all
pasid entries in all passthrough devices to update host side bindings.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 hw/i386/intel_iommu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ef6477de53..1f78274204 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -92,6 +92,7 @@ struct vtd_iotlb_key {
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp);
+static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s);
 
 static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
 {
@@ -2894,6 +2895,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_replay_pasid_bindings_all(s);
 }
 
 /* Set Interrupt Remap Table Pointer */
@@ -2928,6 +2930,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_replay_pasid_bindings_all(s);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -3427,6 +3430,13 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
     vtd_iommu_unlock(s);
 }
 
+static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info = { .type = VTD_INV_DESC_PASIDC_G_GLOBAL };
+
+    vtd_pasid_cache_sync(s, &pc_info);
+}
+
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 16/23] intel_iommu: Replay pasid bindings after context cache invalidation
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (14 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 15/23] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 17/23] iommufd: Introduce a helper function to extract vendor capabilities Zhenzhong Duan
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

This replays guest pasid bindings after context cache invalidation.
Actually, programmer should issue pasid cache invalidation with proper
granularity after issuing context cache invalidation.

We see old linux such as 6.7.0-rc2 not following the spec, it sends
pasid cache invalidation before context cache invalidation, then QEMU
depends on context cache invalidation to get pasid entry and setup
binding.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 47 +++++++++++++++++++++++++++++++++++++++++++
 hw/i386/trace-events  |  1 +
 2 files changed, 48 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1f78274204..edd1416382 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -93,6 +93,8 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp);
 static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
+                                        gpointer user_data);
 
 static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
 {
@@ -2388,6 +2390,13 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
      * VT-d emulation codes.
      */
     vtd_iommu_replay_all(s);
+    /*
+     * Same for pasid cache invalidation, per VT-d spec 6.5.2.1, a global
+     * context cache invalidation should be followed by global PASID cache
+     * invalidation. In order to work with guest not following spec,
+     * handle global PASID cache invalidation here.
+     */
+    vtd_replay_pasid_bindings_all(s);
 }
 
 #ifdef CONFIG_IOMMUFD
@@ -2589,6 +2598,35 @@ vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
 }
 #endif
 
+static void vtd_pasid_cache_devsi(VTDAddressSpace *vtd_as)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    PCIBus *bus = vtd_as->bus;
+    uint8_t devfn = vtd_as->devfn;
+    struct vtd_as_key key = {
+        .bus = bus,
+        .devfn = devfn,
+        .pasid = vtd_as->pasid,
+    };
+    VTDPASIDCacheInfo pc_info;
+
+    if (!s->fsts || !s->root_scalable || !s->dmar_enabled) {
+        return;
+    }
+
+    trace_vtd_pasid_cache_devsi(pci_bus_num(bus),
+                                VTD_PCI_SLOT(devfn), VTD_PCI_FUNC(devfn));
+
+    /* We fake to be global invalidation just to bypass all checks */
+    pc_info.type = VTD_INV_DESC_PASIDC_G_GLOBAL;
+
+    /*
+     * We already get vtd_as of the device whose PASID cache is invalidated,
+     * so just call vtd_pasid_cache_sync_locked() once.
+     */
+    vtd_pasid_cache_sync_locked(&key, vtd_as, &pc_info);
+}
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -2647,6 +2685,15 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
              * happened.
              */
             vtd_address_space_sync(vtd_as);
+            /*
+             * Per spec 6.5.2.1, context flush should be followed by PASID
+             * cache and iotlb flush. In order to work with a guest which does
+             * not follow spec and missed PASID cache flush, e.g., linux
+             * 6.7.0-rc2, we have vtd_pasid_cache_devsi() to invalidate PASID
+             * cache of passthrough device. Host iommu driver would flush
+             * piotlb when a pasid unbind is pass down to it.
+             */
+            vtd_pasid_cache_devsi(vtd_as);
         }
     }
 }
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 5a3ee1cf64..5fa5e93b68 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -28,6 +28,7 @@ vtd_pasid_cache_reset(void) ""
 vtd_inv_desc_pasid_cache_gsi(void) ""
 vtd_inv_desc_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
 vtd_inv_desc_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint8_t bus, uint8_t dev, uint8_t fn) "Dev selective PC invalidation dev: %02"PRIx8":%02"PRIx8".%02"PRIx8
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 17/23] iommufd: Introduce a helper function to extract vendor capabilities
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (15 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 16/23] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24 16:44   ` Cédric Le Goater
  2025-10-24 17:34   ` Cédric Le Goater
  2025-10-24  8:43 ` [PATCH v7 18/23] vfio: Add a new element bypass_ro in VFIOContainer Zhenzhong Duan
                   ` (5 subsequent siblings)
  22 siblings, 2 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

In VFIO core, we call iommufd_backend_get_device_info() to return vendor
specific hardware information data, but it's not good to extract this raw
data in VFIO core.

Introduce host_iommu_extract_quirks() to help extracting the raw data and
return a bitmap in iommufd.c because it's the place defining
iommufd_backend_get_device_info().

The other choice is to put vendor data extracting code in vendor vIOMMU
emulation file, but that will make those files mixed with vIOMMU
emulation and host IOMMU extracting code, also need a new callback in
PCIIOMMUOps. So we choose a simpler way as above.

Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/hw/iommu.h                 |  5 +++++
 include/system/host_iommu_device.h | 15 +++++++++++++++
 backends/iommufd.c                 | 13 +++++++++++++
 3 files changed, 33 insertions(+)

diff --git a/include/hw/iommu.h b/include/hw/iommu.h
index 9b8bb94fc2..6d61410703 100644
--- a/include/hw/iommu.h
+++ b/include/hw/iommu.h
@@ -22,4 +22,9 @@ enum viommu_flags {
     VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
 };
 
+/* Host IOMMU quirks. Extracted from host IOMMU capabilities */
+enum host_iommu_quirks {
+    HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO = BIT_ULL(0),
+};
+
 #endif /* HW_IOMMU_H */
diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index ab849a4a82..9ae7f4cc6d 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -39,6 +39,21 @@ typedef struct HostIOMMUDeviceCaps {
     uint64_t hw_caps;
     VendorCaps vendor_caps;
 } HostIOMMUDeviceCaps;
+
+/**
+ * host_iommu_extract_quirk: Extract host IOMMU quirks
+ *
+ * This function converts @type specific hardware information data
+ * into a standard bitmap format.
+ *
+ * @type: IOMMU Hardware Info Types
+ *
+ * @VendorCaps: IOMMU @type specific hardware information data
+ *
+ * Returns: bitmap with each representing a host IOMMU quirk defined in
+ * enum host_iommu_quirks
+ */
+uint64_t host_iommu_extract_quirks(uint32_t type, VendorCaps *caps);
 #endif
 
 #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 086bd67aea..61b991ec53 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -19,6 +19,7 @@
 #include "migration/cpr.h"
 #include "monitor/monitor.h"
 #include "trace.h"
+#include "hw/iommu.h"
 #include "hw/vfio/vfio-device.h"
 #include <sys/ioctl.h>
 #include <linux/iommufd.h>
@@ -411,6 +412,18 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
     return true;
 }
 
+uint64_t host_iommu_extract_quirks(uint32_t type, VendorCaps *caps)
+{
+    uint64_t quirks = 0;
+
+    if (type == IOMMU_HW_INFO_TYPE_INTEL_VTD &&
+        caps->vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
+        quirks |= HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO;
+    }
+
+    return quirks;
+}
+
 bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
                                       uint32_t data_type, uint32_t entry_len,
                                       uint32_t *entry_num, void *data,
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 18/23] vfio: Add a new element bypass_ro in VFIOContainer
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (16 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 17/23] iommufd: Introduce a helper function to extract vendor capabilities Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 19/23] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

When bypass_ro is true, readonly memory section is bypassed from mapping
in the container.

This is a preparing patch to workaround Intel ERRATA_772415, see changelog
in next patch for details about the errata.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
---
 include/hw/vfio/vfio-container.h |  1 +
 hw/vfio/listener.c               | 21 ++++++++++++++-------
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/hw/vfio/vfio-container.h b/include/hw/vfio/vfio-container.h
index 9f6e8cedfc..a7d5c5ed67 100644
--- a/include/hw/vfio/vfio-container.h
+++ b/include/hw/vfio/vfio-container.h
@@ -52,6 +52,7 @@ struct VFIOContainer {
     QLIST_HEAD(, VFIODevice) device_list;
     GList *iova_ranges;
     NotifierWithReturn cpr_reboot_notifier;
+    bool bypass_ro;
 };
 
 #define TYPE_VFIO_IOMMU "vfio-iommu"
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index 2109101158..0862b2b834 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -76,8 +76,13 @@ static bool vfio_log_sync_needed(const VFIOContainer *bcontainer)
     return true;
 }
 
-static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+static bool vfio_listener_skipped_section(MemoryRegionSection *section,
+                                          bool bypass_ro)
 {
+    if (bypass_ro && section->readonly) {
+        return true;
+    }
+
     return (!memory_region_is_ram(section->mr) &&
             !memory_region_is_iommu(section->mr)) ||
            memory_region_is_protected(section->mr) ||
@@ -368,9 +373,9 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
 }
 
 static bool vfio_listener_valid_section(MemoryRegionSection *section,
-                                        const char *name)
+                                        bool bypass_ro, const char *name)
 {
-    if (vfio_listener_skipped_section(section)) {
+    if (vfio_listener_skipped_section(section, bypass_ro)) {
         trace_vfio_listener_region_skip(name,
                 section->offset_within_address_space,
                 section->offset_within_address_space +
@@ -497,7 +502,8 @@ void vfio_container_region_add(VFIOContainer *bcontainer,
     int ret;
     Error *err = NULL;
 
-    if (!vfio_listener_valid_section(section, "region_add")) {
+    if (!vfio_listener_valid_section(section, bcontainer->bypass_ro,
+                                     "region_add")) {
         return;
     }
 
@@ -663,7 +669,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
     int ret;
     bool try_unmap = true;
 
-    if (!vfio_listener_valid_section(section, "region_del")) {
+    if (!vfio_listener_valid_section(section, bcontainer->bypass_ro,
+                                     "region_del")) {
         return;
     }
 
@@ -821,7 +828,7 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
         container_of(listener, VFIODirtyRangesListener, listener);
     hwaddr iova, end;
 
-    if (!vfio_listener_valid_section(section, "tracking_update") ||
+    if (!vfio_listener_valid_section(section, false, "tracking_update") ||
         !vfio_get_section_iova_range(dirty->bcontainer, section,
                                      &iova, &end, NULL)) {
         return;
@@ -1215,7 +1222,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
     int ret;
     Error *local_err = NULL;
 
-    if (vfio_listener_skipped_section(section)) {
+    if (vfio_listener_skipped_section(section, false)) {
         return;
     }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 19/23] Workaround for ERRATA_772415_SPR17
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (17 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 18/23] vfio: Add a new element bypass_ro in VFIOContainer Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24 17:36   ` Cédric Le Goater
  2025-10-24 17:38   ` Cédric Le Goater
  2025-10-24  8:43 ` [PATCH v7 20/23] vfio: Bypass readonly region for dirty tracking Zhenzhong Duan
                   ` (3 subsequent siblings)
  22 siblings, 2 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on second stage page table could still be written.

Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.
https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update/

Also copied the SPR17 details from above link:
"Problem: When remapping hardware is configured by system software in
scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the
PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended
Access bit if enabled) in first-stage page-table entries even when
second-stage mappings indicate that corresponding first-stage page-table
is Read-Only.

Implication: Due to this erratum, pages mapped as Read-only in second-stage
page-tables may be modified by remapping hardware Access/Dirty bit updates.

Workaround: None identified. System software enabling nested translations
for a VM should ensure that there are no read-only pages in the
corresponding second-stage mappings."

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/iommufd.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index f9d0926274..f9da0e79cc 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -15,6 +15,7 @@
 #include <linux/vfio.h>
 #include <linux/iommufd.h>
 
+#include "hw/iommu.h"
 #include "hw/vfio/vfio-device.h"
 #include "qemu/error-report.h"
 #include "trace.h"
@@ -351,6 +352,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
     VFIOContainer *bcontainer = VFIO_IOMMU(container);
     uint32_t type, flags = 0;
     uint64_t hw_caps;
+    VendorCaps caps;
     VFIOIOASHwpt *hwpt;
     uint32_t hwpt_id;
     int ret;
@@ -396,7 +398,8 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
      * instead.
      */
     if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
-                                         &type, NULL, 0, &hw_caps, errp)) {
+                                         &type, &caps, sizeof(caps), &hw_caps,
+                                         errp)) {
         return false;
     }
 
@@ -411,6 +414,11 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
      */
     if (vfio_device_get_viommu_flags_want_nesting(vbasedev)) {
         flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
+
+        if (host_iommu_extract_quirks(type, &caps) &
+            HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO) {
+            bcontainer->bypass_ro = true;
+        }
     }
 
     if (cpr_is_incoming()) {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 20/23] vfio: Bypass readonly region for dirty tracking
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (18 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 19/23] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24 16:32   ` Cédric Le Goater
  2025-10-24  8:43 ` [PATCH v7 21/23] intel_iommu: Add migration support with x-flts=on Zhenzhong Duan
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

When doing ditry tracking or calculating dirty tracking range, readonly
regions can be bypassed, because corresponding DMA mappings are readonly
and never become dirty.

This can optimize dirty tracking a bit for passthrough device.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/listener.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index 0862b2b834..cbd86c79af 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -828,7 +828,8 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
         container_of(listener, VFIODirtyRangesListener, listener);
     hwaddr iova, end;
 
-    if (!vfio_listener_valid_section(section, false, "tracking_update") ||
+    /* Bypass readonly section as it never become dirty */
+    if (!vfio_listener_valid_section(section, true, "tracking_update") ||
         !vfio_get_section_iova_range(dirty->bcontainer, section,
                                      &iova, &end, NULL)) {
         return;
@@ -1087,6 +1088,12 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     if (!mr) {
         goto out_unlock;
     }
+
+    if (!(iotlb->perm & IOMMU_WO) || mr->readonly) {
+        rcu_read_unlock();
+        return;
+    }
+
     translated_addr = memory_region_get_ram_addr(mr) + xlat;
 
     ret = vfio_container_query_dirty_bitmap(bcontainer, iova, iotlb->addr_mask + 1,
@@ -1222,7 +1229,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
     int ret;
     Error *local_err = NULL;
 
-    if (vfio_listener_skipped_section(section, false)) {
+    if (vfio_listener_skipped_section(section, true)) {
         return;
     }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 21/23] intel_iommu: Add migration support with x-flts=on
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (19 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 20/23] vfio: Bypass readonly region for dirty tracking Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 22/23] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 23/23] docs/devel: Add IOMMUFD nesting documentation Zhenzhong Duan
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

When x-flts=on, we set up bindings to nested HWPT in host, after
migration, VFIO device binds to nesting parent HWPT by default.
We need to re-establish the bindings to nested HWPT, or else device
DMA will break.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index edd1416382..8fec61be3e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4360,6 +4360,13 @@ static int vtd_post_load(void *opaque, int version_id)
      */
     vtd_switch_address_space_all(iommu);
 
+    /*
+     * Bindings to nested HWPT in host is set up dynamically depending
+     * on pasid entry configuration from guest. After migration, we
+     * need to re-establish the bindings before restore device's DMA.
+     */
+    vtd_replay_pasid_bindings_all(iommu);
+
     return 0;
 }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 22/23] intel_iommu: Enable host device when x-flts=on in scalable mode
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (20 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 21/23] intel_iommu: Add migration support with x-flts=on Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  2025-10-24  8:43 ` [PATCH v7 23/23] docs/devel: Add IOMMUFD nesting documentation Zhenzhong Duan
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Now that all infrastructures of supporting passthrough device running
with first stage translation are there, enable it now.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 8fec61be3e..356623ef13 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4979,6 +4979,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
                    "when x-flts=on");
         return false;
     }
+
+    return true;
 #endif
 
     error_setg(errp,
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v7 23/23] docs/devel: Add IOMMUFD nesting documentation
  2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (21 preceding siblings ...)
  2025-10-24  8:43 ` [PATCH v7 22/23] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
@ 2025-10-24  8:43 ` Zhenzhong Duan
  22 siblings, 0 replies; 48+ messages in thread
From: Zhenzhong Duan @ 2025-10-24  8:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Add documentation about using IOMMUFD backed VFIO device with intel_iommu with
x-flts=on.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 docs/devel/vfio-iommufd.rst | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/docs/devel/vfio-iommufd.rst b/docs/devel/vfio-iommufd.rst
index 3d1c11f175..f1e4d940e6 100644
--- a/docs/devel/vfio-iommufd.rst
+++ b/docs/devel/vfio-iommufd.rst
@@ -164,3 +164,28 @@ RAM discarding for mdev.
 
 ``vfio-ap`` and ``vfio-ccw`` devices don't have same issue as their backend
 devices are always mdev and RAM discarding is force enabled.
+
+Usage with intel_iommu with x-flts=on
+-------------------------------------
+
+Only IOMMUFD backed VFIO device is supported when intel_iommu is configured
+with x-flts=on, for legacy container backed VFIO device, below error shows:
+
+.. code-block:: none
+
+    qemu-system-x86_64: -device vfio-pci,host=0000:02:00.0: vfio 0000:02:00.0: Failed to set vIOMMU: Need IOMMUFD backend when x-flts=on
+
+VFIO device under PCI bridge is unsupported, use PCIE bridge if necessary,
+or else below error shows:
+
+.. code-block:: none
+
+    qemu-system-x86_64: -device vfio-pci,host=0000:02:00.0,bus=bridge1,iommufd=iommufd0: vfio 0000:02:00.0: Failed to set vIOMMU: Host device under PCI bridge is unsupported when x-flts=on
+
+If host IOMMU has ERRATA_772415_SPR17, kexec or reboot from "intel_iommu=on,sm_on"
+to "intel_iommu=on,sm_off" in guest is also unsupported. Configure scalable mode
+off as below if it's not needed by guest.
+
+.. code-block:: bash
+
+    -device intel-iommu,x-scalable-mode=off
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 08/23] vfio/iommufd: Force creating nesting parent HWPT
  2025-10-24  8:43 ` [PATCH v7 08/23] vfio/iommufd: Force creating nesting parent HWPT Zhenzhong Duan
@ 2025-10-24 16:23   ` Cédric Le Goater
  2025-10-28  6:00     ` Duan, Zhenzhong
  0 siblings, 1 reply; 48+ messages in thread
From: Cédric Le Goater @ 2025-10-24 16:23 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 10/24/25 10:43, Zhenzhong Duan wrote:
> Call pci_device_get_viommu_flags() to get if vIOMMU supports
> VIOMMU_FLAG_WANT_NESTING_PARENT.
> 
> If yes, create a nesting parent HWPT and add it to the container's hwpt_list,
> letting this parent HWPT cover the entire second stage mappings (GPA=>HPA).
> 
> This allows a VFIO passthrough device to directly attach to this default HWPT
> and then to use the system address space and its listener.
> 
> Introduce a vfio_device_get_viommu_flags_want_nesting() helper to facilitate
> this implementation.
> 
> It is safe to do so because a vIOMMU will be able to fail in set_iommu_device()
> call, if something else related to the VFIO device or vIOMMU isn't compatible.
> 
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
> ---
>   include/hw/vfio/vfio-device.h |  2 ++
>   hw/vfio/device.c              | 12 ++++++++++++
>   hw/vfio/iommufd.c             |  9 +++++++++
>   3 files changed, 23 insertions(+)
> 
> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index a0b8fc2eb6..48d00c7bc4 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -267,6 +267,8 @@ void vfio_device_prepare(VFIODevice *vbasedev, VFIOContainer *bcontainer,
>   
>   void vfio_device_unprepare(VFIODevice *vbasedev);
>   
> +bool vfio_device_get_viommu_flags_want_nesting(VFIODevice *vbasedev);
> +
>   int vfio_device_get_region_info(VFIODevice *vbasedev, int index,
>                                   struct vfio_region_info **info);
>   int vfio_device_get_region_info_type(VFIODevice *vbasedev, uint32_t type,
> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
> index 5ed3103e72..be94947623 100644
> --- a/hw/vfio/device.c
> +++ b/hw/vfio/device.c
> @@ -23,6 +23,7 @@
>   
>   #include "hw/vfio/vfio-device.h"
>   #include "hw/vfio/pci.h"
> +#include "hw/iommu.h"
>   #include "hw/hw.h"
>   #include "trace.h"
>   #include "qapi/error.h"
> @@ -521,6 +522,17 @@ void vfio_device_unprepare(VFIODevice *vbasedev)
>       vbasedev->bcontainer = NULL;
>   }
>   
> +bool vfio_device_get_viommu_flags_want_nesting(VFIODevice *vbasedev)
> +{
> +    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
> +
> +    if (vdev) {
> +        return !!(pci_device_get_viommu_flags(&vdev->parent_obj) &

Using PCI_DEVICE(vdev) would be more appropriate. It can come later.

Thanks,

C.



> +                  VIOMMU_FLAG_WANT_NESTING_PARENT);
> +    }
> +    return false;
> +}
> +
>   /*
>    * Traditional ioctl() based io
>    */
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 8de765c769..f9d0926274 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -404,6 +404,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>           flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>       }
>   
> +    /*
> +     * If vIOMMU requests VFIO's cooperation to create nesting parent HWPT,
> +     * force to create it so that it could be reused by vIOMMU to create
> +     * nested HWPT.
> +     */
> +    if (vfio_device_get_viommu_flags_want_nesting(vbasedev)) {
> +        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
> +    }
> +
>       if (cpr_is_incoming()) {
>           hwpt_id = vbasedev->cpr.hwpt_id;
>           goto skip_alloc;



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 20/23] vfio: Bypass readonly region for dirty tracking
  2025-10-24  8:43 ` [PATCH v7 20/23] vfio: Bypass readonly region for dirty tracking Zhenzhong Duan
@ 2025-10-24 16:32   ` Cédric Le Goater
  2025-10-28  9:47     ` Duan, Zhenzhong
  0 siblings, 1 reply; 48+ messages in thread
From: Cédric Le Goater @ 2025-10-24 16:32 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 10/24/25 10:43, Zhenzhong Duan wrote:
> When doing ditry tracking or calculating dirty tracking range, readonly
> regions can be bypassed, because corresponding DMA mappings are readonly
> and never become dirty.
> 
> This can optimize dirty tracking a bit for passthrough device.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/vfio/listener.c | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index 0862b2b834..cbd86c79af 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -828,7 +828,8 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
>           container_of(listener, VFIODirtyRangesListener, listener);
>       hwaddr iova, end;
>   
> -    if (!vfio_listener_valid_section(section, false, "tracking_update") ||
> +    /* Bypass readonly section as it never become dirty */
> +    if (!vfio_listener_valid_section(section, true, "tracking_update") ||
>           !vfio_get_section_iova_range(dirty->bcontainer, section,
>                                        &iova, &end, NULL)) {
>           return;
> @@ -1087,6 +1088,12 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>       if (!mr) {
>           goto out_unlock;
>       }
> +
> +    if (!(iotlb->perm & IOMMU_WO) || mr->readonly) {


In case you resend, please add a trace event.

Anyhow,

Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> +        rcu_read_unlock();
> +        return;
> +    }
> +
>       translated_addr = memory_region_get_ram_addr(mr) + xlat;
>   
>       ret = vfio_container_query_dirty_bitmap(bcontainer, iova, iotlb->addr_mask + 1,
> @@ -1222,7 +1229,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
>       int ret;
>       Error *local_err = NULL;
>   
> -    if (vfio_listener_skipped_section(section, false)) {
> +    if (vfio_listener_skipped_section(section, true)) {
>           return;
>       }
>   



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 12/23] intel_iommu: Add some macros and inline functions
  2025-10-24  8:43 ` [PATCH v7 12/23] intel_iommu: Add some macros and inline functions Zhenzhong Duan
@ 2025-10-24 16:39   ` Cédric Le Goater
  2025-10-28  6:01     ` Duan, Zhenzhong
  0 siblings, 1 reply; 48+ messages in thread
From: Cédric Le Goater @ 2025-10-24 16:39 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 10/24/25 10:43, Zhenzhong Duan wrote:
> Add some macros and inline functions that will be used by following
> patch.
> 
> This patch also make a cleanup to change macro VTD_SM_PASID_ENTRY_DID
> and VTD_SM_PASID_ENTRY_FSPM to use extract64() just like what smmu does,
> because they are either used in following patches or used indirectly by
> new introduced inline functions. But we doesn't aim to change the huge
> amount of bit mask style macro definitions in this patch, that should be
> in a separate patch.
> 
> Suggested-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
> ---
>   hw/i386/intel_iommu_internal.h |  8 +++++--
>   hw/i386/intel_iommu.c          | 38 +++++++++++++++++++++++++++-------
>   2 files changed, 37 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 09edba81e2..df80af839d 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -642,10 +642,14 @@ typedef struct VTDPASIDCacheInfo {
>   #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>   
>   #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
> -#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
> +#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>   
> -#define VTD_SM_PASID_ENTRY_FSPM          3ULL
>   #define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
> +#define VTD_SM_PASID_ENTRY_SRE_BIT(x)    extract64((x)->val[2], 0, 1)
> +/* 00: 4-level paging, 01: 5-level paging, 10-11: Reserved */
> +#define VTD_SM_PASID_ENTRY_FSPM(x)       extract64((x)->val[2], 2, 2)
> +#define VTD_SM_PASID_ENTRY_WPE_BIT(x)    extract64((x)->val[2], 4, 1)
> +#define VTD_SM_PASID_ENTRY_EAFE_BIT(x)   extract64((x)->val[2], 7, 1)
>   
>   /* First Stage Paging Structure */
>   /* Masks for First Stage Paging Entry */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 56abbb991d..871e6aad19 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -52,8 +52,7 @@
>   
>   /* pe operations */
>   #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
> -#define VTD_PE_GET_FS_LEVEL(pe) \
> -    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FSPM))
> +#define VTD_PE_GET_FS_LEVEL(pe) (VTD_SM_PASID_ENTRY_FSPM(pe) + 4)
>   #define VTD_PE_GET_SS_LEVEL(pe) \
>       (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
>   
> @@ -837,6 +836,31 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>       }
>   }
>   
> +static inline dma_addr_t vtd_pe_get_fspt_base(VTDPASIDEntry *pe)
> +{
> +    return pe->val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
> +}
> +
> +/*
> + * First stage IOVA address width: 48 bits for 4-level paging(FSPM=00)
> + *                                 57 bits for 5-level paging(FSPM=01)
> + */
> +static inline uint32_t vtd_pe_get_fs_aw(VTDPASIDEntry *pe)
> +{
> +    return 48 + VTD_SM_PASID_ENTRY_FSPM(pe) * 9;


Can't we use VTD_HOST_AW_48BIT here ?


Thanks,

C.


> +}
> +
> +static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
> +{
> +    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
> +}
> +
> +/* check if pgtt is first stage translation */
> +static inline bool vtd_pe_pgtt_is_fst(VTDPASIDEntry *pe)
> +{
> +    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FST);
> +}
> +
>   static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>   {
>       return pdire->val & 1;
> @@ -1625,7 +1649,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
>   
>       if (s->root_scalable) {
>           vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> -        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
> +        return VTD_SM_PASID_ENTRY_DID(&pe);
>       }
>   
>       return VTD_CONTEXT_ENTRY_DID(ce->hi);
> @@ -1707,7 +1731,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>                */
>               return false;
>           }
> -        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
> +        return vtd_pe_pgtt_is_pt(&pe);
>       }
>   
>       return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
> @@ -3146,9 +3170,9 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
>           /* Fall through */
>       case VTD_INV_DESC_PASIDC_G_DSI:
>           if (pc_entry->valid) {
> -            did = VTD_SM_PASID_ENTRY_DID(pc_entry->pasid_entry.val[1]);
> +            did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
>           } else {
> -            did = VTD_SM_PASID_ENTRY_DID(pe.val[1]);
> +            did = VTD_SM_PASID_ENTRY_DID(&pe);
>           }
>           if (pc_info->did != did) {
>               return;
> @@ -5267,7 +5291,7 @@ static int vtd_pri_perform_implicit_invalidation(VTDAddressSpace *vtd_as,
>           return -EINVAL;
>       }
>       pgtt = VTD_PE_GET_TYPE(&pe);
> -    domain_id = VTD_SM_PASID_ENTRY_DID(pe.val[1]);
> +    domain_id = VTD_SM_PASID_ENTRY_DID(&pe);
>       ret = 0;
>       switch (pgtt) {
>       case VTD_SM_PASID_ENTRY_FST:



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 17/23] iommufd: Introduce a helper function to extract vendor capabilities
  2025-10-24  8:43 ` [PATCH v7 17/23] iommufd: Introduce a helper function to extract vendor capabilities Zhenzhong Duan
@ 2025-10-24 16:44   ` Cédric Le Goater
  2025-10-28  9:43     ` Duan, Zhenzhong
  2025-10-24 17:34   ` Cédric Le Goater
  1 sibling, 1 reply; 48+ messages in thread
From: Cédric Le Goater @ 2025-10-24 16:44 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 10/24/25 10:43, Zhenzhong Duan wrote:
> In VFIO core, we call iommufd_backend_get_device_info() to return vendor
> specific hardware information data, but it's not good to extract this raw
> data in VFIO core.
> 
> Introduce host_iommu_extract_quirks() to help extracting the raw data and
> return a bitmap in iommufd.c because it's the place defining
> iommufd_backend_get_device_info().
> 
> The other choice is to put vendor data extracting code in vendor vIOMMU
> emulation file, but that will make those files mixed with vIOMMU
> emulation and host IOMMU extracting code, also need a new callback in
> PCIIOMMUOps. So we choose a simpler way as above.
> 
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   include/hw/iommu.h                 |  5 +++++
>   include/system/host_iommu_device.h | 15 +++++++++++++++
>   backends/iommufd.c                 | 13 +++++++++++++
>   3 files changed, 33 insertions(+)
> 
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> index 9b8bb94fc2..6d61410703 100644
> --- a/include/hw/iommu.h
> +++ b/include/hw/iommu.h
> @@ -22,4 +22,9 @@ enum viommu_flags {
>       VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
>   };
>   
> +/* Host IOMMU quirks. Extracted from host IOMMU capabilities */
> +enum host_iommu_quirks {
> +    HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO = BIT_ULL(0),
> +};
> +
>   #endif /* HW_IOMMU_H */
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index ab849a4a82..9ae7f4cc6d 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -39,6 +39,21 @@ typedef struct HostIOMMUDeviceCaps {
>       uint64_t hw_caps;
>       VendorCaps vendor_caps;
>   } HostIOMMUDeviceCaps;
> +
> +/**
> + * host_iommu_extract_quirk: Extract host IOMMU quirks
> + *
> + * This function converts @type specific hardware information data
> + * into a standard bitmap format.
> + *
> + * @type: IOMMU Hardware Info Types
> + *
> + * @VendorCaps: IOMMU @type specific hardware information data
> + *
> + * Returns: bitmap with each representing a host IOMMU quirk defined in
> + * enum host_iommu_quirks
> + */
> +uint64_t host_iommu_extract_quirks(uint32_t type, VendorCaps *caps);

..._get_quirks() sounds nicer. This is minor.


Thanks,

C.




>   #endif
>   
>   #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 086bd67aea..61b991ec53 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -19,6 +19,7 @@
>   #include "migration/cpr.h"
>   #include "monitor/monitor.h"
>   #include "trace.h"
> +#include "hw/iommu.h"
>   #include "hw/vfio/vfio-device.h"
>   #include <sys/ioctl.h>
>   #include <linux/iommufd.h>
> @@ -411,6 +412,18 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>       return true;
>   }
>   
> +uint64_t host_iommu_extract_quirks(uint32_t type, VendorCaps *caps)
> +{
> +    uint64_t quirks = 0;
> +
> +    if (type == IOMMU_HW_INFO_TYPE_INTEL_VTD &&
> +        caps->vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
> +        quirks |= HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO;
> +    }
> +
> +    return quirks;
> +}
> +
>   bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
>                                         uint32_t data_type, uint32_t entry_len,
>                                         uint32_t *entry_num, void *data,



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 13/23] intel_iommu: Bind/unbind guest page table to host
  2025-10-24  8:43 ` [PATCH v7 13/23] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-10-24 17:01   ` Cédric Le Goater
  2025-10-24 17:33   ` Cédric Le Goater
  1 sibling, 0 replies; 48+ messages in thread
From: Cédric Le Goater @ 2025-10-24 17:01 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng, Yi Sun

On 10/24/25 10:43, Zhenzhong Duan wrote:
> This captures the guest PASID table entry modifications and propagates
> the changes to host to attach a hwpt with type determined per guest IOMMU
> PGTT configuration.
> 
> When PGTT=PT, attach PASID_0 to a second stage HWPT(GPA->HPA).
> When PGTT=FST, attach PASID_0 to nested HWPT with nesting parent HWPT
> coming from VFIO.
> 
> Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   include/hw/i386/intel_iommu.h |   1 +
>   hw/i386/intel_iommu.c         | 150 +++++++++++++++++++++++++++++++++-
>   hw/i386/trace-events          |   3 +
>   3 files changed, 151 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 3758ac239c..b5f8a9fc29 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -104,6 +104,7 @@ struct VTDAddressSpace {
>       PCIBus *bus;
>       uint8_t devfn;
>       uint32_t pasid;
> +    uint32_t fs_hwpt;
>       AddressSpace as;
>       IOMMUMemoryRegion iommu;
>       MemoryRegion root;          /* The root container of the device */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 871e6aad19..3789a36147 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -20,6 +20,7 @@
>    */
>   
>   #include "qemu/osdep.h"
> +#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
>   #include "qapi/error.h"
> @@ -42,6 +43,9 @@
>   #include "migration/vmstate.h"
>   #include "trace.h"
>   #include "system/iommufd.h"
> +#ifdef CONFIG_IOMMUFD
> +#include <linux/iommufd.h>
> +#endif


Exposing IOMMUFD in the Intel vIOMMU is unexpected. Initially, we
introduced HostIOMMUDeviceClass to avoid exposing the IOMMU backends.
Are we OK to bypass this abstract layer now ?


Thanks,

C.



  >   /* context entry operations */
>   #define PASID_0    0
> @@ -87,6 +91,7 @@ struct vtd_iotlb_key {
>   
>   static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>   static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp);
>   
>   static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
>   {
> @@ -98,7 +103,11 @@ static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
>       g_hash_table_iter_init(&as_it, s->vtd_address_spaces);
>       while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_as)) {
>           VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> -        pc_entry->valid = false;
> +        if (pc_entry->valid) {
> +            pc_entry->valid = false;
> +            /* It's fatal to get failure during reset */
> +            vtd_bind_guest_pasid(vtd_as, &error_fatal);
> +        }
>       }
>   }
>   
> @@ -2380,6 +2389,128 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
>       vtd_iommu_replay_all(s);
>   }
>   
> +#ifdef CONFIG_IOMMUFD
> +static int vtd_create_fs_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                              VTDPASIDEntry *pe, uint32_t *fs_hwpt,
> +                              Error **errp)
> +{
> +    struct iommu_hwpt_vtd_s1 vtd = {};
> +
> +    vtd.flags = (VTD_SM_PASID_ENTRY_SRE_BIT(pe) ? IOMMU_VTD_S1_SRE : 0) |
> +                (VTD_SM_PASID_ENTRY_WPE_BIT(pe) ? IOMMU_VTD_S1_WPE : 0) |
> +                (VTD_SM_PASID_ENTRY_EAFE_BIT(pe) ? IOMMU_VTD_S1_EAFE : 0);
> +    vtd.addr_width = vtd_pe_get_fs_aw(pe);
> +    vtd.pgtbl_addr = (uint64_t)vtd_pe_get_fspt_base(pe);
> +
> +    return !iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> +                                       idev->hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
> +                                       sizeof(vtd), &vtd, fs_hwpt, errp);
> +}
> +
> +static void vtd_destroy_old_fs_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                                    VTDAddressSpace *vtd_as)
> +{
> +    if (!vtd_as->fs_hwpt) {
> +        return;
> +    }
> +    iommufd_backend_free_id(idev->iommufd, vtd_as->fs_hwpt);
> +    vtd_as->fs_hwpt = 0;
> +}
> +
> +static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
> +                                     VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> +    VTDPASIDEntry *pe = &vtd_as->pasid_cache_entry.pasid_entry;
> +    uint32_t hwpt_id;
> +    bool ret;
> +
> +    /*
> +     * We can get here only if flts=on, the supported PGTT is FST and PT.
> +     * Catch invalid PGTT when processing invalidation request to avoid
> +     * attaching to wrong hwpt.
> +     */
> +    if (!vtd_pe_pgtt_is_fst(pe) && !vtd_pe_pgtt_is_pt(pe)) {
> +        error_setg(errp, "Invalid PGTT type");
> +        return -EINVAL;
> +    }
> +
> +    if (vtd_pe_pgtt_is_pt(pe)) {
> +        hwpt_id = idev->hwpt_id;
> +    } else if (vtd_create_fs_hwpt(idev, pe, &hwpt_id, errp)) {
> +        return -EINVAL;
> +    }
> +
> +    ret = host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp);
> +    trace_vtd_device_attach_hwpt(idev->devid, vtd_as->pasid, hwpt_id, !ret);
> +    if (ret) {
> +        /* Destroy old fs_hwpt if it's a replacement */
> +        vtd_destroy_old_fs_hwpt(idev, vtd_as);
> +        if (vtd_pe_pgtt_is_fst(pe)) {
> +            vtd_as->fs_hwpt = hwpt_id;
> +        }
> +    } else if (vtd_pe_pgtt_is_fst(pe)) {
> +        iommufd_backend_free_id(idev->iommufd, hwpt_id);
> +    }
> +
> +    return !ret;
> +}
> +
> +static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
> +                                     VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    uint32_t pasid = vtd_as->pasid;
> +    bool ret;
> +
> +    if (s->dmar_enabled && s->root_scalable) {
> +        ret = host_iommu_device_iommufd_detach_hwpt(idev, errp);
> +        trace_vtd_device_detach_hwpt(idev->devid, pasid, !ret);
> +    } else {
> +        /*
> +         * If DMAR remapping is disabled or guest switches to legacy mode,
> +         * we fallback to the default HWPT which contains shadow page table.
> +         * So guest DMA could still work.
> +         */
> +        ret = host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
> +        trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
> +                                           !ret);
> +    }
> +
> +    if (ret) {
> +        vtd_destroy_old_fs_hwpt(idev, vtd_as);
> +    }
> +
> +    return !ret;
> +}
> +
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> +    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(vtd_as);
> +    int ret;
> +
> +    /* Ignore emulated device or legacy VFIO backed device */
> +    if (!vtd_hiod) {
> +        return 0;
> +    }
> +
> +    if (pc_entry->valid) {
> +        ret = vtd_device_attach_iommufd(vtd_hiod, vtd_as, errp);
> +    } else {
> +        ret = vtd_device_detach_iommufd(vtd_hiod, vtd_as, errp);
> +    }
> +
> +    return ret;
> +}
> +#else
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    return 0;
> +}
> +#endif
> +
>   /* Do a context-cache device-selective invalidation.
>    * @func_mask: FM field after shifting
>    */
> @@ -3134,6 +3265,8 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
>       VTDPASIDEntry pe;
>       IOMMUNotifier *n;
>       uint16_t did;
> +    const char *err_prefix;
> +    Error *local_err = NULL;
>   
>       if (vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
>           if (!pc_entry->valid) {
> @@ -3154,7 +3287,9 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
>               vtd_address_space_unmap(vtd_as, n);
>           }
>           vtd_switch_address_space(vtd_as);
> -        return;
> +
> +        err_prefix = "Detaching from HWPT failed: ";
> +        goto do_bind_unbind;
>       }
>   
>       /*
> @@ -3182,12 +3317,21 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
>       if (!pc_entry->valid) {
>           pc_entry->pasid_entry = pe;
>           pc_entry->valid = true;
> -    } else if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> +        err_prefix = "Attaching to HWPT failed: ";
> +    } else if (vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> +        err_prefix = "Replacing HWPT attachment failed: ";
> +    } else {
>           return;
>       }
>   
>       vtd_switch_address_space(vtd_as);
>       vtd_address_space_sync(vtd_as);
> +
> +do_bind_unbind:
> +    /* TODO: Fault event injection into guest, report error to QEMU for now */
> +    if (vtd_bind_guest_pasid(vtd_as, &local_err)) {
> +        error_reportf_err(local_err, "%s", err_prefix);
> +    }
>   }
>   
>   static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index b704f4f90c..5a3ee1cf64 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -73,6 +73,9 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
>   vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
>   vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
>   vtd_reset_exit(void) ""
> +vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
> +vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
> +vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
>   
>   # amd_iommu.c
>   amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 05/23] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-10-24  8:43 ` [PATCH v7 05/23] hw/pci: Introduce pci_device_get_viommu_flags() Zhenzhong Duan
@ 2025-10-24 17:18   ` Cédric Le Goater
  2025-10-28  6:57     ` Duan, Zhenzhong
  0 siblings, 1 reply; 48+ messages in thread
From: Cédric Le Goater @ 2025-10-24 17:18 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 10/24/25 10:43, Zhenzhong Duan wrote:
> Introduce a new PCIIOMMUOps optional callback, get_viommu_flags() which
> allows to retrieve flags exposed by a vIOMMU. The first planned vIOMMU
> device flag is VIOMMU_FLAG_WANT_NESTING_PARENT that advertises the
> support of HW nested stage translation scheme and wants other sub-system
> like VFIO's cooperation to create nesting parent HWPT.
> 
> pci_device_get_viommu_flags() is a wrapper that can be called on a PCI
> device potentially protected by a vIOMMU.
> 
> get_viommu_flags() is designed to return 64bit bitmap of purely vIOMMU
> flags which are only determined by user's configuration, no host
> capabilities involved. Reasons are:
> 
> 1. host may has heterogeneous IOMMUs, each with different capabilities
> 2. this is migration friendly, return value is consistent between source
>     and target.
> 
> Note that this op will be invoked at the attach_device() stage, at which
> point host IOMMU capabilities are not yet forwarded to the vIOMMU through
> the set_iommu_device() callback that will be after the attach_device().
> 
> See below sequence:
> 
>    vfio_device_attach():
>        iommufd_cdev_attach():
>            pci_device_get_viommu_flags() for HW nesting cap
>            create a nesting parent HWPT
>            attach device to the HWPT
>            vfio_device_hiod_create_and_realize() creating hiod
>    ...
>    pci_device_set_iommu_device(hiod)
> 
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
> ---
>   MAINTAINERS          |  1 +
>   include/hw/iommu.h   | 25 +++++++++++++++++++++++++


Hmm, why not under include/hw/pci/ ? Was this discussed ?


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.



>   include/hw/pci/pci.h | 22 ++++++++++++++++++++++
>   hw/pci/pci.c         | 11 +++++++++++
>   4 files changed, 59 insertions(+)
>   create mode 100644 include/hw/iommu.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 36eef27b41..d94fbcbdfb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2338,6 +2338,7 @@ F: include/system/iommufd.h
>   F: backends/host_iommu_device.c
>   F: include/system/host_iommu_device.h
>   F: include/qemu/chardev_open.h
> +F: include/hw/iommu.h
>   F: util/chardev_open.c
>   F: docs/devel/vfio-iommufd.rst
>   
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> new file mode 100644
> index 0000000000..9b8bb94fc2
> --- /dev/null
> +++ b/include/hw/iommu.h
> @@ -0,0 +1,25 @@
> +/*
> + * General vIOMMU flags
> + *
> + * Copyright (C) 2025 Intel Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_IOMMU_H
> +#define HW_IOMMU_H
> +
> +#include "qemu/bitops.h"
> +
> +/*
> + * Theoretical vIOMMU flags. Only determined by the vIOMMU device properties and
> + * independent on the actual host IOMMU capabilities they may depend on. Each
> + * flag can be an expectation or request to other sub-system or just a pure
> + * vIOMMU capability. vIOMMU can choose which flags to expose.
> + */
> +enum viommu_flags {
> +    /* vIOMMU needs nesting parent HWPT to create nested HWPT */
> +    VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
> +};
> +
> +#endif /* HW_IOMMU_H */
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index bde9dca8e2..cf99b5bb68 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -462,6 +462,18 @@ typedef struct PCIIOMMUOps {
>        * @devfn: device and function number of the PCI device.
>        */
>       void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
> +    /**
> +     * @get_viommu_flags: get vIOMMU flags
> +     *
> +     * Optional callback, if not implemented, then vIOMMU doesn't support
> +     * exposing flags to other sub-system, e.g., VFIO.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * Returns: bitmap with each representing a vIOMMU flag defined in
> +     * enum viommu_flags.
> +     */
> +    uint64_t (*get_viommu_flags)(void *opaque);
>       /**
>        * @get_iotlb_info: get properties required to initialize a device IOTLB.
>        *
> @@ -644,6 +656,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
>                                    Error **errp);
>   void pci_device_unset_iommu_device(PCIDevice *dev);
>   
> +/**
> + * pci_device_get_viommu_flags: get vIOMMU flags.
> + *
> + * Returns: bitmap with each representing a vIOMMU flag defined in
> + * enum viommu_flags. Or 0 if vIOMMU doesn't report any.
> + *
> + * @dev: PCI device pointer.
> + */
> +uint64_t pci_device_get_viommu_flags(PCIDevice *dev);
> +
>   /**
>    * pci_iommu_get_iotlb_info: get properties required to initialize a
>    * device IOTLB.
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index d0e81651aa..c9932c87e3 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -3010,6 +3010,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
>       }
>   }
>   
> +uint64_t pci_device_get_viommu_flags(PCIDevice *dev)
> +{
> +    PCIBus *iommu_bus;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
> +    if (iommu_bus && iommu_bus->iommu_ops->get_viommu_flags) {
> +        return iommu_bus->iommu_ops->get_viommu_flags(iommu_bus->iommu_opaque);
> +    }
> +    return 0;
> +}
> +
>   int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
>                            bool exec_req, hwaddr addr, bool lpig,
>                            uint16_t prgi, bool is_read, bool is_write)



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 10/23] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-10-24  8:43 ` [PATCH v7 10/23] intel_iommu: Check for compatibility with IOMMUFD backed " Zhenzhong Duan
@ 2025-10-24 17:29   ` Cédric Le Goater
  2025-10-29  7:37     ` Duan, Zhenzhong
  0 siblings, 1 reply; 48+ messages in thread
From: Cédric Le Goater @ 2025-10-24 17:29 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 10/24/25 10:43, Zhenzhong Duan wrote:
> When vIOMMU is configured x-flts=on in scalable mode, first stage page table
> is passed to host to construct nested page table for passthrough devices.
> 
> We need to check compatibility of some critical IOMMU capabilities between
> vIOMMU and host IOMMU to ensure guest first stage page table could be used by
> host.
> 
> For instance, vIOMMU supports first stage 1GB large page mapping, but host does
> not, then this IOMMUFD backed device should fail.
> 
> Even of the checks pass, for now we willingly reject the association because
> all the bits are not there yet, it will be relaxed in the end of this series.
> 
> Note vIOMMU has exposed IOMMU_HWPT_ALLOC_NEST_PARENT flag to force VFIO core to
> create nesting parent HWPT, if host doesn't support nested translation, the
> creation will fail. So no need to check nested capability here.
> 
> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> ---
>   hw/i386/intel_iommu.c | 25 ++++++++++++++++++++++++-
>   1 file changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index ce4c54165e..7d908cdb58 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -4636,8 +4636,31 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>           return true;
>       }
>   
> +#ifdef CONFIG_IOMMUFD


Before using CONFIG_IOMMUFD, '#include CONFIG_DEVICES' should be done
first. But as said earlier, this is something we wanted to avoid in the
intel-iommu model which can have different host IOMMU backends.

At first glance, it seems to me that these changes take the fast path
and avoid an abstract layer. Is it too complex to keep on using
HostIOMMUDeviceClass ?


Thanks,

C.





> +    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
> +    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
> +
> +    /* Remaining checks are all first stage translation specific */
> +    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> +        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
> +        return false;
> +    }
> +
> +    if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
> +        error_setg(errp, "Incompatible host platform IOMMU type %d",
> +                   caps->type);
> +        return false;
> +    }
> +
> +    if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
> +        error_setg(errp,
> +                   "First stage 1GB large page is unsupported by host IOMMU");
> +        return false;
> +    }
> +#endif
> +
>       error_setg(errp,
> -               "host device is uncompatible with first stage translation");
> +               "host IOMMU is incompatible with guest first stage translation");
>       return false;
>   }
>   



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 13/23] intel_iommu: Bind/unbind guest page table to host
  2025-10-24  8:43 ` [PATCH v7 13/23] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
  2025-10-24 17:01   ` Cédric Le Goater
@ 2025-10-24 17:33   ` Cédric Le Goater
  2025-10-29  9:56     ` Duan, Zhenzhong
  1 sibling, 1 reply; 48+ messages in thread
From: Cédric Le Goater @ 2025-10-24 17:33 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng, Yi Sun

On 10/24/25 10:43, Zhenzhong Duan wrote:
> This captures the guest PASID table entry modifications and propagates
> the changes to host to attach a hwpt with type determined per guest IOMMU
> PGTT configuration.
> 
> When PGTT=PT, attach PASID_0 to a second stage HWPT(GPA->HPA).
> When PGTT=FST, attach PASID_0 to nested HWPT with nesting parent HWPT
> coming from VFIO.
> 
> Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   include/hw/i386/intel_iommu.h |   1 +
>   hw/i386/intel_iommu.c         | 150 +++++++++++++++++++++++++++++++++-
>   hw/i386/trace-events          |   3 +
>   3 files changed, 151 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 3758ac239c..b5f8a9fc29 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -104,6 +104,7 @@ struct VTDAddressSpace {
>       PCIBus *bus;
>       uint8_t devfn;
>       uint32_t pasid;
> +    uint32_t fs_hwpt;
>       AddressSpace as;
>       IOMMUMemoryRegion iommu;
>       MemoryRegion root;          /* The root container of the device */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 871e6aad19..3789a36147 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -20,6 +20,7 @@
>    */
>   
>   #include "qemu/osdep.h"
> +#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
>   #include "qapi/error.h"
> @@ -42,6 +43,9 @@
>   #include "migration/vmstate.h"
>   #include "trace.h"
>   #include "system/iommufd.h"
> +#ifdef CONFIG_IOMMUFD
> +#include <linux/iommufd.h>
> +#endif
>   
>   /* context entry operations */
>   #define PASID_0    0
> @@ -87,6 +91,7 @@ struct vtd_iotlb_key {
>   
>   static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>   static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp);
>   
>   static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
>   {
> @@ -98,7 +103,11 @@ static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
>       g_hash_table_iter_init(&as_it, s->vtd_address_spaces);
>       while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_as)) {
>           VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> -        pc_entry->valid = false;
> +        if (pc_entry->valid) {
> +            pc_entry->valid = false;
> +            /* It's fatal to get failure during reset */
> +            vtd_bind_guest_pasid(vtd_as, &error_fatal);
> +        }
>       }
>   }
>   
> @@ -2380,6 +2389,128 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
>       vtd_iommu_replay_all(s);
>   }
>   
> +#ifdef CONFIG_IOMMUFD
> +static int vtd_create_fs_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                              VTDPASIDEntry *pe, uint32_t *fs_hwpt,
> +                              Error **errp)

Returning a bool is better. Same for the routines below.

> +{
> +    struct iommu_hwpt_vtd_s1 vtd = {};
> +
> +    vtd.flags = (VTD_SM_PASID_ENTRY_SRE_BIT(pe) ? IOMMU_VTD_S1_SRE : 0) |
> +                (VTD_SM_PASID_ENTRY_WPE_BIT(pe) ? IOMMU_VTD_S1_WPE : 0) |
> +                (VTD_SM_PASID_ENTRY_EAFE_BIT(pe) ? IOMMU_VTD_S1_EAFE : 0);
> +    vtd.addr_width = vtd_pe_get_fs_aw(pe);
> +    vtd.pgtbl_addr = (uint64_t)vtd_pe_get_fspt_base(pe);
> +
> +    return !iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> +                                       idev->hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
> +                                       sizeof(vtd), &vtd, fs_hwpt, errp);
> +}
> +
> +static void vtd_destroy_old_fs_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                                    VTDAddressSpace *vtd_as)
> +{
> +    if (!vtd_as->fs_hwpt) {
> +        return;
> +    }
> +    iommufd_backend_free_id(idev->iommufd, vtd_as->fs_hwpt);
> +    vtd_as->fs_hwpt = 0;
> +}
> +
> +static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
> +                                     VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> +    VTDPASIDEntry *pe = &vtd_as->pasid_cache_entry.pasid_entry;
> +    uint32_t hwpt_id;
> +    bool ret;
> +
> +    /*
> +     * We can get here only if flts=on, the supported PGTT is FST and PT.
> +     * Catch invalid PGTT when processing invalidation request to avoid
> +     * attaching to wrong hwpt.
> +     */
> +    if (!vtd_pe_pgtt_is_fst(pe) && !vtd_pe_pgtt_is_pt(pe)) {
> +        error_setg(errp, "Invalid PGTT type");
> +        return -EINVAL;
> +    }
> +
> +    if (vtd_pe_pgtt_is_pt(pe)) {
> +        hwpt_id = idev->hwpt_id;
> +    } else if (vtd_create_fs_hwpt(idev, pe, &hwpt_id, errp)) {
> +        return -EINVAL;
> +    }
> +
> +    ret = host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp);
> +    trace_vtd_device_attach_hwpt(idev->devid, vtd_as->pasid, hwpt_id, !ret);
> +    if (ret) {
> +        /* Destroy old fs_hwpt if it's a replacement */
> +        vtd_destroy_old_fs_hwpt(idev, vtd_as);
> +        if (vtd_pe_pgtt_is_fst(pe)) {
> +            vtd_as->fs_hwpt = hwpt_id;
> +        }
> +    } else if (vtd_pe_pgtt_is_fst(pe)) {
> +        iommufd_backend_free_id(idev->iommufd, hwpt_id);
> +    }
> +
> +    return !ret;
> +}
> +
> +static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
> +                                     VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    uint32_t pasid = vtd_as->pasid;
> +    bool ret;
> +
> +    if (s->dmar_enabled && s->root_scalable) {
> +        ret = host_iommu_device_iommufd_detach_hwpt(idev, errp);
> +        trace_vtd_device_detach_hwpt(idev->devid, pasid, !ret);
> +    } else {
> +        /*
> +         * If DMAR remapping is disabled or guest switches to legacy mode,
> +         * we fallback to the default HWPT which contains shadow page table.
> +         * So guest DMA could still work.
> +         */
> +        ret = host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
> +        trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
> +                                           !ret);
> +    }
> +
> +    if (ret) {
> +        vtd_destroy_old_fs_hwpt(idev, vtd_as);
> +    }
> +
> +    return !ret;
> +}
> +
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> +    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(vtd_as);
> +    int ret;
> +
> +    /* Ignore emulated device or legacy VFIO backed device */
> +    if (!vtd_hiod) {
> +        return 0;
> +    }
> +
> +    if (pc_entry->valid) {
> +        ret = vtd_device_attach_iommufd(vtd_hiod, vtd_as, errp);
> +    } else {
> +        ret = vtd_device_detach_iommufd(vtd_hiod, vtd_as, errp);
> +    }
> +
> +    return ret;
> +}
> +#else
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    return 0;
> +}
> +#endif
> +
>   /* Do a context-cache device-selective invalidation.
>    * @func_mask: FM field after shifting
>    */
> @@ -3134,6 +3265,8 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
>       VTDPASIDEntry pe;
>       IOMMUNotifier *n;
>       uint16_t did;
> +    const char *err_prefix;

Setting this prefix looks a bit fragile. May be add a default value here.


Thanks,

C.


> +    Error *local_err = NULL;
>   
>       if (vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
>           if (!pc_entry->valid) {
> @@ -3154,7 +3287,9 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
>               vtd_address_space_unmap(vtd_as, n);
>           }
>           vtd_switch_address_space(vtd_as);
> -        return;
> +
> +        err_prefix = "Detaching from HWPT failed: ";
> +        goto do_bind_unbind;
>       }
>   
>       /*
> @@ -3182,12 +3317,21 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
>       if (!pc_entry->valid) {
>           pc_entry->pasid_entry = pe;
>           pc_entry->valid = true;
> -    } else if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> +        err_prefix = "Attaching to HWPT failed: ";
> +    } else if (vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> +        err_prefix = "Replacing HWPT attachment failed: ";
> +    } else {
>           return;
>       }
>   
>       vtd_switch_address_space(vtd_as);
>       vtd_address_space_sync(vtd_as);
> +
> +do_bind_unbind:
> +    /* TODO: Fault event injection into guest, report error to QEMU for now */
> +    if (vtd_bind_guest_pasid(vtd_as, &local_err)) {
> +        error_reportf_err(local_err, "%s", err_prefix);
> +    }
>   }
>   
>   static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index b704f4f90c..5a3ee1cf64 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -73,6 +73,9 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
>   vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
>   vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
>   vtd_reset_exit(void) ""
> +vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
> +vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
> +vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
>   
>   # amd_iommu.c
>   amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 17/23] iommufd: Introduce a helper function to extract vendor capabilities
  2025-10-24  8:43 ` [PATCH v7 17/23] iommufd: Introduce a helper function to extract vendor capabilities Zhenzhong Duan
  2025-10-24 16:44   ` Cédric Le Goater
@ 2025-10-24 17:34   ` Cédric Le Goater
  2025-10-28  9:28     ` Duan, Zhenzhong
  1 sibling, 1 reply; 48+ messages in thread
From: Cédric Le Goater @ 2025-10-24 17:34 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 10/24/25 10:43, Zhenzhong Duan wrote:
> In VFIO core, we call iommufd_backend_get_device_info() to return vendor
> specific hardware information data, but it's not good to extract this raw
> data in VFIO core.
> 
> Introduce host_iommu_extract_quirks() to help extracting the raw data and
> return a bitmap in iommufd.c because it's the place defining
> iommufd_backend_get_device_info().
> 
> The other choice is to put vendor data extracting code in vendor vIOMMU
> emulation file, but that will make those files mixed with vIOMMU
> emulation and host IOMMU extracting code, also need a new callback in
> PCIIOMMUOps. So we choose a simpler way as above.
> 
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   include/hw/iommu.h                 |  5 +++++
>   include/system/host_iommu_device.h | 15 +++++++++++++++
>   backends/iommufd.c                 | 13 +++++++++++++
>   3 files changed, 33 insertions(+)
> 
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> index 9b8bb94fc2..6d61410703 100644
> --- a/include/hw/iommu.h
> +++ b/include/hw/iommu.h
> @@ -22,4 +22,9 @@ enum viommu_flags {
>       VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
>   };
>   
> +/* Host IOMMU quirks. Extracted from host IOMMU capabilities */
> +enum host_iommu_quirks {
> +    HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO = BIT_ULL(0),


This host IOMMU quirk definition in a vIOMMU header file is not very
consistent.


Thanks,

C.

   > +};
> +
>   #endif /* HW_IOMMU_H */
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index ab849a4a82..9ae7f4cc6d 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -39,6 +39,21 @@ typedef struct HostIOMMUDeviceCaps {
>       uint64_t hw_caps;
>       VendorCaps vendor_caps;
>   } HostIOMMUDeviceCaps;
> +
> +/**
> + * host_iommu_extract_quirk: Extract host IOMMU quirks
> + *
> + * This function converts @type specific hardware information data
> + * into a standard bitmap format.
> + *
> + * @type: IOMMU Hardware Info Types
> + *
> + * @VendorCaps: IOMMU @type specific hardware information data
> + *
> + * Returns: bitmap with each representing a host IOMMU quirk defined in
> + * enum host_iommu_quirks
> + */
> +uint64_t host_iommu_extract_quirks(uint32_t type, VendorCaps *caps);
>   #endif
>   
>   #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 086bd67aea..61b991ec53 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -19,6 +19,7 @@
>   #include "migration/cpr.h"
>   #include "monitor/monitor.h"
>   #include "trace.h"
> +#include "hw/iommu.h"
>   #include "hw/vfio/vfio-device.h"
>   #include <sys/ioctl.h>
>   #include <linux/iommufd.h>
> @@ -411,6 +412,18 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>       return true;
>   }
>   
> +uint64_t host_iommu_extract_quirks(uint32_t type, VendorCaps *caps)
> +{
> +    uint64_t quirks = 0;
> +
> +    if (type == IOMMU_HW_INFO_TYPE_INTEL_VTD &&
> +        caps->vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
> +        quirks |= HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO;
> +    }
> +
> +    return quirks;
> +}
> +
>   bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
>                                         uint32_t data_type, uint32_t entry_len,
>                                         uint32_t *entry_num, void *data,



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 19/23] Workaround for ERRATA_772415_SPR17
  2025-10-24  8:43 ` [PATCH v7 19/23] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
@ 2025-10-24 17:36   ` Cédric Le Goater
  2025-10-24 17:38   ` Cédric Le Goater
  1 sibling, 0 replies; 48+ messages in thread
From: Cédric Le Goater @ 2025-10-24 17:36 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 10/24/25 10:43, Zhenzhong Duan wrote:
> On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
> is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
> range mapped on second stage page table could still be written.
> 
> Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
> Update, Errata Details, SPR17.
> https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update/
> 
> Also copied the SPR17 details from above link:
> "Problem: When remapping hardware is configured by system software in
> scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the
> PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended
> Access bit if enabled) in first-stage page-table entries even when
> second-stage mappings indicate that corresponding first-stage page-table
> is Read-Only.
> 
> Implication: Due to this erratum, pages mapped as Read-only in second-stage
> page-tables may be modified by remapping hardware Access/Dirty bit updates.
> 
> Workaround: None identified. System software enabling nested translations
> for a VM should ensure that there are no read-only pages in the
> corresponding second-stage mappings."
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/vfio/iommufd.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index f9d0926274..f9da0e79cc 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -15,6 +15,7 @@
>   #include <linux/vfio.h>
>   #include <linux/iommufd.h>
>   
> +#include "hw/iommu.h"

Changes look ok apart from this include.


Thanks,

C.



>   #include "hw/vfio/vfio-device.h"
>   #include "qemu/error-report.h"
>   #include "trace.h"
> @@ -351,6 +352,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>       VFIOContainer *bcontainer = VFIO_IOMMU(container);
>       uint32_t type, flags = 0;
>       uint64_t hw_caps;
> +    VendorCaps caps;
>       VFIOIOASHwpt *hwpt;
>       uint32_t hwpt_id;
>       int ret;
> @@ -396,7 +398,8 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>        * instead.
>        */
>       if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
> -                                         &type, NULL, 0, &hw_caps, errp)) {
> +                                         &type, &caps, sizeof(caps), &hw_caps,
> +                                         errp)) {
>           return false;
>       }
>   
> @@ -411,6 +414,11 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>        */
>       if (vfio_device_get_viommu_flags_want_nesting(vbasedev)) {
>           flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
> +
> +        if (host_iommu_extract_quirks(type, &caps) &
> +            HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO) {
> +            bcontainer->bypass_ro = true;
> +        }
>       }
>   
>       if (cpr_is_incoming()) {



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 19/23] Workaround for ERRATA_772415_SPR17
  2025-10-24  8:43 ` [PATCH v7 19/23] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
  2025-10-24 17:36   ` Cédric Le Goater
@ 2025-10-24 17:38   ` Cédric Le Goater
  1 sibling, 0 replies; 48+ messages in thread
From: Cédric Le Goater @ 2025-10-24 17:38 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 10/24/25 10:43, Zhenzhong Duan wrote:
> On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
> is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
> range mapped on second stage page table could still be written.
> 
> Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
> Update, Errata Details, SPR17.
> https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update/
> 
> Also copied the SPR17 details from above link:
> "Problem: When remapping hardware is configured by system software in
> scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the
> PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended
> Access bit if enabled) in first-stage page-table entries even when
> second-stage mappings indicate that corresponding first-stage page-table
> is Read-Only.
> 
> Implication: Due to this erratum, pages mapped as Read-only in second-stage
> page-tables may be modified by remapping hardware Access/Dirty bit updates.
> 
> Workaround: None identified. System software enabling nested translations
> for a VM should ensure that there are no read-only pages in the
> corresponding second-stage mappings."
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/vfio/iommufd.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index f9d0926274..f9da0e79cc 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -15,6 +15,7 @@
>   #include <linux/vfio.h>
>   #include <linux/iommufd.h>
>   
> +#include "hw/iommu.h"

Changes look ok apart from this include.


Thanks,

C.



>   #include "hw/vfio/vfio-device.h"
>   #include "qemu/error-report.h"
>   #include "trace.h"
> @@ -351,6 +352,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>       VFIOContainer *bcontainer = VFIO_IOMMU(container);
>       uint32_t type, flags = 0;
>       uint64_t hw_caps;
> +    VendorCaps caps;
>       VFIOIOASHwpt *hwpt;
>       uint32_t hwpt_id;
>       int ret;
> @@ -396,7 +398,8 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>        * instead.
>        */
>       if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
> -                                         &type, NULL, 0, &hw_caps, errp)) {
> +                                         &type, &caps, sizeof(caps), &hw_caps,
> +                                         errp)) {
>           return false;
>       }
>   
> @@ -411,6 +414,11 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>        */
>       if (vfio_device_get_viommu_flags_want_nesting(vbasedev)) {
>           flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
> +
> +        if (host_iommu_extract_quirks(type, &caps) &
> +            HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO) {
> +            bcontainer->bypass_ro = true;
> +        }
>       }
>   
>       if (cpr_is_incoming()) {



^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v7 08/23] vfio/iommufd: Force creating nesting parent HWPT
  2025-10-24 16:23   ` Cédric Le Goater
@ 2025-10-28  6:00     ` Duan, Zhenzhong
  0 siblings, 0 replies; 48+ messages in thread
From: Duan, Zhenzhong @ 2025-10-28  6:00 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v7 08/23] vfio/iommufd: Force creating nesting parent
>HWPT
>
>On 10/24/25 10:43, Zhenzhong Duan wrote:
>> Call pci_device_get_viommu_flags() to get if vIOMMU supports
>> VIOMMU_FLAG_WANT_NESTING_PARENT.
>>
>> If yes, create a nesting parent HWPT and add it to the container's hwpt_list,
>> letting this parent HWPT cover the entire second stage mappings
>(GPA=>HPA).
>>
>> This allows a VFIO passthrough device to directly attach to this default
>HWPT
>> and then to use the system address space and its listener.
>>
>> Introduce a vfio_device_get_viommu_flags_want_nesting() helper to
>facilitate
>> this implementation.
>>
>> It is safe to do so because a vIOMMU will be able to fail in
>set_iommu_device()
>> call, if something else related to the VFIO device or vIOMMU isn't
>compatible.
>>
>> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>   include/hw/vfio/vfio-device.h |  2 ++
>>   hw/vfio/device.c              | 12 ++++++++++++
>>   hw/vfio/iommufd.c             |  9 +++++++++
>>   3 files changed, 23 insertions(+)
>>
>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>> index a0b8fc2eb6..48d00c7bc4 100644
>> --- a/include/hw/vfio/vfio-device.h
>> +++ b/include/hw/vfio/vfio-device.h
>> @@ -267,6 +267,8 @@ void vfio_device_prepare(VFIODevice *vbasedev,
>VFIOContainer *bcontainer,
>>
>>   void vfio_device_unprepare(VFIODevice *vbasedev);
>>
>> +bool vfio_device_get_viommu_flags_want_nesting(VFIODevice
>*vbasedev);
>> +
>>   int vfio_device_get_region_info(VFIODevice *vbasedev, int index,
>>                                   struct vfio_region_info **info);
>>   int vfio_device_get_region_info_type(VFIODevice *vbasedev, uint32_t
>type,
>> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>> index 5ed3103e72..be94947623 100644
>> --- a/hw/vfio/device.c
>> +++ b/hw/vfio/device.c
>> @@ -23,6 +23,7 @@
>>
>>   #include "hw/vfio/vfio-device.h"
>>   #include "hw/vfio/pci.h"
>> +#include "hw/iommu.h"
>>   #include "hw/hw.h"
>>   #include "trace.h"
>>   #include "qapi/error.h"
>> @@ -521,6 +522,17 @@ void vfio_device_unprepare(VFIODevice
>*vbasedev)
>>       vbasedev->bcontainer = NULL;
>>   }
>>
>> +bool vfio_device_get_viommu_flags_want_nesting(VFIODevice
>*vbasedev)
>> +{
>> +    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
>> +
>> +    if (vdev) {
>> +        return !!(pci_device_get_viommu_flags(&vdev->parent_obj) &
>
>Using PCI_DEVICE(vdev) would be more appropriate. It can come later.

Yes, will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v7 12/23] intel_iommu: Add some macros and inline functions
  2025-10-24 16:39   ` Cédric Le Goater
@ 2025-10-28  6:01     ` Duan, Zhenzhong
  0 siblings, 0 replies; 48+ messages in thread
From: Duan, Zhenzhong @ 2025-10-28  6:01 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v7 12/23] intel_iommu: Add some macros and inline
>functions
>
>On 10/24/25 10:43, Zhenzhong Duan wrote:
>> Add some macros and inline functions that will be used by following
>> patch.
>>
>> This patch also make a cleanup to change macro
>VTD_SM_PASID_ENTRY_DID
>> and VTD_SM_PASID_ENTRY_FSPM to use extract64() just like what smmu
>does,
>> because they are either used in following patches or used indirectly by
>> new introduced inline functions. But we doesn't aim to change the huge
>> amount of bit mask style macro definitions in this patch, that should be
>> in a separate patch.
>>
>> Suggested-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>   hw/i386/intel_iommu_internal.h |  8 +++++--
>>   hw/i386/intel_iommu.c          | 38
>+++++++++++++++++++++++++++-------
>>   2 files changed, 37 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 09edba81e2..df80af839d 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -642,10 +642,14 @@ typedef struct VTDPASIDCacheInfo {
>>   #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>>
>>   #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted
>guest-address-width */
>> -#define VTD_SM_PASID_ENTRY_DID(val)    ((val) &
>VTD_DOMAIN_ID_MASK)
>> +#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>>
>> -#define VTD_SM_PASID_ENTRY_FSPM          3ULL
>>   #define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
>> +#define VTD_SM_PASID_ENTRY_SRE_BIT(x)    extract64((x)->val[2], 0, 1)
>> +/* 00: 4-level paging, 01: 5-level paging, 10-11: Reserved */
>> +#define VTD_SM_PASID_ENTRY_FSPM(x)       extract64((x)->val[2], 2,
>2)
>> +#define VTD_SM_PASID_ENTRY_WPE_BIT(x)    extract64((x)->val[2], 4,
>1)
>> +#define VTD_SM_PASID_ENTRY_EAFE_BIT(x)   extract64((x)->val[2], 7, 1)
>>
>>   /* First Stage Paging Structure */
>>   /* Masks for First Stage Paging Entry */
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 56abbb991d..871e6aad19 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -52,8 +52,7 @@
>>
>>   /* pe operations */
>>   #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] &
>VTD_SM_PASID_ENTRY_PGTT)
>> -#define VTD_PE_GET_FS_LEVEL(pe) \
>> -    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FSPM))
>> +#define VTD_PE_GET_FS_LEVEL(pe) (VTD_SM_PASID_ENTRY_FSPM(pe) +
>4)
>>   #define VTD_PE_GET_SS_LEVEL(pe) \
>>       (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
>>
>> @@ -837,6 +836,31 @@ static inline bool
>vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>>       }
>>   }
>>
>> +static inline dma_addr_t vtd_pe_get_fspt_base(VTDPASIDEntry *pe)
>> +{
>> +    return pe->val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
>> +}
>> +
>> +/*
>> + * First stage IOVA address width: 48 bits for 4-level paging(FSPM=00)
>> + *                                 57 bits for 5-level
>paging(FSPM=01)
>> + */
>> +static inline uint32_t vtd_pe_get_fs_aw(VTDPASIDEntry *pe)
>> +{
>> +    return 48 + VTD_SM_PASID_ENTRY_FSPM(pe) * 9;
>
>
>Can't we use VTD_HOST_AW_48BIT here ?

Will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v7 05/23] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-10-24 17:18   ` Cédric Le Goater
@ 2025-10-28  6:57     ` Duan, Zhenzhong
  2025-10-28 15:19       ` Eric Auger
  0 siblings, 1 reply; 48+ messages in thread
From: Duan, Zhenzhong @ 2025-10-28  6:57 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v7 05/23] hw/pci: Introduce
>pci_device_get_viommu_flags()
>
>On 10/24/25 10:43, Zhenzhong Duan wrote:
>> Introduce a new PCIIOMMUOps optional callback, get_viommu_flags()
>which
>> allows to retrieve flags exposed by a vIOMMU. The first planned vIOMMU
>> device flag is VIOMMU_FLAG_WANT_NESTING_PARENT that advertises the
>> support of HW nested stage translation scheme and wants other sub-system
>> like VFIO's cooperation to create nesting parent HWPT.
>>
>> pci_device_get_viommu_flags() is a wrapper that can be called on a PCI
>> device potentially protected by a vIOMMU.
>>
>> get_viommu_flags() is designed to return 64bit bitmap of purely vIOMMU
>> flags which are only determined by user's configuration, no host
>> capabilities involved. Reasons are:
>>
>> 1. host may has heterogeneous IOMMUs, each with different capabilities
>> 2. this is migration friendly, return value is consistent between source
>>     and target.
>>
>> Note that this op will be invoked at the attach_device() stage, at which
>> point host IOMMU capabilities are not yet forwarded to the vIOMMU
>through
>> the set_iommu_device() callback that will be after the attach_device().
>>
>> See below sequence:
>>
>>    vfio_device_attach():
>>        iommufd_cdev_attach():
>>            pci_device_get_viommu_flags() for HW nesting cap
>>            create a nesting parent HWPT
>>            attach device to the HWPT
>>            vfio_device_hiod_create_and_realize() creating hiod
>>    ...
>>    pci_device_set_iommu_device(hiod)
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>   MAINTAINERS          |  1 +
>>   include/hw/iommu.h   | 25 +++++++++++++++++++++++++
>
>
>Hmm, why not under include/hw/pci/ ?

I'm not sure if it's better to restrict iommu to pci subsystem.
I have a vague memory there is iommu supporting non-PCI devices.

> Was this discussed ?

No.

Thanks
Zhenzhong

>
>
>Reviewed-by: Cédric Le Goater <clg@redhat.com>
>
>Thanks,
>
>C.
>
>
>
>>   include/hw/pci/pci.h | 22 ++++++++++++++++++++++
>>   hw/pci/pci.c         | 11 +++++++++++
>>   4 files changed, 59 insertions(+)
>>   create mode 100644 include/hw/iommu.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 36eef27b41..d94fbcbdfb 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2338,6 +2338,7 @@ F: include/system/iommufd.h
>>   F: backends/host_iommu_device.c
>>   F: include/system/host_iommu_device.h
>>   F: include/qemu/chardev_open.h
>> +F: include/hw/iommu.h
>>   F: util/chardev_open.c
>>   F: docs/devel/vfio-iommufd.rst
>>
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> new file mode 100644
>> index 0000000000..9b8bb94fc2
>> --- /dev/null
>> +++ b/include/hw/iommu.h
>> @@ -0,0 +1,25 @@
>> +/*
>> + * General vIOMMU flags
>> + *
>> + * Copyright (C) 2025 Intel Corporation.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#ifndef HW_IOMMU_H
>> +#define HW_IOMMU_H
>> +
>> +#include "qemu/bitops.h"
>> +
>> +/*
>> + * Theoretical vIOMMU flags. Only determined by the vIOMMU device
>properties and
>> + * independent on the actual host IOMMU capabilities they may depend on.
>Each
>> + * flag can be an expectation or request to other sub-system or just a pure
>> + * vIOMMU capability. vIOMMU can choose which flags to expose.
>> + */
>> +enum viommu_flags {
>> +    /* vIOMMU needs nesting parent HWPT to create nested HWPT */
>> +    VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
>> +};
>> +
>> +#endif /* HW_IOMMU_H */
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index bde9dca8e2..cf99b5bb68 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -462,6 +462,18 @@ typedef struct PCIIOMMUOps {
>>        * @devfn: device and function number of the PCI device.
>>        */
>>       void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
>> +    /**
>> +     * @get_viommu_flags: get vIOMMU flags
>> +     *
>> +     * Optional callback, if not implemented, then vIOMMU doesn't
>support
>> +     * exposing flags to other sub-system, e.g., VFIO.
>> +     *
>> +     * @opaque: the data passed to pci_setup_iommu().
>> +     *
>> +     * Returns: bitmap with each representing a vIOMMU flag defined in
>> +     * enum viommu_flags.
>> +     */
>> +    uint64_t (*get_viommu_flags)(void *opaque);
>>       /**
>>        * @get_iotlb_info: get properties required to initialize a device
>IOTLB.
>>        *
>> @@ -644,6 +656,16 @@ bool pci_device_set_iommu_device(PCIDevice
>*dev, HostIOMMUDevice *hiod,
>>                                    Error **errp);
>>   void pci_device_unset_iommu_device(PCIDevice *dev);
>>
>> +/**
>> + * pci_device_get_viommu_flags: get vIOMMU flags.
>> + *
>> + * Returns: bitmap with each representing a vIOMMU flag defined in
>> + * enum viommu_flags. Or 0 if vIOMMU doesn't report any.
>> + *
>> + * @dev: PCI device pointer.
>> + */
>> +uint64_t pci_device_get_viommu_flags(PCIDevice *dev);
>> +
>>   /**
>>    * pci_iommu_get_iotlb_info: get properties required to initialize a
>>    * device IOTLB.
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index d0e81651aa..c9932c87e3 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -3010,6 +3010,17 @@ void
>pci_device_unset_iommu_device(PCIDevice *dev)
>>       }
>>   }
>>
>> +uint64_t pci_device_get_viommu_flags(PCIDevice *dev)
>> +{
>> +    PCIBus *iommu_bus;
>> +
>> +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
>> +    if (iommu_bus && iommu_bus->iommu_ops->get_viommu_flags) {
>> +        return
>iommu_bus->iommu_ops->get_viommu_flags(iommu_bus->iommu_opaque);
>> +    }
>> +    return 0;
>> +}
>> +
>>   int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
>>                            bool exec_req, hwaddr addr, bool lpig,
>>                            uint16_t prgi, bool is_read, bool is_write)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v7 17/23] iommufd: Introduce a helper function to extract vendor capabilities
  2025-10-24 17:34   ` Cédric Le Goater
@ 2025-10-28  9:28     ` Duan, Zhenzhong
  0 siblings, 0 replies; 48+ messages in thread
From: Duan, Zhenzhong @ 2025-10-28  9:28 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v7 17/23] iommufd: Introduce a helper function to
>extract vendor capabilities
>
>On 10/24/25 10:43, Zhenzhong Duan wrote:
>> In VFIO core, we call iommufd_backend_get_device_info() to return vendor
>> specific hardware information data, but it's not good to extract this raw
>> data in VFIO core.
>>
>> Introduce host_iommu_extract_quirks() to help extracting the raw data and
>> return a bitmap in iommufd.c because it's the place defining
>> iommufd_backend_get_device_info().
>>
>> The other choice is to put vendor data extracting code in vendor vIOMMU
>> emulation file, but that will make those files mixed with vIOMMU
>> emulation and host IOMMU extracting code, also need a new callback in
>> PCIIOMMUOps. So we choose a simpler way as above.
>>
>> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>> ---
>>   include/hw/iommu.h                 |  5 +++++
>>   include/system/host_iommu_device.h | 15 +++++++++++++++
>>   backends/iommufd.c                 | 13 +++++++++++++
>>   3 files changed, 33 insertions(+)
>>
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> index 9b8bb94fc2..6d61410703 100644
>> --- a/include/hw/iommu.h
>> +++ b/include/hw/iommu.h
>> @@ -22,4 +22,9 @@ enum viommu_flags {
>>       VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
>>   };
>>
>> +/* Host IOMMU quirks. Extracted from host IOMMU capabilities */
>> +enum host_iommu_quirks {
>> +    HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO = BIT_ULL(0),
>
>
>This host IOMMU quirk definition in a vIOMMU header file is not very
>consistent.

I take it as general IOMMMU header, both vIOMMU info and host IOMMU info can be there.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v7 17/23] iommufd: Introduce a helper function to extract vendor capabilities
  2025-10-24 16:44   ` Cédric Le Goater
@ 2025-10-28  9:43     ` Duan, Zhenzhong
  0 siblings, 0 replies; 48+ messages in thread
From: Duan, Zhenzhong @ 2025-10-28  9:43 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v7 17/23] iommufd: Introduce a helper function to
>extract vendor capabilities
>
>On 10/24/25 10:43, Zhenzhong Duan wrote:
>> In VFIO core, we call iommufd_backend_get_device_info() to return vendor
>> specific hardware information data, but it's not good to extract this raw
>> data in VFIO core.
>>
>> Introduce host_iommu_extract_quirks() to help extracting the raw data and
>> return a bitmap in iommufd.c because it's the place defining
>> iommufd_backend_get_device_info().
>>
>> The other choice is to put vendor data extracting code in vendor vIOMMU
>> emulation file, but that will make those files mixed with vIOMMU
>> emulation and host IOMMU extracting code, also need a new callback in
>> PCIIOMMUOps. So we choose a simpler way as above.
>>
>> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>> ---
>>   include/hw/iommu.h                 |  5 +++++
>>   include/system/host_iommu_device.h | 15 +++++++++++++++
>>   backends/iommufd.c                 | 13 +++++++++++++
>>   3 files changed, 33 insertions(+)
>>
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> index 9b8bb94fc2..6d61410703 100644
>> --- a/include/hw/iommu.h
>> +++ b/include/hw/iommu.h
>> @@ -22,4 +22,9 @@ enum viommu_flags {
>>       VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
>>   };
>>
>> +/* Host IOMMU quirks. Extracted from host IOMMU capabilities */
>> +enum host_iommu_quirks {
>> +    HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO = BIT_ULL(0),
>> +};
>> +
>>   #endif /* HW_IOMMU_H */
>> diff --git a/include/system/host_iommu_device.h
>b/include/system/host_iommu_device.h
>> index ab849a4a82..9ae7f4cc6d 100644
>> --- a/include/system/host_iommu_device.h
>> +++ b/include/system/host_iommu_device.h
>> @@ -39,6 +39,21 @@ typedef struct HostIOMMUDeviceCaps {
>>       uint64_t hw_caps;
>>       VendorCaps vendor_caps;
>>   } HostIOMMUDeviceCaps;
>> +
>> +/**
>> + * host_iommu_extract_quirk: Extract host IOMMU quirks
>> + *
>> + * This function converts @type specific hardware information data
>> + * into a standard bitmap format.
>> + *
>> + * @type: IOMMU Hardware Info Types
>> + *
>> + * @VendorCaps: IOMMU @type specific hardware information data
>> + *
>> + * Returns: bitmap with each representing a host IOMMU quirk defined in
>> + * enum host_iommu_quirks
>> + */
>> +uint64_t host_iommu_extract_quirks(uint32_t type, VendorCaps *caps);
>
>..._get_quirks() sounds nicer. This is minor.

OK, will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v7 20/23] vfio: Bypass readonly region for dirty tracking
  2025-10-24 16:32   ` Cédric Le Goater
@ 2025-10-28  9:47     ` Duan, Zhenzhong
  0 siblings, 0 replies; 48+ messages in thread
From: Duan, Zhenzhong @ 2025-10-28  9:47 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v7 20/23] vfio: Bypass readonly region for dirty tracking
>
>On 10/24/25 10:43, Zhenzhong Duan wrote:
>> When doing ditry tracking or calculating dirty tracking range, readonly
>> regions can be bypassed, because corresponding DMA mappings are
>readonly
>> and never become dirty.
>>
>> This can optimize dirty tracking a bit for passthrough device.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   hw/vfio/listener.c | 11 +++++++++--
>>   1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>> index 0862b2b834..cbd86c79af 100644
>> --- a/hw/vfio/listener.c
>> +++ b/hw/vfio/listener.c
>> @@ -828,7 +828,8 @@ static void
>vfio_dirty_tracking_update(MemoryListener *listener,
>>           container_of(listener, VFIODirtyRangesListener, listener);
>>       hwaddr iova, end;
>>
>> -    if (!vfio_listener_valid_section(section, false, "tracking_update") ||
>> +    /* Bypass readonly section as it never become dirty */
>> +    if (!vfio_listener_valid_section(section, true, "tracking_update") ||
>>           !vfio_get_section_iova_range(dirty->bcontainer, section,
>>                                        &iova, &end, NULL)) {
>>           return;
>> @@ -1087,6 +1088,12 @@ static void
>vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>>       if (!mr) {
>>           goto out_unlock;
>>       }
>> +
>> +    if (!(iotlb->perm & IOMMU_WO) || mr->readonly) {
>
>
>In case you resend, please add a trace event.

OK, will add:

  trace_vfio_iommu_map_dirty_notify_skip_ro(iova, iova + iotlb->addr_mask);

Thanks
Zhenzhong

>
>Anyhow,
>
>Reviewed-by: Cédric Le Goater <clg@redhat.com>
>
>Thanks,
>
>C.
>
>
>> +        rcu_read_unlock();
>> +        return;
>> +    }
>> +
>>       translated_addr = memory_region_get_ram_addr(mr) + xlat;
>>
>>       ret = vfio_container_query_dirty_bitmap(bcontainer, iova,
>iotlb->addr_mask + 1,
>> @@ -1222,7 +1229,7 @@ static void
>vfio_listener_log_sync(MemoryListener *listener,
>>       int ret;
>>       Error *local_err = NULL;
>>
>> -    if (vfio_listener_skipped_section(section, false)) {
>> +    if (vfio_listener_skipped_section(section, true)) {
>>           return;
>>       }
>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 05/23] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-10-28  6:57     ` Duan, Zhenzhong
@ 2025-10-28 15:19       ` Eric Auger
  0 siblings, 0 replies; 48+ messages in thread
From: Eric Auger @ 2025-10-28 15:19 UTC (permalink / raw)
  To: Duan, Zhenzhong, Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



On 10/28/25 7:57 AM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH v7 05/23] hw/pci: Introduce
>> pci_device_get_viommu_flags()
>>
>> On 10/24/25 10:43, Zhenzhong Duan wrote:
>>> Introduce a new PCIIOMMUOps optional callback, get_viommu_flags()
>> which
>>> allows to retrieve flags exposed by a vIOMMU. The first planned vIOMMU
>>> device flag is VIOMMU_FLAG_WANT_NESTING_PARENT that advertises the
>>> support of HW nested stage translation scheme and wants other sub-system
>>> like VFIO's cooperation to create nesting parent HWPT.
>>>
>>> pci_device_get_viommu_flags() is a wrapper that can be called on a PCI
>>> device potentially protected by a vIOMMU.
>>>
>>> get_viommu_flags() is designed to return 64bit bitmap of purely vIOMMU
>>> flags which are only determined by user's configuration, no host
>>> capabilities involved. Reasons are:
>>>
>>> 1. host may has heterogeneous IOMMUs, each with different capabilities
>>> 2. this is migration friendly, return value is consistent between source
>>>     and target.
>>>
>>> Note that this op will be invoked at the attach_device() stage, at which
>>> point host IOMMU capabilities are not yet forwarded to the vIOMMU
>> through
>>> the set_iommu_device() callback that will be after the attach_device().
>>>
>>> See below sequence:
>>>
>>>    vfio_device_attach():
>>>        iommufd_cdev_attach():
>>>            pci_device_get_viommu_flags() for HW nesting cap
>>>            create a nesting parent HWPT
>>>            attach device to the HWPT
>>>            vfio_device_hiod_create_and_realize() creating hiod
>>>    ...
>>>    pci_device_set_iommu_device(hiod)
>>>
>>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>>> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>>> ---
>>>   MAINTAINERS          |  1 +
>>>   include/hw/iommu.h   | 25 +++++++++++++++++++++++++
>>
>> Hmm, why not under include/hw/pci/ ?
> I'm not sure if it's better to restrict iommu to pci subsystem.
> I have a vague memory there is iommu supporting non-PCI devices.

effectively on ARM we may need to support SMMU for platform devices too

Eric
>
>> Was this discussed ?
> No.
>
> Thanks
> Zhenzhong
>
>>
>> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>>
>> Thanks,
>>
>> C.
>>
>>
>>
>>>   include/hw/pci/pci.h | 22 ++++++++++++++++++++++
>>>   hw/pci/pci.c         | 11 +++++++++++
>>>   4 files changed, 59 insertions(+)
>>>   create mode 100644 include/hw/iommu.h
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 36eef27b41..d94fbcbdfb 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -2338,6 +2338,7 @@ F: include/system/iommufd.h
>>>   F: backends/host_iommu_device.c
>>>   F: include/system/host_iommu_device.h
>>>   F: include/qemu/chardev_open.h
>>> +F: include/hw/iommu.h
>>>   F: util/chardev_open.c
>>>   F: docs/devel/vfio-iommufd.rst
>>>
>>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>>> new file mode 100644
>>> index 0000000000..9b8bb94fc2
>>> --- /dev/null
>>> +++ b/include/hw/iommu.h
>>> @@ -0,0 +1,25 @@
>>> +/*
>>> + * General vIOMMU flags
>>> + *
>>> + * Copyright (C) 2025 Intel Corporation.
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#ifndef HW_IOMMU_H
>>> +#define HW_IOMMU_H
>>> +
>>> +#include "qemu/bitops.h"
>>> +
>>> +/*
>>> + * Theoretical vIOMMU flags. Only determined by the vIOMMU device
>> properties and
>>> + * independent on the actual host IOMMU capabilities they may depend on.
>> Each
>>> + * flag can be an expectation or request to other sub-system or just a pure
>>> + * vIOMMU capability. vIOMMU can choose which flags to expose.
>>> + */
>>> +enum viommu_flags {
>>> +    /* vIOMMU needs nesting parent HWPT to create nested HWPT */
>>> +    VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
>>> +};
>>> +
>>> +#endif /* HW_IOMMU_H */
>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>> index bde9dca8e2..cf99b5bb68 100644
>>> --- a/include/hw/pci/pci.h
>>> +++ b/include/hw/pci/pci.h
>>> @@ -462,6 +462,18 @@ typedef struct PCIIOMMUOps {
>>>        * @devfn: device and function number of the PCI device.
>>>        */
>>>       void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
>>> +    /**
>>> +     * @get_viommu_flags: get vIOMMU flags
>>> +     *
>>> +     * Optional callback, if not implemented, then vIOMMU doesn't
>> support
>>> +     * exposing flags to other sub-system, e.g., VFIO.
>>> +     *
>>> +     * @opaque: the data passed to pci_setup_iommu().
>>> +     *
>>> +     * Returns: bitmap with each representing a vIOMMU flag defined in
>>> +     * enum viommu_flags.
>>> +     */
>>> +    uint64_t (*get_viommu_flags)(void *opaque);
>>>       /**
>>>        * @get_iotlb_info: get properties required to initialize a device
>> IOTLB.
>>>        *
>>> @@ -644,6 +656,16 @@ bool pci_device_set_iommu_device(PCIDevice
>> *dev, HostIOMMUDevice *hiod,
>>>                                    Error **errp);
>>>   void pci_device_unset_iommu_device(PCIDevice *dev);
>>>
>>> +/**
>>> + * pci_device_get_viommu_flags: get vIOMMU flags.
>>> + *
>>> + * Returns: bitmap with each representing a vIOMMU flag defined in
>>> + * enum viommu_flags. Or 0 if vIOMMU doesn't report any.
>>> + *
>>> + * @dev: PCI device pointer.
>>> + */
>>> +uint64_t pci_device_get_viommu_flags(PCIDevice *dev);
>>> +
>>>   /**
>>>    * pci_iommu_get_iotlb_info: get properties required to initialize a
>>>    * device IOTLB.
>>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>>> index d0e81651aa..c9932c87e3 100644
>>> --- a/hw/pci/pci.c
>>> +++ b/hw/pci/pci.c
>>> @@ -3010,6 +3010,17 @@ void
>> pci_device_unset_iommu_device(PCIDevice *dev)
>>>       }
>>>   }
>>>
>>> +uint64_t pci_device_get_viommu_flags(PCIDevice *dev)
>>> +{
>>> +    PCIBus *iommu_bus;
>>> +
>>> +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
>>> +    if (iommu_bus && iommu_bus->iommu_ops->get_viommu_flags) {
>>> +        return
>> iommu_bus->iommu_ops->get_viommu_flags(iommu_bus->iommu_opaque);
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>>   int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
>>>                            bool exec_req, hwaddr addr, bool lpig,
>>>                            uint16_t prgi, bool is_read, bool is_write)



^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v7 10/23] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-10-24 17:29   ` Cédric Le Goater
@ 2025-10-29  7:37     ` Duan, Zhenzhong
  0 siblings, 0 replies; 48+ messages in thread
From: Duan, Zhenzhong @ 2025-10-29  7:37 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v7 10/23] intel_iommu: Check for compatibility with
>IOMMUFD backed device when x-flts=on
>
>On 10/24/25 10:43, Zhenzhong Duan wrote:
>> When vIOMMU is configured x-flts=on in scalable mode, first stage page
>table
>> is passed to host to construct nested page table for passthrough devices.
>>
>> We need to check compatibility of some critical IOMMU capabilities
>between
>> vIOMMU and host IOMMU to ensure guest first stage page table could be
>used by
>> host.
>>
>> For instance, vIOMMU supports first stage 1GB large page mapping, but
>host does
>> not, then this IOMMUFD backed device should fail.
>>
>> Even of the checks pass, for now we willingly reject the association because
>> all the bits are not there yet, it will be relaxed in the end of this series.
>>
>> Note vIOMMU has exposed IOMMU_HWPT_ALLOC_NEST_PARENT flag to
>force VFIO core to
>> create nesting parent HWPT, if host doesn't support nested translation, the
>> creation will fail. So no need to check nested capability here.
>>
>> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>> ---
>>   hw/i386/intel_iommu.c | 25 ++++++++++++++++++++++++-
>>   1 file changed, 24 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index ce4c54165e..7d908cdb58 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -4636,8 +4636,31 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>HostIOMMUDevice *hiod,
>>           return true;
>>       }
>>
>> +#ifdef CONFIG_IOMMUFD
>
>
>Before using CONFIG_IOMMUFD, '#include CONFIG_DEVICES' should be done
>first. But as said earlier, this is something we wanted to avoid in the
>intel-iommu model which can have different host IOMMU backends.

Ah, yes, should have '#include CONFIG_DEVICES' in this patch.
>
>At first glance, it seems to me that these changes take the fast path
>and avoid an abstract layer. Is it too complex to keep on using
>HostIOMMUDeviceClass ?

It looks question in patch13 is same as here, so reply all here.

We can benefit from exposing IOMMUFD in vIOMMU, because in foreseeable future,
it's the only backend supporting nested HWPT, it's straightforward for vIOMMU
to create nested HWPT and do attachment through IOMMUFD.

Most of the code guarded by CONFIG_IOMMUFD are cooperation between vIOMMU
and IOMMUFD backend, it's hard to abstract them with comon callbacks into VFIO,
we need to take both vtd and smmu into consideration.

We are using HostIOMMUDevice whenever suitable, it's used as a connection between
VFIO and vIOMMU, we do capability check and call attach/detach_dev callback through it.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v7 13/23] intel_iommu: Bind/unbind guest page table to host
  2025-10-24 17:33   ` Cédric Le Goater
@ 2025-10-29  9:56     ` Duan, Zhenzhong
  0 siblings, 0 replies; 48+ messages in thread
From: Duan, Zhenzhong @ 2025-10-29  9:56 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P, Yi Sun



>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v7 13/23] intel_iommu: Bind/unbind guest page table to
>host
>
>On 10/24/25 10:43, Zhenzhong Duan wrote:
>> This captures the guest PASID table entry modifications and propagates
>> the changes to host to attach a hwpt with type determined per guest
>IOMMU
>> PGTT configuration.
>>
>> When PGTT=PT, attach PASID_0 to a second stage HWPT(GPA->HPA).
>> When PGTT=FST, attach PASID_0 to nested HWPT with nesting parent HWPT
>> coming from VFIO.
>>
>> Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   include/hw/i386/intel_iommu.h |   1 +
>>   hw/i386/intel_iommu.c         | 150
>+++++++++++++++++++++++++++++++++-
>>   hw/i386/trace-events          |   3 +
>>   3 files changed, 151 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index 3758ac239c..b5f8a9fc29 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -104,6 +104,7 @@ struct VTDAddressSpace {
>>       PCIBus *bus;
>>       uint8_t devfn;
>>       uint32_t pasid;
>> +    uint32_t fs_hwpt;
>>       AddressSpace as;
>>       IOMMUMemoryRegion iommu;
>>       MemoryRegion root;          /* The root container of the
>device */
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 871e6aad19..3789a36147 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -20,6 +20,7 @@
>>    */
>>
>>   #include "qemu/osdep.h"
>> +#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
>>   #include "qemu/error-report.h"
>>   #include "qemu/main-loop.h"
>>   #include "qapi/error.h"
>> @@ -42,6 +43,9 @@
>>   #include "migration/vmstate.h"
>>   #include "trace.h"
>>   #include "system/iommufd.h"
>> +#ifdef CONFIG_IOMMUFD
>> +#include <linux/iommufd.h>
>> +#endif
>>
>>   /* context entry operations */
>>   #define PASID_0    0
>> @@ -87,6 +91,7 @@ struct vtd_iotlb_key {
>>
>>   static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>>   static void vtd_address_space_unmap(VTDAddressSpace *as,
>IOMMUNotifier *n);
>> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp);
>>
>>   static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
>>   {
>> @@ -98,7 +103,11 @@ static void
>vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
>>       g_hash_table_iter_init(&as_it, s->vtd_address_spaces);
>>       while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_as)) {
>>           VTDPASIDCacheEntry *pc_entry =
>&vtd_as->pasid_cache_entry;
>> -        pc_entry->valid = false;
>> +        if (pc_entry->valid) {
>> +            pc_entry->valid = false;
>> +            /* It's fatal to get failure during reset */
>> +            vtd_bind_guest_pasid(vtd_as, &error_fatal);
>> +        }
>>       }
>>   }
>>
>> @@ -2380,6 +2389,128 @@ static void
>vtd_context_global_invalidate(IntelIOMMUState *s)
>>       vtd_iommu_replay_all(s);
>>   }
>>
>> +#ifdef CONFIG_IOMMUFD
>> +static int vtd_create_fs_hwpt(HostIOMMUDeviceIOMMUFD *idev,
>> +                              VTDPASIDEntry *pe, uint32_t
>*fs_hwpt,
>> +                              Error **errp)
>
>Returning a bool is better. Same for the routines below.

We will need the returned error code to determine the fault event to inject to guest
in future fault event series.

>
>> +{
>> +    struct iommu_hwpt_vtd_s1 vtd = {};
>> +
>> +    vtd.flags = (VTD_SM_PASID_ENTRY_SRE_BIT(pe) ?
>IOMMU_VTD_S1_SRE : 0) |
>> +                (VTD_SM_PASID_ENTRY_WPE_BIT(pe) ?
>IOMMU_VTD_S1_WPE : 0) |
>> +                (VTD_SM_PASID_ENTRY_EAFE_BIT(pe) ?
>IOMMU_VTD_S1_EAFE : 0);
>> +    vtd.addr_width = vtd_pe_get_fs_aw(pe);
>> +    vtd.pgtbl_addr = (uint64_t)vtd_pe_get_fspt_base(pe);
>> +
>> +    return !iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
>> +                                       idev->hwpt_id, 0,
>IOMMU_HWPT_DATA_VTD_S1,
>> +                                       sizeof(vtd), &vtd, fs_hwpt,
>errp);
>> +}
>> +
>> +static void vtd_destroy_old_fs_hwpt(HostIOMMUDeviceIOMMUFD *idev,
>> +                                    VTDAddressSpace *vtd_as)
>> +{
>> +    if (!vtd_as->fs_hwpt) {
>> +        return;
>> +    }
>> +    iommufd_backend_free_id(idev->iommufd, vtd_as->fs_hwpt);
>> +    vtd_as->fs_hwpt = 0;
>> +}
>> +
>> +static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
>> +                                     VTDAddressSpace *vtd_as,
>Error **errp)
>> +{
>> +    HostIOMMUDeviceIOMMUFD *idev =
>HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
>> +    VTDPASIDEntry *pe = &vtd_as->pasid_cache_entry.pasid_entry;
>> +    uint32_t hwpt_id;
>> +    bool ret;
>> +
>> +    /*
>> +     * We can get here only if flts=on, the supported PGTT is FST and PT.
>> +     * Catch invalid PGTT when processing invalidation request to avoid
>> +     * attaching to wrong hwpt.
>> +     */
>> +    if (!vtd_pe_pgtt_is_fst(pe) && !vtd_pe_pgtt_is_pt(pe)) {
>> +        error_setg(errp, "Invalid PGTT type");
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (vtd_pe_pgtt_is_pt(pe)) {
>> +        hwpt_id = idev->hwpt_id;
>> +    } else if (vtd_create_fs_hwpt(idev, pe, &hwpt_id, errp)) {
>> +        return -EINVAL;
>> +    }
>> +
>> +    ret = host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id,
>errp);
>> +    trace_vtd_device_attach_hwpt(idev->devid, vtd_as->pasid,
>hwpt_id, !ret);
>> +    if (ret) {
>> +        /* Destroy old fs_hwpt if it's a replacement */
>> +        vtd_destroy_old_fs_hwpt(idev, vtd_as);
>> +        if (vtd_pe_pgtt_is_fst(pe)) {
>> +            vtd_as->fs_hwpt = hwpt_id;
>> +        }
>> +    } else if (vtd_pe_pgtt_is_fst(pe)) {
>> +        iommufd_backend_free_id(idev->iommufd, hwpt_id);
>> +    }
>> +
>> +    return !ret;
>> +}
>> +
>> +static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
>> +                                     VTDAddressSpace *vtd_as,
>Error **errp)
>> +{
>> +    HostIOMMUDeviceIOMMUFD *idev =
>HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    uint32_t pasid = vtd_as->pasid;
>> +    bool ret;
>> +
>> +    if (s->dmar_enabled && s->root_scalable) {
>> +        ret = host_iommu_device_iommufd_detach_hwpt(idev, errp);
>> +        trace_vtd_device_detach_hwpt(idev->devid, pasid, !ret);
>> +    } else {
>> +        /*
>> +         * If DMAR remapping is disabled or guest switches to legacy
>mode,
>> +         * we fallback to the default HWPT which contains shadow page
>table.
>> +         * So guest DMA could still work.
>> +         */
>> +        ret = host_iommu_device_iommufd_attach_hwpt(idev,
>idev->hwpt_id, errp);
>> +        trace_vtd_device_reattach_def_hwpt(idev->devid, pasid,
>idev->hwpt_id,
>> +                                           !ret);
>> +    }
>> +
>> +    if (ret) {
>> +        vtd_destroy_old_fs_hwpt(idev, vtd_as);
>> +    }
>> +
>> +    return !ret;
>> +}
>> +
>> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
>> +{
>> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> +    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(vtd_as);
>> +    int ret;
>> +
>> +    /* Ignore emulated device or legacy VFIO backed device */
>> +    if (!vtd_hiod) {
>> +        return 0;
>> +    }
>> +
>> +    if (pc_entry->valid) {
>> +        ret = vtd_device_attach_iommufd(vtd_hiod, vtd_as, errp);
>> +    } else {
>> +        ret = vtd_device_detach_iommufd(vtd_hiod, vtd_as, errp);
>> +    }
>> +
>> +    return ret;
>> +}
>> +#else
>> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
>> +{
>> +    return 0;
>> +}
>> +#endif
>> +
>>   /* Do a context-cache device-selective invalidation.
>>    * @func_mask: FM field after shifting
>>    */
>> @@ -3134,6 +3265,8 @@ static void
>vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
>>       VTDPASIDEntry pe;
>>       IOMMUNotifier *n;
>>       uint16_t did;
>> +    const char *err_prefix;
>
>Setting this prefix looks a bit fragile. May be add a default value here.

OK, like:

    const char *err_prefix = "Attaching to HWPT failed: ";

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 02/23] intel_iommu: Delete RPS capability related supporting code
  2025-10-24  8:43 ` [PATCH v7 02/23] intel_iommu: Delete RPS capability related supporting code Zhenzhong Duan
@ 2025-10-31  7:50   ` Eric Auger
  2025-10-31  9:49     ` Duan, Zhenzhong
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Auger @ 2025-10-31  7:50 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

Hi Zhenzhong,

On 10/24/25 10:43 AM, Zhenzhong Duan wrote:
> RID-PASID Support(RPS) is not set in vIOMMU ECAP register, the supporting
> code is there but never takes effect.
>
> Meanwhile, according to VTD spec section 3.4.3:
> "Implementations not supporting RID_PASID capability (ECAP_REG.RPS is 0b),
> use a PASID value of 0 to perform address translation for requests without
> PASID."
>
> We should delete the supporting code which fetches RID_PASID field from
> scalable context entry and use 0 as RID_PASID directly, because RID_PASID
> field is ignored if no RPS support according to spec.
>
> This simplifies the code and doesn't bring any penalty.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu_internal.h |  1 -
>  hw/i386/intel_iommu.c          | 82 +++++++++++-----------------------
>  2 files changed, 27 insertions(+), 56 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 75bafdf0cd..bf8fb2aa80 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -609,7 +609,6 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>  
> -#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>  #define VTD_SM_CONTEXT_ENTRY_PRE            0x10ULL
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 70746e3080..06065d16b6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -42,8 +42,7 @@
>  #include "trace.h"
>  
>  /* context entry operations */
> -#define VTD_CE_GET_RID2PASID(ce) \
> -    ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
> +#define PASID_0    0
>  #define VTD_CE_GET_PASID_DIR_TABLE(ce) \
>      ((ce)->val[0] & VTD_PASID_DIR_BASE_ADDR_MASK)
>  #define VTD_CE_GET_PRE(ce) \
> @@ -963,7 +962,7 @@ static int vtd_ce_get_pasid_entry(IntelIOMMUState *s, VTDContextEntry *ce,
>      int ret = 0;
while you are at it, get rid of ret and simply return
vtd_get_pe_from_pasid_table()?
>  
>      if (pasid == PCI_NO_PASID) {
> -        pasid = VTD_CE_GET_RID2PASID(ce);
> +        pasid = PASID_0;
>      }
>      pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>      ret = vtd_get_pe_from_pasid_table(s, pasid_dir_base, pasid, pe);
> @@ -982,7 +981,7 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
>      VTDPASIDEntry pe;
>  
>      if (pasid == PCI_NO_PASID) {
> -        pasid = VTD_CE_GET_RID2PASID(ce);
> +        pasid = PASID_0;
>      }
>      pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>  
> @@ -1522,17 +1521,15 @@ static inline int vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
>      return 0;
>  }
>  
> -static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
> -                                  VTDContextEntry *ce)
> +static int vtd_ce_pasid_0_check(IntelIOMMUState *s, VTDContextEntry *ce)
>  {
>      VTDPASIDEntry pe;
>  
>      /*
>       * Make sure in Scalable Mode, a present context entry
> -     * has valid rid2pasid setting, which includes valid
> -     * rid2pasid field and corresponding pasid entry setting
> +     * has valid pasid entry setting at PASID_0.
>       */
> -    return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
> +    return vtd_ce_get_pasid_entry(s, ce, &pe, PASID_0);
>  }
>  
>  /* Map a device to its corresponding domain (context-entry) */
> @@ -1593,12 +1590,11 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>          }
>      } else {
>          /*
> -         * Check if the programming of context-entry.rid2pasid
> -         * and corresponding pasid setting is valid, and thus
> -         * avoids to check pasid entry fetching result in future
> -         * helper function calling.
> +         * Check if the programming of pasid setting of PASID_0
> +         * is valid, and thus avoids to check pasid entry fetching
> +         * result in future helper function calling.
>           */
> -        ret_fr = vtd_ce_rid2pasid_check(s, ce);
> +        ret_fr = vtd_ce_pasid_0_check(s, ce);
I guess you should be able to return vtd_ce_pasid_0_check(s, ce)
directly too.
>          if (ret_fr) {
>              return ret_fr;
>          }
> @@ -2110,7 +2106,6 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>      bool reads = true;
>      bool writes = true;
>      uint8_t access_flags, pgtt;
> -    bool rid2pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
>      VTDIOTLBEntry *iotlb_entry;
>      uint64_t xlat, size;
>  
> @@ -2122,21 +2117,23 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>  
>      vtd_iommu_lock(s);
>  
> -    cc_entry = &vtd_as->context_cache_entry;
any reason why cc_entry setting was moved? Seems a spurious change.
> +    if (pasid == PCI_NO_PASID && s->root_scalable) {
> +        pasid = PASID_0;
> +    }
>  
> -    /* Try to fetch pte from IOTLB, we don't need RID2PASID logic */
> -    if (!rid2pasid) {
> -        iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
> -        if (iotlb_entry) {
> -            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
> -                                     iotlb_entry->domain_id);
> -            pte = iotlb_entry->pte;
> -            access_flags = iotlb_entry->access_flags;
> -            page_mask = iotlb_entry->mask;
> -            goto out;
> -        }
> +    /* Try to fetch pte from IOTLB */
> +    iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
> +    if (iotlb_entry) {
> +        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
> +                                 iotlb_entry->domain_id);
> +        pte = iotlb_entry->pte;
> +        access_flags = iotlb_entry->access_flags;
> +        page_mask = iotlb_entry->mask;
> +        goto out;
>      }
>  
> +    cc_entry = &vtd_as->context_cache_entry;
> +
>      /* Try to fetch context-entry from cache first */
>      if (cc_entry->context_cache_gen == s->context_cache_gen) {
>          trace_vtd_iotlb_cc_hit(bus_num, devfn, cc_entry->context_entry.hi,
> @@ -2173,10 +2170,6 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          cc_entry->context_cache_gen = s->context_cache_gen;
>      }
>  
> -    if (rid2pasid) {
> -        pasid = VTD_CE_GET_RID2PASID(&ce);
> -    }
> -
>      /*
>       * We don't need to translate for pass-through context entries.
>       * Also, let's ignore IOTLB caching as well for PT devices.
> @@ -2202,19 +2195,6 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          return true;
>      }
>  
> -    /* Try to fetch pte from IOTLB for RID2PASID slow path */
> -    if (rid2pasid) {
> -        iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
> -        if (iotlb_entry) {
> -            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
> -                                     iotlb_entry->domain_id);
> -            pte = iotlb_entry->pte;
> -            access_flags = iotlb_entry->access_flags;
> -            page_mask = iotlb_entry->mask;
> -            goto out;
> -        }
> -    }
> -
>      if (s->flts && s->root_scalable) {
>          ret_fr = vtd_iova_to_flpte(s, &ce, addr, is_write, &pte, &level,
>                                     &reads, &writes, s->aw_bits, pasid);
> @@ -2477,20 +2457,14 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>          ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                         vtd_as->devfn, &ce);
>          if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
> -            uint32_t rid2pasid = PCI_NO_PASID;
> -
> -            if (s->root_scalable) {
> -                rid2pasid = VTD_CE_GET_RID2PASID(&ce);
> -            }
> -
>              /*
>               * In legacy mode, vtd_as->pasid == pasid is always true.
>               * In scalable mode, for vtd address space backing a PCI
>               * device without pasid, needs to compare pasid with
> -             * rid2pasid of this device.
> +             * PASID_0 of this device.
>               */
>              if (!(vtd_as->pasid == pasid ||
> -                  (vtd_as->pasid == PCI_NO_PASID && pasid == rid2pasid))) {
> +                  (vtd_as->pasid == PCI_NO_PASID && pasid == PASID_0))) {
don't you need to check you are in s->root_scalable mode too?

Thanks

Eric
>                  continue;
>              }
>  
> @@ -2995,9 +2969,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>          if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                        vtd_as->devfn, &ce) &&
>              domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
> -            uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
> -
> -            if ((vtd_as->pasid != PCI_NO_PASID || pasid != rid2pasid) &&
> +            if ((vtd_as->pasid != PCI_NO_PASID || pasid != PASID_0) &&
>                  vtd_as->pasid != pasid) {
>                  continue;
>              }



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v7 09/23] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-flts=on
  2025-10-24  8:43 ` [PATCH v7 09/23] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-flts=on Zhenzhong Duan
@ 2025-10-31  8:09   ` Eric Auger
  2025-10-31  9:52     ` Duan, Zhenzhong
  0 siblings, 1 reply; 48+ messages in thread
From: Eric Auger @ 2025-10-31  8:09 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

Hi Zhenzhong,

On 10/24/25 10:43 AM, Zhenzhong Duan wrote:
> When guest enables scalable mode and setup first stage page table, we don't
> want to use IOMMU MR but rather continue using the system MR for IOMMUFD
> backed host device.
>
> Then default HWPT in VFIO contains GPA->HPA mappings which could be reused
> as nesting parent HWPT to construct nested HWPT in vIOMMU.

we had a discussion thread with Nicolin and Shameer about usage of AS
for nested SMMU
(https://lore.kernel.org/all/add07edd-3652-430d-b52c-cb2bdbc7f587@redhat.com/)
If I understand correctly you also rely on system MR for nested. I am
not sure this is a good usage of the API/AS objects as in practice you
have an actual translation in place (althout implemented by HW) while by
using the system MR you do not reflect that. I encouraged Shameer to try
using a dummy dedicated AS that can be shared. I think it would be
better if we could align the strategies.

>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
> ---
>  hw/i386/intel_iommu.c | 36 ++++++++++++++++++++++++++++++++++--
>  1 file changed, 34 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 4c83578c54..ce4c54165e 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -41,6 +41,7 @@
>  #include "migration/misc.h"
>  #include "migration/vmstate.h"
>  #include "trace.h"
> +#include "system/iommufd.h"
>  
>  /* context entry operations */
>  #define PASID_0    0
> @@ -1713,6 +1714,24 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>  
>  }
>  
> +static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(VTDAddressSpace *as)
> +{
> +    IntelIOMMUState *s = as->iommu_state;
> +    struct vtd_as_key key = {
> +        .bus = as->bus,
> +        .devfn = as->devfn,
> +    };
> +    VTDHostIOMMUDevice *vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev,
> +                                                       &key);
> +
> +    if (vtd_hiod && vtd_hiod->hiod &&
> +        object_dynamic_cast(OBJECT(vtd_hiod->hiod),
> +                            TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> +        return vtd_hiod;
> +    }
> +    return NULL;
> +}
> +
>  static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>  {
>      IntelIOMMUState *s;
> @@ -1738,12 +1757,25 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>  /* Return whether the device is using IOMMU translation. */
>  static bool vtd_switch_address_space(VTDAddressSpace *as)
>  {
> +    IntelIOMMUState *s;
>      bool use_iommu, pt;
>  
>      assert(as);
>  
> -    use_iommu = as->iommu_state->dmar_enabled && !vtd_as_pt_enabled(as);
> -    pt = as->iommu_state->dmar_enabled && vtd_as_pt_enabled(as);
> +    s = as->iommu_state;
> +    use_iommu = s->dmar_enabled && !vtd_as_pt_enabled(as);
> +    pt = s->dmar_enabled && vtd_as_pt_enabled(as);
> +
> +    /*
> +     * When guest enables scalable mode and sets up first stage page table,
> +     * we stick to system MR for IOMMUFD backed host device. Then its
> +     * default hwpt contains GPA->HPA mappings which is used directly if
> +     * PGTT=PT and used as nesting parent if PGTT=FST. Otherwise fall back
> +     * to original processing.
According to the above comment you have a S1 translation in place but
you set use_iommu = false and use system MR?

Revoking my R-bs for now because I am not convinced we shall use system
MR when S1+S2 is setup. I may be wrong but at least I need more
explanations ;-)

Eric
> +     */
> +    if (s->root_scalable && s->fsts && vtd_find_hiod_iommufd(as)) {
> +        use_iommu = false;
> +    }
>  
>      trace_vtd_switch_address_space(pci_bus_num(as->bus),
>                                     VTD_PCI_SLOT(as->devfn),



^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v7 02/23] intel_iommu: Delete RPS capability related supporting code
  2025-10-31  7:50   ` Eric Auger
@ 2025-10-31  9:49     ` Duan, Zhenzhong
  0 siblings, 0 replies; 48+ messages in thread
From: Duan, Zhenzhong @ 2025-10-31  9:49 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Liu, Yi L, Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v7 02/23] intel_iommu: Delete RPS capability related
>supporting code
>
>Hi Zhenzhong,
>
>On 10/24/25 10:43 AM, Zhenzhong Duan wrote:
>> RID-PASID Support(RPS) is not set in vIOMMU ECAP register, the supporting
>> code is there but never takes effect.
>>
>> Meanwhile, according to VTD spec section 3.4.3:
>> "Implementations not supporting RID_PASID capability (ECAP_REG.RPS is
>0b),
>> use a PASID value of 0 to perform address translation for requests without
>> PASID."
>>
>> We should delete the supporting code which fetches RID_PASID field from
>> scalable context entry and use 0 as RID_PASID directly, because RID_PASID
>> field is ignored if no RPS support according to spec.
>>
>> This simplifies the code and doesn't bring any penalty.
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu_internal.h |  1 -
>>  hw/i386/intel_iommu.c          | 82 +++++++++++-----------------------
>>  2 files changed, 27 insertions(+), 56 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 75bafdf0cd..bf8fb2aa80 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -609,7 +609,6 @@ typedef struct VTDRootEntry VTDRootEntry;
>>  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>>  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>>
>> -#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL |
>~VTD_HAW_MASK(aw))
>>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1
>0xffffffffffe00000ULL
>>  #define VTD_SM_CONTEXT_ENTRY_PRE            0x10ULL
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 70746e3080..06065d16b6 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -42,8 +42,7 @@
>>  #include "trace.h"
>>
>>  /* context entry operations */
>> -#define VTD_CE_GET_RID2PASID(ce) \
>> -    ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
>> +#define PASID_0    0
>>  #define VTD_CE_GET_PASID_DIR_TABLE(ce) \
>>      ((ce)->val[0] & VTD_PASID_DIR_BASE_ADDR_MASK)
>>  #define VTD_CE_GET_PRE(ce) \
>> @@ -963,7 +962,7 @@ static int vtd_ce_get_pasid_entry(IntelIOMMUState
>*s, VTDContextEntry *ce,
>>      int ret = 0;
>while you are at it, get rid of ret and simply return
>vtd_get_pe_from_pasid_table()?

Sure

>>
>>      if (pasid == PCI_NO_PASID) {
>> -        pasid = VTD_CE_GET_RID2PASID(ce);
>> +        pasid = PASID_0;
>>      }
>>      pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>>      ret = vtd_get_pe_from_pasid_table(s, pasid_dir_base, pasid, pe);
>> @@ -982,7 +981,7 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState
>*s,
>>      VTDPASIDEntry pe;
>>
>>      if (pasid == PCI_NO_PASID) {
>> -        pasid = VTD_CE_GET_RID2PASID(ce);
>> +        pasid = PASID_0;
>>      }
>>      pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>>
>> @@ -1522,17 +1521,15 @@ static inline int
>vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
>>      return 0;
>>  }
>>
>> -static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
>> -                                  VTDContextEntry *ce)
>> +static int vtd_ce_pasid_0_check(IntelIOMMUState *s, VTDContextEntry
>*ce)
>>  {
>>      VTDPASIDEntry pe;
>>
>>      /*
>>       * Make sure in Scalable Mode, a present context entry
>> -     * has valid rid2pasid setting, which includes valid
>> -     * rid2pasid field and corresponding pasid entry setting
>> +     * has valid pasid entry setting at PASID_0.
>>       */
>> -    return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
>> +    return vtd_ce_get_pasid_entry(s, ce, &pe, PASID_0);
>>  }
>>
>>  /* Map a device to its corresponding domain (context-entry) */
>> @@ -1593,12 +1590,11 @@ static int
>vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>>          }
>>      } else {
>>          /*
>> -         * Check if the programming of context-entry.rid2pasid
>> -         * and corresponding pasid setting is valid, and thus
>> -         * avoids to check pasid entry fetching result in future
>> -         * helper function calling.
>> +         * Check if the programming of pasid setting of PASID_0
>> +         * is valid, and thus avoids to check pasid entry fetching
>> +         * result in future helper function calling.
>>           */
>> -        ret_fr = vtd_ce_rid2pasid_check(s, ce);
>> +        ret_fr = vtd_ce_pasid_0_check(s, ce);
>I guess you should be able to return vtd_ce_pasid_0_check(s, ce)
>directly too.

Yes.

>>          if (ret_fr) {
>>              return ret_fr;
>>          }
>> @@ -2110,7 +2106,6 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>      bool reads = true;
>>      bool writes = true;
>>      uint8_t access_flags, pgtt;
>> -    bool rid2pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
>>      VTDIOTLBEntry *iotlb_entry;
>>      uint64_t xlat, size;
>>
>> @@ -2122,21 +2117,23 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>
>>      vtd_iommu_lock(s);
>>
>> -    cc_entry = &vtd_as->context_cache_entry;
>any reason why cc_entry setting was moved? Seems a spurious change.

I'd like to initialize cc_entry right before it will be dereferenced, no need to initialize it early because we can 'goto out' early.

>> +    if (pasid == PCI_NO_PASID && s->root_scalable) {
>> +        pasid = PASID_0;
>> +    }
>>
>> -    /* Try to fetch pte from IOTLB, we don't need RID2PASID logic */
>> -    if (!rid2pasid) {
>> -        iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
>> -        if (iotlb_entry) {
>> -            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
>> -                                     iotlb_entry->domain_id);
>> -            pte = iotlb_entry->pte;
>> -            access_flags = iotlb_entry->access_flags;
>> -            page_mask = iotlb_entry->mask;
>> -            goto out;
>> -        }
>> +    /* Try to fetch pte from IOTLB */
>> +    iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
>> +    if (iotlb_entry) {
>> +        trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
>> +                                 iotlb_entry->domain_id);
>> +        pte = iotlb_entry->pte;
>> +        access_flags = iotlb_entry->access_flags;
>> +        page_mask = iotlb_entry->mask;
>> +        goto out;
>>      }
>>
>> +    cc_entry = &vtd_as->context_cache_entry;
>> +
>>      /* Try to fetch context-entry from cache first */
>>      if (cc_entry->context_cache_gen == s->context_cache_gen) {
>>          trace_vtd_iotlb_cc_hit(bus_num, devfn,
>cc_entry->context_entry.hi,
>> @@ -2173,10 +2170,6 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>          cc_entry->context_cache_gen = s->context_cache_gen;
>>      }
>>
>> -    if (rid2pasid) {
>> -        pasid = VTD_CE_GET_RID2PASID(&ce);
>> -    }
>> -
>>      /*
>>       * We don't need to translate for pass-through context entries.
>>       * Also, let's ignore IOTLB caching as well for PT devices.
>> @@ -2202,19 +2195,6 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>          return true;
>>      }
>>
>> -    /* Try to fetch pte from IOTLB for RID2PASID slow path */
>> -    if (rid2pasid) {
>> -        iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
>> -        if (iotlb_entry) {
>> -            trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
>> -                                     iotlb_entry->domain_id);
>> -            pte = iotlb_entry->pte;
>> -            access_flags = iotlb_entry->access_flags;
>> -            page_mask = iotlb_entry->mask;
>> -            goto out;
>> -        }
>> -    }
>> -
>>      if (s->flts && s->root_scalable) {
>>          ret_fr = vtd_iova_to_flpte(s, &ce, addr, is_write, &pte, &level,
>>                                     &reads, &writes, s->aw_bits,
>pasid);
>> @@ -2477,20 +2457,14 @@ static void
>vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>          ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>                                         vtd_as->devfn, &ce);
>>          if (!ret && domain_id == vtd_get_domain_id(s, &ce,
>vtd_as->pasid)) {
>> -            uint32_t rid2pasid = PCI_NO_PASID;
>> -
>> -            if (s->root_scalable) {
>> -                rid2pasid = VTD_CE_GET_RID2PASID(&ce);
>> -            }
>> -
>>              /*
>>               * In legacy mode, vtd_as->pasid == pasid is always true.
>>               * In scalable mode, for vtd address space backing a PCI
>>               * device without pasid, needs to compare pasid with
>> -             * rid2pasid of this device.
>> +             * PASID_0 of this device.
>>               */
>>              if (!(vtd_as->pasid == pasid ||
>> -                  (vtd_as->pasid == PCI_NO_PASID && pasid ==
>rid2pasid))) {
>> +                  (vtd_as->pasid == PCI_NO_PASID && pasid ==
>PASID_0))) {
>don't you need to check you are in s->root_scalable mode too?

I think no need, this combination check can handle both scalable and legacy modes,
because if s->root_scalable=false, pasid always is PCI_NO_PASID,
'vtd_as->pasid == pasid' becomes vtd_'as->pasid == PCI_NO_PASID', it's a superset
of the remaining check.

So the remaining check is already for s->root_scalable=true.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 48+ messages in thread

* RE: [PATCH v7 09/23] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-flts=on
  2025-10-31  8:09   ` Eric Auger
@ 2025-10-31  9:52     ` Duan, Zhenzhong
  0 siblings, 0 replies; 48+ messages in thread
From: Duan, Zhenzhong @ 2025-10-31  9:52 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Liu, Yi L, Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v7 09/23] intel_iommu: Stick to system MR for
>IOMMUFD backed host device when x-flts=on
>
>Hi Zhenzhong,
>
>On 10/24/25 10:43 AM, Zhenzhong Duan wrote:
>> When guest enables scalable mode and setup first stage page table, we
>don't
>> want to use IOMMU MR but rather continue using the system MR for
>IOMMUFD
>> backed host device.
>>
>> Then default HWPT in VFIO contains GPA->HPA mappings which could be
>reused
>> as nesting parent HWPT to construct nested HWPT in vIOMMU.
>
>we had a discussion thread with Nicolin and Shameer about usage of AS
>for nested SMMU
>(https://lore.kernel.org/all/add07edd-3652-430d-b52c-cb2bdbc7f587@redha
>t.com/)
>If I understand correctly you also rely on system MR for nested. I am
>not sure this is a good usage of the API/AS objects as in practice you
>have an actual translation in place (althout implemented by HW) while by
>using the system MR you do not reflect that. I encouraged Shameer to try
>using a dummy dedicated AS that can be shared. I think it would be
>better if we could align the strategies.

Hmm, I think it's hard for VTD to use dedicated AS just like smmu, because
VTD supports legacy mode even with 'x-scalable-mode=on,x-flts=on', we
don't know guest's choice at runtime. So we always return IOMMU AS to
VFIO, we should never return address_space_memory or a dedicated AS
in vtd_find_add_as().

There was a discussion with Nicolin and Liuyi on this.

https://lore.kernel.org/qemu-devel/SJ0PR11MB6744340B889FF65D3BD5B8459267A@SJ0PR11MB6744.namprd11.prod.outlook.com/

>
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>> ---
>>  hw/i386/intel_iommu.c | 36 ++++++++++++++++++++++++++++++++++--
>>  1 file changed, 34 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 4c83578c54..ce4c54165e 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -41,6 +41,7 @@
>>  #include "migration/misc.h"
>>  #include "migration/vmstate.h"
>>  #include "trace.h"
>> +#include "system/iommufd.h"
>>
>>  /* context entry operations */
>>  #define PASID_0    0
>> @@ -1713,6 +1714,24 @@ static bool
>vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>>
>>  }
>>
>> +static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(VTDAddressSpace
>*as)
>> +{
>> +    IntelIOMMUState *s = as->iommu_state;
>> +    struct vtd_as_key key = {
>> +        .bus = as->bus,
>> +        .devfn = as->devfn,
>> +    };
>> +    VTDHostIOMMUDevice *vtd_hiod =
>g_hash_table_lookup(s->vtd_host_iommu_dev,
>> +                                                       &key);
>> +
>> +    if (vtd_hiod && vtd_hiod->hiod &&
>> +        object_dynamic_cast(OBJECT(vtd_hiod->hiod),
>> +
>TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
>> +        return vtd_hiod;
>> +    }
>> +    return NULL;
>> +}
>> +
>>  static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>>  {
>>      IntelIOMMUState *s;
>> @@ -1738,12 +1757,25 @@ static bool
>vtd_as_pt_enabled(VTDAddressSpace *as)
>>  /* Return whether the device is using IOMMU translation. */
>>  static bool vtd_switch_address_space(VTDAddressSpace *as)
>>  {
>> +    IntelIOMMUState *s;
>>      bool use_iommu, pt;
>>
>>      assert(as);
>>
>> -    use_iommu = as->iommu_state->dmar_enabled
>&& !vtd_as_pt_enabled(as);
>> -    pt = as->iommu_state->dmar_enabled && vtd_as_pt_enabled(as);
>> +    s = as->iommu_state;
>> +    use_iommu = s->dmar_enabled && !vtd_as_pt_enabled(as);
>> +    pt = s->dmar_enabled && vtd_as_pt_enabled(as);
>> +
>> +    /*
>> +     * When guest enables scalable mode and sets up first stage page
>table,
>> +     * we stick to system MR for IOMMUFD backed host device. Then its
>> +     * default hwpt contains GPA->HPA mappings which is used directly if
>> +     * PGTT=PT and used as nesting parent if PGTT=FST. Otherwise fall
>back
>> +     * to original processing.
>According to the above comment you have a S1 translation in place but
>you set use_iommu = false and use system MR?

Yes, we have extended the usages of MRs under IOMMU AS with nesting.
In nesting mode, system MR on/off isn't aligned with S1 translation anymore.

>
>Revoking my R-bs for now because I am not convinced we shall use system
>MR when S1+S2 is setup. I may be wrong but at least I need more
>explanations ;-)

Okay, let's discuss further.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2025-10-31  9:52 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-24  8:43 [PATCH v7 00/23] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 01/23] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 02/23] intel_iommu: Delete RPS capability related supporting code Zhenzhong Duan
2025-10-31  7:50   ` Eric Auger
2025-10-31  9:49     ` Duan, Zhenzhong
2025-10-24  8:43 ` [PATCH v7 03/23] intel_iommu: Update terminology to match VTD spec Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 04/23] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 05/23] hw/pci: Introduce pci_device_get_viommu_flags() Zhenzhong Duan
2025-10-24 17:18   ` Cédric Le Goater
2025-10-28  6:57     ` Duan, Zhenzhong
2025-10-28 15:19       ` Eric Auger
2025-10-24  8:43 ` [PATCH v7 06/23] intel_iommu: Implement get_viommu_flags() callback Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 07/23] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 08/23] vfio/iommufd: Force creating nesting parent HWPT Zhenzhong Duan
2025-10-24 16:23   ` Cédric Le Goater
2025-10-28  6:00     ` Duan, Zhenzhong
2025-10-24  8:43 ` [PATCH v7 09/23] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-flts=on Zhenzhong Duan
2025-10-31  8:09   ` Eric Auger
2025-10-31  9:52     ` Duan, Zhenzhong
2025-10-24  8:43 ` [PATCH v7 10/23] intel_iommu: Check for compatibility with IOMMUFD backed " Zhenzhong Duan
2025-10-24 17:29   ` Cédric Le Goater
2025-10-29  7:37     ` Duan, Zhenzhong
2025-10-24  8:43 ` [PATCH v7 11/23] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 12/23] intel_iommu: Add some macros and inline functions Zhenzhong Duan
2025-10-24 16:39   ` Cédric Le Goater
2025-10-28  6:01     ` Duan, Zhenzhong
2025-10-24  8:43 ` [PATCH v7 13/23] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-10-24 17:01   ` Cédric Le Goater
2025-10-24 17:33   ` Cédric Le Goater
2025-10-29  9:56     ` Duan, Zhenzhong
2025-10-24  8:43 ` [PATCH v7 14/23] intel_iommu: Propagate PASID-based iotlb invalidation " Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 15/23] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 16/23] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 17/23] iommufd: Introduce a helper function to extract vendor capabilities Zhenzhong Duan
2025-10-24 16:44   ` Cédric Le Goater
2025-10-28  9:43     ` Duan, Zhenzhong
2025-10-24 17:34   ` Cédric Le Goater
2025-10-28  9:28     ` Duan, Zhenzhong
2025-10-24  8:43 ` [PATCH v7 18/23] vfio: Add a new element bypass_ro in VFIOContainer Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 19/23] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
2025-10-24 17:36   ` Cédric Le Goater
2025-10-24 17:38   ` Cédric Le Goater
2025-10-24  8:43 ` [PATCH v7 20/23] vfio: Bypass readonly region for dirty tracking Zhenzhong Duan
2025-10-24 16:32   ` Cédric Le Goater
2025-10-28  9:47     ` Duan, Zhenzhong
2025-10-24  8:43 ` [PATCH v7 21/23] intel_iommu: Add migration support with x-flts=on Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 22/23] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
2025-10-24  8:43 ` [PATCH v7 23/23] docs/devel: Add IOMMUFD nesting documentation Zhenzhong Duan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).