qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device
@ 2025-09-18  8:57 Zhenzhong Duan
  2025-09-18  8:57 ` [PATCH v6 01/22] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
                   ` (21 more replies)
  0 siblings, 22 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Hi,

For passthrough device with intel_iommu.x-flts=on, we don't do shadowing of
guest page table but pass first stage page table to host side to construct a
nested HWPT. There was some effort to enable this feature in old days, see
[1] for details.

The key design is to utilize the dual-stage IOMMU translation (also known as
IOMMU nested translation) capability in host IOMMU. As the below diagram shows,
guest I/O page table pointer in GPA (guest physical address) is passed to host
and be used to perform the first stage address translation. Along with it,
modifications to present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

        .-------------.  .---------------------------.
        |   vIOMMU    |  | Guest I/O page table      |
        |             |  '---------------------------'
        .----------------/
        | PASID Entry |--- PASID cache flush --+
        '-------------'                        |
        |             |                        V
        |             |           I/O page table pointer in GPA
        '-------------'
    Guest
    ------| Shadow |---------------------------|--------
          v        v                           v
    Host
        .-------------.  .-----------------------------.
        |   pIOMMU    |  | First stage for GIOVA->GPA  |
        |             |  '-----------------------------'
        .----------------/  |
        | PASID Entry |     V (Nested xlate)
        '----------------\.--------------------------------------------.
        |             |   | Second stage for GPA->HPA, unmanaged domain|
        |             |   '--------------------------------------------'
        '-------------'
<Intel VT-d Nested translation>

This series reuse VFIO device's default HWPT as nesting parent instead of
creating new one. This way avoids duplicate code of a new memory listener,
all existing feature from VFIO listener can be shared, e.g., ram discard,
dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
under a PCI bridge with emulated device, because emulated device wants
IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off" on platform
with ERRATA_772415_SPR17, because VFIO device's default HWPT is created
with NEST_PARENT flag, kernel inhibit RO mappings when switch to shadow
mode.

This series is also a prerequisite work for vSVA, i.e. Sharing guest
application address space with passthrough devices.

There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
  subsystem. VFIO calls them to register/unregister HostIOMMUDevice
  instance to vIOMMU at vfio device realize stage.
* vIOMMU registers PCIIOMMUOps get_viommu_flags to PCI subsystem.
  VFIO calls it to get vIOMMU exposed flags.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
  to bind/unbind device to IOMMUFD backed domains, either nested
  domain or not.

See below diagram:

        VFIO Device                                 Intel IOMMU
    .-----------------.                         .-------------------.
    |                 |                         |                   |
    |       .---------|PCIIOMMUOps              |.-------------.    |
    |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU  |    |
    |       | Device  |------------------------>|| Device list |    |
    |       .---------|(get_viommu_flags)       |.-------------.    |
    |                 |                         |       |           |
    |                 |                         |       V           |
    |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
    |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
    |       | link    |<------------------------|  |   Device    |  |
    |       .---------|            (detach_hwpt)|  .-------------.  |
    |                 |                         |       |           |
    |                 |                         |       ...         |
    .-----------------.                         .-------------------.

Below is an example to enable first stage translation for passthrough device:

    -M q35,...
    -device intel-iommu,x-scalable-mode=on,x-flts=on...
    -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...

Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test

PATCH01-09: Some preparing work
PATCH10-11: Compatibility check between vIOMMU and Host IOMMU
PATCH12-17: Implement first stage page table for passthrough device
PATCH18-20: Workaround for ERRATA_772415_SPR17
PATCH21:    Enable first stage translation for passthrough device
PATCH22:    Add doc

Qemu code can be found at [2]

Fault event injection to guest isn't supported in this series, we presume guest
kernel always construct correct first stage page table for passthrough device.
For emulated devices, the emulation code already provided first stage fault
injection.

TODO:
- Fault event injection to guest when HW first stage page table faults

[1] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v6

Thanks
Zhenzhong

Changelog:
v6:
- delete RPS capability related supporting code (Eric, Yi)
- use terminology 'first/second stage' to replace 'first/second level" (Eric, Yi)
- use get_viommu_flags() instead of get_viommu_caps() (Nicolin)
- drop non-RID_PASID related code and simplify pasid invalidation handling (Eric, Yi)
- drop the patch that handle pasid replay when context invalidation (Eric)
- move vendor specific cap check from VFIO core to backend/iommufd.c (Nicolin)

v5:
- refine commit log of patch2 (Cedric, Nicolin)
- introduce helper vfio_pci_from_vfio_device() (Cedric)
- introduce helper vfio_device_viommu_get_nested() (Cedric)
- pass 'bool bypass_ro' argument to vfio_listener_valid_section() instead of 'VFIOContainerBase *' (Cedric)
- fix a potential build error reported by Jim Shu

v4:
- s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin, Donald, Shameer)
- clarify get_viommu_cap() return pure emulated caps and explain reason in commit log (Eric)
- retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric)
- refine doc comment and commit log in patch10-11 (Eric)

v3:
- define enum type for VIOMMU_CAP_* (Eric)
- drop inline flag in the patch which uses the helper (Eric)
- use extract64 in new introduced MACRO (Eric)
- polish comments and fix typo error (Eric)
- split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
- optimize bind/unbind error path processing

v2:
- introduce get_viommu_cap() to get STAGE1 flag to create nesting parent HWPT (Liuyi)
- reuse VFIO's default HWPT as parent HWPT of nested translation (Nicolin, Liuyi)
- abandon support of VFIO device under pcie-to-pci bridge to simplify design (Liuyi)
- bypass RO mapping in VFIO's default HWPT if ERRATA_772415_SPR17 (Liuyi)
- drop vtd_dev_to_context_entry optimization (Liuyi)

v1:
- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
- rebase to master

rfcv3:
- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer)
- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
- simplify return value check of get_cap() (Eric)
- drop realize_late (Cedric, Eric)
- split patch13:intel_iommu: Add PASID cache management infrastructure (Eric)
- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
- refine comments (Eric, Donald)

rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
  iommu pasid, this is important for dropping VTDPASIDAddressSpace

Yi Liu (2):
  intel_iommu: Propagate PASID-based iotlb invalidation to host
  intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
    changed

Zhenzhong Duan (20):
  intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
    vtd_ce_get_pasid_entry
  intel_iommu: Delete RPS capability related supporting code
  intel_iommu: Update terminology to match VTD spec
  hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  hw/pci: Introduce pci_device_get_viommu_flags()
  intel_iommu: Implement get_viommu_flags() callback
  intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  vfio/iommufd: Force creating nesting parent HWPT
  intel_iommu: Stick to system MR for IOMMUFD backed host device when
    x-fls=on
  intel_iommu: Check for compatibility with IOMMUFD backed device when
    x-flts=on
  intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  intel_iommu: Handle PASID cache invalidation
  intel_iommu: Reset pasid cache when system level reset
  intel_iommu: Add some macros and inline functions
  intel_iommu: Bind/unbind guest page table to host
  iommufd: Introduce a helper function to extract vendor capabilities
  vfio: Add a new element bypass_ro in VFIOContainerBase
  Workaround for ERRATA_772415_SPR17
  intel_iommu: Enable host device when x-flts=on in scalable mode
  docs/devel: Add IOMMUFD nesting documentation

 MAINTAINERS                           |   1 +
 docs/devel/vfio-iommufd.rst           |  24 +
 hw/i386/intel_iommu_internal.h        | 100 ++-
 include/hw/i386/intel_iommu.h         |  11 +-
 include/hw/iommu.h                    |  24 +
 include/hw/pci/pci.h                  |  29 +
 include/hw/vfio/vfio-container-base.h |   1 +
 include/hw/vfio/vfio-device.h         |   2 +
 include/system/host_iommu_device.h    |  16 +
 backends/iommufd.c                    |  13 +
 hw/i386/intel_iommu.c                 | 848 ++++++++++++++++++++------
 hw/pci/pci.c                          |  23 +-
 hw/vfio/device.c                      |  12 +
 hw/vfio/iommufd.c                     |  19 +-
 hw/vfio/listener.c                    |  21 +-
 tests/qtest/intel-iommu-test.c        |   4 +-
 hw/i386/trace-events                  |   7 +
 17 files changed, 927 insertions(+), 228 deletions(-)
 create mode 100644 include/hw/iommu.h

-- 
2.47.1



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v6 01/22] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-18  8:57 ` [PATCH v6 02/22] intel_iommu: Delete RPS capability related supporting code Zhenzhong Duan
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
of rid2pasid, then it was extended to get any pasid entry. So a new name
vtd_ce_get_pasid_entry is better to match what it actually does.

No functional change intended.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/i386/intel_iommu.c | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 83c5e44413..71b70b795d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -944,10 +944,8 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
     return 0;
 }
 
-static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
-                                      VTDContextEntry *ce,
-                                      VTDPASIDEntry *pe,
-                                      uint32_t pasid)
+static int vtd_ce_get_pasid_entry(IntelIOMMUState *s, VTDContextEntry *ce,
+                                  VTDPASIDEntry *pe, uint32_t pasid)
 {
     dma_addr_t pasid_dir_base;
     int ret = 0;
@@ -1025,7 +1023,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return VTD_PE_GET_FL_LEVEL(&pe);
         } else {
@@ -1048,7 +1046,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
     }
 
@@ -1116,7 +1114,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
         } else {
@@ -1522,7 +1520,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
      * has valid rid2pasid setting, which includes valid
      * rid2pasid field and corresponding pasid entry setting
      */
-    return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
+    return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1611,7 +1609,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
     }
 
@@ -1687,7 +1685,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
     int ret;
 
     if (s->root_scalable) {
-        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (ret) {
             /*
              * This error is guest triggerable. We should assumt PT
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 02/22] intel_iommu: Delete RPS capability related supporting code
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
  2025-09-18  8:57 ` [PATCH v6 01/22] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-30 13:49   ` Eric Auger
  2025-09-18  8:57 ` [PATCH v6 03/22] intel_iommu: Update terminology to match VTD spec Zhenzhong Duan
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

RID-PASID Support(RPS) is not set in vIOMMU ECAP register, the supporting
code is there but never take effect.

Meanwhile, according to VTD spec section 3.4.3:
"Implementations not supporting RID_PASID capability (ECAP_REG.RPS is 0b),
use a PASID value of 0 to perform address translation for requests without
PASID."

We should delete the supporting code which fetches RID_PASID field from
scalable context entry and use 0 as RID_PASID directly, because RID_PASID
field is ignored if no RPS support according to spec.

This simplify the code and doesn't bring any penalty.

Opportunistically, s/rid2pasid/rid_pasid and s/RID2PASID/RID_PASID as
VTD spec uses RID_PASID terminology.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  1 -
 hw/i386/intel_iommu.c          | 49 +++++++++++++---------------------
 2 files changed, 19 insertions(+), 31 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 360e937989..6abe76556a 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -547,7 +547,6 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CTX_ENTRY_LEGACY_SIZE     16
 #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
 
-#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 71b70b795d..b976b251bc 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -41,8 +41,7 @@
 #include "trace.h"
 
 /* context entry operations */
-#define VTD_CE_GET_RID2PASID(ce) \
-    ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
+#define RID_PASID    0
 #define VTD_CE_GET_PASID_DIR_TABLE(ce) \
     ((ce)->val[0] & VTD_PASID_DIR_BASE_ADDR_MASK)
 
@@ -951,7 +950,7 @@ static int vtd_ce_get_pasid_entry(IntelIOMMUState *s, VTDContextEntry *ce,
     int ret = 0;
 
     if (pasid == PCI_NO_PASID) {
-        pasid = VTD_CE_GET_RID2PASID(ce);
+        pasid = RID_PASID;
     }
     pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
     ret = vtd_get_pe_from_pasid_table(s, pasid_dir_base, pasid, pe);
@@ -970,7 +969,7 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (pasid == PCI_NO_PASID) {
-        pasid = VTD_CE_GET_RID2PASID(ce);
+        pasid = RID_PASID;
     }
     pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
 
@@ -1510,15 +1509,14 @@ static inline int vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
     return 0;
 }
 
-static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
+static int vtd_ce_rid_pasid_check(IntelIOMMUState *s,
                                   VTDContextEntry *ce)
 {
     VTDPASIDEntry pe;
 
     /*
      * Make sure in Scalable Mode, a present context entry
-     * has valid rid2pasid setting, which includes valid
-     * rid2pasid field and corresponding pasid entry setting
+     * has valid pasid entry setting at RID_PASID(0).
      */
     return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
 }
@@ -1581,12 +1579,11 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
         }
     } else {
         /*
-         * Check if the programming of context-entry.rid2pasid
-         * and corresponding pasid setting is valid, and thus
-         * avoids to check pasid entry fetching result in future
-         * helper function calling.
+         * Check if the programming of pasid setting at RID_PASID(0)
+         * is valid, and thus avoids to check pasid entry fetching
+         * result in future helper function calling.
          */
-        ret_fr = vtd_ce_rid2pasid_check(s, ce);
+        ret_fr = vtd_ce_rid_pasid_check(s, ce);
         if (ret_fr) {
             return ret_fr;
         }
@@ -2097,7 +2094,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
     bool reads = true;
     bool writes = true;
     uint8_t access_flags, pgtt;
-    bool rid2pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
+    bool rid_pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
     VTDIOTLBEntry *iotlb_entry;
     uint64_t xlat, size;
 
@@ -2111,8 +2108,8 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
 
     cc_entry = &vtd_as->context_cache_entry;
 
-    /* Try to fetch pte from IOTLB, we don't need RID2PASID logic */
-    if (!rid2pasid) {
+    /* Try to fetch pte from IOTLB, we don't need RID_PASID(0) logic */
+    if (!rid_pasid) {
         iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
         if (iotlb_entry) {
             trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
@@ -2160,8 +2157,8 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         cc_entry->context_cache_gen = s->context_cache_gen;
     }
 
-    if (rid2pasid) {
-        pasid = VTD_CE_GET_RID2PASID(&ce);
+    if (rid_pasid) {
+        pasid = RID_PASID;
     }
 
     /*
@@ -2189,8 +2186,8 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         return true;
     }
 
-    /* Try to fetch pte from IOTLB for RID2PASID slow path */
-    if (rid2pasid) {
+    /* Try to fetch pte from IOTLB for RID_PASID(0) slow path */
+    if (rid_pasid) {
         iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
         if (iotlb_entry) {
             trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
@@ -2464,20 +2461,14 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
         ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                        vtd_as->devfn, &ce);
         if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
-            uint32_t rid2pasid = PCI_NO_PASID;
-
-            if (s->root_scalable) {
-                rid2pasid = VTD_CE_GET_RID2PASID(&ce);
-            }
-
             /*
              * In legacy mode, vtd_as->pasid == pasid is always true.
              * In scalable mode, for vtd address space backing a PCI
              * device without pasid, needs to compare pasid with
-             * rid2pasid of this device.
+             * RID_PASID(0) of this device.
              */
             if (!(vtd_as->pasid == pasid ||
-                  (vtd_as->pasid == PCI_NO_PASID && pasid == rid2pasid))) {
+                  (vtd_as->pasid == PCI_NO_PASID && pasid == RID_PASID))) {
                 continue;
             }
 
@@ -2976,9 +2967,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
         if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
                                       vtd_as->devfn, &ce) &&
             domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
-            uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
-
-            if ((vtd_as->pasid != PCI_NO_PASID || pasid != rid2pasid) &&
+            if ((vtd_as->pasid != PCI_NO_PASID || pasid != RID_PASID) &&
                 vtd_as->pasid != pasid) {
                 continue;
             }
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 03/22] intel_iommu: Update terminology to match VTD spec
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
  2025-09-18  8:57 ` [PATCH v6 01/22] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
  2025-09-18  8:57 ` [PATCH v6 02/22] intel_iommu: Delete RPS capability related supporting code Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-30  7:45   ` Eric Auger
  2025-10-12 12:30   ` Yi Liu
  2025-09-18  8:57 ` [PATCH v6 04/22] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
                   ` (18 subsequent siblings)
  21 siblings, 2 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Paolo Bonzini

VTD spec revision 3.4 released in December 2021 renamed "First-level" to
"First-stage" and "Second-level" to "Second-stage".

Do the same in intel_iommu code to match spec, change all existing
"fl/sl/FL/SL/first level/second level/stage-1/stage-2" terminology to
"fs/ss/FS/SS/first stage/second stage".

No functional changes intended.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Suggested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  63 ++++-----
 include/hw/i386/intel_iommu.h  |   2 +-
 hw/i386/intel_iommu.c          | 240 +++++++++++++++++----------------
 tests/qtest/intel-iommu-test.c |   4 +-
 4 files changed, 156 insertions(+), 153 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 6abe76556a..86b8bfc71f 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -195,8 +195,8 @@
 #define VTD_ECAP_PSS                (7ULL << 35) /* limit: MemTxAttrs::pid */
 #define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
-#define VTD_ECAP_SLTS               (1ULL << 46)
-#define VTD_ECAP_FLTS               (1ULL << 47)
+#define VTD_ECAP_SSTS               (1ULL << 46)
+#define VTD_ECAP_FSTS               (1ULL << 47)
 
 /* CAP_REG */
 /* (offset >> 4) << 24 */
@@ -210,7 +210,7 @@
 #define VTD_MAMV                    18ULL
 #define VTD_CAP_MAMV                (VTD_MAMV << 48)
 #define VTD_CAP_PSI                 (1ULL << 39)
-#define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
+#define VTD_CAP_SSLPS               ((1ULL << 34) | (1ULL << 35))
 #define VTD_CAP_DRAIN_WRITE         (1ULL << 54)
 #define VTD_CAP_DRAIN_READ          (1ULL << 55)
 #define VTD_CAP_FS1GP               (1ULL << 56)
@@ -282,7 +282,7 @@ typedef enum VTDFaultReason {
     VTD_FR_ADDR_BEYOND_MGAW,    /* Input-address above (2^x-1) */
     VTD_FR_WRITE,               /* No write permission */
     VTD_FR_READ,                /* No read permission */
-    /* Fail to access a second-level paging entry (not SL_PML4E) */
+    /* Fail to access a second-stage paging entry (not SS_PML4E) */
     VTD_FR_PAGING_ENTRY_INV,
     VTD_FR_ROOT_TABLE_INV,      /* Fail to access a root-entry */
     VTD_FR_CONTEXT_TABLE_INV,   /* Fail to access a context-entry */
@@ -290,7 +290,8 @@ typedef enum VTDFaultReason {
     VTD_FR_ROOT_ENTRY_RSVD,
     /* Non-zero reserved field in a present context-entry */
     VTD_FR_CONTEXT_ENTRY_RSVD,
-    /* Non-zero reserved field in a second-level paging entry with at lease one
+    /*
+     * Non-zero reserved field in a second-stage paging entry with at lease one
      * Read(R) and Write(W) or Execute(E) field is Set.
      */
     VTD_FR_PAGING_ENTRY_RSVD,
@@ -323,7 +324,7 @@ typedef enum VTDFaultReason {
     VTD_FR_PASID_ENTRY_P = 0x59,
     VTD_FR_PASID_TABLE_ENTRY_INV = 0x5b,  /*Invalid PASID table entry */
 
-    /* Fail to access a first-level paging entry (not FS_PML4E) */
+    /* Fail to access a first-stage paging entry (not FS_PML4E) */
     VTD_FR_FS_PAGING_ENTRY_INV = 0x70,
     VTD_FR_FS_PAGING_ENTRY_P = 0x71,
     /* Non-zero reserved field in present first-stage paging entry */
@@ -445,23 +446,23 @@ typedef union VTDInvDesc VTDInvDesc;
 
 #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, stale_tm) \
         stale_tm ? \
-        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
-        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM | VTD_SS_TM)) : \
+        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 #define VTD_SPTE_PAGE_L2_RSVD_MASK(aw) \
-        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 #define VTD_SPTE_PAGE_L3_RSVD_MASK(aw) \
-        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
-        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 
 #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw, stale_tm) \
         stale_tm ? \
-        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
-        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM | VTD_SS_TM)) : \
+        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 #define VTD_SPTE_LPAGE_L3_RSVD_MASK(aw, stale_tm) \
         stale_tm ? \
-        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
-        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
+        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM | VTD_SS_TM)) : \
+        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
 
 /* Rsvd field masks for fpte */
 #define VTD_FS_UPPER_IGNORED 0xfff0000000000000ULL
@@ -535,7 +536,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CONTEXT_TT_DEV_IOTLB    (1ULL << 2)
 #define VTD_CONTEXT_TT_PASS_THROUGH (2ULL << 2)
 /* Second Level Page Translation Pointer*/
-#define VTD_CONTEXT_ENTRY_SLPTPTR   (~0xfffULL)
+#define VTD_CONTEXT_ENTRY_SSPTPTR   (~0xfffULL)
 #define VTD_CONTEXT_ENTRY_RSVD_LO(aw) (0xff0ULL | ~VTD_HAW_MASK(aw))
 /* hi */
 #define VTD_CONTEXT_ENTRY_AW        7ULL /* Adjusted guest-address-width */
@@ -565,35 +566,35 @@ typedef struct VTDRootEntry VTDRootEntry;
 /* PASID Granular Translation Type Mask */
 #define VTD_PASID_ENTRY_P              1ULL
 #define VTD_SM_PASID_ENTRY_PGTT        (7ULL << 6)
-#define VTD_SM_PASID_ENTRY_FLT         (1ULL << 6)
-#define VTD_SM_PASID_ENTRY_SLT         (2ULL << 6)
+#define VTD_SM_PASID_ENTRY_FST         (1ULL << 6)
+#define VTD_SM_PASID_ENTRY_SST         (2ULL << 6)
 #define VTD_SM_PASID_ENTRY_NESTED      (3ULL << 6)
 #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
 
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
 
-#define VTD_SM_PASID_ENTRY_FLPM          3ULL
-#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_FSPM          3ULL
+#define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
 
 /* First Level Paging Structure */
 /* Masks for First Level Paging Entry */
-#define VTD_FL_P                    1ULL
-#define VTD_FL_RW                   (1ULL << 1)
-#define VTD_FL_US                   (1ULL << 2)
-#define VTD_FL_A                    (1ULL << 5)
-#define VTD_FL_D                    (1ULL << 6)
+#define VTD_FS_P                    1ULL
+#define VTD_FS_RW                   (1ULL << 1)
+#define VTD_FS_US                   (1ULL << 2)
+#define VTD_FS_A                    (1ULL << 5)
+#define VTD_FS_D                    (1ULL << 6)
 
 /* Second Level Page Translation Pointer*/
-#define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SSPTPTR     (~0xfffULL)
 
 /* Second Level Paging Structure */
 /* Masks for Second Level Paging Entry */
-#define VTD_SL_RW_MASK              3ULL
-#define VTD_SL_R                    1ULL
-#define VTD_SL_W                    (1ULL << 1)
-#define VTD_SL_IGN_COM              0xbff0000000000000ULL
-#define VTD_SL_TM                   (1ULL << 62)
+#define VTD_SS_RW_MASK              3ULL
+#define VTD_SS_R                    1ULL
+#define VTD_SS_W                    (1ULL << 1)
+#define VTD_SS_IGN_COM              0xbff0000000000000ULL
+#define VTD_SS_TM                   (1ULL << 62)
 
 /* Common for both First Level and Second Level */
 #define VTD_PML4_LEVEL           4
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index e95477e855..564d4d4236 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -264,7 +264,7 @@ struct IntelIOMMUState {
 
     bool caching_mode;              /* RO - is cap CM enabled? */
     bool scalable_mode;             /* RO - is Scalable Mode supported? */
-    bool flts;                      /* RO - is stage-1 translation supported? */
+    bool fsts;                      /* RO - is first stage translation supported? */
     bool snoop_control;             /* RO - is SNP filed supported? */
 
     dma_addr_t root;                /* Current root table pointer */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b976b251bc..a47482ba9d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -47,9 +47,9 @@
 
 /* pe operations */
 #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
-#define VTD_PE_GET_FL_LEVEL(pe) \
-    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM))
-#define VTD_PE_GET_SL_LEVEL(pe) \
+#define VTD_PE_GET_FS_LEVEL(pe) \
+    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FSPM))
+#define VTD_PE_GET_SS_LEVEL(pe) \
     (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
 
 /*
@@ -319,7 +319,7 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
      * nested (PGTT=011b) mapping associated with specified domain-id are
      * invalidated. Nested isn't supported yet, so only need to check 001b.
      */
-    if (entry->pgtt == VTD_SM_PASID_ENTRY_FLT) {
+    if (entry->pgtt == VTD_SM_PASID_ENTRY_FST) {
         return true;
     }
 
@@ -340,7 +340,7 @@ static gboolean vtd_hash_remove_by_page_piotlb(gpointer key, gpointer value,
      * or pass-through (PGTT=100b) mappings. Nested isn't supported yet,
      * so only need to check first-stage (PGTT=001b) mappings.
      */
-    if (entry->pgtt != VTD_SM_PASID_ENTRY_FLT) {
+    if (entry->pgtt != VTD_SM_PASID_ENTRY_FST) {
         return false;
     }
 
@@ -747,9 +747,9 @@ static int vtd_get_context_entry_from_root(IntelIOMMUState *s,
     return 0;
 }
 
-static inline dma_addr_t vtd_ce_get_slpt_base(VTDContextEntry *ce)
+static inline dma_addr_t vtd_ce_get_sspt_base(VTDContextEntry *ce)
 {
-    return ce->lo & VTD_CONTEXT_ENTRY_SLPTPTR;
+    return ce->lo & VTD_CONTEXT_ENTRY_SSPTPTR;
 }
 
 static inline uint64_t vtd_get_pte_addr(uint64_t pte, uint8_t aw)
@@ -790,13 +790,13 @@ static inline uint32_t vtd_iova_level_offset(uint64_t iova, uint32_t level)
 }
 
 /* Check Capability Register to see if the @level of page-table is supported */
-static inline bool vtd_is_sl_level_supported(IntelIOMMUState *s, uint32_t level)
+static inline bool vtd_is_ss_level_supported(IntelIOMMUState *s, uint32_t level)
 {
     return VTD_CAP_SAGAW_MASK & s->cap &
            (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
 }
 
-static inline bool vtd_is_fl_level_supported(IntelIOMMUState *s, uint32_t level)
+static inline bool vtd_is_fs_level_supported(IntelIOMMUState *s, uint32_t level)
 {
     return level == VTD_PML4_LEVEL;
 }
@@ -805,10 +805,10 @@ static inline bool vtd_is_fl_level_supported(IntelIOMMUState *s, uint32_t level)
 static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
 {
     switch (VTD_PE_GET_TYPE(pe)) {
-    case VTD_SM_PASID_ENTRY_FLT:
-        return !!(s->ecap & VTD_ECAP_FLTS);
-    case VTD_SM_PASID_ENTRY_SLT:
-        return !!(s->ecap & VTD_ECAP_SLTS);
+    case VTD_SM_PASID_ENTRY_FST:
+        return !!(s->ecap & VTD_ECAP_FSTS);
+    case VTD_SM_PASID_ENTRY_SST:
+        return !!(s->ecap & VTD_ECAP_SSTS);
     case VTD_SM_PASID_ENTRY_NESTED:
         /* Not support NESTED page table type yet */
         return false;
@@ -880,13 +880,13 @@ static int vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
     }
 
     pgtt = VTD_PE_GET_TYPE(pe);
-    if (pgtt == VTD_SM_PASID_ENTRY_SLT &&
-        !vtd_is_sl_level_supported(s, VTD_PE_GET_SL_LEVEL(pe))) {
+    if (pgtt == VTD_SM_PASID_ENTRY_SST &&
+        !vtd_is_ss_level_supported(s, VTD_PE_GET_SS_LEVEL(pe))) {
             return -VTD_FR_PASID_TABLE_ENTRY_INV;
     }
 
-    if (pgtt == VTD_SM_PASID_ENTRY_FLT &&
-        !vtd_is_fl_level_supported(s, VTD_PE_GET_FL_LEVEL(pe))) {
+    if (pgtt == VTD_SM_PASID_ENTRY_FST &&
+        !vtd_is_fs_level_supported(s, VTD_PE_GET_FS_LEVEL(pe))) {
             return -VTD_FR_PASID_TABLE_ENTRY_INV;
     }
 
@@ -1007,7 +1007,8 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
     return 0;
 }
 
-/* Get the page-table level that hardware should use for the second-level
+/*
+ * Get the page-table level that hardware should use for the second-stage
  * page-table walk from the Address Width field of context-entry.
  */
 static inline uint32_t vtd_ce_get_level(VTDContextEntry *ce)
@@ -1023,10 +1024,10 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
 
     if (s->root_scalable) {
         vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
-        if (s->flts) {
-            return VTD_PE_GET_FL_LEVEL(&pe);
+        if (s->fsts) {
+            return VTD_PE_GET_FS_LEVEL(&pe);
         } else {
-            return VTD_PE_GET_SL_LEVEL(&pe);
+            return VTD_PE_GET_SS_LEVEL(&pe);
         }
     }
 
@@ -1095,7 +1096,7 @@ static inline uint64_t vtd_iova_limit(IntelIOMMUState *s,
 }
 
 /* Return true if IOVA passes range check, otherwise false. */
-static inline bool vtd_iova_sl_range_check(IntelIOMMUState *s,
+static inline bool vtd_iova_ss_range_check(IntelIOMMUState *s,
                                            uint64_t iova, VTDContextEntry *ce,
                                            uint8_t aw, uint32_t pasid)
 {
@@ -1114,14 +1115,14 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
 
     if (s->root_scalable) {
         vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
-        if (s->flts) {
-            return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+        if (s->fsts) {
+            return pe.val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
         } else {
-            return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
+            return pe.val[0] & VTD_SM_PASID_ENTRY_SSPTPTR;
         }
     }
 
-    return vtd_ce_get_slpt_base(ce);
+    return vtd_ce_get_sspt_base(ce);
 }
 
 /*
@@ -1136,13 +1137,13 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
 static uint64_t vtd_spte_rsvd[VTD_SPTE_RSVD_LEN];
 static uint64_t vtd_spte_rsvd_large[VTD_SPTE_RSVD_LEN];
 
-static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
+static bool vtd_sspte_nonzero_rsvd(uint64_t sspte, uint32_t level)
 {
     uint64_t rsvd_mask;
 
     /*
      * We should have caught a guest-mis-programmed level earlier,
-     * via vtd_is_sl_level_supported.
+     * via vtd_is_ss_level_supported.
      */
     assert(level < VTD_SPTE_RSVD_LEN);
     /*
@@ -1152,46 +1153,47 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
     assert(level);
 
     if ((level == VTD_PD_LEVEL || level == VTD_PDP_LEVEL) &&
-        (slpte & VTD_PT_PAGE_SIZE_MASK)) {
+        (sspte & VTD_PT_PAGE_SIZE_MASK)) {
         /* large page */
         rsvd_mask = vtd_spte_rsvd_large[level];
     } else {
         rsvd_mask = vtd_spte_rsvd[level];
     }
 
-    return slpte & rsvd_mask;
+    return sspte & rsvd_mask;
 }
 
-/* Given the @iova, get relevant @slptep. @slpte_level will be the last level
+/*
+ * Given the @iova, get relevant @ssptep. @sspte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
-static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
+static int vtd_iova_to_sspte(IntelIOMMUState *s, VTDContextEntry *ce,
                              uint64_t iova, bool is_write,
-                             uint64_t *slptep, uint32_t *slpte_level,
+                             uint64_t *ssptep, uint32_t *sspte_level,
                              bool *reads, bool *writes, uint8_t aw_bits,
                              uint32_t pasid)
 {
     dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
     uint32_t level = vtd_get_iova_level(s, ce, pasid);
     uint32_t offset;
-    uint64_t slpte;
+    uint64_t sspte;
     uint64_t access_right_check;
 
-    if (!vtd_iova_sl_range_check(s, iova, ce, aw_bits, pasid)) {
+    if (!vtd_iova_ss_range_check(s, iova, ce, aw_bits, pasid)) {
         error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ","
                           "pasid=0x%" PRIx32 ")", __func__, iova, pasid);
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
     /* FIXME: what is the Atomics request here? */
-    access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
+    access_right_check = is_write ? VTD_SS_W : VTD_SS_R;
 
     while (true) {
         offset = vtd_iova_level_offset(iova, level);
-        slpte = vtd_get_pte(addr, offset);
+        sspte = vtd_get_pte(addr, offset);
 
-        if (slpte == (uint64_t)-1) {
-            error_report_once("%s: detected read error on DMAR slpte "
+        if (sspte == (uint64_t)-1) {
+            error_report_once("%s: detected read error on DMAR sspte "
                               "(iova=0x%" PRIx64 ", pasid=0x%" PRIx32 ")",
                               __func__, iova, pasid);
             if (level == vtd_get_iova_level(s, ce, pasid)) {
@@ -1201,30 +1203,30 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
                 return -VTD_FR_PAGING_ENTRY_INV;
             }
         }
-        *reads = (*reads) && (slpte & VTD_SL_R);
-        *writes = (*writes) && (slpte & VTD_SL_W);
-        if (!(slpte & access_right_check)) {
-            error_report_once("%s: detected slpte permission error "
+        *reads = (*reads) && (sspte & VTD_SS_R);
+        *writes = (*writes) && (sspte & VTD_SS_W);
+        if (!(sspte & access_right_check)) {
+            error_report_once("%s: detected sspte permission error "
                               "(iova=0x%" PRIx64 ", level=0x%" PRIx32 ", "
-                              "slpte=0x%" PRIx64 ", write=%d, pasid=0x%"
+                              "sspte=0x%" PRIx64 ", write=%d, pasid=0x%"
                               PRIx32 ")", __func__, iova, level,
-                              slpte, is_write, pasid);
+                              sspte, is_write, pasid);
             return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
         }
-        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+        if (vtd_sspte_nonzero_rsvd(sspte, level)) {
             error_report_once("%s: detected splte reserve non-zero "
                               "iova=0x%" PRIx64 ", level=0x%" PRIx32
-                              "slpte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
-                              __func__, iova, level, slpte, pasid);
+                              "sspte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
+                              __func__, iova, level, sspte, pasid);
             return -VTD_FR_PAGING_ENTRY_RSVD;
         }
 
-        if (vtd_is_last_pte(slpte, level)) {
-            *slptep = slpte;
-            *slpte_level = level;
+        if (vtd_is_last_pte(sspte, level)) {
+            *ssptep = sspte;
+            *sspte_level = level;
             break;
         }
-        addr = vtd_get_pte_addr(slpte, aw_bits);
+        addr = vtd_get_pte_addr(sspte, aw_bits);
         level--;
     }
 
@@ -1350,7 +1352,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
 {
     bool read_cur, write_cur, entry_valid;
     uint32_t offset;
-    uint64_t slpte;
+    uint64_t sspte;
     uint64_t subpage_size, subpage_mask;
     IOMMUTLBEvent event;
     uint64_t iova = start;
@@ -1366,21 +1368,21 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
         iova_next = (iova & subpage_mask) + subpage_size;
 
         offset = vtd_iova_level_offset(iova, level);
-        slpte = vtd_get_pte(addr, offset);
+        sspte = vtd_get_pte(addr, offset);
 
-        if (slpte == (uint64_t)-1) {
+        if (sspte == (uint64_t)-1) {
             trace_vtd_page_walk_skip_read(iova, iova_next);
             goto next;
         }
 
-        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
+        if (vtd_sspte_nonzero_rsvd(sspte, level)) {
             trace_vtd_page_walk_skip_reserve(iova, iova_next);
             goto next;
         }
 
         /* Permissions are stacked with parents' */
-        read_cur = read && (slpte & VTD_SL_R);
-        write_cur = write && (slpte & VTD_SL_W);
+        read_cur = read && (sspte & VTD_SS_R);
+        write_cur = write && (sspte & VTD_SS_W);
 
         /*
          * As long as we have either read/write permission, this is a
@@ -1389,12 +1391,12 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
          */
         entry_valid = read_cur | write_cur;
 
-        if (!vtd_is_last_pte(slpte, level) && entry_valid) {
+        if (!vtd_is_last_pte(sspte, level) && entry_valid) {
             /*
              * This is a valid PDE (or even bigger than PDE).  We need
              * to walk one further level.
              */
-            ret = vtd_page_walk_level(vtd_get_pte_addr(slpte, info->aw),
+            ret = vtd_page_walk_level(vtd_get_pte_addr(sspte, info->aw),
                                       iova, MIN(iova_next, end), level - 1,
                                       read_cur, write_cur, info);
         } else {
@@ -1411,7 +1413,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
             event.entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
             event.entry.addr_mask = ~subpage_mask;
             /* NOTE: this is only meaningful if entry_valid == true */
-            event.entry.translated_addr = vtd_get_pte_addr(slpte, info->aw);
+            event.entry.translated_addr = vtd_get_pte_addr(sspte, info->aw);
             event.type = event.entry.perm ? IOMMU_NOTIFIER_MAP :
                                             IOMMU_NOTIFIER_UNMAP;
             ret = vtd_page_walk_one(&event, info);
@@ -1445,11 +1447,11 @@ static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
     dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
     uint32_t level = vtd_get_iova_level(s, ce, pasid);
 
-    if (!vtd_iova_sl_range_check(s, start, ce, info->aw, pasid)) {
+    if (!vtd_iova_ss_range_check(s, start, ce, info->aw, pasid)) {
         return -VTD_FR_ADDR_BEYOND_MGAW;
     }
 
-    if (!vtd_iova_sl_range_check(s, end, ce, info->aw, pasid)) {
+    if (!vtd_iova_ss_range_check(s, end, ce, info->aw, pasid)) {
         /* Fix end so that it reaches the maximum */
         end = vtd_iova_limit(s, ce, info->aw, pasid);
     }
@@ -1563,7 +1565,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
 
     /* Check if the programming of context-entry is valid */
     if (!s->root_scalable &&
-        !vtd_is_sl_level_supported(s, vtd_ce_get_level(ce))) {
+        !vtd_is_ss_level_supported(s, vtd_ce_get_level(ce))) {
         error_report_once("%s: invalid context entry: hi=%"PRIx64
                           ", lo=%"PRIx64" (level %d not supported)",
                           __func__, ce->hi, ce->lo,
@@ -1670,10 +1672,9 @@ static int vtd_address_space_sync(VTDAddressSpace *vtd_as)
 }
 
 /*
- * Check if specific device is configured to bypass address
- * translation for DMA requests. In Scalable Mode, bypass
- * 1st-level translation or 2nd-level translation, it depends
- * on PGTT setting.
+ * Check if specific device is configured to bypass address translation
+ * for DMA requests. In Scalable Mode, bypass first stage translation
+ * or second stage translation, it depends on PGTT setting.
  */
 static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
                                uint32_t pasid)
@@ -1910,13 +1911,13 @@ out:
 static uint64_t vtd_fpte_rsvd[VTD_FPTE_RSVD_LEN];
 static uint64_t vtd_fpte_rsvd_large[VTD_FPTE_RSVD_LEN];
 
-static bool vtd_flpte_nonzero_rsvd(uint64_t flpte, uint32_t level)
+static bool vtd_fspte_nonzero_rsvd(uint64_t fspte, uint32_t level)
 {
     uint64_t rsvd_mask;
 
     /*
      * We should have caught a guest-mis-programmed level earlier,
-     * via vtd_is_fl_level_supported.
+     * via vtd_is_fs_level_supported.
      */
     assert(level < VTD_FPTE_RSVD_LEN);
     /*
@@ -1926,23 +1927,23 @@ static bool vtd_flpte_nonzero_rsvd(uint64_t flpte, uint32_t level)
     assert(level);
 
     if ((level == VTD_PD_LEVEL || level == VTD_PDP_LEVEL) &&
-        (flpte & VTD_PT_PAGE_SIZE_MASK)) {
+        (fspte & VTD_PT_PAGE_SIZE_MASK)) {
         /* large page */
         rsvd_mask = vtd_fpte_rsvd_large[level];
     } else {
         rsvd_mask = vtd_fpte_rsvd[level];
     }
 
-    return flpte & rsvd_mask;
+    return fspte & rsvd_mask;
 }
 
-static inline bool vtd_flpte_present(uint64_t flpte)
+static inline bool vtd_fspte_present(uint64_t fspte)
 {
-    return !!(flpte & VTD_FL_P);
+    return !!(fspte & VTD_FS_P);
 }
 
 /* Return true if IOVA is canonical, otherwise false. */
-static bool vtd_iova_fl_check_canonical(IntelIOMMUState *s, uint64_t iova,
+static bool vtd_iova_fs_check_canonical(IntelIOMMUState *s, uint64_t iova,
                                         VTDContextEntry *ce, uint32_t pasid)
 {
     uint64_t iova_limit = vtd_iova_limit(s, ce, s->aw_bits, pasid);
@@ -1972,32 +1973,32 @@ static MemTxResult vtd_set_flag_in_pte(dma_addr_t base_addr, uint32_t index,
 }
 
 /*
- * Given the @iova, get relevant @flptep. @flpte_level will be the last level
+ * Given the @iova, get relevant @fsptep. @fspte_level will be the last level
  * of the translation, can be used for deciding the size of large page.
  */
-static int vtd_iova_to_flpte(IntelIOMMUState *s, VTDContextEntry *ce,
+static int vtd_iova_to_fspte(IntelIOMMUState *s, VTDContextEntry *ce,
                              uint64_t iova, bool is_write,
-                             uint64_t *flptep, uint32_t *flpte_level,
+                             uint64_t *fsptep, uint32_t *fspte_level,
                              bool *reads, bool *writes, uint8_t aw_bits,
                              uint32_t pasid)
 {
     dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
     uint32_t offset;
-    uint64_t flpte, flag_ad = VTD_FL_A;
-    *flpte_level = vtd_get_iova_level(s, ce, pasid);
+    uint64_t fspte, flag_ad = VTD_FS_A;
+    *fspte_level = vtd_get_iova_level(s, ce, pasid);
 
-    if (!vtd_iova_fl_check_canonical(s, iova, ce, pasid)) {
+    if (!vtd_iova_fs_check_canonical(s, iova, ce, pasid)) {
         error_report_once("%s: detected non canonical IOVA (iova=0x%" PRIx64 ","
                           "pasid=0x%" PRIx32 ")", __func__, iova, pasid);
         return -VTD_FR_FS_NON_CANONICAL;
     }
 
     while (true) {
-        offset = vtd_iova_level_offset(iova, *flpte_level);
-        flpte = vtd_get_pte(addr, offset);
+        offset = vtd_iova_level_offset(iova, *fspte_level);
+        fspte = vtd_get_pte(addr, offset);
 
-        if (flpte == (uint64_t)-1) {
-            if (*flpte_level == vtd_get_iova_level(s, ce, pasid)) {
+        if (fspte == (uint64_t)-1) {
+            if (*fspte_level == vtd_get_iova_level(s, ce, pasid)) {
                 /* Invalid programming of pasid-entry */
                 return -VTD_FR_PASID_ENTRY_FSPTPTR_INV;
             } else {
@@ -2005,47 +2006,47 @@ static int vtd_iova_to_flpte(IntelIOMMUState *s, VTDContextEntry *ce,
             }
         }
 
-        if (!vtd_flpte_present(flpte)) {
+        if (!vtd_fspte_present(fspte)) {
             *reads = false;
             *writes = false;
             return -VTD_FR_FS_PAGING_ENTRY_P;
         }
 
         /* No emulated device supports supervisor privilege request yet */
-        if (!(flpte & VTD_FL_US)) {
+        if (!(fspte & VTD_FS_US)) {
             *reads = false;
             *writes = false;
             return -VTD_FR_FS_PAGING_ENTRY_US;
         }
 
         *reads = true;
-        *writes = (*writes) && (flpte & VTD_FL_RW);
-        if (is_write && !(flpte & VTD_FL_RW)) {
+        *writes = (*writes) && (fspte & VTD_FS_RW);
+        if (is_write && !(fspte & VTD_FS_RW)) {
             return -VTD_FR_SM_WRITE;
         }
-        if (vtd_flpte_nonzero_rsvd(flpte, *flpte_level)) {
-            error_report_once("%s: detected flpte reserved non-zero "
+        if (vtd_fspte_nonzero_rsvd(fspte, *fspte_level)) {
+            error_report_once("%s: detected fspte reserved non-zero "
                               "iova=0x%" PRIx64 ", level=0x%" PRIx32
-                              "flpte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
-                              __func__, iova, *flpte_level, flpte, pasid);
+                              "fspte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
+                              __func__, iova, *fspte_level, fspte, pasid);
             return -VTD_FR_FS_PAGING_ENTRY_RSVD;
         }
 
-        if (vtd_is_last_pte(flpte, *flpte_level) && is_write) {
-            flag_ad |= VTD_FL_D;
+        if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
+            flag_ad |= VTD_FS_D;
         }
 
-        if (vtd_set_flag_in_pte(addr, offset, flpte, flag_ad) != MEMTX_OK) {
+        if (vtd_set_flag_in_pte(addr, offset, fspte, flag_ad) != MEMTX_OK) {
             return -VTD_FR_FS_BIT_UPDATE_FAILED;
         }
 
-        if (vtd_is_last_pte(flpte, *flpte_level)) {
-            *flptep = flpte;
+        if (vtd_is_last_pte(fspte, *fspte_level)) {
+            *fsptep = fspte;
             return 0;
         }
 
-        addr = vtd_get_pte_addr(flpte, aw_bits);
-        (*flpte_level)--;
+        addr = vtd_get_pte_addr(fspte, aw_bits);
+        (*fspte_level)--;
     }
 }
 
@@ -2199,14 +2200,14 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
         }
     }
 
-    if (s->flts && s->root_scalable) {
-        ret_fr = vtd_iova_to_flpte(s, &ce, addr, is_write, &pte, &level,
+    if (s->fsts && s->root_scalable) {
+        ret_fr = vtd_iova_to_fspte(s, &ce, addr, is_write, &pte, &level,
                                    &reads, &writes, s->aw_bits, pasid);
-        pgtt = VTD_SM_PASID_ENTRY_FLT;
+        pgtt = VTD_SM_PASID_ENTRY_FST;
     } else {
-        ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &pte, &level,
+        ret_fr = vtd_iova_to_sspte(s, &ce, addr, is_write, &pte, &level,
                                    &reads, &writes, s->aw_bits, pasid);
-        pgtt = VTD_SM_PASID_ENTRY_SLT;
+        pgtt = VTD_SM_PASID_ENTRY_SST;
     }
     if (!ret_fr) {
         xlat = vtd_get_pte_addr(pte, s->aw_bits);
@@ -2474,13 +2475,13 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
 
             if (vtd_as_has_map_notifier(vtd_as)) {
                 /*
-                 * When stage-1 translation is off, as long as we have MAP
+                 * When first stage translation is off, as long as we have MAP
                  * notifications registered in any of our IOMMU notifiers,
                  * we need to sync the shadow page table. Otherwise VFIO
                  * device attaches to nested page table instead of shadow
                  * page table, so no need to sync.
                  */
-                if (!s->flts || !s->root_scalable) {
+                if (!s->fsts || !s->root_scalable) {
                     vtd_sync_shadow_page_table_range(vtd_as, &ce, addr, size);
                 }
             } else {
@@ -2972,7 +2973,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
                 continue;
             }
 
-            if (!s->flts || !vtd_as_has_map_notifier(vtd_as)) {
+            if (!s->fsts || !vtd_as_has_map_notifier(vtd_as)) {
                 vtd_address_space_sync(vtd_as);
             }
         }
@@ -3818,7 +3819,7 @@ static const Property vtd_properties[] = {
                       VTD_HOST_ADDRESS_WIDTH),
     DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
     DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
-    DEFINE_PROP_BOOL("x-flts", IntelIOMMUState, flts, FALSE),
+    DEFINE_PROP_BOOL("x-flts", IntelIOMMUState, fsts, FALSE),
     DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control, false),
     DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
     DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
@@ -4344,12 +4345,13 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         return false;
     }
 
-    if (!s->flts) {
-        /* All checks requested by VTD stage-2 translation pass */
+    if (!s->fsts) {
+        /* All checks requested by VTD second stage translation pass */
         return true;
     }
 
-    error_setg(errp, "host device is uncompatible with stage-1 translation");
+    error_setg(errp,
+               "host device is uncompatible with first stage translation");
     return false;
 }
 
@@ -4535,7 +4537,7 @@ static void vtd_cap_init(IntelIOMMUState *s)
     X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
 
     s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
-             VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
+             VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SSLPS |
              VTD_CAP_MGAW(s->aw_bits);
     if (s->dma_drain) {
         s->cap |= VTD_CAP_DRAIN;
@@ -4571,13 +4573,13 @@ static void vtd_cap_init(IntelIOMMUState *s)
     }
 
     /* TODO: read cap/ecap from host to decide which cap to be exposed. */
-    if (s->flts) {
-        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_FLTS;
+    if (s->fsts) {
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_FSTS;
         if (s->fs1gp) {
             s->cap |= VTD_CAP_FS1GP;
         }
     } else if (s->scalable_mode) {
-        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
+        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SSTS;
     }
 
     if (s->snoop_control) {
@@ -4864,12 +4866,12 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
         }
     }
 
-    if (!s->scalable_mode && s->flts) {
+    if (!s->scalable_mode && s->fsts) {
         error_setg(errp, "x-flts is only available in scalable mode");
         return false;
     }
 
-    if (!s->flts && s->aw_bits != VTD_HOST_AW_39BIT &&
+    if (!s->fsts && s->aw_bits != VTD_HOST_AW_39BIT &&
         s->aw_bits != VTD_HOST_AW_48BIT) {
         error_setg(errp, "%s: supported values for aw-bits are: %d, %d",
                    s->scalable_mode ? "Scalable mode(flts=off)" : "Legacy mode",
@@ -4877,7 +4879,7 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
         return false;
     }
 
-    if (s->flts && s->aw_bits != VTD_HOST_AW_48BIT) {
+    if (s->fsts && s->aw_bits != VTD_HOST_AW_48BIT) {
         error_setg(errp,
                    "Scalable mode(flts=on): supported value for aw-bits is: %d",
                    VTD_HOST_AW_48BIT);
diff --git a/tests/qtest/intel-iommu-test.c b/tests/qtest/intel-iommu-test.c
index c521b3796e..e5cc6acaf0 100644
--- a/tests/qtest/intel-iommu-test.c
+++ b/tests/qtest/intel-iommu-test.c
@@ -13,9 +13,9 @@
 #include "hw/i386/intel_iommu_internal.h"
 
 #define CAP_STAGE_1_FIXED1    (VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | \
-                              VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS)
+                              VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SSLPS)
 #define ECAP_STAGE_1_FIXED1   (VTD_ECAP_QI |  VTD_ECAP_IR | VTD_ECAP_IRO | \
-                              VTD_ECAP_MHMV | VTD_ECAP_SMTS | VTD_ECAP_FLTS)
+                              VTD_ECAP_MHMV | VTD_ECAP_SMTS | VTD_ECAP_FSTS)
 
 static inline uint64_t vtd_reg_readq(QTestState *s, uint64_t offset)
 {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 04/22] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 03/22] intel_iommu: Update terminology to match VTD spec Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-18  8:57 ` [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags() Zhenzhong Duan
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Returns true if PCI device is aliased or false otherwise. This will be
used in following patch to determine if a PCI device is under a PCI
bridge.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 include/hw/pci/pci.h |  2 ++
 hw/pci/pci.c         | 12 ++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6bccb25ac2..bde9dca8e2 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -637,6 +637,8 @@ typedef struct PCIIOMMUOps {
                             bool is_write);
 } PCIIOMMUOps;
 
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+                                    PCIBus **aliased_bus, int *aliased_devfn);
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index c3df9d6656..4d4b9dda4d 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2860,20 +2860,21 @@ static void pci_device_class_base_init(ObjectClass *klass, const void *data)
  * For call sites which don't need aliased BDF, passing NULL to
  * aliased_[bus|devfn] is allowed.
  *
+ * Returns true if PCI device RID is aliased or false otherwise.
+ *
  * @piommu_bus: return root #PCIBus backed by an IOMMU for the PCI device.
  *
  * @aliased_bus: return aliased #PCIBus of the PCI device, optional.
  *
  * @aliased_devfn: return aliased devfn of the PCI device, optional.
  */
-static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
-                                           PCIBus **piommu_bus,
-                                           PCIBus **aliased_bus,
-                                           int *aliased_devfn)
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+                                    PCIBus **aliased_bus, int *aliased_devfn)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
     int devfn = dev->devfn;
+    bool aliased = false;
 
     while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
         PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
@@ -2910,6 +2911,7 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
                 devfn = parent->devfn;
                 bus = parent_bus;
             }
+            aliased = true;
         }
 
         /*
@@ -2944,6 +2946,8 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
     if (aliased_devfn) {
         *aliased_devfn = devfn;
     }
+
+    return aliased;
 }
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (3 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 04/22] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-23 18:47   ` Nicolin Chen
  2025-10-12 12:26   ` Yi Liu
  2025-09-18  8:57 ` [PATCH v6 06/22] intel_iommu: Implement get_viommu_flags() callback Zhenzhong Duan
                   ` (16 subsequent siblings)
  21 siblings, 2 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Introduce a new PCIIOMMUOps optional callback, get_viommu_flags() which
allows to retrieve flags exposed by a vIOMMU. The first planned vIOMMU
device flag is VIOMMU_FLAG_WANT_NESTING_PARENT that advertises the
support of HW nested stage translation scheme and wants other sub-system
like VFIO's cooperation to create nesting parent HWPT.

pci_device_get_viommu_flags() is a wrapper that can be called on a PCI
device potentially protected by a vIOMMU.

get_viommu_flags() is designed to return 64bit bitmap of purely vIOMMU
flags which are only determined by user's configuration, no host
capabilities involved. Reasons are:

1. host may has heterogeneous IOMMUs, each with different capabilities
2. this is migration friendly, return value is consistent between source
   and target.
3. host IOMMU capabilities are passed to vIOMMU through set_iommu_device()
   interface which have to be after attach_device(), when get_viommu_flags()
   is called in attach_device(), there is no way for vIOMMU to get host
   IOMMU capabilities yet, so only pure vIOMMU flags can be returned.
   See below sequence:

     vfio_device_attach():
         iommufd_cdev_attach():
             pci_device_get_viommu_flags() for HW nesting cap
             create a nesting parent HWPT
             attach device to the HWPT
             vfio_device_hiod_create_and_realize() creating hiod
     ...
     pci_device_set_iommu_device(hiod)

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 MAINTAINERS          |  1 +
 include/hw/iommu.h   | 19 +++++++++++++++++++
 include/hw/pci/pci.h | 27 +++++++++++++++++++++++++++
 hw/pci/pci.c         | 11 +++++++++++
 4 files changed, 58 insertions(+)
 create mode 100644 include/hw/iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index f8cd513d8b..71457e4cde 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2307,6 +2307,7 @@ F: include/system/iommufd.h
 F: backends/host_iommu_device.c
 F: include/system/host_iommu_device.h
 F: include/qemu/chardev_open.h
+F: include/hw/iommu.h
 F: util/chardev_open.c
 F: docs/devel/vfio-iommufd.rst
 
diff --git a/include/hw/iommu.h b/include/hw/iommu.h
new file mode 100644
index 0000000000..65d652950a
--- /dev/null
+++ b/include/hw/iommu.h
@@ -0,0 +1,19 @@
+/*
+ * General vIOMMU flags
+ *
+ * Copyright (C) 2025 Intel Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_IOMMU_H
+#define HW_IOMMU_H
+
+#include "qemu/bitops.h"
+
+enum {
+    /* Nesting parent HWPT will be reused by vIOMMU to create nested HWPT */
+     VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
+};
+
+#endif /* HW_IOMMU_H */
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index bde9dca8e2..c54f2b53ae 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -462,6 +462,23 @@ typedef struct PCIIOMMUOps {
      * @devfn: device and function number of the PCI device.
      */
     void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
+    /**
+     * @get_viommu_flags: get vIOMMU flags
+     *
+     * Optional callback, if not implemented, then vIOMMU doesn't support
+     * exposing flags to other sub-system, e.g., VFIO. Each flag can be
+     * an expectation or request to other sub-system or just a pure vIOMMU
+     * capability. vIOMMU can choose which flags to expose.
+     *
+     * @opaque: the data passed to pci_setup_iommu().
+     *
+     * Returns: 64bit bitmap with each bit represents a flag that vIOMMU
+     * wants to expose. See VIOMMU_FLAG_* in include/hw/iommu.h for all
+     * possible flags currently used. These flags are theoretical which
+     * are only determined by vIOMMU device properties and independent on
+     * the actual host capabilities they may depend on.
+     */
+    uint64_t (*get_viommu_flags)(void *opaque);
     /**
      * @get_iotlb_info: get properties required to initialize a device IOTLB.
      *
@@ -644,6 +661,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
 void pci_device_unset_iommu_device(PCIDevice *dev);
 
+/**
+ * pci_device_get_viommu_flags: get vIOMMU flags.
+ *
+ * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
+ * flags, 0 if vIOMMU doesn't support that.
+ *
+ * @dev: PCI device pointer.
+ */
+uint64_t pci_device_get_viommu_flags(PCIDevice *dev);
+
 /**
  * pci_iommu_get_iotlb_info: get properties required to initialize a
  * device IOTLB.
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 4d4b9dda4d..1315ef13ea 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -3012,6 +3012,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
     }
 }
 
+uint64_t pci_device_get_viommu_flags(PCIDevice *dev)
+{
+    PCIBus *iommu_bus;
+
+    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
+    if (iommu_bus && iommu_bus->iommu_ops->get_viommu_flags) {
+        return iommu_bus->iommu_ops->get_viommu_flags(iommu_bus->iommu_opaque);
+    }
+    return 0;
+}
+
 int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
                          bool exec_req, hwaddr addr, bool lpig,
                          uint16_t prgi, bool is_read, bool is_write)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 06/22] intel_iommu: Implement get_viommu_flags() callback
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (4 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags() Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-10-12 12:28   ` Yi Liu
  2025-09-18  8:57 ` [PATCH v6 07/22] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Implement get_viommu_flags() callback and expose a request for nesting
parent HWPT for now.

VFIO uses it to create nesting parent HWPT which is further used to create
nested HWPT in vIOMMU. All these will be implemented in following patches.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
---
 hw/i386/intel_iommu.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a47482ba9d..83c40975cc 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -24,6 +24,7 @@
 #include "qemu/main-loop.h"
 #include "qapi/error.h"
 #include "hw/sysbus.h"
+#include "hw/iommu.h"
 #include "intel_iommu_internal.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/pci_bus.h"
@@ -4412,6 +4413,16 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
     vtd_iommu_unlock(s);
 }
 
+static uint64_t vtd_get_viommu_flags(void *opaque)
+{
+    IntelIOMMUState *s = opaque;
+    uint64_t caps;
+
+    caps = s->fsts ? VIOMMU_FLAG_WANT_NESTING_PARENT : 0;
+
+    return caps;
+}
+
 /* Unmap the whole range in the notifier's scope. */
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 {
@@ -4842,6 +4853,7 @@ static PCIIOMMUOps vtd_iommu_ops = {
     .register_iotlb_notifier = vtd_register_iotlb_notifier,
     .unregister_iotlb_notifier = vtd_unregister_iotlb_notifier,
     .ats_request_translation = vtd_ats_request_translation,
+    .get_viommu_flags = vtd_get_viommu_flags,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 07/22] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (5 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 06/22] intel_iommu: Implement get_viommu_flags() callback Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-18  8:57 ` [PATCH v6 08/22] vfio/iommufd: Force creating nesting parent HWPT Zhenzhong Duan
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Introduce a new structure VTDHostIOMMUDevice which replaces
HostIOMMUDevice to be stored in hash table.

It includes a reference to HostIOMMUDevice and IntelIOMMUState,
also includes BDF information which will be used in future
patches.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu_internal.h |  7 +++++++
 include/hw/i386/intel_iommu.h  |  2 +-
 hw/i386/intel_iommu.c          | 15 +++++++++++++--
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 86b8bfc71f..9cdc8d5dbb 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -28,6 +28,7 @@
 #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
 #define HW_I386_INTEL_IOMMU_INTERNAL_H
 #include "hw/i386/intel_iommu.h"
+#include "system/host_iommu_device.h"
 
 /*
  * Intel IOMMU register specification
@@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
 /* Bits to decide the offset for each level */
 #define VTD_LEVEL_BITS           9
 
+typedef struct VTDHostIOMMUDevice {
+    IntelIOMMUState *iommu_state;
+    PCIBus *bus;
+    uint8_t devfn;
+    HostIOMMUDevice *hiod;
+} VTDHostIOMMUDevice;
 #endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 564d4d4236..3351892da0 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -295,7 +295,7 @@ struct IntelIOMMUState {
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
-    GHashTable *vtd_host_iommu_dev;             /* HostIOMMUDevice */
+    GHashTable *vtd_host_iommu_dev;             /* VTDHostIOMMUDevice */
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 83c40975cc..ba40649c85 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -280,7 +280,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
 
 static void vtd_hiod_destroy(gpointer v)
 {
-    object_unref(v);
+    VTDHostIOMMUDevice *vtd_hiod = v;
+
+    object_unref(vtd_hiod->hiod);
+    g_free(vtd_hiod);
 }
 
 static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
@@ -4360,6 +4363,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
                                      HostIOMMUDevice *hiod, Error **errp)
 {
     IntelIOMMUState *s = opaque;
+    VTDHostIOMMUDevice *vtd_hiod;
     struct vtd_as_key key = {
         .bus = bus,
         .devfn = devfn,
@@ -4376,7 +4380,14 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
         return false;
     }
 
+    vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
+    vtd_hiod->bus = bus;
+    vtd_hiod->devfn = (uint8_t)devfn;
+    vtd_hiod->iommu_state = s;
+    vtd_hiod->hiod = hiod;
+
     if (!vtd_check_hiod(s, hiod, errp)) {
+        g_free(vtd_hiod);
         vtd_iommu_unlock(s);
         return false;
     }
@@ -4386,7 +4397,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     new_key->devfn = devfn;
 
     object_ref(hiod);
-    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
+    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
 
     vtd_iommu_unlock(s);
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 08/22] vfio/iommufd: Force creating nesting parent HWPT
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (6 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 07/22] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-30 14:19   ` Eric Auger
  2025-10-12 12:33   ` Yi Liu
  2025-09-18  8:57 ` [PATCH v6 09/22] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
                   ` (13 subsequent siblings)
  21 siblings, 2 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Call pci_device_get_viommu_flags() to get if vIOMMU supports
VIOMMU_FLAG_WANT_NESTING_PARENT.

If yes, create a nesting parent HWPT and add it to the container's hwpt_list,
letting this parent HWPT cover the entire second stage mappings (GPA=>HPA).

This allows a VFIO passthrough device to directly attach to this default HWPT
and then to use the system address space and its listener.

Introduce a vfio_device_get_viommu_flags_want_nesting() helper to facilitate
this implementation.

It is safe to do so because a vIOMMU will be able to fail in set_iommu_device()
call, if something else related to the VFIO device or vIOMMU isn't compatible.

Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/hw/vfio/vfio-device.h |  2 ++
 hw/vfio/device.c              | 12 ++++++++++++
 hw/vfio/iommufd.c             |  9 +++++++++
 3 files changed, 23 insertions(+)

diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index e7e6243e2d..a964091135 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -257,6 +257,8 @@ void vfio_device_prepare(VFIODevice *vbasedev, VFIOContainerBase *bcontainer,
 
 void vfio_device_unprepare(VFIODevice *vbasedev);
 
+bool vfio_device_get_viommu_flags_want_nesting(VFIODevice *vbasedev);
+
 int vfio_device_get_region_info(VFIODevice *vbasedev, int index,
                                 struct vfio_region_info **info);
 int vfio_device_get_region_info_type(VFIODevice *vbasedev, uint32_t type,
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 08f12ac31f..620cc78b77 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -23,6 +23,7 @@
 
 #include "hw/vfio/vfio-device.h"
 #include "hw/vfio/pci.h"
+#include "hw/iommu.h"
 #include "hw/hw.h"
 #include "trace.h"
 #include "qapi/error.h"
@@ -504,6 +505,17 @@ void vfio_device_unprepare(VFIODevice *vbasedev)
     vbasedev->bcontainer = NULL;
 }
 
+bool vfio_device_get_viommu_flags_want_nesting(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
+
+    if (vdev) {
+        return !!(pci_device_get_viommu_flags(&vdev->parent_obj) &
+                  VIOMMU_FLAG_WANT_NESTING_PARENT);
+    }
+    return false;
+}
+
 /*
  * Traditional ioctl() based io
  */
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 8c27222f75..f1684a39b7 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -379,6 +379,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
         flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
     }
 
+    /*
+     * If vIOMMU requests VFIO's cooperation to create nesting parent HWPT,
+     * force to create it so that it could be reused by vIOMMU to create
+     * nested HWPT.
+     */
+    if (vfio_device_get_viommu_flags_want_nesting(vbasedev)) {
+        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
+    }
+
     if (cpr_is_incoming()) {
         hwpt_id = vbasedev->cpr.hwpt_id;
         goto skip_alloc;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 09/22] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (7 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 08/22] vfio/iommufd: Force creating nesting parent HWPT Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-30 15:04   ` Eric Auger
  2025-10-12 12:51   ` Yi Liu
  2025-09-18  8:57 ` [PATCH v6 10/22] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
                   ` (12 subsequent siblings)
  21 siblings, 2 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

When guest enables scalable mode and setup first stage page table, we don't
want to use IOMMU MR but rather continue using the system MR for IOMMUFD
backed host device.

Then default HWPT in VFIO contains GPA->HPA mappings which could be reused
as nesting parent HWPT to construct nested HWPT in vIOMMU.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ba40649c85..bd80de1670 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
 #include "kvm/kvm_i386.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "system/iommufd.h"
 
 /* context entry operations */
 #define RID_PASID    0
@@ -1702,6 +1703,24 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
 
 }
 
+static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(VTDAddressSpace *as)
+{
+    IntelIOMMUState *s = as->iommu_state;
+    struct vtd_as_key key = {
+        .bus = as->bus,
+        .devfn = as->devfn,
+    };
+    VTDHostIOMMUDevice *vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev,
+                                                       &key);
+
+    if (vtd_hiod && vtd_hiod->hiod &&
+        object_dynamic_cast(OBJECT(vtd_hiod->hiod),
+                            TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+        return vtd_hiod;
+    }
+    return NULL;
+}
+
 static bool vtd_as_pt_enabled(VTDAddressSpace *as)
 {
     IntelIOMMUState *s;
@@ -1710,6 +1729,7 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
     assert(as);
 
     s = as->iommu_state;
+
     if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
                                  &ce)) {
         /*
@@ -1727,12 +1747,25 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
 /* Return whether the device is using IOMMU translation. */
 static bool vtd_switch_address_space(VTDAddressSpace *as)
 {
+    IntelIOMMUState *s;
     bool use_iommu, pt;
 
     assert(as);
 
-    use_iommu = as->iommu_state->dmar_enabled && !vtd_as_pt_enabled(as);
-    pt = as->iommu_state->dmar_enabled && vtd_as_pt_enabled(as);
+    s = as->iommu_state;
+    use_iommu = s->dmar_enabled && !vtd_as_pt_enabled(as);
+    pt = s->dmar_enabled && vtd_as_pt_enabled(as);
+
+    /*
+     * When guest enables scalable mode and setup first stage page table,
+     * we stick to system MR for IOMMUFD backed host device. Then its
+     * default hwpt contains GPA->HPA mappings which is used directly
+     * if PGTT=PT and used as nesting parent if PGTT=FST. Otherwise
+     * fallback to original processing.
+     */
+    if (s->root_scalable && s->fsts && vtd_find_hiod_iommufd(as)) {
+        use_iommu = false;
+    }
 
     trace_vtd_switch_address_space(pci_bus_num(as->bus),
                                    VTD_PCI_SLOT(as->devfn),
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 10/22] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (8 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 09/22] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-10-12 12:55   ` Yi Liu
  2025-09-18  8:57 ` [PATCH v6 11/22] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

When vIOMMU is configured x-flts=on in scalable mode, first stage page table
is passed to host to construct nested page table for passthrough devices.

We need to check compatibility of some critical IOMMU capabilities between
vIOMMU and host IOMMU to ensure guest first stage page table could be used by
host.

For instance, vIOMMU supports first stage 1GB large page mapping, but host does
not, then this IOMMUFD backed device should fail.

Even of the checks pass, for now we willingly reject the association because
all the bits are not there yet, it will be relaxed in the end of this series.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 hw/i386/intel_iommu.c | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index bd80de1670..bcfbc5dd46 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4387,8 +4387,31 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         return true;
     }
 
+#ifdef CONFIG_IOMMUFD
+    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
+    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+
+    /* Remaining checks are all first stage translation specific */
+    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
+        return false;
+    }
+
+    if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
+        error_setg(errp, "Incompatible host platform IOMMU type %d",
+                   caps->type);
+        return false;
+    }
+
+    if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
+        error_setg(errp,
+                   "First stage 1GB large page is unsupported by host IOMMU");
+        return false;
+    }
+#endif
+
     error_setg(errp,
-               "host device is uncompatible with first stage translation");
+               "host IOMMU is incompatible with guest first stage translation");
     return false;
 }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 11/22] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (9 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 10/22] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-18  8:57 ` [PATCH v6 12/22] intel_iommu: Handle PASID cache invalidation Zhenzhong Duan
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Currently we don't support nested translation for passthrough device with
emulated device under same PCI bridge, because they require different address
space when x-flts=on.

In theory, we do support if devices under same PCI bridge are all passthrough
devices. But emulated device can be hotplugged under same bridge. To simplify,
just forbid passthrough device under PCI bridge no matter if there is, or will
be emulated devices under same bridge. This is acceptable because PCIE bridge
is more popular than PCI bridge now.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index bcfbc5dd46..d37d47115a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4361,9 +4361,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
     return vtd_dev_as;
 }
 
-static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
+static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
                            Error **errp)
 {
+    HostIOMMUDevice *hiod = vtd_hiod->hiod;
     HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
     int ret;
 
@@ -4390,6 +4391,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
 #ifdef CONFIG_IOMMUFD
     struct HostIOMMUDeviceCaps *caps = &hiod->caps;
     struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+    PCIBus *bus = vtd_hiod->bus;
+    PCIDevice *pdev = bus->devices[vtd_hiod->devfn];
 
     /* Remaining checks are all first stage translation specific */
     if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
@@ -4408,6 +4411,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
                    "First stage 1GB large page is unsupported by host IOMMU");
         return false;
     }
+
+    if (pci_device_get_iommu_bus_devfn(pdev, &bus, NULL, NULL)) {
+        error_setg(errp, "Host device under PCI bridge is unsupported "
+                   "when x-flts=on");
+        return false;
+    }
 #endif
 
     error_setg(errp,
@@ -4442,7 +4451,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     vtd_hiod->iommu_state = s;
     vtd_hiod->hiod = hiod;
 
-    if (!vtd_check_hiod(s, hiod, errp)) {
+    if (!vtd_check_hiod(s, vtd_hiod, errp)) {
         g_free(vtd_hiod);
         vtd_iommu_unlock(s);
         return false;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 12/22] intel_iommu: Handle PASID cache invalidation
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (10 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 11/22] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-10-12 14:58   ` Yi Liu
  2025-09-18  8:57 ` [PATCH v6 13/22] intel_iommu: Reset pasid cache when system level reset Zhenzhong Duan
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

This adds PASID cache sync for RID_PASID, non-RID_PASID isn't supported.

Adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the pasid
entry and track PASID usage and future PASID tagged DMA address translation
support in vIOMMU.

When guest triggers pasid cache invalidation, QEMU will capture it and
update or invalidate pasid cache.

vIOMMU emulator could figure out the reason by fetching latest guest pasid
entry in memory and compare it with cached PASID entry if it's valid.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  19 +++-
 include/hw/i386/intel_iommu.h  |   6 ++
 hw/i386/intel_iommu.c          | 157 ++++++++++++++++++++++++++++++---
 hw/i386/trace-events           |   3 +
 4 files changed, 173 insertions(+), 12 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 9cdc8d5dbb..d400bcee21 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -316,6 +316,7 @@ typedef enum VTDFaultReason {
                                   * request while disabled */
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
+    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
     /* PASID directory entry access failure */
     VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
     /* The Present(P) field of pasid directory entry is 0 */
@@ -493,6 +494,15 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000f1c0ULL
 #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
 
+/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
+#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
+#define VTD_INV_DESC_PASIDC_G_DSI       0
+#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
+#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
+#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16, 16)
+#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32, 20)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
@@ -552,6 +562,13 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+typedef struct VTDPASIDCacheInfo {
+    uint8_t type;
+    uint16_t did;
+    uint32_t pasid;
+    bool reset;
+} VTDPASIDCacheInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
@@ -573,7 +590,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
 
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
-#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
+#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
 
 #define VTD_SM_PASID_ENTRY_FSPM          3ULL
 #define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 3351892da0..ff01e5c82d 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -95,6 +95,11 @@ struct VTDPASIDEntry {
     uint64_t val[8];
 };
 
+typedef struct VTDPASIDCacheEntry {
+    struct VTDPASIDEntry pasid_entry;
+    bool valid;
+} VTDPASIDCacheEntry;
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -107,6 +112,7 @@ struct VTDAddressSpace {
     MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
+    VTDPASIDCacheEntry pasid_cache_entry;
     QLIST_ENTRY(VTDAddressSpace) next;
     /* Superset of notifier flags that this address space has */
     IOMMUNotifierFlag notifier_flags;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d37d47115a..24061f6dc6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1614,7 +1614,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
 
     if (s->root_scalable) {
         vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
-        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
+        return VTD_SM_PASID_ENTRY_DID(&pe);
     }
 
     return VTD_CONTEXT_ENTRY_DID(ce->hi);
@@ -3074,6 +3074,144 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
     return true;
 }
 
+static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
+                                            VTDPASIDEntry *pe)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDContextEntry ce;
+    int ret;
+
+    if (!s->root_scalable) {
+        return -VTD_FR_RTADDR_INV_TTM;
+    }
+
+    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
+                                   &ce);
+    if (ret) {
+        return ret;
+    }
+
+    return vtd_ce_get_pasid_entry(s, &ce, pe, vtd_as->pasid);
+}
+
+/*
+ * For each IOMMUFD backed device, update or invalidate pasid cache based on
+ * the value in memory.
+ */
+static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
+                                        gpointer user_data)
+{
+    VTDPASIDCacheInfo *pc_info = user_data;
+    VTDAddressSpace *vtd_as = value;
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    VTDPASIDEntry pe;
+    uint16_t did;
+
+    /* Ignore emulated device or legacy VFIO backed device */
+    if (!vtd_find_hiod_iommufd(vtd_as)) {
+        return;
+    }
+
+    /* non-RID_PASID isn't supported yet */
+    assert(vtd_as->pasid == PCI_NO_PASID);
+
+    if (vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
+        /*
+         * No valid pasid entry in guest memory. e.g. pasid entry was modified
+         * to be either all-zero or non-present. Either case means existing
+         * pasid cache should be invalidated.
+         */
+        pc_entry->valid = false;
+        return;
+    }
+
+    /*
+     * VTD_INV_DESC_PASIDC_G_DSI and VTD_INV_DESC_PASIDC_G_PASID_SI require
+     * DID check. If DID doesn't match the value in cache or memory, then
+     * it's not a pasid entry we want to invalidate.
+     */
+    switch (pc_info->type) {
+    case VTD_INV_DESC_PASIDC_G_PASID_SI:
+    case VTD_INV_DESC_PASIDC_G_DSI:
+        if (pc_entry->valid) {
+            did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
+            if (pc_info->did == did) {
+                break;
+            }
+        }
+        did = VTD_SM_PASID_ENTRY_DID(&pe);
+        if (pc_info->did == did) {
+            break;
+        }
+        return;
+    }
+
+    pc_entry->pasid_entry = pe;
+    pc_entry->valid = true;
+}
+
+static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
+{
+    if (!s->fsts || !s->root_scalable || !s->dmar_enabled) {
+        return;
+    }
+
+    vtd_iommu_lock(s);
+    g_hash_table_foreach(s->vtd_address_spaces, vtd_pasid_cache_sync_locked,
+                         pc_info);
+    vtd_iommu_unlock(s);
+}
+
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+                                   VTDInvDesc *inv_desc)
+{
+    uint16_t did;
+    uint32_t pasid;
+    VTDPASIDCacheInfo pc_info = {};
+    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
+                        VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
+
+    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
+                                     __func__, "pasid cache inv")) {
+        return false;
+    }
+
+    did = VTD_INV_DESC_PASIDC_DID(inv_desc);
+    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc);
+    pc_info.type = VTD_INV_DESC_PASIDC_G(inv_desc);
+
+    switch (pc_info.type) {
+    case VTD_INV_DESC_PASIDC_G_DSI:
+        trace_vtd_inv_desc_pasid_cache_dsi(did);
+        pc_info.did = did;
+        break;
+
+    case VTD_INV_DESC_PASIDC_G_PASID_SI:
+        /* PASID selective implies a DID selective */
+        trace_vtd_inv_desc_pasid_cache_psi(did, pasid);
+        /* Currently non-RID_PASID invalidation requests are ignored */
+        if (pasid != RID_PASID) {
+            return true;
+        }
+        pc_info.did = did;
+        pc_info.pasid = pasid;
+        break;
+
+    case VTD_INV_DESC_PASIDC_G_GLOBAL:
+        trace_vtd_inv_desc_pasid_cache_gsi();
+        break;
+
+    default:
+        error_report_once("invalid granularity field in PASID-cache invalidate "
+                          "descriptor, hi: 0x%"PRIx64" lo: 0x%" PRIx64,
+                           inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    vtd_pasid_cache_sync(s, &pc_info);
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -3236,6 +3374,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    case VTD_INV_DESC_PC:
+        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_pasid_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
     case VTD_INV_DESC_PIOTLB:
         trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
         if (!vtd_process_piotlb_desc(s, &inv_desc)) {
@@ -3271,16 +3416,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
-    /*
-     * TODO: the entity of below two cases will be implemented in future series.
-     * To make guest (which integrates scalable mode support patch set in
-     * iommu driver) work, just return true is enough so far.
-     */
-    case VTD_INV_DESC_PC:
-        if (s->scalable_mode) {
-            break;
-        }
-    /* fallthrough */
     default:
         error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
                           " (unknown type)", __func__, inv_desc.hi,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ac9e1a10aa..298addb24d 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_inv_desc_pasid_cache_gsi(void) ""
+vtd_inv_desc_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
+vtd_inv_desc_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 13/22] intel_iommu: Reset pasid cache when system level reset
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (11 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 12/22] intel_iommu: Handle PASID cache invalidation Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-10-13 10:25   ` Yi Liu
  2025-09-18  8:57 ` [PATCH v6 14/22] intel_iommu: Add some macros and inline functions Zhenzhong Duan
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Reset pasid cache when system level reset, for RID_PASID, its vtd_as is
allocated by PCI system and never removed, just mark pasid cache invalid.

As we already have vtd_pasid_cache_sync_locked() to handle pasid cache
invalidation, reuse it to do pasid cache invalidation at system reset
level.

Currently only IOMMUFD backed VFIO device caches pasid entry, so we don't
need to care about emulated device.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 15 ++++++++++++++-
 hw/i386/trace-events  |  1 +
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 24061f6dc6..a6638e13be 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -85,6 +85,18 @@ struct vtd_iotlb_key {
 
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
+                                        gpointer user_data);
+
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info = { .reset = true };
+
+    trace_vtd_pasid_cache_reset();
+    g_hash_table_foreach(s->vtd_address_spaces,
+                         vtd_pasid_cache_sync_locked, &pc_info);
+}
+
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -390,6 +402,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
     vtd_iommu_lock(s);
     vtd_reset_iotlb_locked(s);
     vtd_reset_context_cache_locked(s);
+    vtd_pasid_cache_reset_locked(s);
     vtd_iommu_unlock(s);
 }
 
@@ -3115,7 +3128,7 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
     /* non-RID_PASID isn't supported yet */
     assert(vtd_as->pasid == PCI_NO_PASID);
 
-    if (vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
+    if (pc_info->reset || vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
         /*
          * No valid pasid entry in guest memory. e.g. pasid entry was modified
          * to be either all-zero or non-present. Either case means existing
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 298addb24d..b704f4f90c 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_reset(void) ""
 vtd_inv_desc_pasid_cache_gsi(void) ""
 vtd_inv_desc_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
 vtd_inv_desc_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 14/22] intel_iommu: Add some macros and inline functions
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (12 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 13/22] intel_iommu: Reset pasid cache when system level reset Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-10-13 10:25   ` Yi Liu
  2025-09-18  8:57 ` [PATCH v6 15/22] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
                   ` (7 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Add some macros and inline functions that will be used by following
patch.

This patch also make a cleanup to change macro VTD_SM_PASID_ENTRY_FSPM
to use extract64() just like what smmu does, because this macro is used
indirectly by new introduced inline functions. But we doesn't aim to
change the huge amount of bit mask style macro definitions in this patch,
that should be in a separate patch.

Suggested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  6 +++++-
 hw/i386/intel_iommu.c          | 30 +++++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index d400bcee21..3d5ee5ed52 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -592,8 +592,12 @@ typedef struct VTDPASIDCacheInfo {
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
 
-#define VTD_SM_PASID_ENTRY_FSPM          3ULL
 #define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(x)    extract64((x)->val[2], 0, 1)
+/* 00: 4-level paging, 01: 5-level paging, 10-11: Reserved */
+#define VTD_SM_PASID_ENTRY_FSPM(x)       extract64((x)->val[2], 2, 2)
+#define VTD_SM_PASID_ENTRY_WPE_BIT(x)    extract64((x)->val[2], 4, 1)
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(x)   extract64((x)->val[2], 7, 1)
 
 /* First Level Paging Structure */
 /* Masks for First Level Paging Entry */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a6638e13be..5908368c44 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -49,8 +49,7 @@
 
 /* pe operations */
 #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
-#define VTD_PE_GET_FS_LEVEL(pe) \
-    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FSPM))
+#define VTD_PE_GET_FS_LEVEL(pe) (VTD_SM_PASID_ENTRY_FSPM(pe) + 4)
 #define VTD_PE_GET_SS_LEVEL(pe) \
     (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
 
@@ -838,6 +837,31 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
     }
 }
 
+static inline dma_addr_t vtd_pe_get_fspt_base(VTDPASIDEntry *pe)
+{
+    return pe->val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
+}
+
+/*
+ * First stage IOVA address width: 48 bits for 4-level paging(FSPM=00)
+ *                                 57 bits for 5-level paging(FSPM=01)
+ */
+static inline uint32_t vtd_pe_get_fs_aw(VTDPASIDEntry *pe)
+{
+    return 48 + VTD_SM_PASID_ENTRY_FSPM(pe) * 9;
+}
+
+static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
+}
+
+/* check if pgtt is first stage translation */
+static inline bool vtd_pe_pgtt_is_fst(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FST);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1709,7 +1733,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
              */
             return false;
         }
-        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
+        return vtd_pe_pgtt_is_pt(&pe);
     }
 
     return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 15/22] intel_iommu: Bind/unbind guest page table to host
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (13 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 14/22] intel_iommu: Add some macros and inline functions Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-18  8:57 ` [PATCH v6 16/22] intel_iommu: Propagate PASID-based iotlb invalidation " Zhenzhong Duan
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Yi Sun

This captures the guest PASID table entry modifications and propagates
the changes to host to attach a hwpt with type determined per guest IOMMU
PGTT configuration.

When PGTT=PT, attach RID_PASID to a second stage HWPT(GPA->HPA).
When PGTT=FST, attach RID_PASID to nested HWPT with nesting parent
HWPT coming from VFIO.

Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/i386/intel_iommu.h |   1 +
 hw/i386/intel_iommu.c         | 152 +++++++++++++++++++++++++++++++++-
 hw/i386/trace-events          |   3 +
 3 files changed, 154 insertions(+), 2 deletions(-)

diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index ff01e5c82d..86614fbb31 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -104,6 +104,7 @@ struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
     uint32_t pasid;
+    uint32_t fs_hwpt;
     AddressSpace as;
     IOMMUMemoryRegion iommu;
     MemoryRegion root;          /* The root container of the device */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5908368c44..bfe229d0dc 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -20,6 +20,7 @@
  */
 
 #include "qemu/osdep.h"
+#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qapi/error.h"
@@ -41,6 +42,9 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "system/iommufd.h"
+#ifdef CONFIG_IOMMUFD
+#include <linux/iommufd.h>
+#endif
 
 /* context entry operations */
 #define RID_PASID    0
@@ -2398,6 +2402,125 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
+#ifdef CONFIG_IOMMUFD
+static int vtd_create_fs_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                              VTDPASIDEntry *pe, uint32_t *fs_hwpt,
+                              Error **errp)
+{
+    struct iommu_hwpt_vtd_s1 vtd = {};
+
+    vtd.flags = (VTD_SM_PASID_ENTRY_SRE_BIT(pe) ? IOMMU_VTD_S1_SRE : 0) |
+                (VTD_SM_PASID_ENTRY_WPE_BIT(pe) ? IOMMU_VTD_S1_WPE : 0) |
+                (VTD_SM_PASID_ENTRY_EAFE_BIT(pe) ? IOMMU_VTD_S1_EAFE : 0);
+    vtd.addr_width = vtd_pe_get_fs_aw(pe);
+    vtd.pgtbl_addr = (uint64_t)vtd_pe_get_fspt_base(pe);
+
+    return !iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+                                       idev->hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
+                                       sizeof(vtd), &vtd, fs_hwpt, errp);
+}
+
+static void vtd_destroy_old_fs_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                    VTDAddressSpace *vtd_as)
+{
+    if (!vtd_as->fs_hwpt) {
+        return;
+    }
+    iommufd_backend_free_id(idev->iommufd, vtd_as->fs_hwpt);
+    vtd_as->fs_hwpt = 0;
+}
+
+static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                     VTDAddressSpace *vtd_as, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    VTDPASIDEntry *pe = &vtd_as->pasid_cache_entry.pasid_entry;
+    uint32_t hwpt_id;
+    bool ret;
+
+    /*
+     * We can get here only if flts=on, the supported PGTT is FST and PT.
+     * Catch invalid PGTT when processing invalidation request to avoid
+     * attaching to wrong hwpt.
+     */
+    if (!vtd_pe_pgtt_is_fst(pe) && !vtd_pe_pgtt_is_pt(pe)) {
+        error_setg(errp, "Invalid PGTT type");
+        return -EINVAL;
+    }
+
+    if (vtd_pe_pgtt_is_pt(pe)) {
+        hwpt_id = idev->hwpt_id;
+    } else if (vtd_create_fs_hwpt(idev, pe, &hwpt_id, errp)) {
+        return -EINVAL;
+    }
+
+    ret = host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp);
+    trace_vtd_device_attach_hwpt(idev->devid, vtd_as->pasid, hwpt_id, !ret);
+    if (ret) {
+        /* Destroy old fs_hwpt if it's a replacement */
+        vtd_destroy_old_fs_hwpt(idev, vtd_as);
+        if (vtd_pe_pgtt_is_fst(pe)) {
+            vtd_as->fs_hwpt = hwpt_id;
+        }
+    } else if (vtd_pe_pgtt_is_fst(pe)) {
+        iommufd_backend_free_id(idev->iommufd, hwpt_id);
+    }
+
+    return !ret;
+}
+
+static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                     VTDAddressSpace *vtd_as, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint32_t pasid = vtd_as->pasid;
+    bool ret;
+
+    if (s->dmar_enabled && s->root_scalable) {
+        ret = host_iommu_device_iommufd_detach_hwpt(idev, errp);
+        trace_vtd_device_detach_hwpt(idev->devid, pasid, !ret);
+    } else {
+        /*
+         * If DMAR remapping is disabled or guest switches to legacy mode,
+         * we fallback to the default HWPT which contains shadow page table.
+         * So guest DMA could still work.
+         */
+        ret = host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
+        trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
+                                           !ret);
+    }
+
+    if (ret) {
+        vtd_destroy_old_fs_hwpt(idev, vtd_as);
+    }
+
+    return !ret;
+}
+
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
+{
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(vtd_as);
+    int ret;
+
+    assert(vtd_hiod);
+
+    if (pc_entry->valid) {
+        ret = vtd_device_attach_iommufd(vtd_hiod, vtd_as, errp);
+    } else {
+        ret = vtd_device_detach_iommufd(vtd_hiod, vtd_as, errp);
+    }
+
+    return ret;
+}
+#else
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
+{
+    return 0;
+}
+#endif
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -3131,6 +3254,11 @@ static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
     return vtd_ce_get_pasid_entry(s, &ce, pe, vtd_as->pasid);
 }
 
+static int vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+    return memcmp(p1, p2, sizeof(*p1));
+}
+
 /*
  * For each IOMMUFD backed device, update or invalidate pasid cache based on
  * the value in memory.
@@ -3143,6 +3271,8 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
     VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
     VTDPASIDEntry pe;
     uint16_t did;
+    const char *err_prefix;
+    Error *local_err = NULL;
 
     /* Ignore emulated device or legacy VFIO backed device */
     if (!vtd_find_hiod_iommufd(vtd_as)) {
@@ -3153,13 +3283,18 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
     assert(vtd_as->pasid == PCI_NO_PASID);
 
     if (pc_info->reset || vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
+        if (!pc_entry->valid) {
+            return;
+        }
+
         /*
          * No valid pasid entry in guest memory. e.g. pasid entry was modified
          * to be either all-zero or non-present. Either case means existing
          * pasid cache should be invalidated.
          */
         pc_entry->valid = false;
-        return;
+        err_prefix = "Detaching from HWPT failed: ";
+        goto do_bind_unbind;
     }
 
     /*
@@ -3184,7 +3319,20 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
     }
 
     pc_entry->pasid_entry = pe;
-    pc_entry->valid = true;
+    if (!pc_entry->valid) {
+        pc_entry->valid = true;
+        err_prefix = "Attaching to HWPT failed: ";
+    } else if (vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
+        err_prefix = "Replacing HWPT attachment failed: ";
+    } else {
+        return;
+    }
+
+do_bind_unbind:
+    /* TODO: Fault event injection into guest, report error to QEMU for now */
+    if (vtd_bind_guest_pasid(vtd_as, &local_err)) {
+        error_reportf_err(local_err, "%s", err_prefix);
+    }
 }
 
 static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index b704f4f90c..5a3ee1cf64 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -73,6 +73,9 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
 vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
 vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
 vtd_reset_exit(void) ""
+vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
+vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
 
 # amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 16/22] intel_iommu: Propagate PASID-based iotlb invalidation to host
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (14 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 15/22] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-18  8:57 ` [PATCH v6 17/22] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
	Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

This traps the guest PASID-based iotlb invalidation request and propagate it
to host.

Intel VT-d 3.0 supports nested translation in PASID granularity. Guest SVA
support could be implemented by configuring nested translation on specific
pasid. This is also known as dual stage DMA translation.

Under such configuration, guest owns the GVA->GPA translation which is
configured as first stage page table on host side for a specific pasid, and
host owns GPA->HPA translation. As guest owns first stage translation table,
piotlb invalidation should be propagated to host since host IOMMU will cache
first level page table related mappings during DMA address translation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  6 +++
 hw/i386/intel_iommu.c          | 85 +++++++++++++++++++++++++++++++++-
 2 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 3d5ee5ed52..d7c1ff4382 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -569,6 +569,12 @@ typedef struct VTDPASIDCacheInfo {
     bool reset;
 } VTDPASIDCacheInfo;
 
+typedef struct VTDPIOTLBInvInfo {
+    uint16_t domain_id;
+    uint32_t pasid;
+    struct iommu_hwpt_vtd_s1_invalidate *inv_data;
+} VTDPIOTLBInvInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index bfe229d0dc..92548f9573 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2514,11 +2514,88 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
 
     return ret;
 }
+
+/*
+ * This function is a loop function for the s->vtd_address_spaces
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host.
+ */
+static void vtd_flush_host_piotlb_locked(gpointer key, gpointer value,
+                                         gpointer user_data)
+{
+    VTDPIOTLBInvInfo *piotlb_info = user_data;
+    VTDAddressSpace *vtd_as = value;
+    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(vtd_as);
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    uint16_t did;
+
+    if (!vtd_hiod) {
+        return;
+    }
+
+    assert(vtd_as->pasid == PCI_NO_PASID);
+
+    /* Nothing to do if there is no first stage HWPT attached */
+    if (!pc_entry->valid ||
+        !vtd_pe_pgtt_is_fst(&pc_entry->pasid_entry)) {
+        return;
+    }
+
+    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
+
+    if (piotlb_info->domain_id == did && piotlb_info->pasid == RID_PASID) {
+        HostIOMMUDeviceIOMMUFD *idev =
+            HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+        uint32_t entry_num = 1; /* Only implement one request for simplicity */
+        Error *local_err = NULL;
+        struct iommu_hwpt_vtd_s1_invalidate *cache = piotlb_info->inv_data;
+
+        if (!iommufd_backend_invalidate_cache(idev->iommufd, vtd_as->fs_hwpt,
+                                              IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+                                              sizeof(*cache), &entry_num, cache,
+                                              &local_err)) {
+            /* Something wrong in kernel, but trying to continue */
+            error_report_err(local_err);
+        }
+    }
+}
+
+static void
+vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
+                                 uint16_t domain_id, uint32_t pasid,
+                                 hwaddr addr, uint64_t npages, bool ih)
+{
+    struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
+    VTDPIOTLBInvInfo piotlb_info;
+
+    cache_info.addr = addr;
+    cache_info.npages = npages;
+    cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.inv_data = &cache_info;
+
+    /*
+     * Go through each vtd_as instance in s->vtd_address_spaces, find out
+     * the affected host device which need host piotlb invalidation. Piotlb
+     * invalidation should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_address_spaces,
+                         vtd_flush_host_piotlb_locked, &piotlb_info);
+}
 #else
 static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, Error **errp)
 {
     return 0;
 }
+
+static void
+vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
+                                 uint16_t domain_id, uint32_t pasid,
+                                 hwaddr addr, uint64_t npages, bool ih)
+{
+}
 #endif
 
 /* Do a context-cache device-selective invalidation.
@@ -3159,6 +3236,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
     vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
                                 &info);
+    vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, 0, (uint64_t)-1, 0);
     vtd_iommu_unlock(s);
 
     QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
@@ -3178,7 +3256,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
-                                       uint32_t pasid, hwaddr addr, uint8_t am)
+                                       uint32_t pasid, hwaddr addr, uint8_t am,
+                                       bool ih)
 {
     VTDIOTLBPageInvInfo info;
 
@@ -3190,6 +3269,7 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb,
                                 vtd_hash_remove_by_page_piotlb, &info);
+    vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, addr, 1 << am, ih);
     vtd_iommu_unlock(s);
 
     vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am, pasid);
@@ -3221,7 +3301,8 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
     case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
         am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
         addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
-        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am);
+        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+                                   VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
         break;
 
     default:
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 17/22] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (15 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 16/22] intel_iommu: Propagate PASID-based iotlb invalidation " Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-18  8:57 ` [PATCH v6 18/22] iommufd: Introduce a helper function to extract vendor capabilities Zhenzhong Duan
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
all pasid bindings on host side become stale and need to be updated.

Introduce a helper function vtd_replay_pasid_bindings_all() to go through all
pasid entries in all passthrough devices to update host side bindings.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 hw/i386/intel_iommu.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 92548f9573..74496c7d3b 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -90,6 +90,7 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
                                         gpointer user_data);
+static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s);
 
 static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
 {
@@ -2904,6 +2905,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_replay_pasid_bindings_all(s);
 }
 
 /* Set Interrupt Remap Table Pointer */
@@ -2938,6 +2940,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_replay_pasid_bindings_all(s);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -3428,6 +3431,13 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
     vtd_iommu_unlock(s);
 }
 
+static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info = { .type = VTD_INV_DESC_PASIDC_G_GLOBAL };
+
+    vtd_pasid_cache_sync(s, &pc_info);
+}
+
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 18/22] iommufd: Introduce a helper function to extract vendor capabilities
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (16 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 17/22] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-23 19:45   ` Nicolin Chen
  2025-09-18  8:57 ` [PATCH v6 19/22] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

In VFIO core, we call iommufd_backend_get_device_info() to return vendor
specific hardware information data, but it's not good to extract this raw
data in VFIO core.

Introduce host_iommu_extract_vendor_caps() to help extracting the raw
data and return a bitmap in iommufd.c because it's the place defining
iommufd_backend_get_device_info().

The other choice is to put vendor data extracting code in vendor vIOMMU
emulation file, but that will make those files mixed with vIOMMU
emulation and host IOMMU extracting code, also need a new callback in
PCIIOMMUOps. So we choose a simpler way as above.

Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/iommu.h                 |  5 +++++
 include/system/host_iommu_device.h | 16 ++++++++++++++++
 backends/iommufd.c                 | 13 +++++++++++++
 3 files changed, 34 insertions(+)

diff --git a/include/hw/iommu.h b/include/hw/iommu.h
index 65d652950a..9b343e64b0 100644
--- a/include/hw/iommu.h
+++ b/include/hw/iommu.h
@@ -16,4 +16,9 @@ enum {
      VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
 };
 
+enum {
+    /* Nesting parent HWPT shouldn't have readonly mapping, due to errata */
+     IOMMU_HW_NESTING_PARENT_BYPASS_RO = BIT_ULL(0),
+};
+
 #endif /* HW_IOMMU_H */
diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index ab849a4a82..41c9159605 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -39,6 +39,22 @@ typedef struct HostIOMMUDeviceCaps {
     uint64_t hw_caps;
     VendorCaps vendor_caps;
 } HostIOMMUDeviceCaps;
+
+/**
+ * host_iommu_extract_vendor_caps: Extract vendor capabilities
+ *
+ * This function converts @type specific hardware information data
+ * into a standard bitmap format.
+ *
+ * @type: IOMMU Hardware Info Types
+ *
+ * @VendorCaps: IOMMU @type specific hardware information data
+ *
+ * Returns: 64bit bitmap with each bit represents a capability of host
+ * IOMMU that we want to expose. See IOMMU_HW_* in include/hw/iommu.h
+ * for all possible capabilities currently exposed.
+ */
+uint64_t host_iommu_extract_vendor_caps(uint32_t type, VendorCaps *caps);
 #endif
 
 #define TYPE_HOST_IOMMU_DEVICE "host-iommu-device"
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 2a33c7ab0b..0bb1ed40d3 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -19,6 +19,7 @@
 #include "migration/cpr.h"
 #include "monitor/monitor.h"
 #include "trace.h"
+#include "hw/iommu.h"
 #include "hw/vfio/vfio-device.h"
 #include <sys/ioctl.h>
 #include <linux/iommufd.h>
@@ -410,6 +411,18 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
     return true;
 }
 
+uint64_t host_iommu_extract_vendor_caps(uint32_t type, VendorCaps *caps)
+{
+    uint64_t vendor_caps = 0;
+
+    if (type == IOMMU_HW_INFO_TYPE_INTEL_VTD &&
+        caps->vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
+        vendor_caps |= IOMMU_HW_NESTING_PARENT_BYPASS_RO;
+    }
+
+    return vendor_caps;
+}
+
 bool iommufd_backend_invalidate_cache(IOMMUFDBackend *be, uint32_t id,
                                       uint32_t data_type, uint32_t entry_len,
                                       uint32_t *entry_num, void *data,
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 19/22] vfio: Add a new element bypass_ro in VFIOContainerBase
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (17 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 18/22] iommufd: Introduce a helper function to extract vendor capabilities Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-26 12:25   ` Cédric Le Goater
  2025-09-18  8:57 ` [PATCH v6 20/22] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

When bypass_ro is true, readonly memory section is bypassed from mapping
in the container.

This is a preparing patch to workaround Intel ERRATA_772415, see changelog
in next patch for details about the errata.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/vfio/vfio-container-base.h |  1 +
 hw/vfio/listener.c                    | 21 ++++++++++++++-------
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index acbd48a18a..2b9fec217a 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -52,6 +52,7 @@ struct VFIOContainerBase {
     QLIST_HEAD(, VFIODevice) device_list;
     GList *iova_ranges;
     NotifierWithReturn cpr_reboot_notifier;
+    bool bypass_ro;
 };
 
 #define TYPE_VFIO_IOMMU "vfio-iommu"
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index e093833165..581ebfda36 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -76,8 +76,13 @@ static bool vfio_log_sync_needed(const VFIOContainerBase *bcontainer)
     return true;
 }
 
-static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+static bool vfio_listener_skipped_section(MemoryRegionSection *section,
+                                          bool bypass_ro)
 {
+    if (bypass_ro && section->readonly) {
+        return true;
+    }
+
     return (!memory_region_is_ram(section->mr) &&
             !memory_region_is_iommu(section->mr)) ||
            memory_region_is_protected(section->mr) ||
@@ -368,9 +373,9 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
 }
 
 static bool vfio_listener_valid_section(MemoryRegionSection *section,
-                                        const char *name)
+                                        bool bypass_ro, const char *name)
 {
-    if (vfio_listener_skipped_section(section)) {
+    if (vfio_listener_skipped_section(section, bypass_ro)) {
         trace_vfio_listener_region_skip(name,
                 section->offset_within_address_space,
                 section->offset_within_address_space +
@@ -497,7 +502,8 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
     int ret;
     Error *err = NULL;
 
-    if (!vfio_listener_valid_section(section, "region_add")) {
+    if (!vfio_listener_valid_section(section, bcontainer->bypass_ro,
+                                     "region_add")) {
         return;
     }
 
@@ -663,7 +669,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
     int ret;
     bool try_unmap = true;
 
-    if (!vfio_listener_valid_section(section, "region_del")) {
+    if (!vfio_listener_valid_section(section, bcontainer->bypass_ro,
+                                     "region_del")) {
         return;
     }
 
@@ -820,7 +827,7 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
         container_of(listener, VFIODirtyRangesListener, listener);
     hwaddr iova, end;
 
-    if (!vfio_listener_valid_section(section, "tracking_update") ||
+    if (!vfio_listener_valid_section(section, false, "tracking_update") ||
         !vfio_get_section_iova_range(dirty->bcontainer, section,
                                      &iova, &end, NULL)) {
         return;
@@ -1214,7 +1221,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
     int ret;
     Error *local_err = NULL;
 
-    if (vfio_listener_skipped_section(section)) {
+    if (vfio_listener_skipped_section(section, false)) {
         return;
     }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 20/22] Workaround for ERRATA_772415_SPR17
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (18 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 19/22] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
@ 2025-09-18  8:57 ` Zhenzhong Duan
  2025-09-18  8:58 ` [PATCH v6 21/22] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
  2025-09-18  8:58 ` [PATCH v6 22/22] docs/devel: Add IOMMUFD nesting documentation Zhenzhong Duan
  21 siblings, 0 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:57 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on second stage page table could still be written.

Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.
https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update/

Also copied the SPR17 details from above link:
"Problem: When remapping hardware is configured by system software in
scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the
PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended
Access bit if enabled) in first-stage page-table entries even when
second-stage mappings indicate that corresponding first-stage page-table
is Read-Only.

Implication: Due to this erratum, pages mapped as Read-only in second-stage
page-tables may be modified by remapping hardware Access/Dirty bit updates.

Workaround: None identified. System software enabling nested translations
for a VM should ensure that there are no read-only pages in the
corresponding second-stage mappings."

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/iommufd.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index f1684a39b7..5d25ce6f97 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -15,6 +15,7 @@
 #include <linux/vfio.h>
 #include <linux/iommufd.h>
 
+#include "hw/iommu.h"
 #include "hw/vfio/vfio-device.h"
 #include "qemu/error-report.h"
 #include "trace.h"
@@ -326,6 +327,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
     IOMMUFDBackend *iommufd = vbasedev->iommufd;
     uint32_t type, flags = 0;
     uint64_t hw_caps;
+    VendorCaps caps;
     VFIOIOASHwpt *hwpt;
     uint32_t hwpt_id;
     int ret;
@@ -371,7 +373,8 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
      * instead.
      */
     if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
-                                         &type, NULL, 0, &hw_caps, errp)) {
+                                         &type, &caps, sizeof(caps), &hw_caps,
+                                         errp)) {
         return false;
     }
 
@@ -386,6 +389,11 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
      */
     if (vfio_device_get_viommu_flags_want_nesting(vbasedev)) {
         flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
+
+        if (host_iommu_extract_vendor_caps(type, &caps) &
+            IOMMU_HW_NESTING_PARENT_BYPASS_RO) {
+            container->bcontainer.bypass_ro = true;
+        }
     }
 
     if (cpr_is_incoming()) {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 21/22] intel_iommu: Enable host device when x-flts=on in scalable mode
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (19 preceding siblings ...)
  2025-09-18  8:57 ` [PATCH v6 20/22] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
@ 2025-09-18  8:58 ` Zhenzhong Duan
  2025-09-18  8:58 ` [PATCH v6 22/22] docs/devel: Add IOMMUFD nesting documentation Zhenzhong Duan
  21 siblings, 0 replies; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Now that all infrastructures of supporting passthrough device running
with first stage translation are there, enable it now.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
---
 hw/i386/intel_iommu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 74496c7d3b..4bed115017 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4828,6 +4828,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
                    "when x-flts=on");
         return false;
     }
+
+    return true;
 #endif
 
     error_setg(errp,
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v6 22/22] docs/devel: Add IOMMUFD nesting documentation
  2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
                   ` (20 preceding siblings ...)
  2025-09-18  8:58 ` [PATCH v6 21/22] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
@ 2025-09-18  8:58 ` Zhenzhong Duan
  2025-09-18 10:00   ` Cédric Le Goater
  21 siblings, 1 reply; 57+ messages in thread
From: Zhenzhong Duan @ 2025-09-18  8:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Add documentation about using IOMMUFD backed VFIO device with intel_iommu with
x-flts=on.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 docs/devel/vfio-iommufd.rst | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/docs/devel/vfio-iommufd.rst b/docs/devel/vfio-iommufd.rst
index 3d1c11f175..d9cb9e7f5e 100644
--- a/docs/devel/vfio-iommufd.rst
+++ b/docs/devel/vfio-iommufd.rst
@@ -164,3 +164,27 @@ RAM discarding for mdev.
 
 ``vfio-ap`` and ``vfio-ccw`` devices don't have same issue as their backend
 devices are always mdev and RAM discarding is force enabled.
+
+Usage with intel_iommu with x-flts=on
+-------------------------------------
+
+Only IOMMUFD backed VFIO device is supported when intel_iommu is configured
+with x-flts=on, for legacy container backed VFIO device, below error shows:
+
+.. code-block:: none
+
+    qemu-system-x86_64: -device vfio-pci,host=0000:02:00.0: vfio 0000:02:00.0: Failed to set vIOMMU: Need IOMMUFD backend when x-flts=on
+
+VFIO device under PCI bridge is unsupported, use PCIE bridge if necessary,
+or else below error shows:
+
+.. code-block:: none
+
+    qemu-system-x86_64: -device vfio-pci,host=0000:02:00.0,bus=bridge1,iommufd=iommufd0: vfio 0000:02:00.0: Failed to set vIOMMU: Host device under PCI bridge is unsupported when x-flts=on
+
+If host IOMMU has ERRATA_772415_SPR17, kexec or reboot from "intel_iommu=on,sm_on"
+to "intel_iommu=on,sm_off" in guest is also unsupported. Configure scalable mode
+off as below if it's not needed by guest.
+
+.. code-block:: bash
+    -device intel-iommu,x-scalable-mode=off
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 22/22] docs/devel: Add IOMMUFD nesting documentation
  2025-09-18  8:58 ` [PATCH v6 22/22] docs/devel: Add IOMMUFD nesting documentation Zhenzhong Duan
@ 2025-09-18 10:00   ` Cédric Le Goater
  2025-09-19  2:17     ` Duan, Zhenzhong
  0 siblings, 1 reply; 57+ messages in thread
From: Cédric Le Goater @ 2025-09-18 10:00 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

Hello Zhenzhong

On 9/18/25 10:58, Zhenzhong Duan wrote:
> Add documentation about using IOMMUFD backed VFIO device with intel_iommu with
> x-flts=on.
> 
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   docs/devel/vfio-iommufd.rst | 24 ++++++++++++++++++++++++
>   1 file changed, 24 insertions(+)
> 
> diff --git a/docs/devel/vfio-iommufd.rst b/docs/devel/vfio-iommufd.rst
> index 3d1c11f175..d9cb9e7f5e 100644
> --- a/docs/devel/vfio-iommufd.rst
> +++ b/docs/devel/vfio-iommufd.rst
> @@ -164,3 +164,27 @@ RAM discarding for mdev.
>   
>   ``vfio-ap`` and ``vfio-ccw`` devices don't have same issue as their backend
>   devices are always mdev and RAM discarding is force enabled.
> +
> +Usage with intel_iommu with x-flts=on
> +-------------------------------------
> +
> +Only IOMMUFD backed VFIO device is supported when intel_iommu is configured
> +with x-flts=on, for legacy container backed VFIO device, below error shows:
> +
> +.. code-block:: none
> +
> +    qemu-system-x86_64: -device vfio-pci,host=0000:02:00.0: vfio 0000:02:00.0: Failed to set vIOMMU: Need IOMMUFD backend when x-flts=on
> +
> +VFIO device under PCI bridge is unsupported, use PCIE bridge if necessary,
> +or else below error shows:
> +
> +.. code-block:: none
> +
> +    qemu-system-x86_64: -device vfio-pci,host=0000:02:00.0,bus=bridge1,iommufd=iommufd0: vfio 0000:02:00.0: Failed to set vIOMMU: Host device under PCI bridge is unsupported when x-flts=on
> +
> +If host IOMMU has ERRATA_772415_SPR17, kexec or reboot from "intel_iommu=on,sm_on"
> +to "intel_iommu=on,sm_off" in guest is also unsupported. Configure scalable mode
> +off as below if it's not needed by guest.
> +
> +.. code-block:: bash

an new line lacks after code-block.


Thanks,

C.



> +    -device intel-iommu,x-scalable-mode=off



^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 22/22] docs/devel: Add IOMMUFD nesting documentation
  2025-09-18 10:00   ` Cédric Le Goater
@ 2025-09-19  2:17     ` Duan, Zhenzhong
  0 siblings, 0 replies; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-09-19  2:17 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P

Hi Cédric,

>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v6 22/22] docs/devel: Add IOMMUFD nesting
>documentation
>
>Hello Zhenzhong
>
>On 9/18/25 10:58, Zhenzhong Duan wrote:
>> Add documentation about using IOMMUFD backed VFIO device with
>intel_iommu with
>> x-flts=on.
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   docs/devel/vfio-iommufd.rst | 24 ++++++++++++++++++++++++
>>   1 file changed, 24 insertions(+)
>>
>> diff --git a/docs/devel/vfio-iommufd.rst b/docs/devel/vfio-iommufd.rst
>> index 3d1c11f175..d9cb9e7f5e 100644
>> --- a/docs/devel/vfio-iommufd.rst
>> +++ b/docs/devel/vfio-iommufd.rst
>> @@ -164,3 +164,27 @@ RAM discarding for mdev.
>>
>>   ``vfio-ap`` and ``vfio-ccw`` devices don't have same issue as their
>backend
>>   devices are always mdev and RAM discarding is force enabled.
>> +
>> +Usage with intel_iommu with x-flts=on
>> +-------------------------------------
>> +
>> +Only IOMMUFD backed VFIO device is supported when intel_iommu is
>configured
>> +with x-flts=on, for legacy container backed VFIO device, below error
>shows:
>> +
>> +.. code-block:: none
>> +
>> +    qemu-system-x86_64: -device vfio-pci,host=0000:02:00.0: vfio
>0000:02:00.0: Failed to set vIOMMU: Need IOMMUFD backend when
>x-flts=on
>> +
>> +VFIO device under PCI bridge is unsupported, use PCIE bridge if necessary,
>> +or else below error shows:
>> +
>> +.. code-block:: none
>> +
>> +    qemu-system-x86_64: -device
>vfio-pci,host=0000:02:00.0,bus=bridge1,iommufd=iommufd0: vfio
>0000:02:00.0: Failed to set vIOMMU: Host device under PCI bridge is
>unsupported when x-flts=on
>> +
>> +If host IOMMU has ERRATA_772415_SPR17, kexec or reboot from
>"intel_iommu=on,sm_on"
>> +to "intel_iommu=on,sm_off" in guest is also unsupported. Configure
>scalable mode
>> +off as below if it's not needed by guest.
>> +
>> +.. code-block:: bash
>
>an new line lacks after code-block.

Thanks for pointing out, will fix, I forgot to make docs once again☹

BRs,
Zhenzhong

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-09-18  8:57 ` [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags() Zhenzhong Duan
@ 2025-09-23 18:47   ` Nicolin Chen
  2025-09-24  7:05     ` Duan, Zhenzhong
  2025-10-12 12:26   ` Yi Liu
  1 sibling, 1 reply; 57+ messages in thread
From: Nicolin Chen @ 2025-09-23 18:47 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

On Thu, Sep 18, 2025 at 04:57:44AM -0400, Zhenzhong Duan wrote:
> Introduce a new PCIIOMMUOps optional callback, get_viommu_flags() which
> allows to retrieve flags exposed by a vIOMMU. The first planned vIOMMU
> device flag is VIOMMU_FLAG_WANT_NESTING_PARENT that advertises the
> support of HW nested stage translation scheme and wants other sub-system
> like VFIO's cooperation to create nesting parent HWPT.
> 
> pci_device_get_viommu_flags() is a wrapper that can be called on a PCI
> device potentially protected by a vIOMMU.
> 
> get_viommu_flags() is designed to return 64bit bitmap of purely vIOMMU
> flags which are only determined by user's configuration, no host
> capabilities involved. Reasons are:
> 
> 1. host may has heterogeneous IOMMUs, each with different capabilities
> 2. this is migration friendly, return value is consistent between source
>    and target.
> 3. host IOMMU capabilities are passed to vIOMMU through set_iommu_device()
>    interface which have to be after attach_device(), when get_viommu_flags()
>    is called in attach_device(), there is no way for vIOMMU to get host
>    IOMMU capabilities yet, so only pure vIOMMU flags can be returned.

"no way" sounds too strong..

There is an iommufd_backend_get_device_info() call there. So, we
could have passed the host IOMMU capabilities to a vIOMMU. Just,
we chose not to (assuming for migration reason?).

>    See below sequence:
> 
>      vfio_device_attach():
>          iommufd_cdev_attach():
>              pci_device_get_viommu_flags() for HW nesting cap
>              create a nesting parent HWPT
>              attach device to the HWPT
>              vfio_device_hiod_create_and_realize() creating hiod
>      ...
>      pci_device_set_iommu_device(hiod)
> 
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Despite some nits, patch looks good to me:

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

> +enum {
> +    /* Nesting parent HWPT will be reused by vIOMMU to create nested HWPT */
> +     VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
> +};

How about adding a name and move the note here:

/*
 * Theoretical vIOMMU flags. Only determined by the vIOMMU device properties and
 * independent on the actual host IOMMU capabilities they may depend on.
 */
enum viommu_flags {
	...
};

> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index bde9dca8e2..c54f2b53ae 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -462,6 +462,23 @@ typedef struct PCIIOMMUOps {
>       * @devfn: device and function number of the PCI device.
>       */
>      void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
> +    /**
> +     * @get_viommu_flags: get vIOMMU flags
> +     *
> +     * Optional callback, if not implemented, then vIOMMU doesn't support
> +     * exposing flags to other sub-system, e.g., VFIO. Each flag can be
> +     * an expectation or request to other sub-system or just a pure vIOMMU
> +     * capability. vIOMMU can choose which flags to expose.

The 2nd statement is somewhat redundant. Perhaps we could squash
it into the notes at enum viommu_flags above, if we really need.

> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * Returns: 64bit bitmap with each bit represents a flag that vIOMMU
> +     * wants to expose. See VIOMMU_FLAG_* in include/hw/iommu.h for all
> +     * possible flags currently used. These flags are theoretical which
> +     * are only determined by vIOMMU device properties and independent on
> +     * the actual host capabilities they may depend on.
> +     */
> +    uint64_t (*get_viommu_flags)(void *opaque);

With the notes above, we could simplify this:

     * Returns: bitmap with each representing a vIOMMU flag defined in
     * enum viommu_flags

> +/**
> + * pci_device_get_viommu_flags: get vIOMMU flags.
> + *
> + * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
> + * flags, 0 if vIOMMU doesn't support that.
> + *
> + * @dev: PCI device pointer.
> + */
> +uint64_t pci_device_get_viommu_flags(PCIDevice *dev);
 
and could make this aligned too:

     * Returns: bitmap with each representing a vIOMMU flag defined in
     * enum viommu_flags. Or 0 if vIOMMU doesn't report any.

Nicolin


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 18/22] iommufd: Introduce a helper function to extract vendor capabilities
  2025-09-18  8:57 ` [PATCH v6 18/22] iommufd: Introduce a helper function to extract vendor capabilities Zhenzhong Duan
@ 2025-09-23 19:45   ` Nicolin Chen
  2025-09-24  8:05     ` Duan, Zhenzhong
  0 siblings, 1 reply; 57+ messages in thread
From: Nicolin Chen @ 2025-09-23 19:45 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

On Thu, Sep 18, 2025 at 04:57:57AM -0400, Zhenzhong Duan wrote:
> In VFIO core, we call iommufd_backend_get_device_info() to return vendor
> specific hardware information data, but it's not good to extract this raw
> data in VFIO core.
> 
> Introduce host_iommu_extract_vendor_caps() to help extracting the raw
> data and return a bitmap in iommufd.c because it's the place defining
> iommufd_backend_get_device_info().
> 
> The other choice is to put vendor data extracting code in vendor vIOMMU
> emulation file, but that will make those files mixed with vIOMMU
> emulation and host IOMMU extracting code, also need a new callback in
> PCIIOMMUOps. So we choose a simpler way as above.
> 
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

With some nits:

> +enum {
> +    /* Nesting parent HWPT shouldn't have readonly mapping, due to errata */
> +     IOMMU_HW_NESTING_PARENT_BYPASS_RO = BIT_ULL(0),
> +};

I would put a name here too. And given this is defined generically:

/* Host IOMMU quirks. Extracted from host IOMMU capabilities */
enum host_iommu_quirks {
	HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO = BIT_ULL(0),
};

> +/**
> + * host_iommu_extract_vendor_caps: Extract vendor capabilities

Then:

 * host_iommu_extract_quirks: Extract host IOMMU quirks

> + * This function converts @type specific hardware information data
> + * into a standard bitmap format.
> + *
> + * @type: IOMMU Hardware Info Types
> + *
> + * @VendorCaps: IOMMU @type specific hardware information data
> + *
> + * Returns: 64bit bitmap with each bit represents a capability of host
> + * IOMMU that we want to expose. See IOMMU_HW_* in include/hw/iommu.h
> + * for all possible capabilities currently exposed.

And simplify this:

 * Returns: bitmap with each representing a host IOMMU quirk defined in
 * enum host_iommu_quirks

> +uint64_t host_iommu_extract_vendor_caps(uint32_t type, VendorCaps *caps)
> +{
> +    uint64_t vendor_caps = 0;
> +
> +    if (type == IOMMU_HW_INFO_TYPE_INTEL_VTD &&
> +        caps->vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
> +        vendor_caps |= IOMMU_HW_NESTING_PARENT_BYPASS_RO;
> +    }
> +
> +    return vendor_caps;
> +}

uint64_t host_iommu_extract_quirks(enum iommu_hw_info_type, VendorCaps *caps)
{
    uint64_t quirks = 0;

#if defined(CONFIG_VTD)
    if (type == IOMMU_HW_INFO_TYPE_INTEL_VTD) {
        if (caps->vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
            quirks |= HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO;
        }
    }
#endif

    return quirks;
}

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-09-23 18:47   ` Nicolin Chen
@ 2025-09-24  7:05     ` Duan, Zhenzhong
  2025-09-24  8:21       ` Nicolin Chen
  0 siblings, 1 reply; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-09-24  7:05 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v6 05/22] hw/pci: Introduce
>pci_device_get_viommu_flags()
>
>On Thu, Sep 18, 2025 at 04:57:44AM -0400, Zhenzhong Duan wrote:
>> Introduce a new PCIIOMMUOps optional callback, get_viommu_flags()
>which
>> allows to retrieve flags exposed by a vIOMMU. The first planned vIOMMU
>> device flag is VIOMMU_FLAG_WANT_NESTING_PARENT that advertises the
>> support of HW nested stage translation scheme and wants other sub-system
>> like VFIO's cooperation to create nesting parent HWPT.
>>
>> pci_device_get_viommu_flags() is a wrapper that can be called on a PCI
>> device potentially protected by a vIOMMU.
>>
>> get_viommu_flags() is designed to return 64bit bitmap of purely vIOMMU
>> flags which are only determined by user's configuration, no host
>> capabilities involved. Reasons are:
>>
>> 1. host may has heterogeneous IOMMUs, each with different capabilities
>> 2. this is migration friendly, return value is consistent between source
>>    and target.
>> 3. host IOMMU capabilities are passed to vIOMMU through
>set_iommu_device()
>>    interface which have to be after attach_device(), when
>get_viommu_flags()
>>    is called in attach_device(), there is no way for vIOMMU to get host
>>    IOMMU capabilities yet, so only pure vIOMMU flags can be returned.
>
>"no way" sounds too strong..
>
>There is an iommufd_backend_get_device_info() call there. So, we
>could have passed the host IOMMU capabilities to a vIOMMU. Just,
>we chose not to (assuming for migration reason?).

What about 'it's hard for vIOMMU to get host IOMMU...'?

>
>>    See below sequence:
>>
>>      vfio_device_attach():
>>          iommufd_cdev_attach():
>>              pci_device_get_viommu_flags() for HW nesting cap
>>              create a nesting parent HWPT
>>              attach device to the HWPT
>>              vfio_device_hiod_create_and_realize() creating hiod
>>      ...
>>      pci_device_set_iommu_device(hiod)
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>
>Despite some nits, patch looks good to me:
>
>Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>
>> +enum {
>> +    /* Nesting parent HWPT will be reused by vIOMMU to create nested
>HWPT */
>> +     VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
>> +};
>
>How about adding a name and move the note here:
>
>/*
> * Theoretical vIOMMU flags. Only determined by the vIOMMU device
>properties and
> * independent on the actual host IOMMU capabilities they may depend on.
> */
>enum viommu_flags {
>	...
>};
>
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index bde9dca8e2..c54f2b53ae 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -462,6 +462,23 @@ typedef struct PCIIOMMUOps {
>>       * @devfn: device and function number of the PCI device.
>>       */
>>      void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
>> +    /**
>> +     * @get_viommu_flags: get vIOMMU flags
>> +     *
>> +     * Optional callback, if not implemented, then vIOMMU doesn't
>support
>> +     * exposing flags to other sub-system, e.g., VFIO. Each flag can be
>> +     * an expectation or request to other sub-system or just a pure
>vIOMMU
>> +     * capability. vIOMMU can choose which flags to expose.
>
>The 2nd statement is somewhat redundant. Perhaps we could squash
>it into the notes at enum viommu_flags above, if we really need.
>
>> +     *
>> +     * @opaque: the data passed to pci_setup_iommu().
>> +     *
>> +     * Returns: 64bit bitmap with each bit represents a flag that vIOMMU
>> +     * wants to expose. See VIOMMU_FLAG_* in include/hw/iommu.h
>for all
>> +     * possible flags currently used. These flags are theoretical which
>> +     * are only determined by vIOMMU device properties and
>independent on
>> +     * the actual host capabilities they may depend on.
>> +     */
>> +    uint64_t (*get_viommu_flags)(void *opaque);
>
>With the notes above, we could simplify this:
>
>     * Returns: bitmap with each representing a vIOMMU flag defined in
>     * enum viommu_flags
>
>> +/**
>> + * pci_device_get_viommu_flags: get vIOMMU flags.
>> + *
>> + * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
>> + * flags, 0 if vIOMMU doesn't support that.
>> + *
>> + * @dev: PCI device pointer.
>> + */
>> +uint64_t pci_device_get_viommu_flags(PCIDevice *dev);
>
>and could make this aligned too:
>
>     * Returns: bitmap with each representing a vIOMMU flag defined in
>     * enum viommu_flags. Or 0 if vIOMMU doesn't report any.

Will do all suggested changes above.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 18/22] iommufd: Introduce a helper function to extract vendor capabilities
  2025-09-23 19:45   ` Nicolin Chen
@ 2025-09-24  8:05     ` Duan, Zhenzhong
  2025-09-24  8:27       ` Nicolin Chen
  0 siblings, 1 reply; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-09-24  8:05 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v6 18/22] iommufd: Introduce a helper function to
>extract vendor capabilities
>
>On Thu, Sep 18, 2025 at 04:57:57AM -0400, Zhenzhong Duan wrote:
>> In VFIO core, we call iommufd_backend_get_device_info() to return vendor
>> specific hardware information data, but it's not good to extract this raw
>> data in VFIO core.
>>
>> Introduce host_iommu_extract_vendor_caps() to help extracting the raw
>> data and return a bitmap in iommufd.c because it's the place defining
>> iommufd_backend_get_device_info().
>>
>> The other choice is to put vendor data extracting code in vendor vIOMMU
>> emulation file, but that will make those files mixed with vIOMMU
>> emulation and host IOMMU extracting code, also need a new callback in
>> PCIIOMMUOps. So we choose a simpler way as above.
>>
>> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>
>Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>
>With some nits:
>
>> +enum {
>> +    /* Nesting parent HWPT shouldn't have readonly mapping, due to
>errata */
>> +     IOMMU_HW_NESTING_PARENT_BYPASS_RO = BIT_ULL(0),
>> +};
>
>I would put a name here too. And given this is defined generically:
>
>/* Host IOMMU quirks. Extracted from host IOMMU capabilities */
>enum host_iommu_quirks {
>	HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO = BIT_ULL(0),
>};
>
>> +/**
>> + * host_iommu_extract_vendor_caps: Extract vendor capabilities
>
>Then:
>
> * host_iommu_extract_quirks: Extract host IOMMU quirks
>
>> + * This function converts @type specific hardware information data
>> + * into a standard bitmap format.
>> + *
>> + * @type: IOMMU Hardware Info Types
>> + *
>> + * @VendorCaps: IOMMU @type specific hardware information data
>> + *
>> + * Returns: 64bit bitmap with each bit represents a capability of host
>> + * IOMMU that we want to expose. See IOMMU_HW_* in
>include/hw/iommu.h
>> + * for all possible capabilities currently exposed.
>
>And simplify this:
>
> * Returns: bitmap with each representing a host IOMMU quirk defined in
> * enum host_iommu_quirks
>
>> +uint64_t host_iommu_extract_vendor_caps(uint32_t type, VendorCaps
>*caps)
>> +{
>> +    uint64_t vendor_caps = 0;
>> +
>> +    if (type == IOMMU_HW_INFO_TYPE_INTEL_VTD &&
>> +        caps->vtd.flags &
>IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
>> +        vendor_caps |= IOMMU_HW_NESTING_PARENT_BYPASS_RO;
>> +    }
>> +
>> +    return vendor_caps;
>> +}
>
>uint64_t host_iommu_extract_quirks(enum iommu_hw_info_type,
>VendorCaps *caps)
>{
>    uint64_t quirks = 0;
>
>#if defined(CONFIG_VTD)

I have applied all suggested change except CONFIG_VTD here as it's a device config and iommufd.c is device agnostic, it doesn't recognize CONFIG_VTD.

../backends/iommufd.c:419:13: error: attempt to use poisoned "CONFIG_VTD"

I thought this is trivial and OK for not having CONFIG_VTD?

Thanks
Zhenzhong

>    if (type == IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>        if (caps->vtd.flags &
>IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
>            quirks |=
>HOST_IOMMU_QUIRK_NESTING_PARENT_BYPASS_RO;
>        }
>    }
>#endif
>
>    return quirks;
>}
>
>Thanks
>Nicolin


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-09-24  7:05     ` Duan, Zhenzhong
@ 2025-09-24  8:21       ` Nicolin Chen
  2025-09-26  2:54         ` Duan, Zhenzhong
  0 siblings, 1 reply; 57+ messages in thread
From: Nicolin Chen @ 2025-09-24  8:21 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Liu, Yi L,
	Peng, Chao P

On Wed, Sep 24, 2025 at 07:05:42AM +0000, Duan, Zhenzhong wrote:
> >From: Nicolin Chen <nicolinc@nvidia.com>
> >Subject: Re: [PATCH v6 05/22] hw/pci: Introduce
> >> get_viommu_flags() is designed to return 64bit bitmap of purely vIOMMU
> >> flags which are only determined by user's configuration, no host
> >> capabilities involved. Reasons are:
> >>
> >> 1. host may has heterogeneous IOMMUs, each with different capabilities
> >> 2. this is migration friendly, return value is consistent between source
> >>    and target.
> >> 3. host IOMMU capabilities are passed to vIOMMU through
> >set_iommu_device()
> >>    interface which have to be after attach_device(), when
> >get_viommu_flags()
> >>    is called in attach_device(), there is no way for vIOMMU to get host
> >>    IOMMU capabilities yet, so only pure vIOMMU flags can be returned.
> >
> >"no way" sounds too strong..
> >
> >There is an iommufd_backend_get_device_info() call there. So, we
> >could have passed the host IOMMU capabilities to a vIOMMU. Just,
> >we chose not to (assuming for migration reason?).
> 
> What about 'it's hard for vIOMMU to get host IOMMU...'?

vfio-iommufd core code gets all the host IOMMU caps via the vfio
device but chooses to not forward to vIOMMU. So, it's neither "no
way" nor "hard" :)

To be honest, I don't feel this very related to be the reason 3
to justify for the new op/API. 1 and 2 are quite okay?

Having said that, it's probably good to add as a side note:

"
Note that this op will be invoked at the attach_device() stage, at which
point host IOMMU capabilities are not yet forwarded to the vIOMMU through
the set_iommu_device() callback that will be after the attach_device().

See the below sequence:
"

Nicolin


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 18/22] iommufd: Introduce a helper function to extract vendor capabilities
  2025-09-24  8:05     ` Duan, Zhenzhong
@ 2025-09-24  8:27       ` Nicolin Chen
  2025-09-26  2:54         ` Duan, Zhenzhong
  0 siblings, 1 reply; 57+ messages in thread
From: Nicolin Chen @ 2025-09-24  8:27 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Liu, Yi L,
	Peng, Chao P

On Wed, Sep 24, 2025 at 08:05:36AM +0000, Duan, Zhenzhong wrote:
> >uint64_t host_iommu_extract_quirks(enum iommu_hw_info_type,
> >VendorCaps *caps)
> >{
> >    uint64_t quirks = 0;
> >
> >#if defined(CONFIG_VTD)
> 
> I have applied all suggested change except CONFIG_VTD here as it's a device config and iommufd.c is device agnostic, it doesn't recognize CONFIG_VTD.
> 
> ../backends/iommufd.c:419:13: error: attempt to use poisoned "CONFIG_VTD"
> 
> I thought this is trivial and OK for not having CONFIG_VTD?

Hmm.. I didn't expect that. It seems that QEMU does encourage
moving all vendor specific code to vendor specific file :-/

Anyway, I think it's fine to drop the ifdef. The VTD type and
cap structure are defined in the shared uAPI header.

Nicolin


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 18/22] iommufd: Introduce a helper function to extract vendor capabilities
  2025-09-24  8:27       ` Nicolin Chen
@ 2025-09-26  2:54         ` Duan, Zhenzhong
  0 siblings, 0 replies; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-09-26  2:54 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v6 18/22] iommufd: Introduce a helper function to
>extract vendor capabilities
>
>On Wed, Sep 24, 2025 at 08:05:36AM +0000, Duan, Zhenzhong wrote:
>> >uint64_t host_iommu_extract_quirks(enum iommu_hw_info_type,
>> >VendorCaps *caps)
>> >{
>> >    uint64_t quirks = 0;
>> >
>> >#if defined(CONFIG_VTD)
>>
>> I have applied all suggested change except CONFIG_VTD here as it's a
>device config and iommufd.c is device agnostic, it doesn't recognize
>CONFIG_VTD.
>>
>> ../backends/iommufd.c:419:13: error: attempt to use poisoned
>"CONFIG_VTD"
>>
>> I thought this is trivial and OK for not having CONFIG_VTD?
>
>Hmm.. I didn't expect that. It seems that QEMU does encourage
>moving all vendor specific code to vendor specific file :-/

This make me think CONFIG_VTD should not be used here, it controls
intel_iommu emulation, we may have a custom virtual machine without
virtual intel_iommu but still need to check host IOMMU.

If we do want a control, maybe kind of CONFIG_HOST_VTD rather than CONFIG_VTD.

>
>Anyway, I think it's fine to drop the ifdef. The VTD type and
>cap structure are defined in the shared uAPI header.

Agree.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-09-24  8:21       ` Nicolin Chen
@ 2025-09-26  2:54         ` Duan, Zhenzhong
  2025-09-30 13:55           ` Eric Auger
  0 siblings, 1 reply; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-09-26  2:54 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v6 05/22] hw/pci: Introduce
>pci_device_get_viommu_flags()
>
>On Wed, Sep 24, 2025 at 07:05:42AM +0000, Duan, Zhenzhong wrote:
>> >From: Nicolin Chen <nicolinc@nvidia.com>
>> >Subject: Re: [PATCH v6 05/22] hw/pci: Introduce
>> >> get_viommu_flags() is designed to return 64bit bitmap of purely
>vIOMMU
>> >> flags which are only determined by user's configuration, no host
>> >> capabilities involved. Reasons are:
>> >>
>> >> 1. host may has heterogeneous IOMMUs, each with different capabilities
>> >> 2. this is migration friendly, return value is consistent between source
>> >>    and target.
>> >> 3. host IOMMU capabilities are passed to vIOMMU through
>> >set_iommu_device()
>> >>    interface which have to be after attach_device(), when
>> >get_viommu_flags()
>> >>    is called in attach_device(), there is no way for vIOMMU to get host
>> >>    IOMMU capabilities yet, so only pure vIOMMU flags can be
>returned.
>> >
>> >"no way" sounds too strong..
>> >
>> >There is an iommufd_backend_get_device_info() call there. So, we
>> >could have passed the host IOMMU capabilities to a vIOMMU. Just,
>> >we chose not to (assuming for migration reason?).
>>
>> What about 'it's hard for vIOMMU to get host IOMMU...'?
>
>vfio-iommufd core code gets all the host IOMMU caps via the vfio
>device but chooses to not forward to vIOMMU. So, it's neither "no
>way" nor "hard" :)

Yes, that needs to introduce another callback to forward the caps early,
unnecessarily complex.

>
>To be honest, I don't feel this very related to be the reason 3
>to justify for the new op/API. 1 and 2 are quite okay?
>
>Having said that, it's probably good to add as a side note:
>
>"
>Note that this op will be invoked at the attach_device() stage, at which
>point host IOMMU capabilities are not yet forwarded to the vIOMMU through
>the set_iommu_device() callback that will be after the attach_device().
>
>See the below sequence:
>"

OK, will drop 3 and add the side note.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 19/22] vfio: Add a new element bypass_ro in VFIOContainerBase
  2025-09-18  8:57 ` [PATCH v6 19/22] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
@ 2025-09-26 12:25   ` Cédric Le Goater
  0 siblings, 0 replies; 57+ messages in thread
From: Cédric Le Goater @ 2025-09-26 12:25 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On 9/18/25 10:57, Zhenzhong Duan wrote:
> When bypass_ro is true, readonly memory section is bypassed from mapping
> in the container.
> 
> This is a preparing patch to workaround Intel ERRATA_772415, see changelog
> in next patch for details about the errata.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>


Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.


> ---
>   include/hw/vfio/vfio-container-base.h |  1 +
>   hw/vfio/listener.c                    | 21 ++++++++++++++-------
>   2 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
> index acbd48a18a..2b9fec217a 100644
> --- a/include/hw/vfio/vfio-container-base.h
> +++ b/include/hw/vfio/vfio-container-base.h
> @@ -52,6 +52,7 @@ struct VFIOContainerBase {
>       QLIST_HEAD(, VFIODevice) device_list;
>       GList *iova_ranges;
>       NotifierWithReturn cpr_reboot_notifier;
> +    bool bypass_ro;
>   };
>   
>   #define TYPE_VFIO_IOMMU "vfio-iommu"
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index e093833165..581ebfda36 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -76,8 +76,13 @@ static bool vfio_log_sync_needed(const VFIOContainerBase *bcontainer)
>       return true;
>   }
>   
> -static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> +static bool vfio_listener_skipped_section(MemoryRegionSection *section,
> +                                          bool bypass_ro)
>   {
> +    if (bypass_ro && section->readonly) {
> +        return true;
> +    }
> +
>       return (!memory_region_is_ram(section->mr) &&
>               !memory_region_is_iommu(section->mr)) ||
>              memory_region_is_protected(section->mr) ||
> @@ -368,9 +373,9 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
>   }
>   
>   static bool vfio_listener_valid_section(MemoryRegionSection *section,
> -                                        const char *name)
> +                                        bool bypass_ro, const char *name)
>   {
> -    if (vfio_listener_skipped_section(section)) {
> +    if (vfio_listener_skipped_section(section, bypass_ro)) {
>           trace_vfio_listener_region_skip(name,
>                   section->offset_within_address_space,
>                   section->offset_within_address_space +
> @@ -497,7 +502,8 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
>       int ret;
>       Error *err = NULL;
>   
> -    if (!vfio_listener_valid_section(section, "region_add")) {
> +    if (!vfio_listener_valid_section(section, bcontainer->bypass_ro,
> +                                     "region_add")) {
>           return;
>       }
>   
> @@ -663,7 +669,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>       int ret;
>       bool try_unmap = true;
>   
> -    if (!vfio_listener_valid_section(section, "region_del")) {
> +    if (!vfio_listener_valid_section(section, bcontainer->bypass_ro,
> +                                     "region_del")) {
>           return;
>       }
>   
> @@ -820,7 +827,7 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
>           container_of(listener, VFIODirtyRangesListener, listener);
>       hwaddr iova, end;
>   
> -    if (!vfio_listener_valid_section(section, "tracking_update") ||
> +    if (!vfio_listener_valid_section(section, false, "tracking_update") ||
>           !vfio_get_section_iova_range(dirty->bcontainer, section,
>                                        &iova, &end, NULL)) {
>           return;
> @@ -1214,7 +1221,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
>       int ret;
>       Error *local_err = NULL;
>   
> -    if (vfio_listener_skipped_section(section)) {
> +    if (vfio_listener_skipped_section(section, false)) {
>           return;
>       }
>   



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 03/22] intel_iommu: Update terminology to match VTD spec
  2025-09-18  8:57 ` [PATCH v6 03/22] intel_iommu: Update terminology to match VTD spec Zhenzhong Duan
@ 2025-09-30  7:45   ` Eric Auger
  2025-10-12 12:30   ` Yi Liu
  1 sibling, 0 replies; 57+ messages in thread
From: Eric Auger @ 2025-09-30  7:45 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng, Paolo Bonzini

Hi Zhenzhong,

On 9/18/25 10:57 AM, Zhenzhong Duan wrote:
> VTD spec revision 3.4 released in December 2021 renamed "First-level" to
> "First-stage" and "Second-level" to "Second-stage".
>
> Do the same in intel_iommu code to match spec, change all existing
> "fl/sl/FL/SL/first level/second level/stage-1/stage-2" terminology to
> "fs/ss/FS/SS/first stage/second stage".
>
> No functional changes intended.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Suggested-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/i386/intel_iommu_internal.h |  63 ++++-----
>  include/hw/i386/intel_iommu.h  |   2 +-
>  hw/i386/intel_iommu.c          | 240 +++++++++++++++++----------------
>  tests/qtest/intel-iommu-test.c |   4 +-
>  4 files changed, 156 insertions(+), 153 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 6abe76556a..86b8bfc71f 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -195,8 +195,8 @@
>  #define VTD_ECAP_PSS                (7ULL << 35) /* limit: MemTxAttrs::pid */
>  #define VTD_ECAP_PASID              (1ULL << 40)
>  #define VTD_ECAP_SMTS               (1ULL << 43)
> -#define VTD_ECAP_SLTS               (1ULL << 46)
> -#define VTD_ECAP_FLTS               (1ULL << 47)
> +#define VTD_ECAP_SSTS               (1ULL << 46)
> +#define VTD_ECAP_FSTS               (1ULL << 47)
>  
>  /* CAP_REG */
>  /* (offset >> 4) << 24 */
> @@ -210,7 +210,7 @@
>  #define VTD_MAMV                    18ULL
>  #define VTD_CAP_MAMV                (VTD_MAMV << 48)
>  #define VTD_CAP_PSI                 (1ULL << 39)
> -#define VTD_CAP_SLLPS               ((1ULL << 34) | (1ULL << 35))
> +#define VTD_CAP_SSLPS               ((1ULL << 34) | (1ULL << 35))
>  #define VTD_CAP_DRAIN_WRITE         (1ULL << 54)
>  #define VTD_CAP_DRAIN_READ          (1ULL << 55)
>  #define VTD_CAP_FS1GP               (1ULL << 56)
> @@ -282,7 +282,7 @@ typedef enum VTDFaultReason {
>      VTD_FR_ADDR_BEYOND_MGAW,    /* Input-address above (2^x-1) */
>      VTD_FR_WRITE,               /* No write permission */
>      VTD_FR_READ,                /* No read permission */
> -    /* Fail to access a second-level paging entry (not SL_PML4E) */
> +    /* Fail to access a second-stage paging entry (not SS_PML4E) */
>      VTD_FR_PAGING_ENTRY_INV,
>      VTD_FR_ROOT_TABLE_INV,      /* Fail to access a root-entry */
>      VTD_FR_CONTEXT_TABLE_INV,   /* Fail to access a context-entry */
> @@ -290,7 +290,8 @@ typedef enum VTDFaultReason {
>      VTD_FR_ROOT_ENTRY_RSVD,
>      /* Non-zero reserved field in a present context-entry */
>      VTD_FR_CONTEXT_ENTRY_RSVD,
> -    /* Non-zero reserved field in a second-level paging entry with at lease one
> +    /*
> +     * Non-zero reserved field in a second-stage paging entry with at lease one
>       * Read(R) and Write(W) or Execute(E) field is Set.
>       */
>      VTD_FR_PAGING_ENTRY_RSVD,
> @@ -323,7 +324,7 @@ typedef enum VTDFaultReason {
>      VTD_FR_PASID_ENTRY_P = 0x59,
>      VTD_FR_PASID_TABLE_ENTRY_INV = 0x5b,  /*Invalid PASID table entry */
>  
> -    /* Fail to access a first-level paging entry (not FS_PML4E) */
> +    /* Fail to access a first-stage paging entry (not FS_PML4E) */
>      VTD_FR_FS_PAGING_ENTRY_INV = 0x70,
>      VTD_FR_FS_PAGING_ENTRY_P = 0x71,
>      /* Non-zero reserved field in present first-stage paging entry */
> @@ -445,23 +446,23 @@ typedef union VTDInvDesc VTDInvDesc;
>  
>  #define VTD_SPTE_PAGE_L1_RSVD_MASK(aw, stale_tm) \
>          stale_tm ? \
> -        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
> -        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> +        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM | VTD_SS_TM)) : \
> +        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
>  #define VTD_SPTE_PAGE_L2_RSVD_MASK(aw) \
> -        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> +        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
>  #define VTD_SPTE_PAGE_L3_RSVD_MASK(aw) \
> -        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> +        (0x800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
>  #define VTD_SPTE_PAGE_L4_RSVD_MASK(aw) \
> -        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> +        (0x880ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
>  
>  #define VTD_SPTE_LPAGE_L2_RSVD_MASK(aw, stale_tm) \
>          stale_tm ? \
> -        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
> -        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> +        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM | VTD_SS_TM)) : \
> +        (0x1ff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
>  #define VTD_SPTE_LPAGE_L3_RSVD_MASK(aw, stale_tm) \
>          stale_tm ? \
> -        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM | VTD_SL_TM)) : \
> -        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SL_IGN_COM))
> +        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM | VTD_SS_TM)) : \
> +        (0x3ffff800ULL | ~(VTD_HAW_MASK(aw) | VTD_SS_IGN_COM))
>  
>  /* Rsvd field masks for fpte */
>  #define VTD_FS_UPPER_IGNORED 0xfff0000000000000ULL
> @@ -535,7 +536,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_CONTEXT_TT_DEV_IOTLB    (1ULL << 2)
>  #define VTD_CONTEXT_TT_PASS_THROUGH (2ULL << 2)
>  /* Second Level Page Translation Pointer*/
> -#define VTD_CONTEXT_ENTRY_SLPTPTR   (~0xfffULL)
> +#define VTD_CONTEXT_ENTRY_SSPTPTR   (~0xfffULL)
>  #define VTD_CONTEXT_ENTRY_RSVD_LO(aw) (0xff0ULL | ~VTD_HAW_MASK(aw))
>  /* hi */
>  #define VTD_CONTEXT_ENTRY_AW        7ULL /* Adjusted guest-address-width */
> @@ -565,35 +566,35 @@ typedef struct VTDRootEntry VTDRootEntry;
>  /* PASID Granular Translation Type Mask */
>  #define VTD_PASID_ENTRY_P              1ULL
>  #define VTD_SM_PASID_ENTRY_PGTT        (7ULL << 6)
> -#define VTD_SM_PASID_ENTRY_FLT         (1ULL << 6)
> -#define VTD_SM_PASID_ENTRY_SLT         (2ULL << 6)
> +#define VTD_SM_PASID_ENTRY_FST         (1ULL << 6)
> +#define VTD_SM_PASID_ENTRY_SST         (2ULL << 6)
>  #define VTD_SM_PASID_ENTRY_NESTED      (3ULL << 6)
>  #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>  
>  #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
>  #define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
>  
> -#define VTD_SM_PASID_ENTRY_FLPM          3ULL
> -#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
> +#define VTD_SM_PASID_ENTRY_FSPM          3ULL
> +#define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
>  
>  /* First Level Paging Structure */
>  /* Masks for First Level Paging Entry */
> -#define VTD_FL_P                    1ULL
> -#define VTD_FL_RW                   (1ULL << 1)
> -#define VTD_FL_US                   (1ULL << 2)
> -#define VTD_FL_A                    (1ULL << 5)
> -#define VTD_FL_D                    (1ULL << 6)
> +#define VTD_FS_P                    1ULL
> +#define VTD_FS_RW                   (1ULL << 1)
> +#define VTD_FS_US                   (1ULL << 2)
> +#define VTD_FS_A                    (1ULL << 5)
> +#define VTD_FS_D                    (1ULL << 6)
>  
>  /* Second Level Page Translation Pointer*/
> -#define VTD_SM_PASID_ENTRY_SLPTPTR     (~0xfffULL)
> +#define VTD_SM_PASID_ENTRY_SSPTPTR     (~0xfffULL)
>  
>  /* Second Level Paging Structure */
>  /* Masks for Second Level Paging Entry */
> -#define VTD_SL_RW_MASK              3ULL
> -#define VTD_SL_R                    1ULL
> -#define VTD_SL_W                    (1ULL << 1)
> -#define VTD_SL_IGN_COM              0xbff0000000000000ULL
> -#define VTD_SL_TM                   (1ULL << 62)
> +#define VTD_SS_RW_MASK              3ULL
> +#define VTD_SS_R                    1ULL
> +#define VTD_SS_W                    (1ULL << 1)
> +#define VTD_SS_IGN_COM              0xbff0000000000000ULL
> +#define VTD_SS_TM                   (1ULL << 62)
>  
>  /* Common for both First Level and Second Level */
>  #define VTD_PML4_LEVEL           4
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index e95477e855..564d4d4236 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -264,7 +264,7 @@ struct IntelIOMMUState {
>  
>      bool caching_mode;              /* RO - is cap CM enabled? */
>      bool scalable_mode;             /* RO - is Scalable Mode supported? */
> -    bool flts;                      /* RO - is stage-1 translation supported? */
> +    bool fsts;                      /* RO - is first stage translation supported? */
>      bool snoop_control;             /* RO - is SNP filed supported? */
>  
>      dma_addr_t root;                /* Current root table pointer */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index b976b251bc..a47482ba9d 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -47,9 +47,9 @@
>  
>  /* pe operations */
>  #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
> -#define VTD_PE_GET_FL_LEVEL(pe) \
> -    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM))
> -#define VTD_PE_GET_SL_LEVEL(pe) \
> +#define VTD_PE_GET_FS_LEVEL(pe) \
> +    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FSPM))
> +#define VTD_PE_GET_SS_LEVEL(pe) \
>      (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
>  
>  /*
> @@ -319,7 +319,7 @@ static gboolean vtd_hash_remove_by_page(gpointer key, gpointer value,
>       * nested (PGTT=011b) mapping associated with specified domain-id are
>       * invalidated. Nested isn't supported yet, so only need to check 001b.
>       */
> -    if (entry->pgtt == VTD_SM_PASID_ENTRY_FLT) {
> +    if (entry->pgtt == VTD_SM_PASID_ENTRY_FST) {
>          return true;
>      }
>  
> @@ -340,7 +340,7 @@ static gboolean vtd_hash_remove_by_page_piotlb(gpointer key, gpointer value,
>       * or pass-through (PGTT=100b) mappings. Nested isn't supported yet,
>       * so only need to check first-stage (PGTT=001b) mappings.
>       */
> -    if (entry->pgtt != VTD_SM_PASID_ENTRY_FLT) {
> +    if (entry->pgtt != VTD_SM_PASID_ENTRY_FST) {
>          return false;
>      }
>  
> @@ -747,9 +747,9 @@ static int vtd_get_context_entry_from_root(IntelIOMMUState *s,
>      return 0;
>  }
>  
> -static inline dma_addr_t vtd_ce_get_slpt_base(VTDContextEntry *ce)
> +static inline dma_addr_t vtd_ce_get_sspt_base(VTDContextEntry *ce)
>  {
> -    return ce->lo & VTD_CONTEXT_ENTRY_SLPTPTR;
> +    return ce->lo & VTD_CONTEXT_ENTRY_SSPTPTR;
>  }
>  
>  static inline uint64_t vtd_get_pte_addr(uint64_t pte, uint8_t aw)
> @@ -790,13 +790,13 @@ static inline uint32_t vtd_iova_level_offset(uint64_t iova, uint32_t level)
>  }
>  
>  /* Check Capability Register to see if the @level of page-table is supported */
> -static inline bool vtd_is_sl_level_supported(IntelIOMMUState *s, uint32_t level)
> +static inline bool vtd_is_ss_level_supported(IntelIOMMUState *s, uint32_t level)
>  {
>      return VTD_CAP_SAGAW_MASK & s->cap &
>             (1ULL << (level - 2 + VTD_CAP_SAGAW_SHIFT));
>  }
>  
> -static inline bool vtd_is_fl_level_supported(IntelIOMMUState *s, uint32_t level)
> +static inline bool vtd_is_fs_level_supported(IntelIOMMUState *s, uint32_t level)
>  {
>      return level == VTD_PML4_LEVEL;
>  }
> @@ -805,10 +805,10 @@ static inline bool vtd_is_fl_level_supported(IntelIOMMUState *s, uint32_t level)
>  static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>  {
>      switch (VTD_PE_GET_TYPE(pe)) {
> -    case VTD_SM_PASID_ENTRY_FLT:
> -        return !!(s->ecap & VTD_ECAP_FLTS);
> -    case VTD_SM_PASID_ENTRY_SLT:
> -        return !!(s->ecap & VTD_ECAP_SLTS);
> +    case VTD_SM_PASID_ENTRY_FST:
> +        return !!(s->ecap & VTD_ECAP_FSTS);
> +    case VTD_SM_PASID_ENTRY_SST:
> +        return !!(s->ecap & VTD_ECAP_SSTS);
>      case VTD_SM_PASID_ENTRY_NESTED:
>          /* Not support NESTED page table type yet */
>          return false;
> @@ -880,13 +880,13 @@ static int vtd_get_pe_in_pasid_leaf_table(IntelIOMMUState *s,
>      }
>  
>      pgtt = VTD_PE_GET_TYPE(pe);
> -    if (pgtt == VTD_SM_PASID_ENTRY_SLT &&
> -        !vtd_is_sl_level_supported(s, VTD_PE_GET_SL_LEVEL(pe))) {
> +    if (pgtt == VTD_SM_PASID_ENTRY_SST &&
> +        !vtd_is_ss_level_supported(s, VTD_PE_GET_SS_LEVEL(pe))) {
>              return -VTD_FR_PASID_TABLE_ENTRY_INV;
>      }
>  
> -    if (pgtt == VTD_SM_PASID_ENTRY_FLT &&
> -        !vtd_is_fl_level_supported(s, VTD_PE_GET_FL_LEVEL(pe))) {
> +    if (pgtt == VTD_SM_PASID_ENTRY_FST &&
> +        !vtd_is_fs_level_supported(s, VTD_PE_GET_FS_LEVEL(pe))) {
>              return -VTD_FR_PASID_TABLE_ENTRY_INV;
>      }
>  
> @@ -1007,7 +1007,8 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
>      return 0;
>  }
>  
> -/* Get the page-table level that hardware should use for the second-level
> +/*
> + * Get the page-table level that hardware should use for the second-stage
>   * page-table walk from the Address Width field of context-entry.
>   */
>  static inline uint32_t vtd_ce_get_level(VTDContextEntry *ce)
> @@ -1023,10 +1024,10 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
>  
>      if (s->root_scalable) {
>          vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> -        if (s->flts) {
> -            return VTD_PE_GET_FL_LEVEL(&pe);
> +        if (s->fsts) {
> +            return VTD_PE_GET_FS_LEVEL(&pe);
>          } else {
> -            return VTD_PE_GET_SL_LEVEL(&pe);
> +            return VTD_PE_GET_SS_LEVEL(&pe);
>          }
>      }
>  
> @@ -1095,7 +1096,7 @@ static inline uint64_t vtd_iova_limit(IntelIOMMUState *s,
>  }
>  
>  /* Return true if IOVA passes range check, otherwise false. */
> -static inline bool vtd_iova_sl_range_check(IntelIOMMUState *s,
> +static inline bool vtd_iova_ss_range_check(IntelIOMMUState *s,
>                                             uint64_t iova, VTDContextEntry *ce,
>                                             uint8_t aw, uint32_t pasid)
>  {
> @@ -1114,14 +1115,14 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
>  
>      if (s->root_scalable) {
>          vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> -        if (s->flts) {
> -            return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
> +        if (s->fsts) {
> +            return pe.val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
>          } else {
> -            return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
> +            return pe.val[0] & VTD_SM_PASID_ENTRY_SSPTPTR;
>          }
>      }
>  
> -    return vtd_ce_get_slpt_base(ce);
> +    return vtd_ce_get_sspt_base(ce);
>  }
>  
>  /*
> @@ -1136,13 +1137,13 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
>  static uint64_t vtd_spte_rsvd[VTD_SPTE_RSVD_LEN];
>  static uint64_t vtd_spte_rsvd_large[VTD_SPTE_RSVD_LEN];
>  
> -static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
> +static bool vtd_sspte_nonzero_rsvd(uint64_t sspte, uint32_t level)
>  {
>      uint64_t rsvd_mask;
>  
>      /*
>       * We should have caught a guest-mis-programmed level earlier,
> -     * via vtd_is_sl_level_supported.
> +     * via vtd_is_ss_level_supported.
>       */
>      assert(level < VTD_SPTE_RSVD_LEN);
>      /*
> @@ -1152,46 +1153,47 @@ static bool vtd_slpte_nonzero_rsvd(uint64_t slpte, uint32_t level)
>      assert(level);
>  
>      if ((level == VTD_PD_LEVEL || level == VTD_PDP_LEVEL) &&
> -        (slpte & VTD_PT_PAGE_SIZE_MASK)) {
> +        (sspte & VTD_PT_PAGE_SIZE_MASK)) {
>          /* large page */
>          rsvd_mask = vtd_spte_rsvd_large[level];
>      } else {
>          rsvd_mask = vtd_spte_rsvd[level];
>      }
>  
> -    return slpte & rsvd_mask;
> +    return sspte & rsvd_mask;
>  }
>  
> -/* Given the @iova, get relevant @slptep. @slpte_level will be the last level
> +/*
> + * Given the @iova, get relevant @ssptep. @sspte_level will be the last level
>   * of the translation, can be used for deciding the size of large page.
>   */
> -static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
> +static int vtd_iova_to_sspte(IntelIOMMUState *s, VTDContextEntry *ce,
>                               uint64_t iova, bool is_write,
> -                             uint64_t *slptep, uint32_t *slpte_level,
> +                             uint64_t *ssptep, uint32_t *sspte_level,
>                               bool *reads, bool *writes, uint8_t aw_bits,
>                               uint32_t pasid)
>  {
>      dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
>      uint32_t level = vtd_get_iova_level(s, ce, pasid);
>      uint32_t offset;
> -    uint64_t slpte;
> +    uint64_t sspte;
>      uint64_t access_right_check;
>  
> -    if (!vtd_iova_sl_range_check(s, iova, ce, aw_bits, pasid)) {
> +    if (!vtd_iova_ss_range_check(s, iova, ce, aw_bits, pasid)) {
>          error_report_once("%s: detected IOVA overflow (iova=0x%" PRIx64 ","
>                            "pasid=0x%" PRIx32 ")", __func__, iova, pasid);
>          return -VTD_FR_ADDR_BEYOND_MGAW;
>      }
>  
>      /* FIXME: what is the Atomics request here? */
> -    access_right_check = is_write ? VTD_SL_W : VTD_SL_R;
> +    access_right_check = is_write ? VTD_SS_W : VTD_SS_R;
>  
>      while (true) {
>          offset = vtd_iova_level_offset(iova, level);
> -        slpte = vtd_get_pte(addr, offset);
> +        sspte = vtd_get_pte(addr, offset);
>  
> -        if (slpte == (uint64_t)-1) {
> -            error_report_once("%s: detected read error on DMAR slpte "
> +        if (sspte == (uint64_t)-1) {
> +            error_report_once("%s: detected read error on DMAR sspte "
>                                "(iova=0x%" PRIx64 ", pasid=0x%" PRIx32 ")",
>                                __func__, iova, pasid);
>              if (level == vtd_get_iova_level(s, ce, pasid)) {
> @@ -1201,30 +1203,30 @@ static int vtd_iova_to_slpte(IntelIOMMUState *s, VTDContextEntry *ce,
>                  return -VTD_FR_PAGING_ENTRY_INV;
>              }
>          }
> -        *reads = (*reads) && (slpte & VTD_SL_R);
> -        *writes = (*writes) && (slpte & VTD_SL_W);
> -        if (!(slpte & access_right_check)) {
> -            error_report_once("%s: detected slpte permission error "
> +        *reads = (*reads) && (sspte & VTD_SS_R);
> +        *writes = (*writes) && (sspte & VTD_SS_W);
> +        if (!(sspte & access_right_check)) {
> +            error_report_once("%s: detected sspte permission error "
>                                "(iova=0x%" PRIx64 ", level=0x%" PRIx32 ", "
> -                              "slpte=0x%" PRIx64 ", write=%d, pasid=0x%"
> +                              "sspte=0x%" PRIx64 ", write=%d, pasid=0x%"
>                                PRIx32 ")", __func__, iova, level,
> -                              slpte, is_write, pasid);
> +                              sspte, is_write, pasid);
>              return is_write ? -VTD_FR_WRITE : -VTD_FR_READ;
>          }
> -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> +        if (vtd_sspte_nonzero_rsvd(sspte, level)) {
>              error_report_once("%s: detected splte reserve non-zero "
>                                "iova=0x%" PRIx64 ", level=0x%" PRIx32
> -                              "slpte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
> -                              __func__, iova, level, slpte, pasid);
> +                              "sspte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
> +                              __func__, iova, level, sspte, pasid);
>              return -VTD_FR_PAGING_ENTRY_RSVD;
>          }
>  
> -        if (vtd_is_last_pte(slpte, level)) {
> -            *slptep = slpte;
> -            *slpte_level = level;
> +        if (vtd_is_last_pte(sspte, level)) {
> +            *ssptep = sspte;
> +            *sspte_level = level;
>              break;
>          }
> -        addr = vtd_get_pte_addr(slpte, aw_bits);
> +        addr = vtd_get_pte_addr(sspte, aw_bits);
>          level--;
>      }
>  
> @@ -1350,7 +1352,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>  {
>      bool read_cur, write_cur, entry_valid;
>      uint32_t offset;
> -    uint64_t slpte;
> +    uint64_t sspte;
>      uint64_t subpage_size, subpage_mask;
>      IOMMUTLBEvent event;
>      uint64_t iova = start;
> @@ -1366,21 +1368,21 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>          iova_next = (iova & subpage_mask) + subpage_size;
>  
>          offset = vtd_iova_level_offset(iova, level);
> -        slpte = vtd_get_pte(addr, offset);
> +        sspte = vtd_get_pte(addr, offset);
>  
> -        if (slpte == (uint64_t)-1) {
> +        if (sspte == (uint64_t)-1) {
>              trace_vtd_page_walk_skip_read(iova, iova_next);
>              goto next;
>          }
>  
> -        if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> +        if (vtd_sspte_nonzero_rsvd(sspte, level)) {
>              trace_vtd_page_walk_skip_reserve(iova, iova_next);
>              goto next;
>          }
>  
>          /* Permissions are stacked with parents' */
> -        read_cur = read && (slpte & VTD_SL_R);
> -        write_cur = write && (slpte & VTD_SL_W);
> +        read_cur = read && (sspte & VTD_SS_R);
> +        write_cur = write && (sspte & VTD_SS_W);
>  
>          /*
>           * As long as we have either read/write permission, this is a
> @@ -1389,12 +1391,12 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>           */
>          entry_valid = read_cur | write_cur;
>  
> -        if (!vtd_is_last_pte(slpte, level) && entry_valid) {
> +        if (!vtd_is_last_pte(sspte, level) && entry_valid) {
>              /*
>               * This is a valid PDE (or even bigger than PDE).  We need
>               * to walk one further level.
>               */
> -            ret = vtd_page_walk_level(vtd_get_pte_addr(slpte, info->aw),
> +            ret = vtd_page_walk_level(vtd_get_pte_addr(sspte, info->aw),
>                                        iova, MIN(iova_next, end), level - 1,
>                                        read_cur, write_cur, info);
>          } else {
> @@ -1411,7 +1413,7 @@ static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
>              event.entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
>              event.entry.addr_mask = ~subpage_mask;
>              /* NOTE: this is only meaningful if entry_valid == true */
> -            event.entry.translated_addr = vtd_get_pte_addr(slpte, info->aw);
> +            event.entry.translated_addr = vtd_get_pte_addr(sspte, info->aw);
>              event.type = event.entry.perm ? IOMMU_NOTIFIER_MAP :
>                                              IOMMU_NOTIFIER_UNMAP;
>              ret = vtd_page_walk_one(&event, info);
> @@ -1445,11 +1447,11 @@ static int vtd_page_walk(IntelIOMMUState *s, VTDContextEntry *ce,
>      dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
>      uint32_t level = vtd_get_iova_level(s, ce, pasid);
>  
> -    if (!vtd_iova_sl_range_check(s, start, ce, info->aw, pasid)) {
> +    if (!vtd_iova_ss_range_check(s, start, ce, info->aw, pasid)) {
>          return -VTD_FR_ADDR_BEYOND_MGAW;
>      }
>  
> -    if (!vtd_iova_sl_range_check(s, end, ce, info->aw, pasid)) {
> +    if (!vtd_iova_ss_range_check(s, end, ce, info->aw, pasid)) {
>          /* Fix end so that it reaches the maximum */
>          end = vtd_iova_limit(s, ce, info->aw, pasid);
>      }
> @@ -1563,7 +1565,7 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>  
>      /* Check if the programming of context-entry is valid */
>      if (!s->root_scalable &&
> -        !vtd_is_sl_level_supported(s, vtd_ce_get_level(ce))) {
> +        !vtd_is_ss_level_supported(s, vtd_ce_get_level(ce))) {
>          error_report_once("%s: invalid context entry: hi=%"PRIx64
>                            ", lo=%"PRIx64" (level %d not supported)",
>                            __func__, ce->hi, ce->lo,
> @@ -1670,10 +1672,9 @@ static int vtd_address_space_sync(VTDAddressSpace *vtd_as)
>  }
>  
>  /*
> - * Check if specific device is configured to bypass address
> - * translation for DMA requests. In Scalable Mode, bypass
> - * 1st-level translation or 2nd-level translation, it depends
> - * on PGTT setting.
> + * Check if specific device is configured to bypass address translation
> + * for DMA requests. In Scalable Mode, bypass first stage translation
> + * or second stage translation, it depends on PGTT setting.
>   */
>  static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>                                 uint32_t pasid)
> @@ -1910,13 +1911,13 @@ out:
>  static uint64_t vtd_fpte_rsvd[VTD_FPTE_RSVD_LEN];
>  static uint64_t vtd_fpte_rsvd_large[VTD_FPTE_RSVD_LEN];
>  
> -static bool vtd_flpte_nonzero_rsvd(uint64_t flpte, uint32_t level)
> +static bool vtd_fspte_nonzero_rsvd(uint64_t fspte, uint32_t level)
>  {
>      uint64_t rsvd_mask;
>  
>      /*
>       * We should have caught a guest-mis-programmed level earlier,
> -     * via vtd_is_fl_level_supported.
> +     * via vtd_is_fs_level_supported.
>       */
>      assert(level < VTD_FPTE_RSVD_LEN);
>      /*
> @@ -1926,23 +1927,23 @@ static bool vtd_flpte_nonzero_rsvd(uint64_t flpte, uint32_t level)
>      assert(level);
>  
>      if ((level == VTD_PD_LEVEL || level == VTD_PDP_LEVEL) &&
> -        (flpte & VTD_PT_PAGE_SIZE_MASK)) {
> +        (fspte & VTD_PT_PAGE_SIZE_MASK)) {
>          /* large page */
>          rsvd_mask = vtd_fpte_rsvd_large[level];
>      } else {
>          rsvd_mask = vtd_fpte_rsvd[level];
>      }
>  
> -    return flpte & rsvd_mask;
> +    return fspte & rsvd_mask;
>  }
>  
> -static inline bool vtd_flpte_present(uint64_t flpte)
> +static inline bool vtd_fspte_present(uint64_t fspte)
>  {
> -    return !!(flpte & VTD_FL_P);
> +    return !!(fspte & VTD_FS_P);
>  }
>  
>  /* Return true if IOVA is canonical, otherwise false. */
> -static bool vtd_iova_fl_check_canonical(IntelIOMMUState *s, uint64_t iova,
> +static bool vtd_iova_fs_check_canonical(IntelIOMMUState *s, uint64_t iova,
>                                          VTDContextEntry *ce, uint32_t pasid)
>  {
>      uint64_t iova_limit = vtd_iova_limit(s, ce, s->aw_bits, pasid);
> @@ -1972,32 +1973,32 @@ static MemTxResult vtd_set_flag_in_pte(dma_addr_t base_addr, uint32_t index,
>  }
>  
>  /*
> - * Given the @iova, get relevant @flptep. @flpte_level will be the last level
> + * Given the @iova, get relevant @fsptep. @fspte_level will be the last level
>   * of the translation, can be used for deciding the size of large page.
>   */
> -static int vtd_iova_to_flpte(IntelIOMMUState *s, VTDContextEntry *ce,
> +static int vtd_iova_to_fspte(IntelIOMMUState *s, VTDContextEntry *ce,
>                               uint64_t iova, bool is_write,
> -                             uint64_t *flptep, uint32_t *flpte_level,
> +                             uint64_t *fsptep, uint32_t *fspte_level,
>                               bool *reads, bool *writes, uint8_t aw_bits,
>                               uint32_t pasid)
>  {
>      dma_addr_t addr = vtd_get_iova_pgtbl_base(s, ce, pasid);
>      uint32_t offset;
> -    uint64_t flpte, flag_ad = VTD_FL_A;
> -    *flpte_level = vtd_get_iova_level(s, ce, pasid);
> +    uint64_t fspte, flag_ad = VTD_FS_A;
> +    *fspte_level = vtd_get_iova_level(s, ce, pasid);
>  
> -    if (!vtd_iova_fl_check_canonical(s, iova, ce, pasid)) {
> +    if (!vtd_iova_fs_check_canonical(s, iova, ce, pasid)) {
>          error_report_once("%s: detected non canonical IOVA (iova=0x%" PRIx64 ","
>                            "pasid=0x%" PRIx32 ")", __func__, iova, pasid);
>          return -VTD_FR_FS_NON_CANONICAL;
>      }
>  
>      while (true) {
> -        offset = vtd_iova_level_offset(iova, *flpte_level);
> -        flpte = vtd_get_pte(addr, offset);
> +        offset = vtd_iova_level_offset(iova, *fspte_level);
> +        fspte = vtd_get_pte(addr, offset);
>  
> -        if (flpte == (uint64_t)-1) {
> -            if (*flpte_level == vtd_get_iova_level(s, ce, pasid)) {
> +        if (fspte == (uint64_t)-1) {
> +            if (*fspte_level == vtd_get_iova_level(s, ce, pasid)) {
>                  /* Invalid programming of pasid-entry */
>                  return -VTD_FR_PASID_ENTRY_FSPTPTR_INV;
>              } else {
> @@ -2005,47 +2006,47 @@ static int vtd_iova_to_flpte(IntelIOMMUState *s, VTDContextEntry *ce,
>              }
>          }
>  
> -        if (!vtd_flpte_present(flpte)) {
> +        if (!vtd_fspte_present(fspte)) {
>              *reads = false;
>              *writes = false;
>              return -VTD_FR_FS_PAGING_ENTRY_P;
>          }
>  
>          /* No emulated device supports supervisor privilege request yet */
> -        if (!(flpte & VTD_FL_US)) {
> +        if (!(fspte & VTD_FS_US)) {
>              *reads = false;
>              *writes = false;
>              return -VTD_FR_FS_PAGING_ENTRY_US;
>          }
>  
>          *reads = true;
> -        *writes = (*writes) && (flpte & VTD_FL_RW);
> -        if (is_write && !(flpte & VTD_FL_RW)) {
> +        *writes = (*writes) && (fspte & VTD_FS_RW);
> +        if (is_write && !(fspte & VTD_FS_RW)) {
>              return -VTD_FR_SM_WRITE;
>          }
> -        if (vtd_flpte_nonzero_rsvd(flpte, *flpte_level)) {
> -            error_report_once("%s: detected flpte reserved non-zero "
> +        if (vtd_fspte_nonzero_rsvd(fspte, *fspte_level)) {
> +            error_report_once("%s: detected fspte reserved non-zero "
>                                "iova=0x%" PRIx64 ", level=0x%" PRIx32
> -                              "flpte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
> -                              __func__, iova, *flpte_level, flpte, pasid);
> +                              "fspte=0x%" PRIx64 ", pasid=0x%" PRIX32 ")",
> +                              __func__, iova, *fspte_level, fspte, pasid);
>              return -VTD_FR_FS_PAGING_ENTRY_RSVD;
>          }
>  
> -        if (vtd_is_last_pte(flpte, *flpte_level) && is_write) {
> -            flag_ad |= VTD_FL_D;
> +        if (vtd_is_last_pte(fspte, *fspte_level) && is_write) {
> +            flag_ad |= VTD_FS_D;
>          }
>  
> -        if (vtd_set_flag_in_pte(addr, offset, flpte, flag_ad) != MEMTX_OK) {
> +        if (vtd_set_flag_in_pte(addr, offset, fspte, flag_ad) != MEMTX_OK) {
>              return -VTD_FR_FS_BIT_UPDATE_FAILED;
>          }
>  
> -        if (vtd_is_last_pte(flpte, *flpte_level)) {
> -            *flptep = flpte;
> +        if (vtd_is_last_pte(fspte, *fspte_level)) {
> +            *fsptep = fspte;
>              return 0;
>          }
>  
> -        addr = vtd_get_pte_addr(flpte, aw_bits);
> -        (*flpte_level)--;
> +        addr = vtd_get_pte_addr(fspte, aw_bits);
> +        (*fspte_level)--;
>      }
>  }
>  
> @@ -2199,14 +2200,14 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          }
>      }
>  
> -    if (s->flts && s->root_scalable) {
> -        ret_fr = vtd_iova_to_flpte(s, &ce, addr, is_write, &pte, &level,
> +    if (s->fsts && s->root_scalable) {
> +        ret_fr = vtd_iova_to_fspte(s, &ce, addr, is_write, &pte, &level,
>                                     &reads, &writes, s->aw_bits, pasid);
> -        pgtt = VTD_SM_PASID_ENTRY_FLT;
> +        pgtt = VTD_SM_PASID_ENTRY_FST;
>      } else {
> -        ret_fr = vtd_iova_to_slpte(s, &ce, addr, is_write, &pte, &level,
> +        ret_fr = vtd_iova_to_sspte(s, &ce, addr, is_write, &pte, &level,
>                                     &reads, &writes, s->aw_bits, pasid);
> -        pgtt = VTD_SM_PASID_ENTRY_SLT;
> +        pgtt = VTD_SM_PASID_ENTRY_SST;
>      }
>      if (!ret_fr) {
>          xlat = vtd_get_pte_addr(pte, s->aw_bits);
> @@ -2474,13 +2475,13 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>  
>              if (vtd_as_has_map_notifier(vtd_as)) {
>                  /*
> -                 * When stage-1 translation is off, as long as we have MAP
> +                 * When first stage translation is off, as long as we have MAP
>                   * notifications registered in any of our IOMMU notifiers,
>                   * we need to sync the shadow page table. Otherwise VFIO
>                   * device attaches to nested page table instead of shadow
>                   * page table, so no need to sync.
>                   */
> -                if (!s->flts || !s->root_scalable) {
> +                if (!s->fsts || !s->root_scalable) {
>                      vtd_sync_shadow_page_table_range(vtd_as, &ce, addr, size);
>                  }
>              } else {
> @@ -2972,7 +2973,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>                  continue;
>              }
>  
> -            if (!s->flts || !vtd_as_has_map_notifier(vtd_as)) {
> +            if (!s->fsts || !vtd_as_has_map_notifier(vtd_as)) {
>                  vtd_address_space_sync(vtd_as);
>              }
>          }
> @@ -3818,7 +3819,7 @@ static const Property vtd_properties[] = {
>                        VTD_HOST_ADDRESS_WIDTH),
>      DEFINE_PROP_BOOL("caching-mode", IntelIOMMUState, caching_mode, FALSE),
>      DEFINE_PROP_BOOL("x-scalable-mode", IntelIOMMUState, scalable_mode, FALSE),
> -    DEFINE_PROP_BOOL("x-flts", IntelIOMMUState, flts, FALSE),
> +    DEFINE_PROP_BOOL("x-flts", IntelIOMMUState, fsts, FALSE),
>      DEFINE_PROP_BOOL("snoop-control", IntelIOMMUState, snoop_control, false),
>      DEFINE_PROP_BOOL("x-pasid-mode", IntelIOMMUState, pasid, false),
>      DEFINE_PROP_BOOL("dma-drain", IntelIOMMUState, dma_drain, true),
> @@ -4344,12 +4345,13 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>          return false;
>      }
>  
> -    if (!s->flts) {
> -        /* All checks requested by VTD stage-2 translation pass */
> +    if (!s->fsts) {
> +        /* All checks requested by VTD second stage translation pass */
>          return true;
>      }
>  
> -    error_setg(errp, "host device is uncompatible with stage-1 translation");
> +    error_setg(errp,
> +               "host device is uncompatible with first stage translation");
>      return false;
>  }
>  
> @@ -4535,7 +4537,7 @@ static void vtd_cap_init(IntelIOMMUState *s)
>      X86IOMMUState *x86_iommu = X86_IOMMU_DEVICE(s);
>  
>      s->cap = VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND |
> -             VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS |
> +             VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SSLPS |
>               VTD_CAP_MGAW(s->aw_bits);
>      if (s->dma_drain) {
>          s->cap |= VTD_CAP_DRAIN;
> @@ -4571,13 +4573,13 @@ static void vtd_cap_init(IntelIOMMUState *s)
>      }
>  
>      /* TODO: read cap/ecap from host to decide which cap to be exposed. */
> -    if (s->flts) {
> -        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_FLTS;
> +    if (s->fsts) {
> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_FSTS;
>          if (s->fs1gp) {
>              s->cap |= VTD_CAP_FS1GP;
>          }
>      } else if (s->scalable_mode) {
> -        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SLTS;
> +        s->ecap |= VTD_ECAP_SMTS | VTD_ECAP_SRS | VTD_ECAP_SSTS;
>      }
>  
>      if (s->snoop_control) {
> @@ -4864,12 +4866,12 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>          }
>      }
>  
> -    if (!s->scalable_mode && s->flts) {
> +    if (!s->scalable_mode && s->fsts) {
>          error_setg(errp, "x-flts is only available in scalable mode");
>          return false;
>      }
>  
> -    if (!s->flts && s->aw_bits != VTD_HOST_AW_39BIT &&
> +    if (!s->fsts && s->aw_bits != VTD_HOST_AW_39BIT &&
>          s->aw_bits != VTD_HOST_AW_48BIT) {
>          error_setg(errp, "%s: supported values for aw-bits are: %d, %d",
>                     s->scalable_mode ? "Scalable mode(flts=off)" : "Legacy mode",
> @@ -4877,7 +4879,7 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>          return false;
>      }
>  
> -    if (s->flts && s->aw_bits != VTD_HOST_AW_48BIT) {
> +    if (s->fsts && s->aw_bits != VTD_HOST_AW_48BIT) {
>          error_setg(errp,
>                     "Scalable mode(flts=on): supported value for aw-bits is: %d",
>                     VTD_HOST_AW_48BIT);
> diff --git a/tests/qtest/intel-iommu-test.c b/tests/qtest/intel-iommu-test.c
> index c521b3796e..e5cc6acaf0 100644
> --- a/tests/qtest/intel-iommu-test.c
> +++ b/tests/qtest/intel-iommu-test.c
> @@ -13,9 +13,9 @@
>  #include "hw/i386/intel_iommu_internal.h"
>  
>  #define CAP_STAGE_1_FIXED1    (VTD_CAP_FRO | VTD_CAP_NFR | VTD_CAP_ND | \
> -                              VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SLLPS)
> +                              VTD_CAP_MAMV | VTD_CAP_PSI | VTD_CAP_SSLPS)
>  #define ECAP_STAGE_1_FIXED1   (VTD_ECAP_QI |  VTD_ECAP_IR | VTD_ECAP_IRO | \
> -                              VTD_ECAP_MHMV | VTD_ECAP_SMTS | VTD_ECAP_FLTS)
> +                              VTD_ECAP_MHMV | VTD_ECAP_SMTS | VTD_ECAP_FSTS)
>  
>  static inline uint64_t vtd_reg_readq(QTestState *s, uint64_t offset)
>  {



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 02/22] intel_iommu: Delete RPS capability related supporting code
  2025-09-18  8:57 ` [PATCH v6 02/22] intel_iommu: Delete RPS capability related supporting code Zhenzhong Duan
@ 2025-09-30 13:49   ` Eric Auger
  2025-10-09 10:10     ` Duan, Zhenzhong
  0 siblings, 1 reply; 57+ messages in thread
From: Eric Auger @ 2025-09-30 13:49 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

Hi Zhenzhong,

On 9/18/25 10:57 AM, Zhenzhong Duan wrote:
> RID-PASID Support(RPS) is not set in vIOMMU ECAP register, the supporting
> code is there but never take effect.
takes
>
> Meanwhile, according to VTD spec section 3.4.3:
> "Implementations not supporting RID_PASID capability (ECAP_REG.RPS is 0b),
> use a PASID value of 0 to perform address translation for requests without
> PASID."
>
> We should delete the supporting code which fetches RID_PASID field from
> scalable context entry and use 0 as RID_PASID directly, because RID_PASID
> field is ignored if no RPS support according to spec.
>
> This simplify the code and doesn't bring any penalty.
simplifies
>
> Opportunistically, s/rid2pasid/rid_pasid and s/RID2PASID/RID_PASID as
> VTD spec uses RID_PASID terminology.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu_internal.h |  1 -
>  hw/i386/intel_iommu.c          | 49 +++++++++++++---------------------
>  2 files changed, 19 insertions(+), 31 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 360e937989..6abe76556a 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -547,7 +547,6 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>  
> -#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>  
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 71b70b795d..b976b251bc 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -41,8 +41,7 @@
>  #include "trace.h"
>  
>  /* context entry operations */
> -#define VTD_CE_GET_RID2PASID(ce) \
> -    ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
> +#define RID_PASID    0
I would call that RID_PASID_0 to make it more explicit in the code
or even it is a PASID to PASID_0 would do the job too.
>  #define VTD_CE_GET_PASID_DIR_TABLE(ce) \
>      ((ce)->val[0] & VTD_PASID_DIR_BASE_ADDR_MASK)
>  
> @@ -951,7 +950,7 @@ static int vtd_ce_get_pasid_entry(IntelIOMMUState *s, VTDContextEntry *ce,
>      int ret = 0;
>  
>      if (pasid == PCI_NO_PASID) {
> -        pasid = VTD_CE_GET_RID2PASID(ce);
> +        pasid = RID_PASID;
>      }
>      pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>      ret = vtd_get_pe_from_pasid_table(s, pasid_dir_base, pasid, pe);
> @@ -970,7 +969,7 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState *s,
>      VTDPASIDEntry pe;
>  
>      if (pasid == PCI_NO_PASID) {
> -        pasid = VTD_CE_GET_RID2PASID(ce);
> +        pasid = RID_PASID;
>      }
>      pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>  
> @@ -1510,15 +1509,14 @@ static inline int vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
>      return 0;
>  }
>  
> -static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
> +static int vtd_ce_rid_pasid_check(IntelIOMMUState *s,
>                                    VTDContextEntry *ce)
>  {
>      VTDPASIDEntry pe;
>  
>      /*
>       * Make sure in Scalable Mode, a present context entry
> -     * has valid rid2pasid setting, which includes valid
> -     * rid2pasid field and corresponding pasid entry setting
> +     * has valid pasid entry setting at RID_PASID(0).
s/at RID_PASID(0) /for PASID_0?
>       */
>      return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
>  }
> @@ -1581,12 +1579,11 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>          }
>      } else {
>          /*
> -         * Check if the programming of context-entry.rid2pasid
> -         * and corresponding pasid setting is valid, and thus
> -         * avoids to check pasid entry fetching result in future
> -         * helper function calling.
> +         * Check if the programming of pasid setting at RID_PASID(0)
of pasid 0?
> +         * is valid, and thus avoids to check pasid entry fetching
> +         * result in future helper function calling.
>           */
> -        ret_fr = vtd_ce_rid2pasid_check(s, ce);
> +        ret_fr = vtd_ce_rid_pasid_check(s, ce);
>          if (ret_fr) {
>              return ret_fr;
>          }
> @@ -2097,7 +2094,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>      bool reads = true;
>      bool writes = true;
>      uint8_t access_flags, pgtt;
> -    bool rid2pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
> +    bool rid_pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
I am not keen of the rid_pasid name. It does not tell what is the
semantic of the variable. rid_pasid is an actual field in the CE.
does that check whether we face a request without pasid in scalable
mode. If so I would call that request_wo_pasid_sm or somethink alike
>      VTDIOTLBEntry *iotlb_entry;
>      uint64_t xlat, size;
>  
> @@ -2111,8 +2108,8 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>  
>      cc_entry = &vtd_as->context_cache_entry;
>  
> -    /* Try to fetch pte from IOTLB, we don't need RID2PASID logic */
> -    if (!rid2pasid) {
> +    /* Try to fetch pte from IOTLB, we don't need RID_PASID(0) logic */
It is unclear what the "RID_PASID(0) logic" is. All the more so we now
just have to set the pasid to PASID_0.
> +    if (!rid_pasid) {
>          iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
>          if (iotlb_entry) {
>              trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
> @@ -2160,8 +2157,8 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          cc_entry->context_cache_gen = s->context_cache_gen;
>      }
>  
> -    if (rid2pasid) {
> -        pasid = VTD_CE_GET_RID2PASID(&ce);
> +    if (rid_pasid) {
> +        pasid = RID_PASID;
>      }
>  
>      /*
> @@ -2189,8 +2186,8 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>          return true;
>      }
>  
> -    /* Try to fetch pte from IOTLB for RID2PASID slow path */
> -    if (rid2pasid) {
> +    /* Try to fetch pte from IOTLB for RID_PASID(0) slow path */
PASID_0?
> +    if (rid_pasid) {
>          iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
>          if (iotlb_entry) {
>              trace_vtd_iotlb_page_hit(source_id, addr, iotlb_entry->pte,
> @@ -2464,20 +2461,14 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>          ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                         vtd_as->devfn, &ce);
>          if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
> -            uint32_t rid2pasid = PCI_NO_PASID;
> -
> -            if (s->root_scalable) {
> -                rid2pasid = VTD_CE_GET_RID2PASID(&ce);
> -            }
> -
>              /*
>               * In legacy mode, vtd_as->pasid == pasid is always true.
>               * In scalable mode, for vtd address space backing a PCI
>               * device without pasid, needs to compare pasid with
> -             * rid2pasid of this device.
> +             * RID_PASID(0) of this device.
>               */
>              if (!(vtd_as->pasid == pasid ||
> -                  (vtd_as->pasid == PCI_NO_PASID && pasid == rid2pasid))) {
> +                  (vtd_as->pasid == PCI_NO_PASID && pasid == RID_PASID))) {
would strongly suggest using PASID_0 instead
>                  continue;
>              }
>  
> @@ -2976,9 +2967,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>          if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>                                        vtd_as->devfn, &ce) &&
>              domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
> -            uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
> -
> -            if ((vtd_as->pasid != PCI_NO_PASID || pasid != rid2pasid) &&
> +            if ((vtd_as->pasid != PCI_NO_PASID || pasid != RID_PASID) &&
>                  vtd_as->pasid != pasid) {
>                  continue;
>              }
Thanks

Eric



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-09-26  2:54         ` Duan, Zhenzhong
@ 2025-09-30 13:55           ` Eric Auger
  0 siblings, 0 replies; 57+ messages in thread
From: Eric Auger @ 2025-09-30 13:55 UTC (permalink / raw)
  To: Duan, Zhenzhong, Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Liu, Yi L, Peng, Chao P



On 9/26/25 4:54 AM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Subject: Re: [PATCH v6 05/22] hw/pci: Introduce
>> pci_device_get_viommu_flags()
>>
>> On Wed, Sep 24, 2025 at 07:05:42AM +0000, Duan, Zhenzhong wrote:
>>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>>> Subject: Re: [PATCH v6 05/22] hw/pci: Introduce
>>>>> get_viommu_flags() is designed to return 64bit bitmap of purely
>> vIOMMU
>>>>> flags which are only determined by user's configuration, no host
>>>>> capabilities involved. Reasons are:
>>>>>
>>>>> 1. host may has heterogeneous IOMMUs, each with different capabilities
>>>>> 2. this is migration friendly, return value is consistent between source
>>>>>    and target.
>>>>> 3. host IOMMU capabilities are passed to vIOMMU through
>>>> set_iommu_device()
>>>>>    interface which have to be after attach_device(), when
>>>> get_viommu_flags()
>>>>>    is called in attach_device(), there is no way for vIOMMU to get host
>>>>>    IOMMU capabilities yet, so only pure vIOMMU flags can be
>> returned.
>>>> "no way" sounds too strong..
>>>>
>>>> There is an iommufd_backend_get_device_info() call there. So, we
>>>> could have passed the host IOMMU capabilities to a vIOMMU. Just,
>>>> we chose not to (assuming for migration reason?).
>>> What about 'it's hard for vIOMMU to get host IOMMU...'?
>> vfio-iommufd core code gets all the host IOMMU caps via the vfio
>> device but chooses to not forward to vIOMMU. So, it's neither "no
>> way" nor "hard" :)
> Yes, that needs to introduce another callback to forward the caps early,
> unnecessarily complex.
>
>> To be honest, I don't feel this very related to be the reason 3
>> to justify for the new op/API. 1 and 2 are quite okay?
>>
>> Having said that, it's probably good to add as a side note:
>>
>> "
>> Note that this op will be invoked at the attach_device() stage, at which
>> point host IOMMU capabilities are not yet forwarded to the vIOMMU through
>> the set_iommu_device() callback that will be after the attach_device().
>>
>> See the below sequence:
>> "
> OK, will drop 3 and add the side note.

With Nicolin's suggestions:
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Eric
>
> Thanks
> Zhenzhong
>



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 08/22] vfio/iommufd: Force creating nesting parent HWPT
  2025-09-18  8:57 ` [PATCH v6 08/22] vfio/iommufd: Force creating nesting parent HWPT Zhenzhong Duan
@ 2025-09-30 14:19   ` Eric Auger
  2025-10-12 12:33   ` Yi Liu
  1 sibling, 0 replies; 57+ messages in thread
From: Eric Auger @ 2025-09-30 14:19 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng



On 9/18/25 10:57 AM, Zhenzhong Duan wrote:
> Call pci_device_get_viommu_flags() to get if vIOMMU supports
> VIOMMU_FLAG_WANT_NESTING_PARENT.
>
> If yes, create a nesting parent HWPT and add it to the container's hwpt_list,
> letting this parent HWPT cover the entire second stage mappings (GPA=>HPA).
>
> This allows a VFIO passthrough device to directly attach to this default HWPT
> and then to use the system address space and its listener.
>
> Introduce a vfio_device_get_viommu_flags_want_nesting() helper to facilitate
> this implementation.
>
> It is safe to do so because a vIOMMU will be able to fail in set_iommu_device()
> call, if something else related to the VFIO device or vIOMMU isn't compatible.
>
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  include/hw/vfio/vfio-device.h |  2 ++
>  hw/vfio/device.c              | 12 ++++++++++++
>  hw/vfio/iommufd.c             |  9 +++++++++
>  3 files changed, 23 insertions(+)
>
> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index e7e6243e2d..a964091135 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -257,6 +257,8 @@ void vfio_device_prepare(VFIODevice *vbasedev, VFIOContainerBase *bcontainer,
>  
>  void vfio_device_unprepare(VFIODevice *vbasedev);
>  
> +bool vfio_device_get_viommu_flags_want_nesting(VFIODevice *vbasedev);
> +
>  int vfio_device_get_region_info(VFIODevice *vbasedev, int index,
>                                  struct vfio_region_info **info);
>  int vfio_device_get_region_info_type(VFIODevice *vbasedev, uint32_t type,
> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
> index 08f12ac31f..620cc78b77 100644
> --- a/hw/vfio/device.c
> +++ b/hw/vfio/device.c
> @@ -23,6 +23,7 @@
>  
>  #include "hw/vfio/vfio-device.h"
>  #include "hw/vfio/pci.h"
> +#include "hw/iommu.h"
>  #include "hw/hw.h"
>  #include "trace.h"
>  #include "qapi/error.h"
> @@ -504,6 +505,17 @@ void vfio_device_unprepare(VFIODevice *vbasedev)
>      vbasedev->bcontainer = NULL;
>  }
>  
> +bool vfio_device_get_viommu_flags_want_nesting(VFIODevice *vbasedev)
> +{
> +    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
> +
> +    if (vdev) {
> +        return !!(pci_device_get_viommu_flags(&vdev->parent_obj) &
> +                  VIOMMU_FLAG_WANT_NESTING_PARENT);
> +    }
> +    return false;
> +}
> +
>  /*
>   * Traditional ioctl() based io
>   */
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 8c27222f75..f1684a39b7 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -379,6 +379,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>          flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>      }
>  
> +    /*
> +     * If vIOMMU requests VFIO's cooperation to create nesting parent HWPT,
> +     * force to create it so that it could be reused by vIOMMU to create
> +     * nested HWPT.
> +     */
> +    if (vfio_device_get_viommu_flags_want_nesting(vbasedev)) {
> +        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
> +    }
> +
>      if (cpr_is_incoming()) {
>          hwpt_id = vbasedev->cpr.hwpt_id;
>          goto skip_alloc;



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 09/22] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
  2025-09-18  8:57 ` [PATCH v6 09/22] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
@ 2025-09-30 15:04   ` Eric Auger
  2025-10-09 10:10     ` Duan, Zhenzhong
  2025-10-12 12:51   ` Yi Liu
  1 sibling, 1 reply; 57+ messages in thread
From: Eric Auger @ 2025-09-30 15:04 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, skolothumtho, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

Hi Zhenzhong,

On 9/18/25 10:57 AM, Zhenzhong Duan wrote:
> When guest enables scalable mode and setup first stage page table, we don't
> want to use IOMMU MR but rather continue using the system MR for IOMMUFD
> backed host device.
>
> Then default HWPT in VFIO contains GPA->HPA mappings which could be reused
> as nesting parent HWPT to construct nested HWPT in vIOMMU.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c | 37 +++++++++++++++++++++++++++++++++++--
>  1 file changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index ba40649c85..bd80de1670 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -40,6 +40,7 @@
>  #include "kvm/kvm_i386.h"
>  #include "migration/vmstate.h"
>  #include "trace.h"
> +#include "system/iommufd.h"
>  
>  /* context entry operations */
>  #define RID_PASID    0
> @@ -1702,6 +1703,24 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>  
>  }
>  
> +static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(VTDAddressSpace *as)
> +{
> +    IntelIOMMUState *s = as->iommu_state;
> +    struct vtd_as_key key = {
> +        .bus = as->bus,
> +        .devfn = as->devfn,
> +    };
> +    VTDHostIOMMUDevice *vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev,
> +                                                       &key);
> +
> +    if (vtd_hiod && vtd_hiod->hiod &&
> +        object_dynamic_cast(OBJECT(vtd_hiod->hiod),
> +                            TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> +        return vtd_hiod;
> +    }
> +    return NULL;
> +}
> +
>  static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>  {
>      IntelIOMMUState *s;
> @@ -1710,6 +1729,7 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>      assert(as);
>  
>      s = as->iommu_state;
> +
not needed
>      if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
>                                   &ce)) {
>          /*
> @@ -1727,12 +1747,25 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>  /* Return whether the device is using IOMMU translation. */
>  static bool vtd_switch_address_space(VTDAddressSpace *as)
>  {
> +    IntelIOMMUState *s;
>      bool use_iommu, pt;
>  
>      assert(as);
>  
> -    use_iommu = as->iommu_state->dmar_enabled && !vtd_as_pt_enabled(as);
> -    pt = as->iommu_state->dmar_enabled && vtd_as_pt_enabled(as);
> +    s = as->iommu_state;
nit: init could be done at declaration
> +    use_iommu = s->dmar_enabled && !vtd_as_pt_enabled(as);
> +    pt = s->dmar_enabled && vtd_as_pt_enabled(as);
> +
> +    /*
> +     * When guest enables scalable mode and setup first stage page table,
sets up?
> +     * we stick to system MR for IOMMUFD backed host device. Then its
> +     * default hwpt contains GPA->HPA mappings which is used directly
> +     * if PGTT=PT and used as nesting parent if PGTT=FST. Otherwise
> +     * fallback to original processing.
fall back?
> +     */
> +    if (s->root_scalable && s->fsts && vtd_find_hiod_iommufd(as)) {
> +        use_iommu = false;
> +    }
>  
>      trace_vtd_switch_address_space(pci_bus_num(as->bus),
>                                     VTD_PCI_SLOT(as->devfn),
Besides
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Eric




^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 02/22] intel_iommu: Delete RPS capability related supporting code
  2025-09-30 13:49   ` Eric Auger
@ 2025-10-09 10:10     ` Duan, Zhenzhong
  2025-10-12 12:30       ` Yi Liu
  0 siblings, 1 reply; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-10-09 10:10 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Liu, Yi L, Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v6 02/22] intel_iommu: Delete RPS capability related
>supporting code
>
>Hi Zhenzhong,
>
>On 9/18/25 10:57 AM, Zhenzhong Duan wrote:
>> RID-PASID Support(RPS) is not set in vIOMMU ECAP register, the supporting
>> code is there but never take effect.
>takes

Will do

>>
>> Meanwhile, according to VTD spec section 3.4.3:
>> "Implementations not supporting RID_PASID capability (ECAP_REG.RPS is
>0b),
>> use a PASID value of 0 to perform address translation for requests without
>> PASID."
>>
>> We should delete the supporting code which fetches RID_PASID field from
>> scalable context entry and use 0 as RID_PASID directly, because RID_PASID
>> field is ignored if no RPS support according to spec.
>>
>> This simplify the code and doesn't bring any penalty.
>simplifies

Will do

>>
>> Opportunistically, s/rid2pasid/rid_pasid and s/RID2PASID/RID_PASID as
>> VTD spec uses RID_PASID terminology.
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu_internal.h |  1 -
>>  hw/i386/intel_iommu.c          | 49 +++++++++++++---------------------
>>  2 files changed, 19 insertions(+), 31 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 360e937989..6abe76556a 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -547,7 +547,6 @@ typedef struct VTDRootEntry VTDRootEntry;
>>  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>>  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>>
>> -#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL |
>~VTD_HAW_MASK(aw))
>>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1
>0xffffffffffe00000ULL
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 71b70b795d..b976b251bc 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -41,8 +41,7 @@
>>  #include "trace.h"
>>
>>  /* context entry operations */
>> -#define VTD_CE_GET_RID2PASID(ce) \
>> -    ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
>> +#define RID_PASID    0
>I would call that RID_PASID_0 to make it more explicit in the code
>or even it is a PASID to PASID_0 would do the job too.

OK, will use PASID_0

>>  #define VTD_CE_GET_PASID_DIR_TABLE(ce) \
>>      ((ce)->val[0] & VTD_PASID_DIR_BASE_ADDR_MASK)
>>
>> @@ -951,7 +950,7 @@ static int vtd_ce_get_pasid_entry(IntelIOMMUState
>*s, VTDContextEntry *ce,
>>      int ret = 0;
>>
>>      if (pasid == PCI_NO_PASID) {
>> -        pasid = VTD_CE_GET_RID2PASID(ce);
>> +        pasid = RID_PASID;
>>      }
>>      pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>>      ret = vtd_get_pe_from_pasid_table(s, pasid_dir_base, pasid, pe);
>> @@ -970,7 +969,7 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState
>*s,
>>      VTDPASIDEntry pe;
>>
>>      if (pasid == PCI_NO_PASID) {
>> -        pasid = VTD_CE_GET_RID2PASID(ce);
>> +        pasid = RID_PASID;
>>      }
>>      pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>>
>> @@ -1510,15 +1509,14 @@ static inline int
>vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
>>      return 0;
>>  }
>>
>> -static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
>> +static int vtd_ce_rid_pasid_check(IntelIOMMUState *s,
>>                                    VTDContextEntry *ce)
>>  {
>>      VTDPASIDEntry pe;
>>
>>      /*
>>       * Make sure in Scalable Mode, a present context entry
>> -     * has valid rid2pasid setting, which includes valid
>> -     * rid2pasid field and corresponding pasid entry setting
>> +     * has valid pasid entry setting at RID_PASID(0).
>s/at RID_PASID(0) /for PASID_0?

Sure

>>       */
>>      return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
>>  }
>> @@ -1581,12 +1579,11 @@ static int
>vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>>          }
>>      } else {
>>          /*
>> -         * Check if the programming of context-entry.rid2pasid
>> -         * and corresponding pasid setting is valid, and thus
>> -         * avoids to check pasid entry fetching result in future
>> -         * helper function calling.
>> +         * Check if the programming of pasid setting at RID_PASID(0)
>of pasid 0?

OK

>> +         * is valid, and thus avoids to check pasid entry fetching
>> +         * result in future helper function calling.
>>           */
>> -        ret_fr = vtd_ce_rid2pasid_check(s, ce);
>> +        ret_fr = vtd_ce_rid_pasid_check(s, ce);
>>          if (ret_fr) {
>>              return ret_fr;
>>          }
>> @@ -2097,7 +2094,7 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>      bool reads = true;
>>      bool writes = true;
>>      uint8_t access_flags, pgtt;
>> -    bool rid2pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
>> +    bool rid_pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
>I am not keen of the rid_pasid name. It does not tell what is the
>semantic of the variable. rid_pasid is an actual field in the CE.
>does that check whether we face a request without pasid in scalable
>mode. If so I would call that request_wo_pasid_sm or somethink alike

OK

>>      VTDIOTLBEntry *iotlb_entry;
>>      uint64_t xlat, size;
>>
>> @@ -2111,8 +2108,8 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>
>>      cc_entry = &vtd_as->context_cache_entry;
>>
>> -    /* Try to fetch pte from IOTLB, we don't need RID2PASID logic */
>> -    if (!rid2pasid) {
>> +    /* Try to fetch pte from IOTLB, we don't need RID_PASID(0) logic */
>It is unclear what the "RID_PASID(0) logic" is. All the more so we now
>just have to set the pasid to PASID_0.

You have keen insight, yes, this piece of code could be further simplified.
We don't need to check rid2_pasid anymore, just index iotlb cache even for PASID_0.

>> +    if (!rid_pasid) {
>>          iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
>>          if (iotlb_entry) {
>>              trace_vtd_iotlb_page_hit(source_id, addr,
>iotlb_entry->pte,
>> @@ -2160,8 +2157,8 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>          cc_entry->context_cache_gen = s->context_cache_gen;
>>      }
>>
>> -    if (rid2pasid) {
>> -        pasid = VTD_CE_GET_RID2PASID(&ce);
>> +    if (rid_pasid) {
>> +        pasid = RID_PASID;
>>      }
>>
>>      /*
>> @@ -2189,8 +2186,8 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>          return true;
>>      }
>>
>> -    /* Try to fetch pte from IOTLB for RID2PASID slow path */
>> -    if (rid2pasid) {
>> +    /* Try to fetch pte from IOTLB for RID_PASID(0) slow path */
>PASID_0?

With simplification as above, this code is useless and will be deleted.

>> +    if (rid_pasid) {
>>          iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
>>          if (iotlb_entry) {
>>              trace_vtd_iotlb_page_hit(source_id, addr,
>iotlb_entry->pte,
>> @@ -2464,20 +2461,14 @@ static void
>vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>          ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>                                         vtd_as->devfn, &ce);
>>          if (!ret && domain_id == vtd_get_domain_id(s, &ce,
>vtd_as->pasid)) {
>> -            uint32_t rid2pasid = PCI_NO_PASID;
>> -
>> -            if (s->root_scalable) {
>> -                rid2pasid = VTD_CE_GET_RID2PASID(&ce);
>> -            }
>> -
>>              /*
>>               * In legacy mode, vtd_as->pasid == pasid is always true.
>>               * In scalable mode, for vtd address space backing a PCI
>>               * device without pasid, needs to compare pasid with
>> -             * rid2pasid of this device.
>> +             * RID_PASID(0) of this device.
>>               */
>>              if (!(vtd_as->pasid == pasid ||
>> -                  (vtd_as->pasid == PCI_NO_PASID && pasid ==
>rid2pasid))) {
>> +                  (vtd_as->pasid == PCI_NO_PASID && pasid ==
>RID_PASID))) {
>would strongly suggest using PASID_0 instead

Sure.

Thanks
Zhenzhong

>>                  continue;
>>              }
>>
>> @@ -2976,9 +2967,7 @@ static void
>vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>>          if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>                                        vtd_as->devfn, &ce) &&
>>              domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
>> -            uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
>> -
>> -            if ((vtd_as->pasid != PCI_NO_PASID || pasid != rid2pasid) &&
>> +            if ((vtd_as->pasid != PCI_NO_PASID || pasid != RID_PASID)
>&&
>>                  vtd_as->pasid != pasid) {
>>                  continue;
>>              }
>Thanks
>
>Eric


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 09/22] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
  2025-09-30 15:04   ` Eric Auger
@ 2025-10-09 10:10     ` Duan, Zhenzhong
  0 siblings, 0 replies; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-10-09 10:10 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Liu, Yi L, Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v6 09/22] intel_iommu: Stick to system MR for
>IOMMUFD backed host device when x-fls=on
>
>Hi Zhenzhong,
>
>On 9/18/25 10:57 AM, Zhenzhong Duan wrote:
>> When guest enables scalable mode and setup first stage page table, we
>don't
>> want to use IOMMU MR but rather continue using the system MR for
>IOMMUFD
>> backed host device.
>>
>> Then default HWPT in VFIO contains GPA->HPA mappings which could be
>reused
>> as nesting parent HWPT to construct nested HWPT in vIOMMU.
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu.c | 37 +++++++++++++++++++++++++++++++++++--
>>  1 file changed, 35 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index ba40649c85..bd80de1670 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -40,6 +40,7 @@
>>  #include "kvm/kvm_i386.h"
>>  #include "migration/vmstate.h"
>>  #include "trace.h"
>> +#include "system/iommufd.h"
>>
>>  /* context entry operations */
>>  #define RID_PASID    0
>> @@ -1702,6 +1703,24 @@ static bool
>vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>>
>>  }
>>
>> +static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(VTDAddressSpace
>*as)
>> +{
>> +    IntelIOMMUState *s = as->iommu_state;
>> +    struct vtd_as_key key = {
>> +        .bus = as->bus,
>> +        .devfn = as->devfn,
>> +    };
>> +    VTDHostIOMMUDevice *vtd_hiod =
>g_hash_table_lookup(s->vtd_host_iommu_dev,
>> +                                                       &key);
>> +
>> +    if (vtd_hiod && vtd_hiod->hiod &&
>> +        object_dynamic_cast(OBJECT(vtd_hiod->hiod),
>> +
>TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
>> +        return vtd_hiod;
>> +    }
>> +    return NULL;
>> +}
>> +
>>  static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>>  {
>>      IntelIOMMUState *s;
>> @@ -1710,6 +1729,7 @@ static bool vtd_as_pt_enabled(VTDAddressSpace
>*as)
>>      assert(as);
>>
>>      s = as->iommu_state;
>> +
>not needed

Will do.

>>      if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
>>                                   &ce)) {
>>          /*
>> @@ -1727,12 +1747,25 @@ static bool
>vtd_as_pt_enabled(VTDAddressSpace *as)
>>  /* Return whether the device is using IOMMU translation. */
>>  static bool vtd_switch_address_space(VTDAddressSpace *as)
>>  {
>> +    IntelIOMMUState *s;
>>      bool use_iommu, pt;
>>
>>      assert(as);
>>
>> -    use_iommu = as->iommu_state->dmar_enabled
>&& !vtd_as_pt_enabled(as);
>> -    pt = as->iommu_state->dmar_enabled && vtd_as_pt_enabled(as);
>> +    s = as->iommu_state;
>nit: init could be done at declaration

Not exactly, it must be after "assert(as);"

>> +    use_iommu = s->dmar_enabled && !vtd_as_pt_enabled(as);
>> +    pt = s->dmar_enabled && vtd_as_pt_enabled(as);
>> +
>> +    /*
>> +     * When guest enables scalable mode and setup first stage page
>table,
>sets up?

Sure

>> +     * we stick to system MR for IOMMUFD backed host device. Then its
>> +     * default hwpt contains GPA->HPA mappings which is used directly
>> +     * if PGTT=PT and used as nesting parent if PGTT=FST. Otherwise
>> +     * fallback to original processing.
>fall back?

Sure

Thanks
Zhenzhong

>> +     */
>> +    if (s->root_scalable && s->fsts && vtd_find_hiod_iommufd(as)) {
>> +        use_iommu = false;
>> +    }
>>
>>      trace_vtd_switch_address_space(pci_bus_num(as->bus),
>>                                     VTD_PCI_SLOT(as->devfn),
>Besides
>Reviewed-by: Eric Auger <eric.auger@redhat.com>
>Eric
>



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-09-18  8:57 ` [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags() Zhenzhong Duan
  2025-09-23 18:47   ` Nicolin Chen
@ 2025-10-12 12:26   ` Yi Liu
  2025-10-13  6:24     ` Duan, Zhenzhong
  1 sibling, 1 reply; 57+ messages in thread
From: Yi Liu @ 2025-10-12 12:26 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On 2025/9/18 16:57, Zhenzhong Duan wrote:
> Introduce a new PCIIOMMUOps optional callback, get_viommu_flags() which
> allows to retrieve flags exposed by a vIOMMU. The first planned vIOMMU
> device flag is VIOMMU_FLAG_WANT_NESTING_PARENT that advertises the
> support of HW nested stage translation scheme and wants other sub-system
> like VFIO's cooperation to create nesting parent HWPT.
> 
> pci_device_get_viommu_flags() is a wrapper that can be called on a PCI
> device potentially protected by a vIOMMU.
> 
> get_viommu_flags() is designed to return 64bit bitmap of purely vIOMMU
> flags which are only determined by user's configuration, no host
> capabilities involved. Reasons are:
> 
> 1. host may has heterogeneous IOMMUs, each with different capabilities
> 2. this is migration friendly, return value is consistent between source
>     and target.
> 3. host IOMMU capabilities are passed to vIOMMU through set_iommu_device()
>     interface which have to be after attach_device(), when get_viommu_flags()
>     is called in attach_device(), there is no way for vIOMMU to get host
>     IOMMU capabilities yet, so only pure vIOMMU flags can be returned.
>     See below sequence:
> 
>       vfio_device_attach():
>           iommufd_cdev_attach():
>               pci_device_get_viommu_flags() for HW nesting cap
>               create a nesting parent HWPT
>               attach device to the HWPT
>               vfio_device_hiod_create_and_realize() creating hiod
>       ...
>       pci_device_set_iommu_device(hiod)
> 
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   MAINTAINERS          |  1 +
>   include/hw/iommu.h   | 19 +++++++++++++++++++
>   include/hw/pci/pci.h | 27 +++++++++++++++++++++++++++
>   hw/pci/pci.c         | 11 +++++++++++
>   4 files changed, 58 insertions(+)
>   create mode 100644 include/hw/iommu.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f8cd513d8b..71457e4cde 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2307,6 +2307,7 @@ F: include/system/iommufd.h
>   F: backends/host_iommu_device.c
>   F: include/system/host_iommu_device.h
>   F: include/qemu/chardev_open.h
> +F: include/hw/iommu.h
>   F: util/chardev_open.c
>   F: docs/devel/vfio-iommufd.rst
>   
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> new file mode 100644
> index 0000000000..65d652950a
> --- /dev/null
> +++ b/include/hw/iommu.h
> @@ -0,0 +1,19 @@
> +/*
> + * General vIOMMU flags
> + *
> + * Copyright (C) 2025 Intel Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_IOMMU_H
> +#define HW_IOMMU_H
> +
> +#include "qemu/bitops.h"
> +
> +enum {
> +    /* Nesting parent HWPT will be reused by vIOMMU to create nested HWPT */

vIOMMU needs nesting parent HWPT to create nested HWPT

> +     VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
> +};
> +
> +#endif /* HW_IOMMU_H */
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index bde9dca8e2..c54f2b53ae 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -462,6 +462,23 @@ typedef struct PCIIOMMUOps {
>        * @devfn: device and function number of the PCI device.
>        */
>       void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
> +    /**
> +     * @get_viommu_flags: get vIOMMU flags
> +     *
> +     * Optional callback, if not implemented, then vIOMMU doesn't support
> +     * exposing flags to other sub-system, e.g., VFIO. Each flag can be
> +     * an expectation or request to other sub-system or just a pure vIOMMU
> +     * capability. vIOMMU can choose which flags to expose.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * Returns: 64bit bitmap with each bit represents a flag that vIOMMU
> +     * wants to expose. See VIOMMU_FLAG_* in include/hw/iommu.h for all
> +     * possible flags currently used. These flags are theoretical which
> +     * are only determined by vIOMMU device properties and independent on
> +     * the actual host capabilities they may depend on.
> +     */
> +    uint64_t (*get_viommu_flags)(void *opaque);
>       /**
>        * @get_iotlb_info: get properties required to initialize a device IOTLB.
>        *
> @@ -644,6 +661,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
>                                    Error **errp);
>   void pci_device_unset_iommu_device(PCIDevice *dev);
>   
> +/**
> + * pci_device_get_viommu_flags: get vIOMMU flags.
> + *
> + * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
> + * flags, 0 if vIOMMU doesn't support that.
> + *
> + * @dev: PCI device pointer.
> + */
> +uint64_t pci_device_get_viommu_flags(PCIDevice *dev);
> +
>   /**
>    * pci_iommu_get_iotlb_info: get properties required to initialize a
>    * device IOTLB.
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 4d4b9dda4d..1315ef13ea 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -3012,6 +3012,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
>       }
>   }
>   
> +uint64_t pci_device_get_viommu_flags(PCIDevice *dev)
> +{
> +    PCIBus *iommu_bus;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
> +    if (iommu_bus && iommu_bus->iommu_ops->get_viommu_flags) {
> +        return iommu_bus->iommu_ops->get_viommu_flags(iommu_bus->iommu_opaque);
> +    }
> +    return 0;
> +}
> +
>   int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
>                            bool exec_req, hwaddr addr, bool lpig,
>                            uint16_t prgi, bool is_read, bool is_write)

The patch LGTM.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 06/22] intel_iommu: Implement get_viommu_flags() callback
  2025-09-18  8:57 ` [PATCH v6 06/22] intel_iommu: Implement get_viommu_flags() callback Zhenzhong Duan
@ 2025-10-12 12:28   ` Yi Liu
  2025-10-13  6:26     ` Duan, Zhenzhong
  0 siblings, 1 reply; 57+ messages in thread
From: Yi Liu @ 2025-10-12 12:28 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On 2025/9/18 16:57, Zhenzhong Duan wrote:
> Implement get_viommu_flags() callback and expose a request for nesting
> parent HWPT for now.
> 
> VFIO uses it to create nesting parent HWPT which is further used to create
> nested HWPT in vIOMMU. All these will be implemented in following patches.
> 
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   hw/i386/intel_iommu.c | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index a47482ba9d..83c40975cc 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -24,6 +24,7 @@
>   #include "qemu/main-loop.h"
>   #include "qapi/error.h"
>   #include "hw/sysbus.h"
> +#include "hw/iommu.h"
>   #include "intel_iommu_internal.h"
>   #include "hw/pci/pci.h"
>   #include "hw/pci/pci_bus.h"
> @@ -4412,6 +4413,16 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>       vtd_iommu_unlock(s);
>   }
>   
> +static uint64_t vtd_get_viommu_flags(void *opaque)
> +{
> +    IntelIOMMUState *s = opaque;
> +    uint64_t caps;

s/caps/flags

> +
> +    caps = s->fsts ? VIOMMU_FLAG_WANT_NESTING_PARENT : 0;
> +
> +    return caps;
> +}
> +
>   /* Unmap the whole range in the notifier's scope. */
>   static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>   {
> @@ -4842,6 +4853,7 @@ static PCIIOMMUOps vtd_iommu_ops = {
>       .register_iotlb_notifier = vtd_register_iotlb_notifier,
>       .unregister_iotlb_notifier = vtd_unregister_iotlb_notifier,
>       .ats_request_translation = vtd_ats_request_translation,
> +    .get_viommu_flags = vtd_get_viommu_flags,
>   };
>   
>   static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)

Reviewed-by: Yi Liu <yi.l.liu@intel.com>

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 02/22] intel_iommu: Delete RPS capability related supporting code
  2025-10-09 10:10     ` Duan, Zhenzhong
@ 2025-10-12 12:30       ` Yi Liu
  0 siblings, 0 replies; 57+ messages in thread
From: Yi Liu @ 2025-10-12 12:30 UTC (permalink / raw)
  To: Duan, Zhenzhong, eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, skolothumtho@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Peng, Chao P

On 2025/10/9 18:10, Duan, Zhenzhong wrote:
> Hi Eric,
> 
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH v6 02/22] intel_iommu: Delete RPS capability related
>> supporting code
>>
>> Hi Zhenzhong,
>>
>> On 9/18/25 10:57 AM, Zhenzhong Duan wrote:
>>> RID-PASID Support(RPS) is not set in vIOMMU ECAP register, the supporting
>>> code is there but never take effect.
>> takes
> 
> Will do
> 
>>>
>>> Meanwhile, according to VTD spec section 3.4.3:
>>> "Implementations not supporting RID_PASID capability (ECAP_REG.RPS is
>> 0b),
>>> use a PASID value of 0 to perform address translation for requests without
>>> PASID."
>>>
>>> We should delete the supporting code which fetches RID_PASID field from
>>> scalable context entry and use 0 as RID_PASID directly, because RID_PASID
>>> field is ignored if no RPS support according to spec.
>>>
>>> This simplify the code and doesn't bring any penalty.
>> simplifies
> 
> Will do
> 
>>>
>>> Opportunistically, s/rid2pasid/rid_pasid and s/RID2PASID/RID_PASID as
>>> VTD spec uses RID_PASID terminology.
>>>
>>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>   hw/i386/intel_iommu_internal.h |  1 -
>>>   hw/i386/intel_iommu.c          | 49 +++++++++++++---------------------
>>>   2 files changed, 19 insertions(+), 31 deletions(-)
>>>
>>> diff --git a/hw/i386/intel_iommu_internal.h
>> b/hw/i386/intel_iommu_internal.h
>>> index 360e937989..6abe76556a 100644
>>> --- a/hw/i386/intel_iommu_internal.h
>>> +++ b/hw/i386/intel_iommu_internal.h
>>> @@ -547,7 +547,6 @@ typedef struct VTDRootEntry VTDRootEntry;
>>>   #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>>>   #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>>>
>>> -#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>>>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL |
>> ~VTD_HAW_MASK(aw))
>>>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1
>> 0xffffffffffe00000ULL
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index 71b70b795d..b976b251bc 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -41,8 +41,7 @@
>>>   #include "trace.h"
>>>
>>>   /* context entry operations */
>>> -#define VTD_CE_GET_RID2PASID(ce) \
>>> -    ((ce)->val[1] & VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK)
>>> +#define RID_PASID    0
>> I would call that RID_PASID_0 to make it more explicit in the code
>> or even it is a PASID to PASID_0 would do the job too.
> 
> OK, will use PASID_0
> 
>>>   #define VTD_CE_GET_PASID_DIR_TABLE(ce) \
>>>       ((ce)->val[0] & VTD_PASID_DIR_BASE_ADDR_MASK)
>>>
>>> @@ -951,7 +950,7 @@ static int vtd_ce_get_pasid_entry(IntelIOMMUState
>> *s, VTDContextEntry *ce,
>>>       int ret = 0;
>>>
>>>       if (pasid == PCI_NO_PASID) {
>>> -        pasid = VTD_CE_GET_RID2PASID(ce);
>>> +        pasid = RID_PASID;
>>>       }
>>>       pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>>>       ret = vtd_get_pe_from_pasid_table(s, pasid_dir_base, pasid, pe);
>>> @@ -970,7 +969,7 @@ static int vtd_ce_get_pasid_fpd(IntelIOMMUState
>> *s,
>>>       VTDPASIDEntry pe;
>>>
>>>       if (pasid == PCI_NO_PASID) {
>>> -        pasid = VTD_CE_GET_RID2PASID(ce);
>>> +        pasid = RID_PASID;
>>>       }
>>>       pasid_dir_base = VTD_CE_GET_PASID_DIR_TABLE(ce);
>>>
>>> @@ -1510,15 +1509,14 @@ static inline int
>> vtd_context_entry_rsvd_bits_check(IntelIOMMUState *s,
>>>       return 0;
>>>   }
>>>
>>> -static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
>>> +static int vtd_ce_rid_pasid_check(IntelIOMMUState *s,
>>>                                     VTDContextEntry *ce)
>>>   {
>>>       VTDPASIDEntry pe;
>>>
>>>       /*
>>>        * Make sure in Scalable Mode, a present context entry
>>> -     * has valid rid2pasid setting, which includes valid
>>> -     * rid2pasid field and corresponding pasid entry setting
>>> +     * has valid pasid entry setting at RID_PASID(0).
>> s/at RID_PASID(0) /for PASID_0?
> 
> Sure
> 
>>>        */
>>>       return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
>>>   }
>>> @@ -1581,12 +1579,11 @@ static int
>> vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>>>           }
>>>       } else {
>>>           /*
>>> -         * Check if the programming of context-entry.rid2pasid
>>> -         * and corresponding pasid setting is valid, and thus
>>> -         * avoids to check pasid entry fetching result in future
>>> -         * helper function calling.
>>> +         * Check if the programming of pasid setting at RID_PASID(0)
>> of pasid 0?
> 
> OK
> 
>>> +         * is valid, and thus avoids to check pasid entry fetching
>>> +         * result in future helper function calling.
>>>            */
>>> -        ret_fr = vtd_ce_rid2pasid_check(s, ce);
>>> +        ret_fr = vtd_ce_rid_pasid_check(s, ce);
>>>           if (ret_fr) {
>>>               return ret_fr;
>>>           }
>>> @@ -2097,7 +2094,7 @@ static bool
>> vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>>       bool reads = true;
>>>       bool writes = true;
>>>       uint8_t access_flags, pgtt;
>>> -    bool rid2pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
>>> +    bool rid_pasid = (pasid == PCI_NO_PASID) && s->root_scalable;
>> I am not keen of the rid_pasid name. It does not tell what is the
>> semantic of the variable. rid_pasid is an actual field in the CE.
>> does that check whether we face a request without pasid in scalable
>> mode. If so I would call that request_wo_pasid_sm or somethink alike
> 
> OK
> 
>>>       VTDIOTLBEntry *iotlb_entry;
>>>       uint64_t xlat, size;
>>>
>>> @@ -2111,8 +2108,8 @@ static bool
>> vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>>
>>>       cc_entry = &vtd_as->context_cache_entry;
>>>
>>> -    /* Try to fetch pte from IOTLB, we don't need RID2PASID logic */
>>> -    if (!rid2pasid) {
>>> +    /* Try to fetch pte from IOTLB, we don't need RID_PASID(0) logic */
>> It is unclear what the "RID_PASID(0) logic" is. All the more so we now
>> just have to set the pasid to PASID_0.
> 
> You have keen insight, yes, this piece of code could be further simplified.
> We don't need to check rid2_pasid anymore, just index iotlb cache even for PASID_0.
> 
>>> +    if (!rid_pasid) {
>>>           iotlb_entry = vtd_lookup_iotlb(s, source_id, pasid, addr);
>>>           if (iotlb_entry) {
>>>               trace_vtd_iotlb_page_hit(source_id, addr,
>> iotlb_entry->pte,
>>> @@ -2160,8 +2157,8 @@ static bool
>> vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>>           cc_entry->context_cache_gen = s->context_cache_gen;
>>>       }
>>>
>>> -    if (rid2pasid) {
>>> -        pasid = VTD_CE_GET_RID2PASID(&ce);
>>> +    if (rid_pasid) {
>>> +        pasid = RID_PASID;
>>>       }
>>>
>>>       /*
>>> @@ -2189,8 +2186,8 @@ static bool
>> vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>>>           return true;
>>>       }
>>>
>>> -    /* Try to fetch pte from IOTLB for RID2PASID slow path */
>>> -    if (rid2pasid) {
>>> +    /* Try to fetch pte from IOTLB for RID_PASID(0) slow path */
>> PASID_0?
> 
> With simplification as above, this code is useless and will be deleted.

yeah, this code is really confusing. I saw "if (!rid_pasid) {" and
"if (rid_pasid) {", the two if branches have almost the same code.
I suppose just different pasid value. So the two should be able to be
consolidated.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 03/22] intel_iommu: Update terminology to match VTD spec
  2025-09-18  8:57 ` [PATCH v6 03/22] intel_iommu: Update terminology to match VTD spec Zhenzhong Duan
  2025-09-30  7:45   ` Eric Auger
@ 2025-10-12 12:30   ` Yi Liu
  2025-10-13  6:20     ` Duan, Zhenzhong
  1 sibling, 1 reply; 57+ messages in thread
From: Yi Liu @ 2025-10-12 12:30 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng, Paolo Bonzini

On 2025/9/18 16:57, Zhenzhong Duan wrote:
> VTD spec revision 3.4 released in December 2021 renamed "First-level" to
> "First-stage" and "Second-level" to "Second-stage".
> 
> Do the same in intel_iommu code to match spec, change all existing
> "fl/sl/FL/SL/first level/second level/stage-1/stage-2" terminology to
> "fs/ss/FS/SS/first stage/second stage".
> 
> No functional changes intended.

LGTM.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>

[...]

>   
> -    if (!s->scalable_mode && s->flts) {
> +    if (!s->scalable_mode && s->fsts) {
>           error_setg(errp, "x-flts is only available in scalable mode");

just realized we don't have chance to rename x-flts..

>           return false;
>       }
>   
> -    if (!s->flts && s->aw_bits != VTD_HOST_AW_39BIT &&
> +    if (!s->fsts && s->aw_bits != VTD_HOST_AW_39BIT &&
>           s->aw_bits != VTD_HOST_AW_48BIT) {
>           error_setg(errp, "%s: supported values for aw-bits are: %d, %d",
>                      s->scalable_mode ? "Scalable mode(flts=off)" : "Legacy mode",
> @@ -4877,7 +4879,7 @@ static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
>           return false;
>       }
>   
> -    if (s->flts && s->aw_bits != VTD_HOST_AW_48BIT) {
> +    if (s->fsts && s->aw_bits != VTD_HOST_AW_48BIT) {
>           error_setg(errp,
>                      "Scalable mode(flts=on): supported value for aw-bits is: %d",

this should be x-flts=on. right? not a fault of this patch.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 08/22] vfio/iommufd: Force creating nesting parent HWPT
  2025-09-18  8:57 ` [PATCH v6 08/22] vfio/iommufd: Force creating nesting parent HWPT Zhenzhong Duan
  2025-09-30 14:19   ` Eric Auger
@ 2025-10-12 12:33   ` Yi Liu
  1 sibling, 0 replies; 57+ messages in thread
From: Yi Liu @ 2025-10-12 12:33 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On 2025/9/18 16:57, Zhenzhong Duan wrote:
> Call pci_device_get_viommu_flags() to get if vIOMMU supports
> VIOMMU_FLAG_WANT_NESTING_PARENT.
> 
> If yes, create a nesting parent HWPT and add it to the container's hwpt_list,
> letting this parent HWPT cover the entire second stage mappings (GPA=>HPA).
> 
> This allows a VFIO passthrough device to directly attach to this default HWPT
> and then to use the system address space and its listener.
> 
> Introduce a vfio_device_get_viommu_flags_want_nesting() helper to facilitate
> this implementation.
> 
> It is safe to do so because a vIOMMU will be able to fail in set_iommu_device()
> call, if something else related to the VFIO device or vIOMMU isn't compatible.
> 
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   include/hw/vfio/vfio-device.h |  2 ++
>   hw/vfio/device.c              | 12 ++++++++++++
>   hw/vfio/iommufd.c             |  9 +++++++++
>   3 files changed, 23 insertions(+)
>

Reviewed-by: Yi Liu <yi.l.liu@intel.com>

> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index e7e6243e2d..a964091135 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -257,6 +257,8 @@ void vfio_device_prepare(VFIODevice *vbasedev, VFIOContainerBase *bcontainer,
>   
>   void vfio_device_unprepare(VFIODevice *vbasedev);
>   
> +bool vfio_device_get_viommu_flags_want_nesting(VFIODevice *vbasedev);
> +
>   int vfio_device_get_region_info(VFIODevice *vbasedev, int index,
>                                   struct vfio_region_info **info);
>   int vfio_device_get_region_info_type(VFIODevice *vbasedev, uint32_t type,
> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
> index 08f12ac31f..620cc78b77 100644
> --- a/hw/vfio/device.c
> +++ b/hw/vfio/device.c
> @@ -23,6 +23,7 @@
>   
>   #include "hw/vfio/vfio-device.h"
>   #include "hw/vfio/pci.h"
> +#include "hw/iommu.h"
>   #include "hw/hw.h"
>   #include "trace.h"
>   #include "qapi/error.h"
> @@ -504,6 +505,17 @@ void vfio_device_unprepare(VFIODevice *vbasedev)
>       vbasedev->bcontainer = NULL;
>   }
>   
> +bool vfio_device_get_viommu_flags_want_nesting(VFIODevice *vbasedev)
> +{
> +    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
> +
> +    if (vdev) {
> +        return !!(pci_device_get_viommu_flags(&vdev->parent_obj) &
> +                  VIOMMU_FLAG_WANT_NESTING_PARENT);
> +    }
> +    return false;
> +}
> +
>   /*
>    * Traditional ioctl() based io
>    */
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 8c27222f75..f1684a39b7 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -379,6 +379,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>           flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>       }
>   
> +    /*
> +     * If vIOMMU requests VFIO's cooperation to create nesting parent HWPT,
> +     * force to create it so that it could be reused by vIOMMU to create
> +     * nested HWPT.
> +     */
> +    if (vfio_device_get_viommu_flags_want_nesting(vbasedev)) {
> +        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
> +    }
> +
>       if (cpr_is_incoming()) {
>           hwpt_id = vbasedev->cpr.hwpt_id;
>           goto skip_alloc;


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 09/22] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
  2025-09-18  8:57 ` [PATCH v6 09/22] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
  2025-09-30 15:04   ` Eric Auger
@ 2025-10-12 12:51   ` Yi Liu
  1 sibling, 0 replies; 57+ messages in thread
From: Yi Liu @ 2025-10-12 12:51 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On 2025/9/18 16:57, Zhenzhong Duan wrote:
> When guest enables scalable mode and setup first stage page table, we don't
> want to use IOMMU MR but rather continue using the system MR for IOMMUFD
> backed host device.
> 
> Then default HWPT in VFIO contains GPA->HPA mappings which could be reused
> as nesting parent HWPT to construct nested HWPT in vIOMMU.
> 
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/i386/intel_iommu.c | 37 +++++++++++++++++++++++++++++++++++--
>   1 file changed, 35 insertions(+), 2 deletions(-)

Reviewed-by: Yi Liu <yi.l.liu@intel.com>

> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index ba40649c85..bd80de1670 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -40,6 +40,7 @@
>   #include "kvm/kvm_i386.h"
>   #include "migration/vmstate.h"
>   #include "trace.h"
> +#include "system/iommufd.h"
>   
>   /* context entry operations */
>   #define RID_PASID    0
> @@ -1702,6 +1703,24 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>   
>   }
>   
> +static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(VTDAddressSpace *as)
> +{
> +    IntelIOMMUState *s = as->iommu_state;
> +    struct vtd_as_key key = {
> +        .bus = as->bus,
> +        .devfn = as->devfn,
> +    };
> +    VTDHostIOMMUDevice *vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev,
> +                                                       &key);
> +
> +    if (vtd_hiod && vtd_hiod->hiod &&
> +        object_dynamic_cast(OBJECT(vtd_hiod->hiod),
> +                            TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> +        return vtd_hiod;
> +    }
> +    return NULL;
> +}
> +
>   static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>   {
>       IntelIOMMUState *s;
> @@ -1710,6 +1729,7 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>       assert(as);
>   
>       s = as->iommu_state;
> +
>       if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
>                                    &ce)) {
>           /*
> @@ -1727,12 +1747,25 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>   /* Return whether the device is using IOMMU translation. */
>   static bool vtd_switch_address_space(VTDAddressSpace *as)
>   {
> +    IntelIOMMUState *s;
>       bool use_iommu, pt;
>   
>       assert(as);
>   
> -    use_iommu = as->iommu_state->dmar_enabled && !vtd_as_pt_enabled(as);
> -    pt = as->iommu_state->dmar_enabled && vtd_as_pt_enabled(as);
> +    s = as->iommu_state;
> +    use_iommu = s->dmar_enabled && !vtd_as_pt_enabled(as);
> +    pt = s->dmar_enabled && vtd_as_pt_enabled(as);
> +
> +    /*
> +     * When guest enables scalable mode and setup first stage page table,
> +     * we stick to system MR for IOMMUFD backed host device. Then its
> +     * default hwpt contains GPA->HPA mappings which is used directly
> +     * if PGTT=PT and used as nesting parent if PGTT=FST. Otherwise
> +     * fallback to original processing.
> +     */
> +    if (s->root_scalable && s->fsts && vtd_find_hiod_iommufd(as)) {
> +        use_iommu = false;
> +    }
>   
>       trace_vtd_switch_address_space(pci_bus_num(as->bus),
>                                      VTD_PCI_SLOT(as->devfn),


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 10/22] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-09-18  8:57 ` [PATCH v6 10/22] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-10-12 12:55   ` Yi Liu
  2025-10-13  6:48     ` Duan, Zhenzhong
  0 siblings, 1 reply; 57+ messages in thread
From: Yi Liu @ 2025-10-12 12:55 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On 2025/9/18 16:57, Zhenzhong Duan wrote:
> When vIOMMU is configured x-flts=on in scalable mode, first stage page table
> is passed to host to construct nested page table for passthrough devices.
> 
> We need to check compatibility of some critical IOMMU capabilities between
> vIOMMU and host IOMMU to ensure guest first stage page table could be used by
> host.
> 
> For instance, vIOMMU supports first stage 1GB large page mapping, but host does
> not, then this IOMMUFD backed device should fail.
> 
> Even of the checks pass, for now we willingly reject the association because
> all the bits are not there yet, it will be relaxed in the end of this series.

might be good to note that nested cap is required but it's already
covered in the core, so this patch does not check it. Otherwise, it
would be a question why it is not added without checking previous comments.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 12/22] intel_iommu: Handle PASID cache invalidation
  2025-09-18  8:57 ` [PATCH v6 12/22] intel_iommu: Handle PASID cache invalidation Zhenzhong Duan
@ 2025-10-12 14:58   ` Yi Liu
  2025-10-13  7:37     ` Duan, Zhenzhong
  0 siblings, 1 reply; 57+ messages in thread
From: Yi Liu @ 2025-10-12 14:58 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On 2025/9/18 16:57, Zhenzhong Duan wrote:
> This adds PASID cache sync for RID_PASID, non-RID_PASID isn't supported.
> 
> Adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the pasid
> entry and track PASID usage and future PASID tagged DMA address translation
> support in vIOMMU.
> 
> When guest triggers pasid cache invalidation, QEMU will capture it and
> update or invalidate pasid cache.
> 
> vIOMMU emulator could figure out the reason by fetching latest guest pasid
> entry in memory and compare it with cached PASID entry if it's valid.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/i386/intel_iommu_internal.h |  19 +++-
>   include/hw/i386/intel_iommu.h  |   6 ++
>   hw/i386/intel_iommu.c          | 157 ++++++++++++++++++++++++++++++---
>   hw/i386/trace-events           |   3 +
>   4 files changed, 173 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 9cdc8d5dbb..d400bcee21 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -316,6 +316,7 @@ typedef enum VTDFaultReason {
>                                     * request while disabled */
>       VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>   
> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>       /* PASID directory entry access failure */
>       VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>       /* The Present(P) field of pasid directory entry is 0 */
> @@ -493,6 +494,15 @@ typedef union VTDInvDesc VTDInvDesc;
>   #define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000f1c0ULL
>   #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>   
> +/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
> +#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
> +#define VTD_INV_DESC_PASIDC_G_DSI       0
> +#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
> +#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
> +#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16, 16)
> +#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32, 20)
> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
> +
>   /* Information about page-selective IOTLB invalidate */
>   struct VTDIOTLBPageInvInfo {
>       uint16_t domain_id;
> @@ -552,6 +562,13 @@ typedef struct VTDRootEntry VTDRootEntry;
>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>   
> +typedef struct VTDPASIDCacheInfo {
> +    uint8_t type;
> +    uint16_t did;
> +    uint32_t pasid;
> +    bool reset;
> +} VTDPASIDCacheInfo;
> +
>   /* PASID Table Related Definitions */
>   #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>   #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> @@ -573,7 +590,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>   #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>   
>   #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
> -#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
> +#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>   
>   #define VTD_SM_PASID_ENTRY_FSPM          3ULL
>   #define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 3351892da0..ff01e5c82d 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>       uint64_t val[8];
>   };
>   
> +typedef struct VTDPASIDCacheEntry {
> +    struct VTDPASIDEntry pasid_entry;
> +    bool valid;
> +} VTDPASIDCacheEntry;
> +
>   struct VTDAddressSpace {
>       PCIBus *bus;
>       uint8_t devfn;
> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>       MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
>       IntelIOMMUState *iommu_state;
>       VTDContextCacheEntry context_cache_entry;
> +    VTDPASIDCacheEntry pasid_cache_entry;
>       QLIST_ENTRY(VTDAddressSpace) next;
>       /* Superset of notifier flags that this address space has */
>       IOMMUNotifierFlag notifier_flags;
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index d37d47115a..24061f6dc6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1614,7 +1614,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
>   
>       if (s->root_scalable) {
>           vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> -        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
> +        return VTD_SM_PASID_ENTRY_DID(&pe);
>       }
>   
>       return VTD_CONTEXT_ENTRY_DID(ce->hi);
> @@ -3074,6 +3074,144 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
>       return true;
>   }
>   
> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
> +                                            VTDPASIDEntry *pe)
> +{
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    if (!s->root_scalable) {
> +        return -VTD_FR_RTADDR_INV_TTM;
> +    }
> +
> +    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
> +                                   &ce);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return vtd_ce_get_pasid_entry(s, &ce, pe, vtd_as->pasid);
> +}
> +
> +/*
> + * For each IOMMUFD backed device, update or invalidate pasid cache based on
> + * the value in memory.
> + */
> +static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
> +                                        gpointer user_data)
> +{
> +    VTDPASIDCacheInfo *pc_info = user_data;
> +    VTDAddressSpace *vtd_as = value;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> +    VTDPASIDEntry pe;
> +    uint16_t did;
> +
> +    /* Ignore emulated device or legacy VFIO backed device */
> +    if (!vtd_find_hiod_iommufd(vtd_as)) {
> +        return;
> +    }
> +
> +    /* non-RID_PASID isn't supported yet */
> +    assert(vtd_as->pasid == PCI_NO_PASID);
> +
> +    if (vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
> +        /*
> +         * No valid pasid entry in guest memory. e.g. pasid entry was modified
> +         * to be either all-zero or non-present. Either case means existing
> +         * pasid cache should be invalidated.
> +         */
> +        pc_entry->valid = false;
> +        return;
> +    }
> +
> +    /*
> +     * VTD_INV_DESC_PASIDC_G_DSI and VTD_INV_DESC_PASIDC_G_PASID_SI require
> +     * DID check. If DID doesn't match the value in cache or memory, then
> +     * it's not a pasid entry we want to invalidate.

I think comparing DID applies to the case in which pc_entry->valid is
true. If pc_entry->valid is false, this means no cached pc_entry yet. If
pe in guest memory is valid, the pc_entry should be updated/set hence
the bind_pasid operation (added in later patch) would be conducted.

> +     */
> +    switch (pc_info->type) {
> +    case VTD_INV_DESC_PASIDC_G_PASID_SI:
> +    case VTD_INV_DESC_PASIDC_G_DSI:
> +        if (pc_entry->valid) {
> +            did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
> +            if (pc_info->did == did) {
> +                break;
> +            }
> +        }
> +        did = VTD_SM_PASID_ENTRY_DID(&pe);
> +        if (pc_info->did == did) {
> +            break;
> +        }
> +        return;
> +    }
> +
> +    pc_entry->pasid_entry = pe;
> +    pc_entry->valid = true;
> +}
> +
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
> +{
> +    if (!s->fsts || !s->root_scalable || !s->dmar_enabled) {
> +        return;
> +    }
> +
> +    vtd_iommu_lock(s);
> +    g_hash_table_foreach(s->vtd_address_spaces, vtd_pasid_cache_sync_locked,
> +                         pc_info);
> +    vtd_iommu_unlock(s);
> +}
> +
> +static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> +                                   VTDInvDesc *inv_desc)
> +{
> +    uint16_t did;
> +    uint32_t pasid;
> +    VTDPASIDCacheInfo pc_info = {};
> +    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
> +                        VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
> +
> +    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
> +                                     __func__, "pasid cache inv")) {
> +        return false;
> +    }
> +
> +    did = VTD_INV_DESC_PASIDC_DID(inv_desc);
> +    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc);
> +    pc_info.type = VTD_INV_DESC_PASIDC_G(inv_desc);
> +
> +    switch (pc_info.type) {
> +    case VTD_INV_DESC_PASIDC_G_DSI:
> +        trace_vtd_inv_desc_pasid_cache_dsi(did);
> +        pc_info.did = did;
> +        break;
> +
> +    case VTD_INV_DESC_PASIDC_G_PASID_SI:
> +        /* PASID selective implies a DID selective */
> +        trace_vtd_inv_desc_pasid_cache_psi(did, pasid);
> +        /* Currently non-RID_PASID invalidation requests are ignored */

I'm a bit doubting if this is safe given the ATS path (for emulated
device) is merged. ATS path supports non-RID_PASID if emulated device
has PASID cap. The lucky thing is that the ATS path does not have
pasid level cache. So skipping invalidation for non-RID_PASID is not
harmful so far. Just a note to other reviewers although I didn't see a
problem here.

> +        if (pasid != RID_PASID) {
> +            return true;
> +        }
> +        pc_info.did = did;
> +        pc_info.pasid = pasid;
> +        break;
> +

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 03/22] intel_iommu: Update terminology to match VTD spec
  2025-10-12 12:30   ` Yi Liu
@ 2025-10-13  6:20     ` Duan, Zhenzhong
  0 siblings, 0 replies; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-10-13  6:20 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P,
	Paolo Bonzini



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v6 03/22] intel_iommu: Update terminology to match
>VTD spec
>
>On 2025/9/18 16:57, Zhenzhong Duan wrote:
>> VTD spec revision 3.4 released in December 2021 renamed "First-level" to
>> "First-stage" and "Second-level" to "Second-stage".
>>
>> Do the same in intel_iommu code to match spec, change all existing
>> "fl/sl/FL/SL/first level/second level/stage-1/stage-2" terminology to
>> "fs/ss/FS/SS/first stage/second stage".
>>
>> No functional changes intended.
>
>LGTM.
>
>Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>
>[...]
>
>>
>> -    if (!s->scalable_mode && s->flts) {
>> +    if (!s->scalable_mode && s->fsts) {
>>           error_setg(errp, "x-flts is only available in scalable mode");
>
>just realized we don't have chance to rename x-flts..

Ah, yes.

>
>>           return false;
>>       }
>>
>> -    if (!s->flts && s->aw_bits != VTD_HOST_AW_39BIT &&
>> +    if (!s->fsts && s->aw_bits != VTD_HOST_AW_39BIT &&
>>           s->aw_bits != VTD_HOST_AW_48BIT) {
>>           error_setg(errp, "%s: supported values for aw-bits
>are: %d, %d",
>>                      s->scalable_mode ? "Scalable mode(flts=off)" :
>"Legacy mode",
>> @@ -4877,7 +4879,7 @@ static bool vtd_decide_config(IntelIOMMUState
>*s, Error **errp)
>>           return false;
>>       }
>>
>> -    if (s->flts && s->aw_bits != VTD_HOST_AW_48BIT) {
>> +    if (s->fsts && s->aw_bits != VTD_HOST_AW_48BIT) {
>>           error_setg(errp,
>>                      "Scalable mode(flts=on): supported value for
>aw-bits is: %d",
>
>this should be x-flts=on. right? not a fault of this patch.

Yes, let me fix it in this patch opportunistically.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags()
  2025-10-12 12:26   ` Yi Liu
@ 2025-10-13  6:24     ` Duan, Zhenzhong
  0 siblings, 0 replies; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-10-13  6:24 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v6 05/22] hw/pci: Introduce
>pci_device_get_viommu_flags()
>
>On 2025/9/18 16:57, Zhenzhong Duan wrote:
>> Introduce a new PCIIOMMUOps optional callback, get_viommu_flags()
>which
>> allows to retrieve flags exposed by a vIOMMU. The first planned vIOMMU
>> device flag is VIOMMU_FLAG_WANT_NESTING_PARENT that advertises the
>> support of HW nested stage translation scheme and wants other sub-system
>> like VFIO's cooperation to create nesting parent HWPT.
>>
>> pci_device_get_viommu_flags() is a wrapper that can be called on a PCI
>> device potentially protected by a vIOMMU.
>>
>> get_viommu_flags() is designed to return 64bit bitmap of purely vIOMMU
>> flags which are only determined by user's configuration, no host
>> capabilities involved. Reasons are:
>>
>> 1. host may has heterogeneous IOMMUs, each with different capabilities
>> 2. this is migration friendly, return value is consistent between source
>>     and target.
>> 3. host IOMMU capabilities are passed to vIOMMU through
>set_iommu_device()
>>     interface which have to be after attach_device(), when
>get_viommu_flags()
>>     is called in attach_device(), there is no way for vIOMMU to get host
>>     IOMMU capabilities yet, so only pure vIOMMU flags can be returned.
>>     See below sequence:
>>
>>       vfio_device_attach():
>>           iommufd_cdev_attach():
>>               pci_device_get_viommu_flags() for HW nesting cap
>>               create a nesting parent HWPT
>>               attach device to the HWPT
>>               vfio_device_hiod_create_and_realize() creating hiod
>>       ...
>>       pci_device_set_iommu_device(hiod)
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   MAINTAINERS          |  1 +
>>   include/hw/iommu.h   | 19 +++++++++++++++++++
>>   include/hw/pci/pci.h | 27 +++++++++++++++++++++++++++
>>   hw/pci/pci.c         | 11 +++++++++++
>>   4 files changed, 58 insertions(+)
>>   create mode 100644 include/hw/iommu.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index f8cd513d8b..71457e4cde 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2307,6 +2307,7 @@ F: include/system/iommufd.h
>>   F: backends/host_iommu_device.c
>>   F: include/system/host_iommu_device.h
>>   F: include/qemu/chardev_open.h
>> +F: include/hw/iommu.h
>>   F: util/chardev_open.c
>>   F: docs/devel/vfio-iommufd.rst
>>
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> new file mode 100644
>> index 0000000000..65d652950a
>> --- /dev/null
>> +++ b/include/hw/iommu.h
>> @@ -0,0 +1,19 @@
>> +/*
>> + * General vIOMMU flags
>> + *
>> + * Copyright (C) 2025 Intel Corporation.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#ifndef HW_IOMMU_H
>> +#define HW_IOMMU_H
>> +
>> +#include "qemu/bitops.h"
>> +
>> +enum {
>> +    /* Nesting parent HWPT will be reused by vIOMMU to create nested
>HWPT */
>
>vIOMMU needs nesting parent HWPT to create nested HWPT

Will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 06/22] intel_iommu: Implement get_viommu_flags() callback
  2025-10-12 12:28   ` Yi Liu
@ 2025-10-13  6:26     ` Duan, Zhenzhong
  0 siblings, 0 replies; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-10-13  6:26 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v6 06/22] intel_iommu: Implement get_viommu_flags()
>callback
>
>On 2025/9/18 16:57, Zhenzhong Duan wrote:
>> Implement get_viommu_flags() callback and expose a request for nesting
>> parent HWPT for now.
>>
>> VFIO uses it to create nesting parent HWPT which is further used to create
>> nested HWPT in vIOMMU. All these will be implemented in following
>patches.
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>> ---
>>   hw/i386/intel_iommu.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index a47482ba9d..83c40975cc 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -24,6 +24,7 @@
>>   #include "qemu/main-loop.h"
>>   #include "qapi/error.h"
>>   #include "hw/sysbus.h"
>> +#include "hw/iommu.h"
>>   #include "intel_iommu_internal.h"
>>   #include "hw/pci/pci.h"
>>   #include "hw/pci/pci_bus.h"
>> @@ -4412,6 +4413,16 @@ static void
>vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>>       vtd_iommu_unlock(s);
>>   }
>>
>> +static uint64_t vtd_get_viommu_flags(void *opaque)
>> +{
>> +    IntelIOMMUState *s = opaque;
>> +    uint64_t caps;
>
>s/caps/flags

done

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 10/22] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-10-12 12:55   ` Yi Liu
@ 2025-10-13  6:48     ` Duan, Zhenzhong
  0 siblings, 0 replies; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-10-13  6:48 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v6 10/22] intel_iommu: Check for compatibility with
>IOMMUFD backed device when x-flts=on
>
>On 2025/9/18 16:57, Zhenzhong Duan wrote:
>> When vIOMMU is configured x-flts=on in scalable mode, first stage page
>table
>> is passed to host to construct nested page table for passthrough devices.
>>
>> We need to check compatibility of some critical IOMMU capabilities
>between
>> vIOMMU and host IOMMU to ensure guest first stage page table could be
>used by
>> host.
>>
>> For instance, vIOMMU supports first stage 1GB large page mapping, but
>host does
>> not, then this IOMMUFD backed device should fail.
>>
>> Even of the checks pass, for now we willingly reject the association because
>> all the bits are not there yet, it will be relaxed in the end of this series.
>
>might be good to note that nested cap is required but it's already
>covered in the core, so this patch does not check it. Otherwise, it
>would be a question why it is not added without checking previous comments.

Sure, will add:

"Note vIOMMU has exposed IOMMU_HWPT_ALLOC_NEST_PARENT flag to force VFIO core to
create nesting parent HWPT, if host doesn't support nested translation, the
creation will fail. So no need to check nested capability here."

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 57+ messages in thread

* RE: [PATCH v6 12/22] intel_iommu: Handle PASID cache invalidation
  2025-10-12 14:58   ` Yi Liu
@ 2025-10-13  7:37     ` Duan, Zhenzhong
  2025-10-13 12:53       ` Yi Liu
  0 siblings, 1 reply; 57+ messages in thread
From: Duan, Zhenzhong @ 2025-10-13  7:37 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v6 12/22] intel_iommu: Handle PASID cache invalidation
>
>On 2025/9/18 16:57, Zhenzhong Duan wrote:
>> This adds PASID cache sync for RID_PASID, non-RID_PASID isn't supported.
>>
>> Adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
>pasid
>> entry and track PASID usage and future PASID tagged DMA address
>translation
>> support in vIOMMU.
>>
>> When guest triggers pasid cache invalidation, QEMU will capture it and
>> update or invalidate pasid cache.
>>
>> vIOMMU emulator could figure out the reason by fetching latest guest pasid
>> entry in memory and compare it with cached PASID entry if it's valid.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   hw/i386/intel_iommu_internal.h |  19 +++-
>>   include/hw/i386/intel_iommu.h  |   6 ++
>>   hw/i386/intel_iommu.c          | 157
>++++++++++++++++++++++++++++++---
>>   hw/i386/trace-events           |   3 +
>>   4 files changed, 173 insertions(+), 12 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 9cdc8d5dbb..d400bcee21 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -316,6 +316,7 @@ typedef enum VTDFaultReason {
>>                                     * request while disabled */
>>       VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>>
>> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>>       /* PASID directory entry access failure */
>>       VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>>       /* The Present(P) field of pasid directory entry is 0 */
>> @@ -493,6 +494,15 @@ typedef union VTDInvDesc VTDInvDesc;
>>   #define VTD_INV_DESC_PIOTLB_RSVD_VAL0
>0xfff000000000f1c0ULL
>>   #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>>
>> +/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
>> +#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
>> +#define VTD_INV_DESC_PASIDC_G_DSI       0
>> +#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
>> +#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
>> +#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16,
>16)
>> +#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32,
>20)
>> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
>> +
>>   /* Information about page-selective IOTLB invalidate */
>>   struct VTDIOTLBPageInvInfo {
>>       uint16_t domain_id;
>> @@ -552,6 +562,13 @@ typedef struct VTDRootEntry VTDRootEntry;
>>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL |
>~VTD_HAW_MASK(aw))
>>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1
>0xffffffffffe00000ULL
>>
>> +typedef struct VTDPASIDCacheInfo {
>> +    uint8_t type;
>> +    uint16_t did;
>> +    uint32_t pasid;
>> +    bool reset;
>> +} VTDPASIDCacheInfo;
>> +
>>   /* PASID Table Related Definitions */
>>   #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>>   #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
>> @@ -573,7 +590,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>>   #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>>
>>   #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted
>guest-address-width */
>> -#define VTD_SM_PASID_ENTRY_DID(val)    ((val) &
>VTD_DOMAIN_ID_MASK)
>> +#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>>
>>   #define VTD_SM_PASID_ENTRY_FSPM          3ULL
>>   #define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index 3351892da0..ff01e5c82d 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>>       uint64_t val[8];
>>   };
>>
>> +typedef struct VTDPASIDCacheEntry {
>> +    struct VTDPASIDEntry pasid_entry;
>> +    bool valid;
>> +} VTDPASIDCacheEntry;
>> +
>>   struct VTDAddressSpace {
>>       PCIBus *bus;
>>       uint8_t devfn;
>> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>>       MemoryRegion iommu_ir_fault; /* Interrupt region for catching
>fault */
>>       IntelIOMMUState *iommu_state;
>>       VTDContextCacheEntry context_cache_entry;
>> +    VTDPASIDCacheEntry pasid_cache_entry;
>>       QLIST_ENTRY(VTDAddressSpace) next;
>>       /* Superset of notifier flags that this address space has */
>>       IOMMUNotifierFlag notifier_flags;
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index d37d47115a..24061f6dc6 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -1614,7 +1614,7 @@ static uint16_t
>vtd_get_domain_id(IntelIOMMUState *s,
>>
>>       if (s->root_scalable) {
>>           vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>> -        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
>> +        return VTD_SM_PASID_ENTRY_DID(&pe);
>>       }
>>
>>       return VTD_CONTEXT_ENTRY_DID(ce->hi);
>> @@ -3074,6 +3074,144 @@ static bool
>vtd_process_piotlb_desc(IntelIOMMUState *s,
>>       return true;
>>   }
>>
>> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
>> +                                            VTDPASIDEntry *pe)
>> +{
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    VTDContextEntry ce;
>> +    int ret;
>> +
>> +    if (!s->root_scalable) {
>> +        return -VTD_FR_RTADDR_INV_TTM;
>> +    }
>> +
>> +    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>vtd_as->devfn,
>> +                                   &ce);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return vtd_ce_get_pasid_entry(s, &ce, pe, vtd_as->pasid);
>> +}
>> +
>> +/*
>> + * For each IOMMUFD backed device, update or invalidate pasid cache
>based on
>> + * the value in memory.
>> + */
>> +static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
>> +                                        gpointer user_data)
>> +{
>> +    VTDPASIDCacheInfo *pc_info = user_data;
>> +    VTDAddressSpace *vtd_as = value;
>> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> +    VTDPASIDEntry pe;
>> +    uint16_t did;
>> +
>> +    /* Ignore emulated device or legacy VFIO backed device */
>> +    if (!vtd_find_hiod_iommufd(vtd_as)) {
>> +        return;
>> +    }
>> +
>> +    /* non-RID_PASID isn't supported yet */
>> +    assert(vtd_as->pasid == PCI_NO_PASID);
>> +
>> +    if (vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
>> +        /*
>> +         * No valid pasid entry in guest memory. e.g. pasid entry was
>modified
>> +         * to be either all-zero or non-present. Either case means
>existing
>> +         * pasid cache should be invalidated.
>> +         */
>> +        pc_entry->valid = false;
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * VTD_INV_DESC_PASIDC_G_DSI and
>VTD_INV_DESC_PASIDC_G_PASID_SI require
>> +     * DID check. If DID doesn't match the value in cache or memory,
>then
>> +     * it's not a pasid entry we want to invalidate.
>
>I think comparing DID applies to the case in which pc_entry->valid is
>true. If pc_entry->valid is false, this means no cached pc_entry yet. If
>pe in guest memory is valid, the pc_entry should be updated/set hence
>the bind_pasid operation (added in later patch) would be conducted.

We get here only when pe in guest memory is valid, or else we have returned in "if (vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {" check.

If no cached pe but valid pe in guest memory, that means a new pe.
For new entry, guest constructs pasid cache invalidation request with DID
field filled with DID from pe in memory. We don't unconditionally cache new pe
for all devices for one pasid cache invalidation except it's global invalidation.

>
>> +     */
>> +    switch (pc_info->type) {
>> +    case VTD_INV_DESC_PASIDC_G_PASID_SI:
>> +    case VTD_INV_DESC_PASIDC_G_DSI:
>> +        if (pc_entry->valid) {
>> +            did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
>> +            if (pc_info->did == did) {
>> +                break;
>> +            }
>> +        }
>> +        did = VTD_SM_PASID_ENTRY_DID(&pe);
>> +        if (pc_info->did == did) {
>> +            break;
>> +        }
>> +        return;
>> +    }
>> +
>> +    pc_entry->pasid_entry = pe;
>> +    pc_entry->valid = true;
>> +}
>> +
>> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
>VTDPASIDCacheInfo *pc_info)
>> +{
>> +    if (!s->fsts || !s->root_scalable || !s->dmar_enabled) {
>> +        return;
>> +    }
>> +
>> +    vtd_iommu_lock(s);
>> +    g_hash_table_foreach(s->vtd_address_spaces,
>vtd_pasid_cache_sync_locked,
>> +                         pc_info);
>> +    vtd_iommu_unlock(s);
>> +}
>> +
>> +static bool vtd_process_pasid_desc(IntelIOMMUState *s,
>> +                                   VTDInvDesc *inv_desc)
>> +{
>> +    uint16_t did;
>> +    uint32_t pasid;
>> +    VTDPASIDCacheInfo pc_info = {};
>> +    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0,
>VTD_INV_DESC_ALL_ONE,
>> +                        VTD_INV_DESC_ALL_ONE,
>VTD_INV_DESC_ALL_ONE};
>> +
>> +    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
>> +                                     __func__, "pasid cache inv"))
>{
>> +        return false;
>> +    }
>> +
>> +    did = VTD_INV_DESC_PASIDC_DID(inv_desc);
>> +    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc);
>> +    pc_info.type = VTD_INV_DESC_PASIDC_G(inv_desc);
>> +
>> +    switch (pc_info.type) {
>> +    case VTD_INV_DESC_PASIDC_G_DSI:
>> +        trace_vtd_inv_desc_pasid_cache_dsi(did);
>> +        pc_info.did = did;
>> +        break;
>> +
>> +    case VTD_INV_DESC_PASIDC_G_PASID_SI:
>> +        /* PASID selective implies a DID selective */
>> +        trace_vtd_inv_desc_pasid_cache_psi(did, pasid);
>> +        /* Currently non-RID_PASID invalidation requests are ignored */
>
>I'm a bit doubting if this is safe given the ATS path (for emulated
>device) is merged. ATS path supports non-RID_PASID if emulated device
>has PASID cap. The lucky thing is that the ATS path does not have
>pasid level cache. So skipping invalidation for non-RID_PASID is not
>harmful so far. Just a note to other reviewers although I didn't see a
>problem here.

Yes, there is no emulated device supporting PASID cap currently,
so I don't cache pasid entry for emulated device for now.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 13/22] intel_iommu: Reset pasid cache when system level reset
  2025-09-18  8:57 ` [PATCH v6 13/22] intel_iommu: Reset pasid cache when system level reset Zhenzhong Duan
@ 2025-10-13 10:25   ` Yi Liu
  0 siblings, 0 replies; 57+ messages in thread
From: Yi Liu @ 2025-10-13 10:25 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On 2025/9/18 16:57, Zhenzhong Duan wrote:
> Reset pasid cache when system level reset, for RID_PASID, its vtd_as is
> allocated by PCI system and never removed, just mark pasid cache invalid.
> 
> As we already have vtd_pasid_cache_sync_locked() to handle pasid cache
> invalidation, reuse it to do pasid cache invalidation at system reset
> level.
> 
> Currently only IOMMUFD backed VFIO device caches pasid entry, so we don't
> need to care about emulated device.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/i386/intel_iommu.c | 15 ++++++++++++++-
>   hw/i386/trace-events  |  1 +
>   2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 24061f6dc6..a6638e13be 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -85,6 +85,18 @@ struct vtd_iotlb_key {
>   
>   static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>   static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> +static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
> +                                        gpointer user_data);
> +
> +static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
> +{
> +    VTDPASIDCacheInfo pc_info = { .reset = true };
> +
> +    trace_vtd_pasid_cache_reset();
> +    g_hash_table_foreach(s->vtd_address_spaces,
> +                         vtd_pasid_cache_sync_locked, &pc_info);
> +}
> +
>   
>   static void vtd_panic_require_caching_mode(void)
>   {
> @@ -390,6 +402,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
>       vtd_iommu_lock(s);
>       vtd_reset_iotlb_locked(s);
>       vtd_reset_context_cache_locked(s);
> +    vtd_pasid_cache_reset_locked(s);
>       vtd_iommu_unlock(s);
>   }
>   
> @@ -3115,7 +3128,7 @@ static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
>       /* non-RID_PASID isn't supported yet */
>       assert(vtd_as->pasid == PCI_NO_PASID);
>   
> -    if (vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
> +    if (pc_info->reset || vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
>           /*
>            * No valid pasid entry in guest memory. e.g. pasid entry was modified
>            * to be either all-zero or non-present. Either case means existing

do you want to update the comment accordingly? otherwise, the patch
looks good.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 14/22] intel_iommu: Add some macros and inline functions
  2025-09-18  8:57 ` [PATCH v6 14/22] intel_iommu: Add some macros and inline functions Zhenzhong Duan
@ 2025-10-13 10:25   ` Yi Liu
  0 siblings, 0 replies; 57+ messages in thread
From: Yi Liu @ 2025-10-13 10:25 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, skolothumtho, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On 2025/9/18 16:57, Zhenzhong Duan wrote:
> Add some macros and inline functions that will be used by following
> patch.
> 
> This patch also make a cleanup to change macro VTD_SM_PASID_ENTRY_FSPM
> to use extract64() just like what smmu does, because this macro is used
> indirectly by new introduced inline functions. But we doesn't aim to
> change the huge amount of bit mask style macro definitions in this patch,
> that should be in a separate patch.
> 
> Suggested-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/i386/intel_iommu_internal.h |  6 +++++-
>   hw/i386/intel_iommu.c          | 30 +++++++++++++++++++++++++++---
>   2 files changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index d400bcee21..3d5ee5ed52 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -592,8 +592,12 @@ typedef struct VTDPASIDCacheInfo {
>   #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
>   #define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>   
> -#define VTD_SM_PASID_ENTRY_FSPM          3ULL
>   #define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
> +#define VTD_SM_PASID_ENTRY_SRE_BIT(x)    extract64((x)->val[2], 0, 1)
> +/* 00: 4-level paging, 01: 5-level paging, 10-11: Reserved */
> +#define VTD_SM_PASID_ENTRY_FSPM(x)       extract64((x)->val[2], 2, 2)
> +#define VTD_SM_PASID_ENTRY_WPE_BIT(x)    extract64((x)->val[2], 4, 1)
> +#define VTD_SM_PASID_ENTRY_EAFE_BIT(x)   extract64((x)->val[2], 7, 1)
>   
>   /* First Level Paging Structure */
>   /* Masks for First Level Paging Entry */

hmmm. is this missed by patch 02 which cleans up the FL/SL naming to
FS/SS?

> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index a6638e13be..5908368c44 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -49,8 +49,7 @@
>   
>   /* pe operations */
>   #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
> -#define VTD_PE_GET_FS_LEVEL(pe) \
> -    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FSPM))
> +#define VTD_PE_GET_FS_LEVEL(pe) (VTD_SM_PASID_ENTRY_FSPM(pe) + 4)
>   #define VTD_PE_GET_SS_LEVEL(pe) \
>       (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
>   
> @@ -838,6 +837,31 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>       }
>   }
>   
> +static inline dma_addr_t vtd_pe_get_fspt_base(VTDPASIDEntry *pe)
> +{
> +    return pe->val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
> +}
> +
> +/*
> + * First stage IOVA address width: 48 bits for 4-level paging(FSPM=00)
> + *                                 57 bits for 5-level paging(FSPM=01)
> + */
> +static inline uint32_t vtd_pe_get_fs_aw(VTDPASIDEntry *pe)
> +{
> +    return 48 + VTD_SM_PASID_ENTRY_FSPM(pe) * 9;
> +}
> +
> +static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
> +{
> +    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
> +}
> +
> +/* check if pgtt is first stage translation */
> +static inline bool vtd_pe_pgtt_is_fst(VTDPASIDEntry *pe)
> +{
> +    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FST);
> +}
> +
>   static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>   {
>       return pdire->val & 1;
> @@ -1709,7 +1733,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>                */
>               return false;
>           }
> -        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
> +        return vtd_pe_pgtt_is_pt(&pe);
>       }
>   
>       return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);

Reviewed-by: Yi Liu <yi.l.liu@intel.com>


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v6 12/22] intel_iommu: Handle PASID cache invalidation
  2025-10-13  7:37     ` Duan, Zhenzhong
@ 2025-10-13 12:53       ` Yi Liu
  0 siblings, 0 replies; 57+ messages in thread
From: Yi Liu @ 2025-10-13 12:53 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	skolothumtho@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P

On 2025/10/13 15:37, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH v6 12/22] intel_iommu: Handle PASID cache invalidation
>>
>> On 2025/9/18 16:57, Zhenzhong Duan wrote:
>>> This adds PASID cache sync for RID_PASID, non-RID_PASID isn't supported.
>>>
>>> Adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
>> pasid
>>> entry and track PASID usage and future PASID tagged DMA address
>> translation
>>> support in vIOMMU.
>>>
>>> When guest triggers pasid cache invalidation, QEMU will capture it and
>>> update or invalidate pasid cache.
>>>
>>> vIOMMU emulator could figure out the reason by fetching latest guest pasid
>>> entry in memory and compare it with cached PASID entry if it's valid.
>>>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>    hw/i386/intel_iommu_internal.h |  19 +++-
>>>    include/hw/i386/intel_iommu.h  |   6 ++
>>>    hw/i386/intel_iommu.c          | 157
>> ++++++++++++++++++++++++++++++---
>>>    hw/i386/trace-events           |   3 +
>>>    4 files changed, 173 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/hw/i386/intel_iommu_internal.h
>> b/hw/i386/intel_iommu_internal.h
>>> index 9cdc8d5dbb..d400bcee21 100644
>>> --- a/hw/i386/intel_iommu_internal.h
>>> +++ b/hw/i386/intel_iommu_internal.h
>>> @@ -316,6 +316,7 @@ typedef enum VTDFaultReason {
>>>                                      * request while disabled */
>>>        VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>>>
>>> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>>>        /* PASID directory entry access failure */
>>>        VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>>>        /* The Present(P) field of pasid directory entry is 0 */
>>> @@ -493,6 +494,15 @@ typedef union VTDInvDesc VTDInvDesc;
>>>    #define VTD_INV_DESC_PIOTLB_RSVD_VAL0
>> 0xfff000000000f1c0ULL
>>>    #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>>>
>>> +/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
>>> +#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
>>> +#define VTD_INV_DESC_PASIDC_G_DSI       0
>>> +#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
>>> +#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
>>> +#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16,
>> 16)
>>> +#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32,
>> 20)
>>> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
>>> +
>>>    /* Information about page-selective IOTLB invalidate */
>>>    struct VTDIOTLBPageInvInfo {
>>>        uint16_t domain_id;
>>> @@ -552,6 +562,13 @@ typedef struct VTDRootEntry VTDRootEntry;
>>>    #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL |
>> ~VTD_HAW_MASK(aw))
>>>    #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1
>> 0xffffffffffe00000ULL
>>>
>>> +typedef struct VTDPASIDCacheInfo {
>>> +    uint8_t type;
>>> +    uint16_t did;
>>> +    uint32_t pasid;
>>> +    bool reset;
>>> +} VTDPASIDCacheInfo;
>>> +
>>>    /* PASID Table Related Definitions */
>>>    #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>>>    #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
>>> @@ -573,7 +590,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>>>    #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>>>
>>>    #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted
>> guest-address-width */
>>> -#define VTD_SM_PASID_ENTRY_DID(val)    ((val) &
>> VTD_DOMAIN_ID_MASK)
>>> +#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>>>
>>>    #define VTD_SM_PASID_ENTRY_FSPM          3ULL
>>>    #define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
>>> diff --git a/include/hw/i386/intel_iommu.h
>> b/include/hw/i386/intel_iommu.h
>>> index 3351892da0..ff01e5c82d 100644
>>> --- a/include/hw/i386/intel_iommu.h
>>> +++ b/include/hw/i386/intel_iommu.h
>>> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>>>        uint64_t val[8];
>>>    };
>>>
>>> +typedef struct VTDPASIDCacheEntry {
>>> +    struct VTDPASIDEntry pasid_entry;
>>> +    bool valid;
>>> +} VTDPASIDCacheEntry;
>>> +
>>>    struct VTDAddressSpace {
>>>        PCIBus *bus;
>>>        uint8_t devfn;
>>> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>>>        MemoryRegion iommu_ir_fault; /* Interrupt region for catching
>> fault */
>>>        IntelIOMMUState *iommu_state;
>>>        VTDContextCacheEntry context_cache_entry;
>>> +    VTDPASIDCacheEntry pasid_cache_entry;
>>>        QLIST_ENTRY(VTDAddressSpace) next;
>>>        /* Superset of notifier flags that this address space has */
>>>        IOMMUNotifierFlag notifier_flags;
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index d37d47115a..24061f6dc6 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -1614,7 +1614,7 @@ static uint16_t
>> vtd_get_domain_id(IntelIOMMUState *s,
>>>
>>>        if (s->root_scalable) {
>>>            vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>>> -        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
>>> +        return VTD_SM_PASID_ENTRY_DID(&pe);
>>>        }
>>>
>>>        return VTD_CONTEXT_ENTRY_DID(ce->hi);
>>> @@ -3074,6 +3074,144 @@ static bool
>> vtd_process_piotlb_desc(IntelIOMMUState *s,
>>>        return true;
>>>    }
>>>
>>> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
>>> +                                            VTDPASIDEntry *pe)
>>> +{
>>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>>> +    VTDContextEntry ce;
>>> +    int ret;
>>> +
>>> +    if (!s->root_scalable) {
>>> +        return -VTD_FR_RTADDR_INV_TTM;
>>> +    }
>>> +
>>> +    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>> vtd_as->devfn,
>>> +                                   &ce);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    return vtd_ce_get_pasid_entry(s, &ce, pe, vtd_as->pasid);
>>> +}
>>> +
>>> +/*
>>> + * For each IOMMUFD backed device, update or invalidate pasid cache
>> based on
>>> + * the value in memory.
>>> + */
>>> +static void vtd_pasid_cache_sync_locked(gpointer key, gpointer value,
>>> +                                        gpointer user_data)
>>> +{
>>> +    VTDPASIDCacheInfo *pc_info = user_data;
>>> +    VTDAddressSpace *vtd_as = value;
>>> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>>> +    VTDPASIDEntry pe;
>>> +    uint16_t did;
>>> +
>>> +    /* Ignore emulated device or legacy VFIO backed device */
>>> +    if (!vtd_find_hiod_iommufd(vtd_as)) {
>>> +        return;
>>> +    }
>>> +
>>> +    /* non-RID_PASID isn't supported yet */
>>> +    assert(vtd_as->pasid == PCI_NO_PASID);
>>> +
>>> +    if (vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {
>>> +        /*
>>> +         * No valid pasid entry in guest memory. e.g. pasid entry was
>> modified
>>> +         * to be either all-zero or non-present. Either case means
>> existing
>>> +         * pasid cache should be invalidated.
>>> +         */
>>> +        pc_entry->valid = false;
>>> +        return;
>>> +    }
>>> +
>>> +    /*
>>> +     * VTD_INV_DESC_PASIDC_G_DSI and
>> VTD_INV_DESC_PASIDC_G_PASID_SI require
>>> +     * DID check. If DID doesn't match the value in cache or memory,
>> then
>>> +     * it's not a pasid entry we want to invalidate.
>>
>> I think comparing DID applies to the case in which pc_entry->valid is
>> true. If pc_entry->valid is false, this means no cached pc_entry yet. If
>> pe in guest memory is valid, the pc_entry should be updated/set hence
>> the bind_pasid operation (added in later patch) would be conducted.
> 
> We get here only when pe in guest memory is valid, or else we have returned in "if (vtd_dev_get_pe_from_pasid(vtd_as, &pe)) {" check.
> 
> If no cached pe but valid pe in guest memory, that means a new pe.
> For new entry, guest constructs pasid cache invalidation request with DID
> field filled with DID from pe in memory. We don't unconditionally cache new pe
> for all devices for one pasid cache invalidation except it's global invalidation.

I see. yes, intel iommu driver has already used the did configed in the
pasid entry to flush pasid cache per caching mode. But there seems no
words stated this in spec. Anyway, I don't see any reason why a guest
iommu driver wants to use a did unequal to the one in pasid entry when
this is a newly set pasid entry. So it's fine to me now.

btw. it would be nice to note how you support the global invalidation
since it's no more part of pc_info->type.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2025-10-13 12:54 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-18  8:57 [PATCH v6 00/22] intel_iommu: Enable first stage translation for passthrough device Zhenzhong Duan
2025-09-18  8:57 ` [PATCH v6 01/22] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-09-18  8:57 ` [PATCH v6 02/22] intel_iommu: Delete RPS capability related supporting code Zhenzhong Duan
2025-09-30 13:49   ` Eric Auger
2025-10-09 10:10     ` Duan, Zhenzhong
2025-10-12 12:30       ` Yi Liu
2025-09-18  8:57 ` [PATCH v6 03/22] intel_iommu: Update terminology to match VTD spec Zhenzhong Duan
2025-09-30  7:45   ` Eric Auger
2025-10-12 12:30   ` Yi Liu
2025-10-13  6:20     ` Duan, Zhenzhong
2025-09-18  8:57 ` [PATCH v6 04/22] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
2025-09-18  8:57 ` [PATCH v6 05/22] hw/pci: Introduce pci_device_get_viommu_flags() Zhenzhong Duan
2025-09-23 18:47   ` Nicolin Chen
2025-09-24  7:05     ` Duan, Zhenzhong
2025-09-24  8:21       ` Nicolin Chen
2025-09-26  2:54         ` Duan, Zhenzhong
2025-09-30 13:55           ` Eric Auger
2025-10-12 12:26   ` Yi Liu
2025-10-13  6:24     ` Duan, Zhenzhong
2025-09-18  8:57 ` [PATCH v6 06/22] intel_iommu: Implement get_viommu_flags() callback Zhenzhong Duan
2025-10-12 12:28   ` Yi Liu
2025-10-13  6:26     ` Duan, Zhenzhong
2025-09-18  8:57 ` [PATCH v6 07/22] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-09-18  8:57 ` [PATCH v6 08/22] vfio/iommufd: Force creating nesting parent HWPT Zhenzhong Duan
2025-09-30 14:19   ` Eric Auger
2025-10-12 12:33   ` Yi Liu
2025-09-18  8:57 ` [PATCH v6 09/22] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
2025-09-30 15:04   ` Eric Auger
2025-10-09 10:10     ` Duan, Zhenzhong
2025-10-12 12:51   ` Yi Liu
2025-09-18  8:57 ` [PATCH v6 10/22] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-10-12 12:55   ` Yi Liu
2025-10-13  6:48     ` Duan, Zhenzhong
2025-09-18  8:57 ` [PATCH v6 11/22] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
2025-09-18  8:57 ` [PATCH v6 12/22] intel_iommu: Handle PASID cache invalidation Zhenzhong Duan
2025-10-12 14:58   ` Yi Liu
2025-10-13  7:37     ` Duan, Zhenzhong
2025-10-13 12:53       ` Yi Liu
2025-09-18  8:57 ` [PATCH v6 13/22] intel_iommu: Reset pasid cache when system level reset Zhenzhong Duan
2025-10-13 10:25   ` Yi Liu
2025-09-18  8:57 ` [PATCH v6 14/22] intel_iommu: Add some macros and inline functions Zhenzhong Duan
2025-10-13 10:25   ` Yi Liu
2025-09-18  8:57 ` [PATCH v6 15/22] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-09-18  8:57 ` [PATCH v6 16/22] intel_iommu: Propagate PASID-based iotlb invalidation " Zhenzhong Duan
2025-09-18  8:57 ` [PATCH v6 17/22] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
2025-09-18  8:57 ` [PATCH v6 18/22] iommufd: Introduce a helper function to extract vendor capabilities Zhenzhong Duan
2025-09-23 19:45   ` Nicolin Chen
2025-09-24  8:05     ` Duan, Zhenzhong
2025-09-24  8:27       ` Nicolin Chen
2025-09-26  2:54         ` Duan, Zhenzhong
2025-09-18  8:57 ` [PATCH v6 19/22] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
2025-09-26 12:25   ` Cédric Le Goater
2025-09-18  8:57 ` [PATCH v6 20/22] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
2025-09-18  8:58 ` [PATCH v6 21/22] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
2025-09-18  8:58 ` [PATCH v6 22/22] docs/devel: Add IOMMUFD nesting documentation Zhenzhong Duan
2025-09-18 10:00   ` Cédric Le Goater
2025-09-19  2:17     ` Duan, Zhenzhong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).