qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device
@ 2025-08-22  6:40 Zhenzhong Duan
  2025-08-22  6:40 ` [PATCH v5 01/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
                   ` (21 more replies)
  0 siblings, 22 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

Hi,

For passthrough device with intel_iommu.x-flts=on, we don't do shadowing of
guest page table for passthrough device but pass stage-1 page table to host
side to construct a nested domain. There was some effort to enable this feature
in old days, see [1] for details.

The key design is to utilize the dual-stage IOMMU translation (also known as
IOMMU nested translation) capability in host IOMMU. As the below diagram shows,
guest I/O page table pointer in GPA (guest physical address) is passed to host
and be used to perform the stage-1 address translation. Along with it,
modifications to present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

        .-------------.  .---------------------------.
        |   vIOMMU    |  | Guest I/O page table      |
        |             |  '---------------------------'
        .----------------/
        | PASID Entry |--- PASID cache flush --+
        '-------------'                        |
        |             |                        V
        |             |           I/O page table pointer in GPA
        '-------------'
    Guest
    ------| Shadow |---------------------------|--------
          v        v                           v
    Host
        .-------------.  .------------------------.
        |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
        |             |  '------------------------'
        .----------------/  |
        | PASID Entry |     V (Nested xlate)
        '----------------\.--------------------------------------.
        |             |   | Stage2 for GPA->HPA, unmanaged domain|
        |             |   '--------------------------------------'
        '-------------'
For history reason, there are different namings in different VTD spec rev,
Where:
 - Stage1 = First stage = First level = flts
 - Stage2 = Second stage = Second level = slts
<Intel VT-d Nested translation>

This series reuse VFIO device's default hwpt as nested parent instead of
creating new one. This way avoids duplicate code of a new memory listener,
all existing feature from VFIO listener can be shared, e.g., ram discard,
dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
under a PCI bridge with emulated device, because emulated device wants
IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
VFIO device's default hwpt is created with NEST_PARENT flag, kernel
inhibit RO mappings when switch to shadow mode.

This series is also a prerequisite work for vSVA, i.e. Sharing guest
application address space with passthrough devices.

There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
  subsystem. VFIO calls them to register/unregister HostIOMMUDevice
  instance to vIOMMU at vfio device realize stage.
* vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
  VFIO calls it to get vIOMMU exposed capabilities.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
  to bind/unbind device to IOMMUFD backed domains, either nested
  domain or not.

See below diagram:

        VFIO Device                                 Intel IOMMU
    .-----------------.                         .-------------------.
    |                 |                         |                   |
    |       .---------|PCIIOMMUOps              |.-------------.    |
    |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU  |    |
    |       | Device  |------------------------>|| Device list |    |
    |       .---------|(get_viommu_cap)         |.-------------.    |
    |                 |                         |       |           |
    |                 |                         |       V           |
    |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
    |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
    |       | link    |<------------------------|  |   Device    |  |
    |       .---------|            (detach_hwpt)|  .-------------.  |
    |                 |                         |       |           |
    |                 |                         |       ...         |
    .-----------------.                         .-------------------.

Below is an example to enable stage-1 translation for passthrough device:

    -M q35,...
    -device intel-iommu,x-scalable-mode=on,x-flts=on...
    -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...

Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test

PATCH1-7:  Some preparing work
PATCH8-9:  Compatibility check between vIOMMU and Host IOMMU
PATCH10-18:Implement stage-1 page table for passthrough device
PATCH19-20:Workaround for ERRATA_772415_SPR17
PATCH21:   Enable stage-1 translation for passthrough device

Qemu code can be found at [2]

Fault report isn't supported in this series, we presume guest kernel always
construct correct stage1 page table for passthrough device. For emulated
devices, the emulation code already provided stage1 fault injection.

TODO:
- Fault report to guest when HW stage1 faults

[1] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v5

Thanks
Zhenzhong

Changelog:
v5:
- refine commit log of patch2 (Cedric, Nicolin)
- introduce helper vfio_pci_from_vfio_device() (Cedric)
- introduce helper vfio_device_viommu_get_nested() (Cedric)
- pass 'bool bypass_ro' argument to vfio_listener_valid_section() instead of 'VFIOContainerBase *' (Cedric)
- fix a potential build error reported by Jim Shu

v4:
- s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin, Donald, Shameer)
- clarify get_viommu_cap() return pure emulated caps and explain reason in commit log (Eric)
- retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric)
- refine doc comment and commit log in patch10-11 (Eric)

v3:
- define enum type for VIOMMU_CAP_* (Eric)
- drop inline flag in the patch which uses the helper (Eric)
- use extract64 in new introduced MACRO (Eric)
- polish comments and fix typo error (Eric)
- split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
- optimize bind/unbind error path processing

v2:
- introduce get_viommu_cap() to get STAGE1 flag to create nested parent hwpt (Liuyi)
- reuse VFIO's default hwpt as parent hwpt of nested translation (Nicolin, Liuyi)
- abandon support of VFIO device under pcie-to-pci bridge to simplify design (Liuyi)
- bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 (Liuyi)
- drop vtd_dev_to_context_entry optimization (Liuyi)

v1:
- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
- rebase to master

rfcv3:
- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer)
- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
- simplify return value check of get_cap() (Eric)
- drop realize_late (Cedric, Eric)
- split patch13:intel_iommu: Add PASID cache management infrastructure (Eric)
- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
- refine comments (Eric, Donald)

rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
  iommu pasid, this is important for dropping VTDPASIDAddressSpace


Yi Liu (3):
  intel_iommu: Replay pasid bindings after context cache invalidation
  intel_iommu: Propagate PASID-based iotlb invalidation to host
  intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
    changed

Zhenzhong Duan (18):
  intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
    vtd_ce_get_pasid_entry
  hw/pci: Introduce pci_device_get_viommu_cap()
  intel_iommu: Implement get_viommu_cap() callback
  vfio: Introduce helper vfio_pci_from_vfio_device()
  vfio/iommufd: Force creating nested parent domain
  hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  intel_iommu: Check for compatibility with IOMMUFD backed device when
    x-flts=on
  intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  intel_iommu: Handle PASID entry removal and update
  intel_iommu: Handle PASID entry addition
  intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
  intel_iommu: Stick to system MR for IOMMUFD backed host device when
    x-fls=on
  intel_iommu: Bind/unbind guest page table to host
  vfio: Add a new element bypass_ro in VFIOContainerBase
  Workaround for ERRATA_772415_SPR17
  intel_iommu: Enable host device when x-flts=on in scalable mode

 MAINTAINERS                           |   1 +
 hw/i386/intel_iommu_internal.h        |  68 +-
 hw/vfio/pci.h                         |  12 +
 include/hw/i386/intel_iommu.h         |   9 +-
 include/hw/iommu.h                    |  19 +
 include/hw/pci/pci.h                  |  27 +
 include/hw/vfio/vfio-container-base.h |   1 +
 include/hw/vfio/vfio-device.h         |   2 +
 hw/i386/intel_iommu.c                 | 941 +++++++++++++++++++++++++-
 hw/pci/pci.c                          |  23 +-
 hw/vfio/container.c                   |   4 +-
 hw/vfio/device.c                      |  14 +-
 hw/vfio/iommufd.c                     |  20 +-
 hw/vfio/listener.c                    |  25 +-
 hw/vfio/pci.c                         |   9 +
 hw/i386/trace-events                  |   8 +
 16 files changed, 1131 insertions(+), 52 deletions(-)
 create mode 100644 include/hw/iommu.h


base-commit: 88f72048d2f5835a1b9eaba690c7861393aef283
-- 
2.47.1



^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH v5 01/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-22 22:19   ` Nicolin Chen via
  2025-08-22  6:40 ` [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
of rid2pasid, then it was extended to get any pasid entry. So a new name
vtd_ce_get_pasid_entry is better to match what it actually does.

No functional change intended.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 hw/i386/intel_iommu.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 83c5e44413..04809bd776 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
     return 0;
 }
 
-static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
+static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
                                       VTDContextEntry *ce,
                                       VTDPASIDEntry *pe,
                                       uint32_t pasid)
@@ -1025,7 +1025,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return VTD_PE_GET_FL_LEVEL(&pe);
         } else {
@@ -1048,7 +1048,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
     }
 
@@ -1116,7 +1116,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
         } else {
@@ -1522,7 +1522,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
      * has valid rid2pasid setting, which includes valid
      * rid2pasid field and corresponding pasid entry setting
      */
-    return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
+    return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1611,7 +1611,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
     }
 
@@ -1687,7 +1687,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
     int ret;
 
     if (s->root_scalable) {
-        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (ret) {
             /*
              * This error is guest triggerable. We should assumt PT
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
  2025-08-22  6:40 ` [PATCH v5 01/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-22 22:22   ` Nicolin Chen
  2025-08-27 11:13   ` Yi Liu
  2025-08-22  6:40 ` [PATCH v5 03/21] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
                   ` (19 subsequent siblings)
  21 siblings, 2 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
allows to retrieve capabilities exposed by a vIOMMU. The first planned
vIOMMU device capability is VIOMMU_CAP_HW_NESTED that advertises the
support of HW nested stage translation scheme. pci_device_get_viommu_cap
is a wrapper that can be called on a PCI device potentially protected by
a vIOMMU.

get_viommu_cap() is designed to return 64bit bitmap of purely emulated
capabilities which are only determined by user's configuration, no host
capabilities involved. Reasons are:

1. host may has heterogeneous IOMMUs, each with different capabilities
2. this is migration friendly, return value is consistent between source
   and target.
3. host IOMMU capabilities are passed to vIOMMU through set_iommu_device()
   interface which have to be after attach_device(), when get_viommu_cap()
   is called in attach_device(), there is no way for vIOMMU to get host
   IOMMU capabilities yet, so only emulated capabilities can be returned.
   See below sequence:

     vfio_device_attach():
         iommufd_cdev_attach():
             pci_device_get_viommu_cap() for HW nesting cap
             create a nesting parent hwpt
             attach device to the hwpt
             vfio_device_hiod_create_and_realize() creating hiod
     ...
     pci_device_set_iommu_device(hiod)

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 MAINTAINERS          |  1 +
 include/hw/iommu.h   | 19 +++++++++++++++++++
 include/hw/pci/pci.h | 25 +++++++++++++++++++++++++
 hw/pci/pci.c         | 11 +++++++++++
 4 files changed, 56 insertions(+)
 create mode 100644 include/hw/iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index a07086ed76..54fb878128 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2305,6 +2305,7 @@ F: include/system/iommufd.h
 F: backends/host_iommu_device.c
 F: include/system/host_iommu_device.h
 F: include/qemu/chardev_open.h
+F: include/hw/iommu.h
 F: util/chardev_open.c
 F: docs/devel/vfio-iommufd.rst
 
diff --git a/include/hw/iommu.h b/include/hw/iommu.h
new file mode 100644
index 0000000000..7dd0c11b16
--- /dev/null
+++ b/include/hw/iommu.h
@@ -0,0 +1,19 @@
+/*
+ * General vIOMMU capabilities, flags, etc
+ *
+ * Copyright (C) 2025 Intel Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_IOMMU_H
+#define HW_IOMMU_H
+
+#include "qemu/bitops.h"
+
+enum {
+    /* hardware nested stage-1 page table support */
+    VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
+};
+
+#endif /* HW_IOMMU_H */
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6b7d3ac8a3..cde7a54a69 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -462,6 +462,21 @@ typedef struct PCIIOMMUOps {
      * @devfn: device and function number of the PCI device.
      */
     void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
+    /**
+     * @get_viommu_cap: get vIOMMU capabilities
+     *
+     * Optional callback, if not implemented, then vIOMMU doesn't
+     * support exposing capabilities to other subsystem, e.g., VFIO.
+     * vIOMMU can choose which capabilities to expose.
+     *
+     * @opaque: the data passed to pci_setup_iommu().
+     *
+     * Returns: 64bit bitmap with each bit represents a capability emulated by
+     * VIOMMU_CAP_* in include/hw/iommu.h, these capabilities are theoretical
+     * which are only determined by vIOMMU device properties and independent
+     * on the actual host capabilities they may depend on.
+     */
+    uint64_t (*get_viommu_cap)(void *opaque);
     /**
      * @get_iotlb_info: get properties required to initialize a device IOTLB.
      *
@@ -642,6 +657,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
 void pci_device_unset_iommu_device(PCIDevice *dev);
 
+/**
+ * pci_device_get_viommu_cap: get vIOMMU capabilities.
+ *
+ * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
+ * capability, 0 if vIOMMU doesn't support exposing capabilities.
+ *
+ * @dev: PCI device pointer.
+ */
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev);
+
 /**
  * pci_iommu_get_iotlb_info: get properties required to initialize a
  * device IOTLB.
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index c70b5ceeba..df1fb615a8 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2992,6 +2992,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
     }
 }
 
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev)
+{
+    PCIBus *iommu_bus;
+
+    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
+    if (iommu_bus && iommu_bus->iommu_ops->get_viommu_cap) {
+        return iommu_bus->iommu_ops->get_viommu_cap(iommu_bus->iommu_opaque);
+    }
+    return 0;
+}
+
 int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
                          bool exec_req, hwaddr addr, bool lpig,
                          uint16_t prgi, bool is_read, bool is_write)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 03/21] intel_iommu: Implement get_viommu_cap() callback
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
  2025-08-22  6:40 ` [PATCH v5 01/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
  2025-08-22  6:40 ` [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-22 22:23   ` Nicolin Chen
  2025-08-22  6:40 ` [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device() Zhenzhong Duan
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

Implement get_viommu_cap() callback and expose stage-1 capability for now.

VFIO uses it to create nested parent domain which is further used to create
nested domain in vIOMMU. All these will be implemented in following patches.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 hw/i386/intel_iommu.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 04809bd776..e3b871de70 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -24,6 +24,7 @@
 #include "qemu/main-loop.h"
 #include "qapi/error.h"
 #include "hw/sysbus.h"
+#include "hw/iommu.h"
 #include "intel_iommu_internal.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/pci_bus.h"
@@ -4423,6 +4424,16 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
     vtd_iommu_unlock(s);
 }
 
+static uint64_t vtd_get_viommu_cap(void *opaque)
+{
+    IntelIOMMUState *s = opaque;
+    uint64_t caps;
+
+    caps = s->flts ? VIOMMU_CAP_HW_NESTED : 0;
+
+    return caps;
+}
+
 /* Unmap the whole range in the notifier's scope. */
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 {
@@ -4853,6 +4864,7 @@ static PCIIOMMUOps vtd_iommu_ops = {
     .register_iotlb_notifier = vtd_register_iotlb_notifier,
     .unregister_iotlb_notifier = vtd_unregister_iotlb_notifier,
     .ats_request_translation = vtd_ats_request_translation,
+    .get_viommu_cap = vtd_get_viommu_cap,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device()
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 03/21] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-22 22:40   ` Nicolin Chen via
                     ` (3 more replies)
  2025-08-22  6:40 ` [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
                   ` (17 subsequent siblings)
  21 siblings, 4 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

Introduce helper vfio_pci_from_vfio_device() to transform from VFIODevice
to VFIOPCIDevice, also to hide low level VFIO_DEVICE_TYPE_PCI type check.

Suggested-by: Cédric Le Goater <clg@redhat.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20250801023533.1458644-1-zhenzhong.duan@intel.com
[ clg: Added documentation ]
Signed-off-by: Cédric Le Goater <clg@redhat.com>
---
 hw/vfio/pci.h       | 12 ++++++++++++
 hw/vfio/container.c |  4 ++--
 hw/vfio/device.c    |  2 +-
 hw/vfio/iommufd.c   |  4 ++--
 hw/vfio/listener.c  |  4 ++--
 hw/vfio/pci.c       |  9 +++++++++
 6 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 810a842f4a..beb8fb9ee7 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -221,6 +221,18 @@ void vfio_pci_write_config(PCIDevice *pdev,
 uint64_t vfio_vga_read(void *opaque, hwaddr addr, unsigned size);
 void vfio_vga_write(void *opaque, hwaddr addr, uint64_t data, unsigned size);
 
+/**
+ * vfio_pci_from_vfio_device: Transform from VFIODevice to
+ * VFIOPCIDevice
+ *
+ * This function checks if the given @vbasedev is a VFIO PCI device.
+ * If it is, it returns the containing VFIOPCIDevice.
+ *
+ * @vbasedev: The VFIODevice to transform
+ *
+ * Return: The VFIOPCIDevice on success, NULL on failure.
+ */
+VFIOPCIDevice *vfio_pci_from_vfio_device(VFIODevice *vbasedev);
 void vfio_sub_page_bar_update_mappings(VFIOPCIDevice *vdev);
 bool vfio_opt_rom_in_denylist(VFIOPCIDevice *vdev);
 bool vfio_config_quirk_setup(VFIOPCIDevice *vdev, Error **errp);
diff --git a/hw/vfio/container.c b/hw/vfio/container.c
index 3e13feaa74..134ddccc52 100644
--- a/hw/vfio/container.c
+++ b/hw/vfio/container.c
@@ -1087,7 +1087,7 @@ static int vfio_legacy_pci_hot_reset(VFIODevice *vbasedev, bool single)
         /* Prep dependent devices for reset and clear our marker. */
         QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
             if (!vbasedev_iter->dev->realized ||
-                vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
+                !vfio_pci_from_vfio_device(vbasedev_iter)) {
                 continue;
             }
             tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
@@ -1172,7 +1172,7 @@ out:
 
         QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
             if (!vbasedev_iter->dev->realized ||
-                vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
+                !vfio_pci_from_vfio_device(vbasedev_iter)) {
                 continue;
             }
             tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 52a1996dc4..08f12ac31f 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -129,7 +129,7 @@ static inline const char *action_to_str(int action)
 
 static const char *index_to_str(VFIODevice *vbasedev, int index)
 {
-    if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
+    if (!vfio_pci_from_vfio_device(vbasedev)) {
         return NULL;
     }
 
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 48c590b6a9..8c27222f75 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -737,8 +737,8 @@ iommufd_cdev_dep_get_realized_vpdev(struct vfio_pci_dependent_device *dep_dev,
     }
 
     vbasedev_tmp = iommufd_cdev_pci_find_by_devid(dep_dev->devid);
-    if (!vbasedev_tmp || !vbasedev_tmp->dev->realized ||
-        vbasedev_tmp->type != VFIO_DEVICE_TYPE_PCI) {
+    if (!vfio_pci_from_vfio_device(vbasedev_tmp) ||
+        !vbasedev_tmp->dev->realized) {
         return NULL;
     }
 
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index f498e23a93..903dfd8bf2 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -450,7 +450,7 @@ static void vfio_device_error_append(VFIODevice *vbasedev, Error **errp)
      * MMIO region mapping failures are not fatal but in this case PCI
      * peer-to-peer transactions are broken.
      */
-    if (vbasedev && vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+    if (vfio_pci_from_vfio_device(vbasedev)) {
         error_append_hint(errp, "%s: PCI peer-to-peer transactions "
                           "on BARs are not supported.\n", vbasedev->name);
     }
@@ -751,7 +751,7 @@ static bool vfio_section_is_vfio_pci(MemoryRegionSection *section,
     owner = memory_region_owner(section->mr);
 
     QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
-        if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
+        if (!vfio_pci_from_vfio_device(vbasedev)) {
             continue;
         }
         pcidev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 07257d0fa0..3fe5b03eb1 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2833,6 +2833,15 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
     return ret;
 }
 
+/* Transform from VFIODevice to VFIOPCIDevice. Return NULL if fails. */
+VFIOPCIDevice *vfio_pci_from_vfio_device(VFIODevice *vbasedev)
+{
+    if (vbasedev && vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        return container_of(vbasedev, VFIOPCIDevice, vbasedev);
+    }
+    return NULL;
+}
+
 void vfio_sub_page_bar_update_mappings(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = &vdev->pdev;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (3 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device() Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-22 23:12   ` Nicolin Chen
  2025-08-27 11:48   ` Eric Auger
  2025-08-22  6:40 ` [PATCH v5 06/21] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
                   ` (16 subsequent siblings)
  21 siblings, 2 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

Call pci_device_get_viommu_cap() to get if vIOMMU supports VIOMMU_CAP_HW_NESTED,
if yes, create nested parent domain which could be reused by vIOMMU to create
nested domain.

Introduce helper vfio_device_viommu_get_nested to facilitate this
implementation.

It is safe because even if VIOMMU_CAP_HW_NESTED is returned, s->flts is
forbidden and VFIO device fails in set_iommu_device() call, until we support
passthrough device with x-flts=on.

Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/vfio/vfio-device.h |  2 ++
 hw/vfio/device.c              | 12 ++++++++++++
 hw/vfio/iommufd.c             |  8 ++++++++
 3 files changed, 22 insertions(+)

diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
index 6e4d5ccdac..ecd82c16c7 100644
--- a/include/hw/vfio/vfio-device.h
+++ b/include/hw/vfio/vfio-device.h
@@ -257,6 +257,8 @@ void vfio_device_prepare(VFIODevice *vbasedev, VFIOContainerBase *bcontainer,
 
 void vfio_device_unprepare(VFIODevice *vbasedev);
 
+bool vfio_device_viommu_get_nested(VFIODevice *vbasedev);
+
 int vfio_device_get_region_info(VFIODevice *vbasedev, int index,
                                 struct vfio_region_info **info);
 int vfio_device_get_region_info_type(VFIODevice *vbasedev, uint32_t type,
diff --git a/hw/vfio/device.c b/hw/vfio/device.c
index 08f12ac31f..3eeb71bd51 100644
--- a/hw/vfio/device.c
+++ b/hw/vfio/device.c
@@ -23,6 +23,7 @@
 
 #include "hw/vfio/vfio-device.h"
 #include "hw/vfio/pci.h"
+#include "hw/iommu.h"
 #include "hw/hw.h"
 #include "trace.h"
 #include "qapi/error.h"
@@ -504,6 +505,17 @@ void vfio_device_unprepare(VFIODevice *vbasedev)
     vbasedev->bcontainer = NULL;
 }
 
+bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
+{
+    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
+
+    if (vdev) {
+        return !!(pci_device_get_viommu_cap(&vdev->pdev) &
+                  VIOMMU_CAP_HW_NESTED);
+    }
+    return false;
+}
+
 /*
  * Traditional ioctl() based io
  */
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 8c27222f75..e503c232e1 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -379,6 +379,14 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
         flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
     }
 
+    /*
+     * If vIOMMU supports stage-1 translation, force to create nested parent
+     * domain which could be reused by vIOMMU to create nested domain.
+     */
+    if (vfio_device_viommu_get_nested(vbasedev)) {
+        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
+    }
+
     if (cpr_is_incoming()) {
         hwpt_id = vbasedev->cpr.hwpt_id;
         goto skip_alloc;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 06/21] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (4 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-22 23:13   ` Nicolin Chen
  2025-08-27 11:14   ` Yi Liu
  2025-08-22  6:40 ` [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
                   ` (15 subsequent siblings)
  21 siblings, 2 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

Returns true if PCI device is aliased or false otherwise. This will be
used in following patch to determine if a PCI device is under a PCI
bridge.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 include/hw/pci/pci.h |  2 ++
 hw/pci/pci.c         | 12 ++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index cde7a54a69..34b4edbf1a 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -652,6 +652,8 @@ typedef struct PCIIOMMUOps {
                             bool is_write);
 } PCIIOMMUOps;
 
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+                                    PCIBus **aliased_bus, int *aliased_devfn);
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index df1fb615a8..151c27088b 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2857,20 +2857,21 @@ static void pci_device_class_base_init(ObjectClass *klass, const void *data)
  * For call sites which don't need aliased BDF, passing NULL to
  * aliased_[bus|devfn] is allowed.
  *
+ * Returns true if PCI device RID is aliased or false otherwise.
+ *
  * @piommu_bus: return root #PCIBus backed by an IOMMU for the PCI device.
  *
  * @aliased_bus: return aliased #PCIBus of the PCI device, optional.
  *
  * @aliased_devfn: return aliased devfn of the PCI device, optional.
  */
-static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
-                                           PCIBus **piommu_bus,
-                                           PCIBus **aliased_bus,
-                                           int *aliased_devfn)
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+                                    PCIBus **aliased_bus, int *aliased_devfn)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
     int devfn = dev->devfn;
+    bool aliased = false;
 
     while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
         PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
@@ -2907,6 +2908,7 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
                 devfn = parent->devfn;
                 bus = parent_bus;
             }
+            aliased = true;
         }
 
         iommu_bus = parent_bus;
@@ -2928,6 +2930,8 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
     if (aliased_devfn) {
         *aliased_devfn = devfn;
     }
+
+    return aliased;
 }
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (5 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 06/21] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-22 23:17   ` Nicolin Chen
                     ` (2 more replies)
  2025-08-22  6:40 ` [PATCH v5 08/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
                   ` (14 subsequent siblings)
  21 siblings, 3 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

Introduce a new structure VTDHostIOMMUDevice which replaces
HostIOMMUDevice to be stored in hash table.

It includes a reference to HostIOMMUDevice and IntelIOMMUState,
also includes BDF information which will be used in future
patches.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 hw/i386/intel_iommu_internal.h |  7 +++++++
 include/hw/i386/intel_iommu.h  |  2 +-
 hw/i386/intel_iommu.c          | 15 +++++++++++++--
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 360e937989..c7046eb4e2 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -28,6 +28,7 @@
 #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
 #define HW_I386_INTEL_IOMMU_INTERNAL_H
 #include "hw/i386/intel_iommu.h"
+#include "system/host_iommu_device.h"
 
 /*
  * Intel IOMMU register specification
@@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
 /* Bits to decide the offset for each level */
 #define VTD_LEVEL_BITS           9
 
+typedef struct VTDHostIOMMUDevice {
+    IntelIOMMUState *iommu_state;
+    PCIBus *bus;
+    uint8_t devfn;
+    HostIOMMUDevice *hiod;
+} VTDHostIOMMUDevice;
 #endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index e95477e855..50f9b27a45 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -295,7 +295,7 @@ struct IntelIOMMUState {
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
-    GHashTable *vtd_host_iommu_dev;             /* HostIOMMUDevice */
+    GHashTable *vtd_host_iommu_dev;             /* VTDHostIOMMUDevice */
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e3b871de70..512ca4fdc5 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
 
 static void vtd_hiod_destroy(gpointer v)
 {
-    object_unref(v);
+    VTDHostIOMMUDevice *vtd_hiod = v;
+
+    object_unref(vtd_hiod->hiod);
+    g_free(vtd_hiod);
 }
 
 static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
@@ -4371,6 +4374,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
                                      HostIOMMUDevice *hiod, Error **errp)
 {
     IntelIOMMUState *s = opaque;
+    VTDHostIOMMUDevice *vtd_hiod;
     struct vtd_as_key key = {
         .bus = bus,
         .devfn = devfn,
@@ -4387,7 +4391,14 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
         return false;
     }
 
+    vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
+    vtd_hiod->bus = bus;
+    vtd_hiod->devfn = (uint8_t)devfn;
+    vtd_hiod->iommu_state = s;
+    vtd_hiod->hiod = hiod;
+
     if (!vtd_check_hiod(s, hiod, errp)) {
+        g_free(vtd_hiod);
         vtd_iommu_unlock(s);
         return false;
     }
@@ -4397,7 +4408,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     new_key->devfn = devfn;
 
     object_ref(hiod);
-    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
+    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
 
     vtd_iommu_unlock(s);
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 08/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (6 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-27 11:42   ` Yi Liu
  2025-08-27 11:55   ` Eric Auger
  2025-08-22  6:40 ` [PATCH v5 09/21] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
                   ` (13 subsequent siblings)
  21 siblings, 2 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
is passed to host to construct nested page table. We need to check
compatibility of some critical IOMMU capabilities between vIOMMU and
host IOMMU to ensure guest stage-1 page table could be used by host.

For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
does not, then this IOMMUFD backed device should fail.

Even of the checks pass, for now we willingly reject the association
because all the bits are not there yet.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  1 +
 hw/i386/intel_iommu.c          | 30 +++++++++++++++++++++++++++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index c7046eb4e2..f7510861d1 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -192,6 +192,7 @@
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_SC                 (1ULL << 7)
 #define VTD_ECAP_MHMV               (15ULL << 20)
+#define VTD_ECAP_NEST               (1ULL << 26)
 #define VTD_ECAP_SRS                (1ULL << 31)
 #define VTD_ECAP_PSS                (7ULL << 35) /* limit: MemTxAttrs::pid */
 #define VTD_ECAP_PASID              (1ULL << 40)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 512ca4fdc5..da355bda79 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
 #include "kvm/kvm_i386.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "system/iommufd.h"
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -4366,7 +4367,34 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         return true;
     }
 
-    error_setg(errp, "host device is uncompatible with stage-1 translation");
+#ifdef CONFIG_IOMMUFD
+    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
+    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+
+    /* Remaining checks are all stage-1 translation specific */
+    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
+        return false;
+    }
+
+    if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
+        error_setg(errp, "Incompatible host platform IOMMU type %d",
+                   caps->type);
+        return false;
+    }
+
+    if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
+        error_setg(errp, "Host IOMMU doesn't support nested translation");
+        return false;
+    }
+
+    if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
+        error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
+        return false;
+    }
+#endif
+
+    error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
     return false;
 }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 09/21] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (7 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 08/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-28 10:33   ` Yi Liu
  2025-08-22  6:40 ` [PATCH v5 10/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

Currently we don't support nested translation for passthrough device with
emulated device under same PCI bridge, because they require different address
space when x-flts=on.

In theory, we do support if devices under same PCI bridge are all passthrough
devices. But emulated device can be hotplugged under same bridge. To simplify,
just forbid passthrough device under PCI bridge no matter if there is, or will
be emulated devices under same bridge. This is acceptable because PCIE bridge
is more popular than PCI bridge now.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 hw/i386/intel_iommu.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index da355bda79..6edd91d94e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4341,9 +4341,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
     return vtd_dev_as;
 }
 
-static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
+static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
                            Error **errp)
 {
+    HostIOMMUDevice *hiod = vtd_hiod->hiod;
     HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
     int ret;
 
@@ -4370,6 +4371,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
 #ifdef CONFIG_IOMMUFD
     struct HostIOMMUDeviceCaps *caps = &hiod->caps;
     struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+    PCIBus *bus = vtd_hiod->bus;
+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), vtd_hiod->devfn);
 
     /* Remaining checks are all stage-1 translation specific */
     if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
@@ -4392,6 +4395,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
         return false;
     }
+
+    if (pci_device_get_iommu_bus_devfn(pdev, &bus, NULL, NULL)) {
+        error_setg(errp, "Host device under PCI bridge is unsupported "
+                   "when x-flts=on");
+        return false;
+    }
 #endif
 
     error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
@@ -4425,7 +4434,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     vtd_hiod->iommu_state = s;
     vtd_hiod->hiod = hiod;
 
-    if (!vtd_check_hiod(s, hiod, errp)) {
+    if (!vtd_check_hiod(s, vtd_hiod, errp)) {
         g_free(vtd_hiod);
         vtd_iommu_unlock(s);
         return false;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 10/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (8 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 09/21] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-28 11:36   ` Yi Liu
  2025-08-22  6:40 ` [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update Zhenzhong Duan
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

PCI device supports two request types, Requests-without-PASID and
Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
prefix, IOMMU fetches rid_pasid from context entry and use it as IOMMU's
pasid to index pasid table.

So we need to translate between PCI's pasid and IOMMU's pasid specially
for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.

vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to vtd_as
which contains PCI's pasid vtd_as->pasid.

vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to iommu_pasid.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 hw/i386/intel_iommu.c | 58 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6edd91d94e..1801f1cdf6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1602,6 +1602,64 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     return 0;
 }
 
+static int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
+                                        uint32_t *pasid)
+{
+    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint8_t bus_num = pci_bus_num(vtd_as->bus);
+    uint8_t devfn = vtd_as->devfn;
+    VTDContextEntry ce;
+    int ret;
+
+    /* For Requests-with-PASID, its pasid value is used by vIOMMU directly */
+    if (vtd_as->pasid != PCI_NO_PASID) {
+        *pasid = vtd_as->pasid;
+        return 0;
+    }
+
+    if (cc_entry->context_cache_gen == s->context_cache_gen) {
+        ce = cc_entry->context_entry;
+    } else {
+        ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    *pasid = VTD_CE_GET_RID2PASID(&ce);
+    return 0;
+}
+
+static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
+                                                   gpointer user_data)
+{
+    VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
+    struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
+    uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
+    uint32_t pasid;
+
+    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+        return false;
+    }
+
+    return (pasid == target->pasid) && (sid == target->sid);
+}
+
+/* Translate iommu pasid to vtd_as */
+static inline
+VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
+                                                uint16_t sid, uint32_t pasid)
+{
+    struct vtd_as_raw_key key = {
+        .sid = sid,
+        .pasid = pasid
+    };
+
+    return g_hash_table_find(s->vtd_address_spaces,
+                             vtd_find_as_by_sid_and_iommu_pasid, &key);
+}
+
 static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
                                      void *private)
 {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (9 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 10/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-27 14:25   ` Eric Auger
  2025-08-28 12:05   ` Yi Liu
  2025-08-22  6:40 ` [PATCH v5 12/21] intel_iommu: Handle PASID entry addition Zhenzhong Duan
                   ` (10 subsequent siblings)
  21 siblings, 2 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan, Yi Sun

This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
pasid entry and track PASID usage and future PASID tagged DMA address
translation support in vIOMMU.

VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
never freed. For other pasid, VTDAddressSpace instance is created/destroyed
per the guest pasid entry set up/destroy.

When guest removes or updates a PASID entry, QEMU will capture the guest pasid
selective pasid cache invalidation, removes VTDAddressSpace or update cached
PASID entry.

vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
and compare it with cached PASID entry.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  27 ++++-
 include/hw/i386/intel_iommu.h  |   6 +
 hw/i386/intel_iommu.c          | 196 +++++++++++++++++++++++++++++++--
 hw/i386/trace-events           |   3 +
 4 files changed, 220 insertions(+), 12 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index f7510861d1..b9b76dd996 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -316,6 +316,7 @@ typedef enum VTDFaultReason {
                                   * request while disabled */
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
+    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
     /* PASID directory entry access failure */
     VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
     /* The Present(P) field of pasid directory entry is 0 */
@@ -493,6 +494,15 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000f1c0ULL
 #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
 
+/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
+#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
+#define VTD_INV_DESC_PASIDC_G_DSI       0
+#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
+#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
+#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16, 16)
+#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32, 20)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
@@ -553,6 +563,21 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+typedef enum VTDPCInvType {
+    /* VTD spec defined PASID cache invalidation type */
+    VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
+    VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
+    VTD_PASID_CACHE_GLOBAL_INV = VTD_INV_DESC_PASIDC_G_GLOBAL,
+} VTDPCInvType;
+
+typedef struct VTDPASIDCacheInfo {
+    VTDPCInvType type;
+    uint16_t did;
+    uint32_t pasid;
+    PCIBus *bus;
+    uint16_t devfn;
+} VTDPASIDCacheInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
@@ -574,7 +599,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
 
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
-#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
+#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
 
 #define VTD_SM_PASID_ENTRY_FLPM          3ULL
 #define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 50f9b27a45..0e3826f6f0 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -95,6 +95,11 @@ struct VTDPASIDEntry {
     uint64_t val[8];
 };
 
+typedef struct VTDPASIDCacheEntry {
+    struct VTDPASIDEntry pasid_entry;
+    bool valid;
+} VTDPASIDCacheEntry;
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -107,6 +112,7 @@ struct VTDAddressSpace {
     MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
+    VTDPASIDCacheEntry pasid_cache_entry;
     QLIST_ENTRY(VTDAddressSpace) next;
     /* Superset of notifier flags that this address space has */
     IOMMUNotifierFlag notifier_flags;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1801f1cdf6..a2ee6d684e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1675,7 +1675,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
 
     if (s->root_scalable) {
         vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
-        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
+        return VTD_SM_PASID_ENTRY_DID(&pe);
     }
 
     return VTD_CONTEXT_ENTRY_DID(ce->hi);
@@ -3112,6 +3112,183 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
     return true;
 }
 
+static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
+                                            uint32_t pasid, VTDPASIDEntry *pe)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDContextEntry ce;
+    int ret;
+
+    if (!s->root_scalable) {
+        return -VTD_FR_RTADDR_INV_TTM;
+    }
+
+    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
+                                   &ce);
+    if (ret) {
+        return ret;
+    }
+
+    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+    return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/*
+ * This function is a loop function which return value determines if
+ * vtd_as including cached pasid entry is removed.
+ *
+ * For PCI_NO_PASID, when corresponding cached pasid entry is cleared,
+ * it returns false so that vtd_as is reserved as it's owned by PCI
+ * sub-system. For other pasid, it returns true so vtd_as is removed.
+ */
+static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
+                                       gpointer user_data)
+{
+    VTDPASIDCacheInfo *pc_info = user_data;
+    VTDAddressSpace *vtd_as = value;
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    VTDPASIDEntry pe;
+    uint16_t did;
+    uint32_t pasid;
+    int ret;
+
+    if (!pc_entry->valid) {
+        return false;
+    }
+    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
+
+    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+        goto remove;
+    }
+
+    switch (pc_info->type) {
+    case VTD_PASID_CACHE_PASIDSI:
+        if (pc_info->pasid != pasid) {
+            return false;
+        }
+        /* fall through */
+    case VTD_PASID_CACHE_DOMSI:
+        if (pc_info->did != did) {
+            return false;
+        }
+        /* fall through */
+    case VTD_PASID_CACHE_GLOBAL_INV:
+        break;
+    default:
+        error_setg(&error_fatal, "invalid pc_info->type for flush");
+    }
+
+    /*
+     * pasid cache invalidation may indicate a present pasid entry to present
+     * pasid entry modification. To cover such case, vIOMMU emulator needs to
+     * fetch latest guest pasid entry and compares with cached pasid entry,
+     * then update pasid cache.
+     */
+    ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
+    if (ret) {
+        /*
+         * No valid pasid entry in guest memory. e.g. pasid entry was modified
+         * to be either all-zero or non-present. Either case means existing
+         * pasid cache should be removed.
+         */
+        goto remove;
+    }
+
+    /*
+     * Update cached pasid entry if it's stale compared to what's in guest
+     * memory.
+     */
+    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
+        pc_entry->pasid_entry = pe;
+    }
+    return false;
+
+remove:
+    pc_entry->valid = false;
+
+    /*
+     * Don't remove address space of PCI_NO_PASID which is created for PCI
+     * sub-system.
+     */
+    if (vtd_as->pasid == PCI_NO_PASID) {
+        return false;
+    }
+    return true;
+}
+
+/*
+ * For a PASID cache invalidation, this function handles below scenarios:
+ * a) a present cached pasid entry needs to be removed
+ * b) a present cached pasid entry needs to be updated
+ */
+static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
+{
+    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
+        return;
+    }
+
+    vtd_iommu_lock(s);
+    /*
+     * a,b): loop all the existing vtd_as instances for pasid cache removal
+       or update.
+     */
+    g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
+                                pc_info);
+    vtd_iommu_unlock(s);
+}
+
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+                                   VTDInvDesc *inv_desc)
+{
+    uint16_t did;
+    uint32_t pasid;
+    VTDPASIDCacheInfo pc_info;
+    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
+                        VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
+
+    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
+                                     __func__, "pasid cache inv")) {
+        return false;
+    }
+
+    did = VTD_INV_DESC_PASIDC_DID(inv_desc);
+    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc);
+
+    switch (VTD_INV_DESC_PASIDC_G(inv_desc)) {
+    case VTD_INV_DESC_PASIDC_G_DSI:
+        trace_vtd_pasid_cache_dsi(did);
+        pc_info.type = VTD_PASID_CACHE_DOMSI;
+        pc_info.did = did;
+        break;
+
+    case VTD_INV_DESC_PASIDC_G_PASID_SI:
+        /* PASID selective implies a DID selective */
+        trace_vtd_pasid_cache_psi(did, pasid);
+        pc_info.type = VTD_PASID_CACHE_PASIDSI;
+        pc_info.did = did;
+        pc_info.pasid = pasid;
+        break;
+
+    case VTD_INV_DESC_PASIDC_G_GLOBAL:
+        trace_vtd_pasid_cache_gsi();
+        pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+        break;
+
+    default:
+        error_report_once("invalid granularity field in PASID-cache invalidate "
+                          "descriptor, hi: 0x%"PRIx64" lo: 0x%" PRIx64,
+                           inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    vtd_pasid_cache_sync(s, &pc_info);
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -3274,6 +3451,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    case VTD_INV_DESC_PC:
+        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_pasid_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
     case VTD_INV_DESC_PIOTLB:
         trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
         if (!vtd_process_piotlb_desc(s, &inv_desc)) {
@@ -3309,16 +3493,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
-    /*
-     * TODO: the entity of below two cases will be implemented in future series.
-     * To make guest (which integrates scalable mode support patch set in
-     * iommu driver) work, just return true is enough so far.
-     */
-    case VTD_INV_DESC_PC:
-        if (s->scalable_mode) {
-            break;
-        }
-    /* fallthrough */
     default:
         error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
                           " (unknown type)", __func__, inv_desc.hi,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ac9e1a10aa..ae5bbfcdc0 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 12/21] intel_iommu: Handle PASID entry addition
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (10 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-27 16:22   ` Eric Auger
  2025-08-29  5:46   ` Yi Liu
  2025-08-22  6:40 ` [PATCH v5 13/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
                   ` (9 subsequent siblings)
  21 siblings, 2 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan, Yi Sun

When guest creates new PASID entries, QEMU will capture the guest pasid
selective pasid cache invalidation, walk through each passthrough device
and each pasid, when a match is found, identify an existing vtd_as or
create a new one and update its corresponding cached pasid entry.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |   2 +
 hw/i386/intel_iommu.c          | 176 ++++++++++++++++++++++++++++++++-
 2 files changed, 175 insertions(+), 3 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index b9b76dd996..fb2a919e87 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -559,6 +559,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CTX_ENTRY_LEGACY_SIZE     16
 #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
 
+#define VTD_SM_CONTEXT_ENTRY_PDTS(x)        extract64((x)->val[0], 9, 3)
 #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
@@ -589,6 +590,7 @@ typedef struct VTDPASIDCacheInfo {
 #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
 #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
 #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
 
 /* PASID Granular Translation Type Mask */
 #define VTD_PASID_ENTRY_P              1ULL
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a2ee6d684e..7d2c9feae7 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -826,6 +826,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
     }
 }
 
+static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
+{
+    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1647,9 +1652,9 @@ static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
 }
 
 /* Translate iommu pasid to vtd_as */
-static inline
-VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
-                                                uint16_t sid, uint32_t pasid)
+static VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
+                                                       uint16_t sid,
+                                                       uint32_t pasid)
 {
     struct vtd_as_raw_key key = {
         .sid = sid,
@@ -3220,10 +3225,172 @@ remove:
     return true;
 }
 
+/*
+ * This function walks over PASID range within [start, end) in a single
+ * PASID table for entries matching @info type/did, then retrieve/create
+ * vtd_as and fill associated pasid entry cache.
+ */
+static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+                                        dma_addr_t pt_base,
+                                        int start,
+                                        int end,
+                                        VTDPASIDCacheInfo *info)
+{
+    VTDPASIDEntry pe;
+    int pasid = start;
+
+    while (pasid < end) {
+        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+            && vtd_pe_present(&pe)) {
+            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
+            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
+            VTDPASIDCacheEntry *pc_entry;
+            VTDAddressSpace *vtd_as;
+
+            vtd_iommu_lock(s);
+            /*
+             * When indexed by rid2pasid, vtd_as should have been created,
+             * e.g., by PCI subsystem. For other iommu pasid, we need to
+             * create vtd_as dynamically. Other iommu pasid is same value
+             * as PCI's pasid, so it's used as input of vtd_find_add_as().
+             */
+            vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
+            vtd_iommu_unlock(s);
+            if (!vtd_as) {
+                vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
+            }
+
+            if ((info->type == VTD_PASID_CACHE_DOMSI ||
+                 info->type == VTD_PASID_CACHE_PASIDSI) &&
+                (info->did != VTD_SM_PASID_ENTRY_DID(&pe))) {
+                /*
+                 * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
+                 * requires domain id check. If domain id check fail,
+                 * go to next pasid.
+                 */
+                pasid++;
+                continue;
+            }
+
+            pc_entry = &vtd_as->pasid_cache_entry;
+            /*
+             * pasid cache update and clear are handled in
+             * vtd_flush_pasid_locked(), only care new pasid entry here.
+             */
+            if (!pc_entry->valid) {
+                pc_entry->pasid_entry = pe;
+                pc_entry->valid = true;
+            }
+        }
+        pasid++;
+    }
+}
+
+/*
+ * In VT-d scalable mode translation, PASID dir + PASID table is used.
+ * This function aims at looping over a range of PASIDs in the given
+ * two level table to identify the pasid config in guest.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
+                                    dma_addr_t pdt_base,
+                                    int start, int end,
+                                    VTDPASIDCacheInfo *info)
+{
+    VTDPASIDDirEntry pdire;
+    int pasid = start;
+    int pasid_next;
+    dma_addr_t pt_base;
+
+    while (pasid < end) {
+        pasid_next =
+             (pasid + VTD_PASID_TBL_ENTRY_NUM) & ~(VTD_PASID_TBL_ENTRY_NUM - 1);
+        pasid_next = pasid_next < end ? pasid_next : end;
+
+        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+            && vtd_pdire_present(&pdire)) {
+            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
+        }
+        pasid = pasid_next;
+    }
+}
+
+static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
+                                          int start, int end,
+                                          VTDPASIDCacheInfo *info)
+{
+    VTDContextEntry ce;
+
+    if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus), info->devfn,
+                                  &ce)) {
+        uint32_t max_pasid;
+
+        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
+        if (end > max_pasid) {
+            end = max_pasid;
+        }
+        vtd_sm_pasid_table_walk(s,
+                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
+                                start,
+                                end,
+                                info);
+    }
+}
+
+/*
+ * This function replays the guest pasid bindings by walking the two level
+ * guest PASID table. For each valid pasid entry, it finds or creates a
+ * vtd_as and caches pasid entry in vtd_as.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+                                            VTDPASIDCacheInfo *pc_info)
+{
+    /*
+     * Currently only Requests-without-PASID is supported, as vIOMMU doesn't
+     * support RPS(RID-PASID Support), pasid scope is fixed to [0, 1).
+     */
+    int start = 0, end = 1;
+    VTDHostIOMMUDevice *vtd_hiod;
+    VTDPASIDCacheInfo walk_info;
+    GHashTableIter as_it;
+
+    switch (pc_info->type) {
+    case VTD_PASID_CACHE_PASIDSI:
+        start = pc_info->pasid;
+        end = pc_info->pasid + 1;
+       /* fall through */
+    case VTD_PASID_CACHE_DOMSI:
+    case VTD_PASID_CACHE_GLOBAL_INV:
+        /* loop all assigned devices */
+        break;
+    default:
+        error_setg(&error_fatal, "invalid pc_info->type for replay");
+    }
+
+    /*
+     * In this replay, one only needs to care about the devices which are
+     * backed by host IOMMU. Those devices have a corresponding vtd_hiod
+     * in s->vtd_host_iommu_dev. For devices not backed by host IOMMU, it
+     * is not necessary to replay the bindings since their cache could be
+     * re-created in the future DMA address translation.
+     *
+     * VTD translation callback never accesses vtd_hiod and its corresponding
+     * cached pasid entry, so no iommu lock needed here.
+     */
+    walk_info = *pc_info;
+    g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
+    while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
+        walk_info.bus = vtd_hiod->bus;
+        walk_info.devfn = vtd_hiod->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+    }
+}
+
 /*
  * For a PASID cache invalidation, this function handles below scenarios:
  * a) a present cached pasid entry needs to be removed
  * b) a present cached pasid entry needs to be updated
+ * c) a present cached pasid entry needs to be created
  */
 static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
 {
@@ -3239,6 +3406,9 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
     g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
                                 pc_info);
     vtd_iommu_unlock(s);
+
+    /* c): loop all passthrough device for new pasid entries */
+    vtd_replay_guest_pasid_bindings(s, pc_info);
 }
 
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 13/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (11 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 12/21] intel_iommu: Handle PASID entry addition Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-27 16:28   ` Eric Auger
  2025-08-22  6:40 ` [PATCH v5 14/21] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan, Yi Sun

FORCE_RESET is different from GLOBAL_INV which updates pasid cache if
underlying pasid entry is still valid, it drops all the pasid caches.

FORCE_RESET isn't a VTD spec defined invalidation type for pasid cache,
only used internally in system level reset.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  9 +++++++++
 hw/i386/intel_iommu.c          | 25 +++++++++++++++++++++++++
 hw/i386/trace-events           |  1 +
 3 files changed, 35 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index fb2a919e87..c510b09d1a 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -569,6 +569,15 @@ typedef enum VTDPCInvType {
     VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
     VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
     VTD_PASID_CACHE_GLOBAL_INV = VTD_INV_DESC_PASIDC_G_GLOBAL,
+
+    /*
+     * Internally used PASID cache invalidation type starts here,
+     * 0x10 is large enough as invalidation type in pc_inv_desc
+     * is 2bits in size.
+     */
+
+    /* Reset all PASID cache entries, used in system level reset */
+    VTD_PASID_CACHE_FORCE_RESET = 0x10,
 } VTDPCInvType;
 
 typedef struct VTDPASIDCacheInfo {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 7d2c9feae7..af384ce7f0 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -87,6 +87,8 @@ struct vtd_iotlb_key {
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+
 static void vtd_panic_require_caching_mode(void)
 {
     error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -391,6 +393,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
     vtd_iommu_lock(s);
     vtd_reset_iotlb_locked(s);
     vtd_reset_context_cache_locked(s);
+    vtd_pasid_cache_reset_locked(s);
     vtd_iommu_unlock(s);
 }
 
@@ -3183,6 +3186,8 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
         /* fall through */
     case VTD_PASID_CACHE_GLOBAL_INV:
         break;
+    case VTD_PASID_CACHE_FORCE_RESET:
+        goto remove;
     default:
         error_setg(&error_fatal, "invalid pc_info->type for flush");
     }
@@ -3225,6 +3230,23 @@ remove:
     return true;
 }
 
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_reset();
+
+    pc_info.type = VTD_PASID_CACHE_FORCE_RESET;
+
+    /*
+     * Reset pasid cache is a big hammer, so use g_hash_table_foreach_remove
+     * which will free all vtd_as instances except those created for PCI
+     * sub-system.
+     */
+    g_hash_table_foreach_remove(s->vtd_address_spaces,
+                                vtd_flush_pasid_locked, &pc_info);
+}
+
 /*
  * This function walks over PASID range within [start, end) in a single
  * PASID table for entries matching @info type/did, then retrieve/create
@@ -3363,6 +3385,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     case VTD_PASID_CACHE_GLOBAL_INV:
         /* loop all assigned devices */
         break;
+    case VTD_PASID_CACHE_FORCE_RESET:
+        /* For force reset, no need to go further replay */
+        return;
     default:
         error_setg(&error_fatal, "invalid pc_info->type for replay");
     }
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ae5bbfcdc0..c8a936eb46 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 14/21] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (12 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 13/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-27 17:14   ` Eric Auger
  2025-08-29  6:06   ` Yi Liu
  2025-08-22  6:40 ` [PATCH v5 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
                   ` (7 subsequent siblings)
  21 siblings, 2 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

When guest in scalable mode and x-flts=on, we stick to system MR for IOMMUFD
backed host device. Then its default hwpt contains GPA->HPA mappings which is
used directly if PGTT=PT and used as nested parent if PGTT=FLT. Otherwise
fallback to original processing.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index af384ce7f0..15582977b8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1773,6 +1773,28 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
 
 }
 
+static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(IntelIOMMUState *s,
+                                                 VTDAddressSpace *as)
+{
+    struct vtd_as_key key = {
+        .bus = as->bus,
+        .devfn = as->devfn,
+    };
+    VTDHostIOMMUDevice *vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev,
+                                                       &key);
+
+    if (vtd_hiod && vtd_hiod->hiod &&
+        object_dynamic_cast(OBJECT(vtd_hiod->hiod),
+                            TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+        return vtd_hiod;
+    }
+    return NULL;
+}
+
+/*
+ * vtd_switch_address_space() calls vtd_as_pt_enabled() to determine which
+ * MR to switch to. Switch to system MR if return true, iommu MR otherwise.
+ */
 static bool vtd_as_pt_enabled(VTDAddressSpace *as)
 {
     IntelIOMMUState *s;
@@ -1781,6 +1803,18 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
     assert(as);
 
     s = as->iommu_state;
+
+    /*
+     * When guest in scalable mode and x-flts=on, we stick to system MR
+     * for IOMMUFD backed host device. Then its default hwpt contains
+     * GPA->HPA mappings which is used directly if PGTT=PT and used as
+     * nested parent if PGTT=FLT. Otherwise fallback to original
+     * processing.
+     */
+    if (s->root_scalable && s->flts && vtd_find_hiod_iommufd(s, as)) {
+        return true;
+    }
+
     if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
                                  &ce)) {
         /*
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (13 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 14/21] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-28  8:37   ` Eric Auger
  2025-08-29  7:05   ` Yi Liu
  2025-08-22  6:40 ` [PATCH v5 16/21] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
                   ` (6 subsequent siblings)
  21 siblings, 2 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan, Yi Sun

This captures the guest PASID table entry modifications and
propagates the changes to host to attach a hwpt with type determined
per guest IOMMU mode and PGTT configuration.

When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
vIOMMU reuse hwpt(GPA->HPA) provided by VFIO as nested parent to
construct nested page table.

When guest decides to use legacy mode then vIOMMU switches the MRs of
the device's AS, hence the IOAS created by VFIO container would be
switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
switched to IOMMU MR. So it is able to support shadowing the guest IO
page table.

Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  14 ++-
 include/hw/i386/intel_iommu.h  |   1 +
 hw/i386/intel_iommu.c          | 221 ++++++++++++++++++++++++++++++++-
 hw/i386/trace-events           |   3 +
 4 files changed, 233 insertions(+), 6 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index c510b09d1a..61e35dbdc0 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -564,6 +564,12 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+typedef enum VTDPASIDOp {
+    VTD_PASID_BIND,
+    VTD_PASID_UPDATE,
+    VTD_PASID_UNBIND,
+} VTDPASIDOp;
+
 typedef enum VTDPCInvType {
     /* VTD spec defined PASID cache invalidation type */
     VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
@@ -612,8 +618,12 @@ typedef struct VTDPASIDCacheInfo {
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
 
-#define VTD_SM_PASID_ENTRY_FLPM          3ULL
-#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(x)    extract64((x)->val[2], 0, 1)
+/* 00: 4-level paging, 01: 5-level paging, 10-11: Reserved */
+#define VTD_SM_PASID_ENTRY_FSPM(x)       extract64((x)->val[2], 2, 2)
+#define VTD_SM_PASID_ENTRY_WPE_BIT(x)    extract64((x)->val[2], 4, 1)
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(x)   extract64((x)->val[2], 7, 1)
 
 /* First Level Paging Structure */
 /* Masks for First Level Paging Entry */
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 0e3826f6f0..2affab36b2 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -104,6 +104,7 @@ struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
     uint32_t pasid;
+    uint32_t s1_hwpt;
     AddressSpace as;
     IOMMUMemoryRegion iommu;
     MemoryRegion root;          /* The root container of the device */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 15582977b8..a10ee8eb4f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -20,6 +20,7 @@
  */
 
 #include "qemu/osdep.h"
+#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qapi/error.h"
@@ -41,6 +42,9 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "system/iommufd.h"
+#ifdef CONFIG_IOMMUFD
+#include <linux/iommufd.h>
+#endif
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -50,10 +54,9 @@
 
 /* pe operations */
 #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
-#define VTD_PE_GET_FL_LEVEL(pe) \
-    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM))
 #define VTD_PE_GET_SL_LEVEL(pe) \
     (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
+#define VTD_PE_GET_FL_LEVEL(pe) (VTD_SM_PASID_ENTRY_FSPM(pe) + 4)
 
 /*
  * PCI bus number (or SID) is not reliable since the device is usaully
@@ -834,6 +837,31 @@ static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
     return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
 }
 
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+    return pe->val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
+}
+
+/*
+ * Stage-1 IOVA address width: 48 bits for 4-level paging(FSPM=00)
+ *                             57 bits for 5-level paging(FSPM=01)
+ */
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+    return 48 + VTD_SM_PASID_ENTRY_FSPM(pe) * 9;
+}
+
+static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
+}
+
+/* check if pgtt is first stage translation */
+static inline bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1131,7 +1159,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
     if (s->root_scalable) {
         vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
-            return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+            return pe.val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
         } else {
             return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
         }
@@ -1766,7 +1794,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
              */
             return false;
         }
-        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
+        return vtd_pe_pgtt_is_pt(&pe);
     }
 
     return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
@@ -2433,6 +2461,178 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
+#ifdef CONFIG_IOMMUFD
+static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
+                                  VTDPASIDEntry *pe)
+{
+    memset(vtd, 0, sizeof(*vtd));
+
+    vtd->flags =  (VTD_SM_PASID_ENTRY_SRE_BIT(pe) ? IOMMU_VTD_S1_SRE : 0) |
+                  (VTD_SM_PASID_ENTRY_WPE_BIT(pe) ? IOMMU_VTD_S1_WPE : 0) |
+                  (VTD_SM_PASID_ENTRY_EAFE_BIT(pe) ? IOMMU_VTD_S1_EAFE : 0);
+    vtd->addr_width = vtd_pe_get_fl_aw(pe);
+    vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
+}
+
+static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                              VTDPASIDEntry *pe, uint32_t *s1_hwpt,
+                              Error **errp)
+{
+    struct iommu_hwpt_vtd_s1 vtd;
+
+    vtd_init_s1_hwpt_data(&vtd, pe);
+
+    return !iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+                                       idev->hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
+                                       sizeof(vtd), &vtd, s1_hwpt, errp);
+}
+
+static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                VTDAddressSpace *vtd_as)
+{
+    if (!vtd_as->s1_hwpt) {
+        return;
+    }
+    iommufd_backend_free_id(idev->iommufd, vtd_as->s1_hwpt);
+    vtd_as->s1_hwpt = 0;
+}
+
+static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                     VTDAddressSpace *vtd_as, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    VTDPASIDEntry *pe = &vtd_as->pasid_cache_entry.pasid_entry;
+    uint32_t hwpt_id;
+    int ret;
+
+    /*
+     * We can get here only if flts=on, the supported PGTT is FLT and PT.
+     * Catch invalid PGTT when processing invalidation request to avoid
+     * attaching to wrong hwpt.
+     */
+    if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
+        error_setg(errp, "Invalid PGTT type");
+        return -EINVAL;
+    }
+
+    if (vtd_pe_pgtt_is_flt(pe)) {
+        /* Should fail if the FLPT base is 0 */
+        if (!vtd_pe_get_flpt_base(pe)) {
+            error_setg(errp, "FLPT base is 0");
+            return -EINVAL;
+        }
+
+        if (vtd_create_s1_hwpt(idev, pe, &hwpt_id, errp)) {
+            return -EINVAL;
+        }
+    } else {
+        hwpt_id = idev->hwpt_id;
+    }
+
+    ret = !host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp);
+    trace_vtd_device_attach_hwpt(idev->devid, vtd_as->pasid, hwpt_id, ret);
+    if (!ret) {
+        vtd_destroy_s1_hwpt(idev, vtd_as);
+        if (vtd_pe_pgtt_is_flt(pe)) {
+            vtd_as->s1_hwpt = hwpt_id;
+        }
+    } else if (vtd_pe_pgtt_is_flt(pe)) {
+        iommufd_backend_free_id(idev->iommufd, hwpt_id);
+    }
+
+    return ret;
+}
+
+static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                     VTDAddressSpace *vtd_as, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    uint32_t pasid = vtd_as->pasid;
+    int ret;
+
+    if (vtd_hiod->iommu_state->dmar_enabled) {
+        ret = !host_iommu_device_iommufd_detach_hwpt(idev, errp);
+        trace_vtd_device_detach_hwpt(idev->devid, pasid, ret);
+    } else {
+        ret = !host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
+        trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
+                                           ret);
+    }
+
+    if (!ret) {
+        vtd_destroy_s1_hwpt(idev, vtd_as);
+    }
+
+    return ret;
+}
+
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
+                                Error **errp)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(s, vtd_as);
+    int ret;
+
+    if (!vtd_hiod) {
+        /* No need to go further, e.g. for emulated device */
+        return 0;
+    }
+
+    if (vtd_as->pasid != PCI_NO_PASID) {
+        error_setg(errp, "Non-rid_pasid %d not supported yet", vtd_as->pasid);
+        return -EINVAL;
+    }
+
+    switch (op) {
+    case VTD_PASID_UPDATE:
+    case VTD_PASID_BIND:
+    {
+        ret = vtd_device_attach_iommufd(vtd_hiod, vtd_as, errp);
+        break;
+    }
+    case VTD_PASID_UNBIND:
+    {
+        ret = vtd_device_detach_iommufd(vtd_hiod, vtd_as, errp);
+        break;
+    }
+    default:
+        error_setg(errp, "Unknown VTDPASIDOp!!!");
+        break;
+    }
+
+    return ret;
+}
+#else
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
+                                Error **errp)
+{
+    return 0;
+}
+#endif
+
+static int vtd_bind_guest_pasid_report_err(VTDAddressSpace *vtd_as,
+                                           VTDPASIDOp op)
+{
+    Error *local_err = NULL;
+    int ret;
+
+    /*
+     * vIOMMU calls into kernel to do BIND/UNBIND, the failure reason
+     * can be kernel, QEMU bug or invalid guest config. None of them
+     * should be reported to guest in PASID cache invalidation
+     * processing path. But at least, we can report it to QEMU console.
+     *
+     * TODO: for invalid guest config, DMA translation fault will be
+     * caught by host and passed to QEMU to inject to guest in future.
+     */
+    ret = vtd_bind_guest_pasid(vtd_as, op, &local_err);
+    if (ret) {
+        error_report_err(local_err);
+    }
+
+    return ret;
+}
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -3248,10 +3448,20 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
      */
     if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
         pc_entry->pasid_entry = pe;
+        if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_UPDATE)) {
+            /*
+             * In case update binding fails, tear down existing binding to
+             * catch invalid pasid entry config during DMA translation.
+             */
+            goto remove;
+        }
     }
     return false;
 
 remove:
+    if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_UNBIND)) {
+        return false;
+    }
     pc_entry->valid = false;
 
     /*
@@ -3336,6 +3546,9 @@ static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
             if (!pc_entry->valid) {
                 pc_entry->pasid_entry = pe;
                 pc_entry->valid = true;
+                if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_BIND)) {
+                    pc_entry->valid = false;
+                }
             }
         }
         pasid++;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index c8a936eb46..1c31b9a873 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -73,6 +73,9 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
 vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
 vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
 vtd_reset_exit(void) ""
+vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
+vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
 
 # amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 16/21] intel_iommu: Replay pasid bindings after context cache invalidation
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (14 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-28  9:43   ` Eric Auger
  2025-08-22  6:40 ` [PATCH v5 17/21] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Yi Sun, Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

This replays guest pasid bindings after context cache invalidation.
This is a behavior to ensure safety. Actually, programmer should issue
pasid cache invalidation with proper granularity after issuing a context
cache invalidation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  2 ++
 hw/i386/intel_iommu.c          | 42 ++++++++++++++++++++++++++++++++++
 hw/i386/trace-events           |  1 +
 3 files changed, 45 insertions(+)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 61e35dbdc0..8af1004888 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -584,6 +584,8 @@ typedef enum VTDPCInvType {
 
     /* Reset all PASID cache entries, used in system level reset */
     VTD_PASID_CACHE_FORCE_RESET = 0x10,
+    /* Invalidate all PASID entries in a device */
+    VTD_PASID_CACHE_DEVSI,
 } VTDPCInvType;
 
 typedef struct VTDPASIDCacheInfo {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a10ee8eb4f..6c0e502d1c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -91,6 +91,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
 static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  PCIBus *bus, uint16_t devfn);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -2442,6 +2446,8 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+    VTDPASIDCacheInfo pc_info;
+
     trace_vtd_inv_desc_cc_global();
     /* Protects context cache */
     vtd_iommu_lock(s);
@@ -2459,6 +2465,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
      * VT-d emulation codes.
      */
     vtd_iommu_replay_all(s);
+
+    pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+    vtd_pasid_cache_sync(s, &pc_info);
 }
 
 #ifdef CONFIG_IOMMUFD
@@ -2691,6 +2700,15 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
              * happened.
              */
             vtd_address_space_sync(vtd_as);
+            /*
+             * Per spec, context flush should also be followed with PASID
+             * cache and iotlb flush. In order to work with a guest which
+             * doesn't follow spec and missed PASID cache flush, we have
+             * vtd_pasid_cache_devsi() to invalidate PASID caches of the
+             * passthrough device. Host iommu driver would flush piotlb
+             * when a pasid unbind is pass down to it.
+             */
+             vtd_pasid_cache_devsi(s, vtd_as->bus, devfn);
         }
     }
 }
@@ -3422,6 +3440,11 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
         break;
     case VTD_PASID_CACHE_FORCE_RESET:
         goto remove;
+    case VTD_PASID_CACHE_DEVSI:
+        if (pc_info->bus != vtd_as->bus || pc_info->devfn != vtd_as->devfn) {
+            return false;
+        }
+        break;
     default:
         error_setg(&error_fatal, "invalid pc_info->type for flush");
     }
@@ -3635,6 +3658,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     case VTD_PASID_CACHE_FORCE_RESET:
         /* For force reset, no need to go further replay */
         return;
+    case VTD_PASID_CACHE_DEVSI:
+        walk_info.bus = pc_info->bus;
+        walk_info.devfn = pc_info->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+        return;
     default:
         error_setg(&error_fatal, "invalid pc_info->type for replay");
     }
@@ -3683,6 +3711,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
     vtd_replay_guest_pasid_bindings(s, pc_info);
 }
 
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  PCIBus *bus, uint16_t devfn)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_devsi(devfn);
+
+    pc_info.type = VTD_PASID_CACHE_DEVSI;
+    pc_info.bus = bus;
+    pc_info.devfn = devfn;
+
+    vtd_pasid_cache_sync(s, &pc_info);
+}
+
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 1c31b9a873..830b11f68b 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -28,6 +28,7 @@ vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 17/21] intel_iommu: Propagate PASID-based iotlb invalidation to host
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (15 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 16/21] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-28 10:00   ` Eric Auger
  2025-08-22  6:40 ` [PATCH v5 18/21] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Yi Sun, Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

This traps the guest PASID-based iotlb invalidation request and propagate it
to host.

Intel VT-d 3.0 supports nested translation in PASID granularity. Guest SVA
support could be implemented by configuring nested translation on specific
pasid. This is also known as dual stage DMA translation.

Under such configuration, guest owns the GVA->GPA translation which is
configured as stage-1 page table on host side for a specific pasid, and host
owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
invalidation should be propagated to host since host IOMMU will cache first
level page table related mappings during DMA address translation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu_internal.h |  6 +++
 hw/i386/intel_iommu.c          | 95 +++++++++++++++++++++++++++++++++-
 2 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 8af1004888..c1a9263651 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -596,6 +596,12 @@ typedef struct VTDPASIDCacheInfo {
     uint16_t devfn;
 } VTDPASIDCacheInfo;
 
+typedef struct VTDPIOTLBInvInfo {
+    uint16_t domain_id;
+    uint32_t pasid;
+    struct iommu_hwpt_vtd_s1_invalidate *inv_data;
+} VTDPIOTLBInvInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6c0e502d1c..7efa22f4ec 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2611,12 +2611,99 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
 
     return ret;
 }
+
+static void
+vtd_invalidate_piotlb_locked(VTDAddressSpace *vtd_as,
+                             struct iommu_hwpt_vtd_s1_invalidate *cache)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(s, vtd_as);
+    HostIOMMUDeviceIOMMUFD *idev;
+    uint32_t entry_num = 1; /* Only implement one request for simplicity */
+    Error *local_err = NULL;
+
+    if (!vtd_hiod || !vtd_as->s1_hwpt) {
+        return;
+    }
+    idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+
+    if (!iommufd_backend_invalidate_cache(idev->iommufd, vtd_as->s1_hwpt,
+                                          IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+                                          sizeof(*cache), &entry_num, cache,
+                                          &local_err)) {
+        /* Something wrong in kernel, but trying to continue */
+        error_report_err(local_err);
+    }
+}
+
+/*
+ * This function is a loop function for the s->vtd_address_spaces
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host.
+ */
+static void vtd_flush_host_piotlb_locked(gpointer key, gpointer value,
+                                         gpointer user_data)
+{
+    VTDPIOTLBInvInfo *piotlb_info = user_data;
+    VTDAddressSpace *vtd_as = value;
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    uint32_t pasid;
+    uint16_t did;
+
+    /* Replay only fills pasid entry cache for passthrough device */
+    if (!pc_entry->valid ||
+        !vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
+        return;
+    }
+
+    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+        return;
+    }
+
+    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
+
+    if (piotlb_info->domain_id == did && piotlb_info->pasid == pasid) {
+        vtd_invalidate_piotlb_locked(vtd_as, piotlb_info->inv_data);
+    }
+}
+
+static void
+vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
+                                 uint16_t domain_id, uint32_t pasid,
+                                 hwaddr addr, uint64_t npages, bool ih)
+{
+    struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
+    VTDPIOTLBInvInfo piotlb_info;
+
+    cache_info.addr = addr;
+    cache_info.npages = npages;
+    cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.inv_data = &cache_info;
+
+    /*
+     * Go through each vtd_as instance in s->vtd_address_spaces, find out
+     * the affected host device which need host piotlb invalidation. Piotlb
+     * invalidation should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_address_spaces,
+                         vtd_flush_host_piotlb_locked, &piotlb_info);
+}
 #else
 static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
                                 Error **errp)
 {
     return 0;
 }
+
+static void
+vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
+                                 uint16_t domain_id, uint32_t pasid,
+                                 hwaddr addr, uint64_t npages, bool ih)
+{
+}
 #endif
 
 static int vtd_bind_guest_pasid_report_err(VTDAddressSpace *vtd_as,
@@ -3295,6 +3382,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
     vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
                                 &info);
+    vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, 0, (uint64_t)-1, 0);
     vtd_iommu_unlock(s);
 
     QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
@@ -3316,7 +3404,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
-                                       uint32_t pasid, hwaddr addr, uint8_t am)
+                                       uint32_t pasid, hwaddr addr, uint8_t am,
+                                       bool ih)
 {
     VTDIOTLBPageInvInfo info;
 
@@ -3328,6 +3417,7 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb,
                                 vtd_hash_remove_by_page_piotlb, &info);
+    vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, addr, 1 << am, ih);
     vtd_iommu_unlock(s);
 
     vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am, pasid);
@@ -3359,7 +3449,8 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
     case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
         am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
         addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
-        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am);
+        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+                                   VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
         break;
 
     default:
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 18/21] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (16 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 17/21] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-28 10:02   ` Eric Auger
  2025-08-22  6:40 ` [PATCH v5 19/21] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
all pasid bindings on host side become stale and need to be updated.

Introduce a helper function vtd_replay_pasid_bindings_all() to go through all
pasid entries in all passthrough devices to update host side bindings.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 7efa22f4ec..f9cb13e945 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -89,6 +89,7 @@ struct vtd_iotlb_key {
 
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s);
 
 static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
 static void vtd_pasid_cache_sync(IntelIOMMUState *s,
@@ -3050,6 +3051,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_replay_pasid_bindings_all(s);
 }
 
 /* Set Interrupt Remap Table Pointer */
@@ -3084,6 +3086,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_replay_pasid_bindings_all(s);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -3777,6 +3780,17 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     }
 }
 
+static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info = { .type = VTD_PASID_CACHE_GLOBAL_INV };
+
+    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
+        return;
+    }
+
+    vtd_replay_guest_pasid_bindings(s, &pc_info);
+}
+
 /*
  * For a PASID cache invalidation, this function handles below scenarios:
  * a) a present cached pasid entry needs to be removed
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 19/21] vfio: Add a new element bypass_ro in VFIOContainerBase
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (17 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 18/21] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-28 12:47   ` Eric Auger
  2025-08-22  6:40 ` [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

When bypass_ro is true, readonly memory section is bypassed from
mapping in the container.

This is a preparing patch to workaround Intel ERRATA_772415.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 include/hw/vfio/vfio-container-base.h |  1 +
 hw/vfio/listener.c                    | 21 ++++++++++++++-------
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index bded6e993f..31fd784d76 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
     QLIST_HEAD(, VFIODevice) device_list;
     GList *iova_ranges;
     NotifierWithReturn cpr_reboot_notifier;
+    bool bypass_ro;
 } VFIOContainerBase;
 
 typedef struct VFIOGuestIOMMU {
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index 903dfd8bf2..5fa2bb7f1a 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -76,8 +76,13 @@ static bool vfio_log_sync_needed(const VFIOContainerBase *bcontainer)
     return true;
 }
 
-static bool vfio_listener_skipped_section(MemoryRegionSection *section)
+static bool vfio_listener_skipped_section(MemoryRegionSection *section,
+                                          bool bypass_ro)
 {
+    if (bypass_ro && section->readonly) {
+        return true;
+    }
+
     return (!memory_region_is_ram(section->mr) &&
             !memory_region_is_iommu(section->mr)) ||
            memory_region_is_protected(section->mr) ||
@@ -365,9 +370,9 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
 }
 
 static bool vfio_listener_valid_section(MemoryRegionSection *section,
-                                        const char *name)
+                                        bool bypass_ro, const char *name)
 {
-    if (vfio_listener_skipped_section(section)) {
+    if (vfio_listener_skipped_section(section, bypass_ro)) {
         trace_vfio_listener_region_skip(name,
                 section->offset_within_address_space,
                 section->offset_within_address_space +
@@ -494,7 +499,8 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
     int ret;
     Error *err = NULL;
 
-    if (!vfio_listener_valid_section(section, "region_add")) {
+    if (!vfio_listener_valid_section(section, bcontainer->bypass_ro,
+                                     "region_add")) {
         return;
     }
 
@@ -655,7 +661,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
     int ret;
     bool try_unmap = true;
 
-    if (!vfio_listener_valid_section(section, "region_del")) {
+    if (!vfio_listener_valid_section(section, bcontainer->bypass_ro,
+                                     "region_del")) {
         return;
     }
 
@@ -812,7 +819,7 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
         container_of(listener, VFIODirtyRangesListener, listener);
     hwaddr iova, end;
 
-    if (!vfio_listener_valid_section(section, "tracking_update") ||
+    if (!vfio_listener_valid_section(section, false, "tracking_update") ||
         !vfio_get_section_iova_range(dirty->bcontainer, section,
                                      &iova, &end, NULL)) {
         return;
@@ -1206,7 +1213,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
     int ret;
     Error *local_err = NULL;
 
-    if (vfio_listener_skipped_section(section)) {
+    if (vfio_listener_skipped_section(section, false)) {
         return;
     }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (18 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 19/21] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-22 23:55   ` Nicolin Chen
  2025-08-22  6:40 ` [PATCH v5 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
  2025-08-27 11:13 ` [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Yi Liu
  21 siblings, 1 reply; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on stage-2 page table could still be written.

Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.
https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update/

Also copied the SPR17 details from above link:
"Problem: When remapping hardware is configured by system software in
scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the
PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended
Access bit if enabled) in first-stage page-table entries even when
second-stage mappings indicate that corresponding first-stage page-table
is Read-Only.

Implication: Due to this erratum, pages mapped as Read-only in second-stage
page-tables may be modified by remapping hardware Access/Dirty bit updates.

Workaround: None identified. System software enabling nested translations
for a VM should ensure that there are no read-only pages in the
corresponding second-stage mappings."

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/iommufd.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index e503c232e1..59735e878c 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -324,6 +324,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
 {
     ERRP_GUARD();
     IOMMUFDBackend *iommufd = vbasedev->iommufd;
+    struct iommu_hw_info_vtd vtd;
     uint32_t type, flags = 0;
     uint64_t hw_caps;
     VFIOIOASHwpt *hwpt;
@@ -371,10 +372,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
      * instead.
      */
     if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
-                                         &type, NULL, 0, &hw_caps, errp)) {
+                                         &type, &vtd, sizeof(vtd), &hw_caps,
+                                         errp)) {
         return false;
     }
 
+    if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
+        container->bcontainer.bypass_ro = true;
+    }
+
     if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
         flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
     }
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v5 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (19 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
@ 2025-08-22  6:40 ` Zhenzhong Duan
  2025-08-28 12:51   ` Eric Auger
  2025-08-29  7:42   ` Yi Liu
  2025-08-27 11:13 ` [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Yi Liu
  21 siblings, 2 replies; 113+ messages in thread
From: Zhenzhong Duan @ 2025-08-22  6:40 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Zhenzhong Duan

Now that all infrastructures of supporting passthrough device running
with stage-1 translation are there, enable it now.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f9cb13e945..04a412d460 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -5222,6 +5222,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
                    "when x-flts=on");
         return false;
     }
+
+    return true;
 #endif
 
     error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 01/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
  2025-08-22  6:40 ` [PATCH v5 01/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-08-22 22:19   ` Nicolin Chen via
  2025-08-25  6:01     ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Nicolin Chen via @ 2025-08-22 22:19 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On Fri, Aug 22, 2025 at 02:40:39AM -0400, Zhenzhong Duan wrote:
> In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
> of rid2pasid, then it was extended to get any pasid entry. So a new name
> vtd_ce_get_pasid_entry is better to match what it actually does.
> 
> No functional change intended.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

> @@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
>      return 0;
>  }
>  
> -static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
> +static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
>                                        VTDContextEntry *ce,
>                                        VTDPASIDEntry *pe,
>                                        uint32_t pasid)
 
Nit: it could be re-organized a bit with the shrunk indentation.

static int vtd_ce_get_pasid_entry(IntelIOMMUState *s, VTDContextEntry *ce,
                                  VTDPASIDEntry *pe, uint32_t pasid)


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-22  6:40 ` [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
@ 2025-08-22 22:22   ` Nicolin Chen
  2025-08-27 11:13   ` Yi Liu
  1 sibling, 0 replies; 113+ messages in thread
From: Nicolin Chen @ 2025-08-22 22:22 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On Fri, Aug 22, 2025 at 02:40:40AM -0400, Zhenzhong Duan wrote:
> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
> allows to retrieve capabilities exposed by a vIOMMU. The first planned
> vIOMMU device capability is VIOMMU_CAP_HW_NESTED that advertises the
> support of HW nested stage translation scheme. pci_device_get_viommu_cap
> is a wrapper that can be called on a PCI device potentially protected by
> a vIOMMU.
> 
> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
> capabilities which are only determined by user's configuration, no host
> capabilities involved. Reasons are:
> 
> 1. host may has heterogeneous IOMMUs, each with different capabilities
> 2. this is migration friendly, return value is consistent between source
>    and target.
> 3. host IOMMU capabilities are passed to vIOMMU through set_iommu_device()
>    interface which have to be after attach_device(), when get_viommu_cap()
>    is called in attach_device(), there is no way for vIOMMU to get host
>    IOMMU capabilities yet, so only emulated capabilities can be returned.
>    See below sequence:
> 
>      vfio_device_attach():
>          iommufd_cdev_attach():
>              pci_device_get_viommu_cap() for HW nesting cap
>              create a nesting parent hwpt
>              attach device to the hwpt
>              vfio_device_hiod_create_and_realize() creating hiod
>      ...
>      pci_device_set_iommu_device(hiod)
> 
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
 
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 03/21] intel_iommu: Implement get_viommu_cap() callback
  2025-08-22  6:40 ` [PATCH v5 03/21] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
@ 2025-08-22 22:23   ` Nicolin Chen
  0 siblings, 0 replies; 113+ messages in thread
From: Nicolin Chen @ 2025-08-22 22:23 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On Fri, Aug 22, 2025 at 02:40:41AM -0400, Zhenzhong Duan wrote:
> Implement get_viommu_cap() callback and expose stage-1 capability for now.
> 
> VFIO uses it to create nested parent domain which is further used to create
> nested domain in vIOMMU. All these will be implemented in following patches.
> 
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
 
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device()
  2025-08-22  6:40 ` [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device() Zhenzhong Duan
@ 2025-08-22 22:40   ` Nicolin Chen via
  2025-08-25  6:06     ` Duan, Zhenzhong
  2025-08-27 11:13   ` Yi Liu
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 113+ messages in thread
From: Nicolin Chen via @ 2025-08-22 22:40 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On Fri, Aug 22, 2025 at 02:40:42AM -0400, Zhenzhong Duan wrote:
> Introduce helper vfio_pci_from_vfio_device() to transform from VFIODevice
> to VFIOPCIDevice, also to hide low level VFIO_DEVICE_TYPE_PCI type check.
> 
> Suggested-by: Cédric Le Goater <clg@redhat.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> Link: https://lore.kernel.org/qemu-devel/20250801023533.1458644-1-zhenzhong.duan@intel.com
> [ clg: Added documentation ]
> Signed-off-by: Cédric Le Goater <clg@redhat.com>

I think we should drop the link? The link points to the v3 that
is not the officially accepted one now, as this PATCH-04 would
be? IOW, the commit should probably have a link to this patch
instead.

Also, in general, your "Signed-off-by" should be the last line,
when you submit a patch.

With that,

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 810a842f4a..beb8fb9ee7 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -221,6 +221,18 @@ void vfio_pci_write_config(PCIDevice *pdev,
>  uint64_t vfio_vga_read(void *opaque, hwaddr addr, unsigned size);
>  void vfio_vga_write(void *opaque, hwaddr addr, uint64_t data, unsigned size);
>  
> +/**
> + * vfio_pci_from_vfio_device: Transform from VFIODevice to
> + * VFIOPCIDevice

Nit: this could fit into one line.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain
  2025-08-22  6:40 ` [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
@ 2025-08-22 23:12   ` Nicolin Chen
  2025-08-25  8:28     ` Duan, Zhenzhong
  2025-08-27 11:48   ` Eric Auger
  1 sibling, 1 reply; 113+ messages in thread
From: Nicolin Chen @ 2025-08-22 23:12 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On Fri, Aug 22, 2025 at 02:40:43AM -0400, Zhenzhong Duan wrote:
> Call pci_device_get_viommu_cap() to get if vIOMMU supports VIOMMU_CAP_HW_NESTED,
> if yes, create nested parent domain which could be reused by vIOMMU to create
> nested domain.
>
> Introduce helper vfio_device_viommu_get_nested to facilitate this
> implementation.

It'd be nicer to slightly mention the benefit of having it. Assuming
that QEMU commit message can be as long as 80 characters:

-------------------------
Call pci_device_get_viommu_cap() to get if vIOMMU supports VIOMMU_CAP_HW_NESTED.

If yes, create a nesting parent domain and add it to the container's hwpt_list,
letting this parent domain cover the entire stage-2 mappings (gPA=>PA).

This allows a VFIO passthrough device to directly attach to this default domain
and then to use the system address space and its listener.

Introduce a vfio_device_viommu_get_nested() helper to facilitate this
implementation.
-------------------------
 
> It is safe because even if VIOMMU_CAP_HW_NESTED is returned, s->flts is
> forbidden and VFIO device fails in set_iommu_device() call, until we support
> passthrough device with x-flts=on.

I think this is too vendor specific to be mentioned here. Likely
the previous VTD patch is the place to have this.

Or you could say:

--------------------------
It is safe to do so because a vIOMMU will be able to fail in set_iommu_device()
call, if something else related to the VFIO device or vIOMMU isn't compatible.
--------------------------

> +bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
> +{
> +    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
> +
> +    if (vdev) {
> +        return !!(pci_device_get_viommu_cap(&vdev->pdev) &
> +                  VIOMMU_CAP_HW_NESTED);

"get_nested" feels too general. Here it particularly means the cap:

bool vfio_device_get_viommu_cap_hw_nested(VFIODevice *vbasedev)

> @@ -379,6 +379,14 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>          flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>      }
>  
> +    /*
> +     * If vIOMMU supports stage-1 translation, force to create nested parent

"nested parent" is a contradictory phrase. Parent is a container
holding some nested items. A nested parent sounds like a "parent"
item that lives inside another parent container.

In kernel kdoc/uAPI, we use:
 - "nesting parent" for stage-2 object
 - "nested hwpt", "nested domain" for stage-1 object

> +     * domain which could be reused by vIOMMU to create nested domain.
> +     */
> +    if (vfio_device_viommu_get_nested(vbasedev)) {

With these addressed,

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 06/21] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  2025-08-22  6:40 ` [PATCH v5 06/21] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
@ 2025-08-22 23:13   ` Nicolin Chen
  2025-08-27 11:14   ` Yi Liu
  1 sibling, 0 replies; 113+ messages in thread
From: Nicolin Chen @ 2025-08-22 23:13 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On Fri, Aug 22, 2025 at 02:40:44AM -0400, Zhenzhong Duan wrote:
> Returns true if PCI device is aliased or false otherwise. This will be
> used in following patch to determine if a PCI device is under a PCI
> bridge.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
 
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-08-22  6:40 ` [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-08-22 23:17   ` Nicolin Chen
  2025-08-26 17:21   ` Nicolin Chen
  2025-08-27 11:14   ` Yi Liu
  2 siblings, 0 replies; 113+ messages in thread
From: Nicolin Chen @ 2025-08-22 23:17 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On Fri, Aug 22, 2025 at 02:40:45AM -0400, Zhenzhong Duan wrote:
> Introduce a new structure VTDHostIOMMUDevice which replaces
> HostIOMMUDevice to be stored in hash table.
> 
> It includes a reference to HostIOMMUDevice and IntelIOMMUState,
> also includes BDF information which will be used in future
> patches.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
  2025-08-22  6:40 ` [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
@ 2025-08-22 23:55   ` Nicolin Chen
  2025-08-25  9:21     ` Duan, Zhenzhong
  2025-08-27 11:56     ` Yi Liu
  0 siblings, 2 replies; 113+ messages in thread
From: Nicolin Chen @ 2025-08-22 23:55 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, joao.m.martins, clement.mathieu--drif,
	kevin.tian, yi.l.liu, chao.p.peng

On Fri, Aug 22, 2025 at 02:40:58AM -0400, Zhenzhong Duan wrote:
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index e503c232e1..59735e878c 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -324,6 +324,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>  {
>      ERRP_GUARD();
>      IOMMUFDBackend *iommufd = vbasedev->iommufd;
> +    struct iommu_hw_info_vtd vtd;

VendorCaps vendor_caps;

>      uint32_t type, flags = 0;
>      uint64_t hw_caps;
>      VFIOIOASHwpt *hwpt;
> @@ -371,10 +372,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>       * instead.
>       */
>      if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
> -                                         &type, NULL, 0, &hw_caps, errp)) {
> +                                         &type, &vtd, sizeof(vtd), &hw_caps,

s/vtd/vendor_caps/g

> +                                         errp)) {
>          return false;
>      }
>  
> +    if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
> +        container->bcontainer.bypass_ro = true;

This circled back to checking a vendor specific flag in the core..

Perhaps we could upgrade the get_viommu_cap op and its API:

enum viommu_flags {
    VIOMMU_FLAG_HW_NESTED = BIT_ULL(0),
    VIOMMU_FLAG_BYPASS_RO = BIT_ULL(1),
};

bool vfio_device_get_viommu_flags(VFIODevice *vbasedev, VendorCaps *vendor_caps,
                                  uint64_t *viommu_flags);

Then:
    if (viommu_flags & VIOMMU_FLAG_BYPASS_RO) {
        container->bcontainer.bypass_ro = true;
    }
...
    if (viommu_flags & VIOMMU_FLAG_HW_NESTED) {
        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
    }

Nicolin


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 01/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
  2025-08-22 22:19   ` Nicolin Chen via
@ 2025-08-25  6:01     ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-25  6:01 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v5 01/21] intel_iommu: Rename
>vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
>
>On Fri, Aug 22, 2025 at 02:40:39AM -0400, Zhenzhong Duan wrote:
>> In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
>> of rid2pasid, then it was extended to get any pasid entry. So a new name
>> vtd_ce_get_pasid_entry is better to match what it actually does.
>>
>> No functional change intended.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
>> Reviewed-by: Yi Liu <yi.l.liu@intel.com>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>
>Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>
>> @@ -944,7 +944,7 @@ static int
>vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
>>      return 0;
>>  }
>>
>> -static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
>> +static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
>>                                        VTDContextEntry *ce,
>>                                        VTDPASIDEntry *pe,
>>                                        uint32_t pasid)
>
>Nit: it could be re-organized a bit with the shrunk indentation.

Good catch, will do.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device()
  2025-08-22 22:40   ` Nicolin Chen via
@ 2025-08-25  6:06     ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-25  6:06 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v5 04/21] vfio: Introduce helper
>vfio_pci_from_vfio_device()
>
>On Fri, Aug 22, 2025 at 02:40:42AM -0400, Zhenzhong Duan wrote:
>> Introduce helper vfio_pci_from_vfio_device() to transform from VFIODevice
>> to VFIOPCIDevice, also to hide low level VFIO_DEVICE_TYPE_PCI type
>check.
>>
>> Suggested-by: Cédric Le Goater <clg@redhat.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>> Link:
>https://lore.kernel.org/qemu-devel/20250801023533.1458644-1-zhenzhong.
>duan@intel.com
>> [ clg: Added documentation ]
>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>
>I think we should drop the link? The link points to the v3 that
>is not the officially accepted one now, as this PATCH-04 would
>be? IOW, the commit should probably have a link to this patch
>instead.

OK, will remove them.

>
>Also, in general, your "Signed-off-by" should be the last line,
>when you submit a patch.

Will do.

>
>With that,
>
>Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
>
>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>> index 810a842f4a..beb8fb9ee7 100644
>> --- a/hw/vfio/pci.h
>> +++ b/hw/vfio/pci.h
>> @@ -221,6 +221,18 @@ void vfio_pci_write_config(PCIDevice *pdev,
>>  uint64_t vfio_vga_read(void *opaque, hwaddr addr, unsigned size);
>>  void vfio_vga_write(void *opaque, hwaddr addr, uint64_t data, unsigned
>size);
>>
>> +/**
>> + * vfio_pci_from_vfio_device: Transform from VFIODevice to
>> + * VFIOPCIDevice
>
>Nit: this could fit into one line.

Sure, will do.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain
  2025-08-22 23:12   ` Nicolin Chen
@ 2025-08-25  8:28     ` Duan, Zhenzhong
  2025-08-27 11:51       ` Eric Auger
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-25  8:28 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent
>domain
>
>On Fri, Aug 22, 2025 at 02:40:43AM -0400, Zhenzhong Duan wrote:
>> Call pci_device_get_viommu_cap() to get if vIOMMU supports
>VIOMMU_CAP_HW_NESTED,
>> if yes, create nested parent domain which could be reused by vIOMMU to
>create
>> nested domain.
>>
>> Introduce helper vfio_device_viommu_get_nested to facilitate this
>> implementation.
>
>It'd be nicer to slightly mention the benefit of having it. Assuming
>that QEMU commit message can be as long as 80 characters:
>
>-------------------------
>Call pci_device_get_viommu_cap() to get if vIOMMU supports
>VIOMMU_CAP_HW_NESTED.
>
>If yes, create a nesting parent domain and add it to the container's hwpt_list,
>letting this parent domain cover the entire stage-2 mappings (gPA=>PA).
>
>This allows a VFIO passthrough device to directly attach to this default
>domain
>and then to use the system address space and its listener.
>
>Introduce a vfio_device_viommu_get_nested() helper to facilitate this
>implementation.
>-------------------------

Thanks, will do.

>
>> It is safe because even if VIOMMU_CAP_HW_NESTED is returned, s->flts is
>> forbidden and VFIO device fails in set_iommu_device() call, until we support
>> passthrough device with x-flts=on.
>
>I think this is too vendor specific to be mentioned here. Likely
>the previous VTD patch is the place to have this.
>
>Or you could say:
>
>--------------------------
>It is safe to do so because a vIOMMU will be able to fail in
>set_iommu_device()
>call, if something else related to the VFIO device or vIOMMU isn't compatible.
>--------------------------

Will do.

>
>> +bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
>> +{
>> +    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
>> +
>> +    if (vdev) {
>> +        return !!(pci_device_get_viommu_cap(&vdev->pdev) &
>> +                  VIOMMU_CAP_HW_NESTED);
>
>"get_nested" feels too general. Here it particularly means the cap:
>
>bool vfio_device_get_viommu_cap_hw_nested(VFIODevice *vbasedev)

Will use vfio_device_get_viommu_cap_hw_nested()

>
>> @@ -379,6 +379,14 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>          flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>      }
>>
>> +    /*
>> +     * If vIOMMU supports stage-1 translation, force to create nested
>parent
>
>"nested parent" is a contradictory phrase. Parent is a container
>holding some nested items. A nested parent sounds like a "parent"
>item that lives inside another parent container.
>
>In kernel kdoc/uAPI, we use:
> - "nesting parent" for stage-2 object
> - "nested hwpt", "nested domain" for stage-1 object

Thanks for sharing this info, I didn't notice that. I will fix the whole series to use 'nesting parent'.

BRs,
Zhenzhong


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
  2025-08-22 23:55   ` Nicolin Chen
@ 2025-08-25  9:21     ` Duan, Zhenzhong
  2025-08-25 16:58       ` Nicolin Chen
  2025-08-27 11:56     ` Yi Liu
  1 sibling, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-25  9:21 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
>
>On Fri, Aug 22, 2025 at 02:40:58AM -0400, Zhenzhong Duan wrote:
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index e503c232e1..59735e878c 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -324,6 +324,7 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>  {
>>      ERRP_GUARD();
>>      IOMMUFDBackend *iommufd = vbasedev->iommufd;
>> +    struct iommu_hw_info_vtd vtd;
>
>VendorCaps vendor_caps;
>
>>      uint32_t type, flags = 0;
>>      uint64_t hw_caps;
>>      VFIOIOASHwpt *hwpt;
>> @@ -371,10 +372,15 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>       * instead.
>>       */
>>      if (!iommufd_backend_get_device_info(vbasedev->iommufd,
>vbasedev->devid,
>> -                                         &type, NULL, 0,
>&hw_caps, errp)) {
>> +                                         &type, &vtd, sizeof(vtd),
>&hw_caps,
>
>s/vtd/vendor_caps/g
>
>> +                                         errp)) {
>>          return false;
>>      }
>>
>> +    if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
>> +        container->bcontainer.bypass_ro = true;
>
>This circled back to checking a vendor specific flag in the core..

I'm not sure if VendorCaps struct wrapper is overprogramming as this ERRARA is only VTD specific. We still need to check VendorCaps.vtd.flags bit.

>
>Perhaps we could upgrade the get_viommu_cap op and its API:
>
>enum viommu_flags {
>    VIOMMU_FLAG_HW_NESTED = BIT_ULL(0),
>    VIOMMU_FLAG_BYPASS_RO = BIT_ULL(1),
>};
>
>bool vfio_device_get_viommu_flags(VFIODevice *vbasedev, VendorCaps
>*vendor_caps,
>                                  uint64_t *viommu_flags);
>
>Then:
>    if (viommu_flags & VIOMMU_FLAG_BYPASS_RO) {
>        container->bcontainer.bypass_ro = true;
>    }
>...
>    if (viommu_flags & VIOMMU_FLAG_HW_NESTED) {
>        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
>    }

IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is a VTD specific flag bit from host IOMMU, we have defined get_viommu_cap() to return pure vIOMMU capability bits, so no host IOMMU flag bit can be returned here. See patch2 commit log for the reason.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
  2025-08-25  9:21     ` Duan, Zhenzhong
@ 2025-08-25 16:58       ` Nicolin Chen
  2025-08-27  7:11         ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Nicolin Chen @ 2025-08-25 16:58 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Liu, Yi L, Peng, Chao P

On Mon, Aug 25, 2025 at 09:21:48AM +0000, Duan, Zhenzhong wrote:
> 
> 
> >-----Original Message-----
> >From: Nicolin Chen <nicolinc@nvidia.com>
> >Subject: Re: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
> >
> >On Fri, Aug 22, 2025 at 02:40:58AM -0400, Zhenzhong Duan wrote:
> >> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> >> index e503c232e1..59735e878c 100644
> >> --- a/hw/vfio/iommufd.c
> >> +++ b/hw/vfio/iommufd.c
> >> @@ -324,6 +324,7 @@ static bool
> >iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> >>  {
> >>      ERRP_GUARD();
> >>      IOMMUFDBackend *iommufd = vbasedev->iommufd;
> >> +    struct iommu_hw_info_vtd vtd;
> >
> >VendorCaps vendor_caps;
> >
> >>      uint32_t type, flags = 0;
> >>      uint64_t hw_caps;
> >>      VFIOIOASHwpt *hwpt;
> >> @@ -371,10 +372,15 @@ static bool
> >iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> >>       * instead.
> >>       */
> >>      if (!iommufd_backend_get_device_info(vbasedev->iommufd,
> >vbasedev->devid,
> >> -                                         &type, NULL, 0,
> >&hw_caps, errp)) {
> >> +                                         &type, &vtd, sizeof(vtd),
> >&hw_caps,
> >
> >s/vtd/vendor_caps/g
> >
> >> +                                         errp)) {
> >>          return false;
> >>      }
> >>
> >> +    if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
> >> +        container->bcontainer.bypass_ro = true;
> >
> >This circled back to checking a vendor specific flag in the core..
> 
> I'm not sure if VendorCaps struct wrapper is overprogramming as this
> ERRARA is only VTD specific. We still need to check VendorCaps.vtd.flags bit.

Look, the HW_INFO call is done by the core.

Then, the core needs:
  1 HW caps for dirty tracking and PASID
  2 IOMMU_HWPT_ALLOC_NEST_PARENT (vIOMMU cap)
  3 bcontainer.bypass_ro (vIOMMU workaround)

Both 2 and 3 need to get from vIOMMU, while 3 needs VendorCaps.
Arguably 2 could do a bit validation using the VendorCaps too.

> >Perhaps we could upgrade the get_viommu_cap op and its API:
> >
> >enum viommu_flags {
> >    VIOMMU_FLAG_HW_NESTED = BIT_ULL(0),
> >    VIOMMU_FLAG_BYPASS_RO = BIT_ULL(1),
> >};
> >
> >bool vfio_device_get_viommu_flags(VFIODevice *vbasedev, VendorCaps
> >*vendor_caps,
> >                                  uint64_t *viommu_flags);
> >
> >Then:
> >    if (viommu_flags & VIOMMU_FLAG_BYPASS_RO) {
> >        container->bcontainer.bypass_ro = true;
> >    }
> >...
> >    if (viommu_flags & VIOMMU_FLAG_HW_NESTED) {
> >        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
> >    }
> 
> IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is a VTD specific flag bit
> from host IOMMU, we have defined get_viommu_cap() to return pure
> vIOMMU capability bits, so no host IOMMU flag bit can be returned
> here. See patch2 commit log for the reason.

VIOMMU_FLAG_BYPASS_RO is a "pure" vIOMMU flag, not confined to
VTD. IOW, if some other vIOMMU has a similar issue, they can use
it as well. Since we define a "bypass_ro" in the core bcontainer
structure, it makes sense to have a core-level flag for it, v.s.
checking the vendor flag in the core.

My sample code is turning this get_viommu_cap to something like
get_viommu_flags, which could include both "cap" and "errata".

Nicolin


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-08-22  6:40 ` [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
  2025-08-22 23:17   ` Nicolin Chen
@ 2025-08-26 17:21   ` Nicolin Chen
  2025-08-27  6:45     ` Duan, Zhenzhong
  2025-08-27 16:36     ` Eric Auger
  2025-08-27 11:14   ` Yi Liu
  2 siblings, 2 replies; 113+ messages in thread
From: Nicolin Chen @ 2025-08-26 17:21 UTC (permalink / raw)
  To: Zhenzhong Duan, yi.l.liu
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, joao.m.martins, clement.mathieu--drif,
	kevin.tian, chao.p.peng

Hi Zhenzhong/Yi,

On Fri, Aug 22, 2025 at 02:40:45AM -0400, Zhenzhong Duan wrote:
> @@ -4371,6 +4374,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>                                       HostIOMMUDevice *hiod, Error **errp)
>  {
>      IntelIOMMUState *s = opaque;
> +    VTDHostIOMMUDevice *vtd_hiod;
>      struct vtd_as_key key = {
>          .bus = bus,
>          .devfn = devfn,

I wonder if the bus/devfn here would always reflect the actual BDF
numbers in this function, on an x86 VM.

With ARM, when the device is attached to a pxb bus, the bus/devfn
here are both 0, so PCI_BUILD_BDF() using these two returns 0 too.

QEMU command for the device:
 -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0 \
 -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.1,accel=on \
 -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1,io-reserve=0 \
 -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.port1,rombar=0,id=dev0,iommufd=iommufd0

QEMU log:
smmuv3_accel_set_iommu_device: bus=0, devfn=0, sid=0

The set_iommu_device op is invoked by vfio_pci_realize() where the
the BDF number won't get ready for this kind of PCI setup until a
later stage that I can't identify yet..

Given that VTD wants the BDF number too, I start to wonder whether
the set_iommu_device op is invoked in the right place or not..

Maybe VTD works because it saves the bus pointer v.s. bus_num(=0),
so its bus_num would be updated when later code calculates the BDF
number using the saved bus pointer (in the key). Nonetheless, the
saved devfn (in the key) is 0, which wouldn't be updated later as
the bus_num. So, if the device is supposed to have a devfn (!=0),
this wouldn't work?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-08-26 17:21   ` Nicolin Chen
@ 2025-08-27  6:45     ` Duan, Zhenzhong
  2025-08-27  8:51       ` Nicolin Chen
  2025-08-27 16:36     ` Eric Auger
  1 sibling, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-27  6:45 UTC (permalink / raw)
  To: Nicolin Chen, Liu, Yi L
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Peng, Chao P

Hi

>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v5 07/21] intel_iommu: Introduce a new structure
>VTDHostIOMMUDevice
>
>Hi Zhenzhong/Yi,
>
>On Fri, Aug 22, 2025 at 02:40:45AM -0400, Zhenzhong Duan wrote:
>> @@ -4371,6 +4374,7 @@ static bool vtd_dev_set_iommu_device(PCIBus
>*bus, void *opaque, int devfn,
>>                                       HostIOMMUDevice *hiod,
>Error **errp)
>>  {
>>      IntelIOMMUState *s = opaque;
>> +    VTDHostIOMMUDevice *vtd_hiod;
>>      struct vtd_as_key key = {
>>          .bus = bus,
>>          .devfn = devfn,
>
>I wonder if the bus/devfn here would always reflect the actual BDF
>numbers in this function, on an x86 VM.

devfn is enumerated by QEMU, see do_pci_register_device(), bus number is enumerated in BIOS or kernel.
So we can't use BDF number as key, we use PCIBus pointer + devfn as the key instead.

>
>With ARM, when the device is attached to a pxb bus, the bus/devfn
>here are both 0, so PCI_BUILD_BDF() using these two returns 0 too.
>
>QEMU command for the device:
> -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0 \
> -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.1,accel=on \
> -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1,io-reserve=0 \
> -device
>vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.port1,rombar=0,id=dev0,iom
>mufd=iommufd0
>
>QEMU log:
>smmuv3_accel_set_iommu_device: bus=0, devfn=0, sid=0

There is only one device under pcie.port1, devfn is initialized to 0, bus number isn't enumerated yet during realize() so 0.

>
>The set_iommu_device op is invoked by vfio_pci_realize() where the
>the BDF number won't get ready for this kind of PCI setup until a
>later stage that I can't identify yet..
>
>Given that VTD wants the BDF number too, I start to wonder whether
>the set_iommu_device op is invoked in the right place or not..
>
>Maybe VTD works because it saves the bus pointer v.s. bus_num(=0),
>so its bus_num would be updated when later code calculates the BDF
>number using the saved bus pointer (in the key). Nonetheless, the
>saved devfn (in the key) is 0, which wouldn't be updated later as
>the bus_num. So, if the device is supposed to have a devfn (!=0),
>this wouldn't work?

Both PCIBus pointer and devfn are fixed value for a QEMU instance, never changed.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
  2025-08-25 16:58       ` Nicolin Chen
@ 2025-08-27  7:11         ` Duan, Zhenzhong
  2025-08-27  8:42           ` Nicolin Chen
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-27  7:11 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Liu, Yi L, Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
>
>On Mon, Aug 25, 2025 at 09:21:48AM +0000, Duan, Zhenzhong wrote:
>>
>>
>> >-----Original Message-----
>> >From: Nicolin Chen <nicolinc@nvidia.com>
>> >Subject: Re: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
>> >
>> >On Fri, Aug 22, 2025 at 02:40:58AM -0400, Zhenzhong Duan wrote:
>> >> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> >> index e503c232e1..59735e878c 100644
>> >> --- a/hw/vfio/iommufd.c
>> >> +++ b/hw/vfio/iommufd.c
>> >> @@ -324,6 +324,7 @@ static bool
>> >iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>> >>  {
>> >>      ERRP_GUARD();
>> >>      IOMMUFDBackend *iommufd = vbasedev->iommufd;
>> >> +    struct iommu_hw_info_vtd vtd;
>> >
>> >VendorCaps vendor_caps;
>> >
>> >>      uint32_t type, flags = 0;
>> >>      uint64_t hw_caps;
>> >>      VFIOIOASHwpt *hwpt;
>> >> @@ -371,10 +372,15 @@ static bool
>> >iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>> >>       * instead.
>> >>       */
>> >>      if (!iommufd_backend_get_device_info(vbasedev->iommufd,
>> >vbasedev->devid,
>> >> -                                         &type, NULL, 0,
>> >&hw_caps, errp)) {
>> >> +                                         &type, &vtd,
>sizeof(vtd),
>> >&hw_caps,
>> >
>> >s/vtd/vendor_caps/g
>> >
>> >> +                                         errp)) {
>> >>          return false;
>> >>      }
>> >>
>> >> +    if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
>> >> +        container->bcontainer.bypass_ro = true;
>> >
>> >This circled back to checking a vendor specific flag in the core..
>>
>> I'm not sure if VendorCaps struct wrapper is overprogramming as this
>> ERRARA is only VTD specific. We still need to check VendorCaps.vtd.flags
>bit.
>
>Look, the HW_INFO call is done by the core.
>
>Then, the core needs:
>  1 HW caps for dirty tracking and PASID
>  2 IOMMU_HWPT_ALLOC_NEST_PARENT (vIOMMU cap)
>  3 bcontainer.bypass_ro (vIOMMU workaround)

Why vIOMMU workaround? ERRATA is from host IOMMU. In a heterogeneous environment, some host IOMMUs can have this ERRATA while other newer IOMMUs not.

IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 may only exist on old IOMMUs with nesting capability, vIOMMU doesn't support nesting emulation yet, it's also no sense to emulate an ERRATA in vIOMMU.

>
>Both 2 and 3 need to get from vIOMMU, while 3 needs VendorCaps.
>Arguably 2 could do a bit validation using the VendorCaps too.
>
>> >Perhaps we could upgrade the get_viommu_cap op and its API:
>> >
>> >enum viommu_flags {
>> >    VIOMMU_FLAG_HW_NESTED = BIT_ULL(0),
>> >    VIOMMU_FLAG_BYPASS_RO = BIT_ULL(1),
>> >};
>> >
>> >bool vfio_device_get_viommu_flags(VFIODevice *vbasedev, VendorCaps
>> >*vendor_caps,
>> >                                  uint64_t *viommu_flags);
>> >
>> >Then:
>> >    if (viommu_flags & VIOMMU_FLAG_BYPASS_RO) {
>> >        container->bcontainer.bypass_ro = true;
>> >    }
>> >...
>> >    if (viommu_flags & VIOMMU_FLAG_HW_NESTED) {
>> >        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
>> >    }
>>
>> IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is a VTD specific flag bit
>> from host IOMMU, we have defined get_viommu_cap() to return pure
>> vIOMMU capability bits, so no host IOMMU flag bit can be returned
>> here. See patch2 commit log for the reason.
>
>VIOMMU_FLAG_BYPASS_RO is a "pure" vIOMMU flag, not confined to
>VTD. IOW, if some other vIOMMU has a similar issue, they can use
>it as well. Since we define a "bypass_ro" in the core bcontainer
>structure, it makes sense to have a core-level flag for it, v.s.
>checking the vendor flag in the core.

It's not a vIOMMU flag but host IOMMU flag except vIOMMU want to emulate that ERRATA.

Due to patch9, there is only one VFIO device under a container, so bypass_ro is set based on VFIO device's backend host IOMMU's flag IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17.

Thanks
Zhenzhong

>
>My sample code is turning this get_viommu_cap to something like
>get_viommu_flags, which could include both "cap" and "errata".
>
>Nicolin


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
  2025-08-27  7:11         ` Duan, Zhenzhong
@ 2025-08-27  8:42           ` Nicolin Chen
  0 siblings, 0 replies; 113+ messages in thread
From: Nicolin Chen @ 2025-08-27  8:42 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian,  Kevin, Liu, Yi L, Peng, Chao P

On Wed, Aug 27, 2025 at 07:11:54AM +0000, Duan, Zhenzhong wrote:
> >> >> +    if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
> >> >> +        container->bcontainer.bypass_ro = true;
> >> >
> >> >This circled back to checking a vendor specific flag in the core..
> >>
> >> I'm not sure if VendorCaps struct wrapper is overprogramming as this
> >> ERRARA is only VTD specific. We still need to check VendorCaps.vtd.flags
> >bit.
> >
> >Look, the HW_INFO call is done by the core.
> >
> >Then, the core needs:
> >  1 HW caps for dirty tracking and PASID
> >  2 IOMMU_HWPT_ALLOC_NEST_PARENT (vIOMMU cap)
> >  3 bcontainer.bypass_ro (vIOMMU workaround)
> 
> Why vIOMMU workaround? ERRATA is from host IOMMU.
> In a heterogeneous environment, some host IOMMUs can have
> this ERRATA while other newer IOMMUs not.

To be fair, the subject of your patch is "Workaround". Though it
might be inaccurate to call it "vIOMMU Workaround", the idea was
to let vendor code decode vendor bits and flags.

Arguably, when a host IOMMU has an erratum while requiring a HW
acceleration like nesting, vIOMMU can be a partner to help apply
the workaround.

> IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 may only exist on old IOMMUs
> with nesting capability, vIOMMU doesn't support nesting emulation yet,
> it's also no sense to emulate an ERRATA in vIOMMU.

Certainly, I never suggest to "emulate an ERRATA".

> >Both 2 and 3 need to get from vIOMMU, while 3 needs VendorCaps.
> >Arguably 2 could do a bit validation using the VendorCaps too.
> >
> >> >Perhaps we could upgrade the get_viommu_cap op and its API:
> >> >
> >> >enum viommu_flags {
> >> >    VIOMMU_FLAG_HW_NESTED = BIT_ULL(0),
> >> >    VIOMMU_FLAG_BYPASS_RO = BIT_ULL(1),
> >> >};
> >> >
> >> >bool vfio_device_get_viommu_flags(VFIODevice *vbasedev, VendorCaps
> >> >*vendor_caps,
> >> >                                  uint64_t *viommu_flags);
> >> >
> >> >Then:
> >> >    if (viommu_flags & VIOMMU_FLAG_BYPASS_RO) {
> >> >        container->bcontainer.bypass_ro = true;
> >> >    }
> >> >...
> >> >    if (viommu_flags & VIOMMU_FLAG_HW_NESTED) {
> >> >        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
> >> >    }
> >>
> >> IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17 is a VTD specific flag bit
> >> from host IOMMU, we have defined get_viommu_cap() to return pure
> >> vIOMMU capability bits, so no host IOMMU flag bit can be returned
> >> here. See patch2 commit log for the reason.
> >
> >VIOMMU_FLAG_BYPASS_RO is a "pure" vIOMMU flag, not confined to
> >VTD. IOW, if some other vIOMMU has a similar issue, they can use
> >it as well. Since we define a "bypass_ro" in the core bcontainer
> >structure, it makes sense to have a core-level flag for it, v.s.
> >checking the vendor flag in the core.
> 
> It's not a vIOMMU flag but host IOMMU flag except vIOMMU want to emulate that ERRATA.

Again, the idea here is not to blame vIOMMU for every flag, nor
to emulate the erratum, but to use vIOMMU as a vendor specific
place to translate a vendor specific flag (either host IOMMU or
vIOMMU) to something that QEMU core code can understand.

The rule of thumb is to avoid checking vendor flags in the core,
which both Eric and I have noted for a couple of times..

It's okay that you don't like the way of grouping the host flag
with vIOMMU cap. Please find some other place in the vendor zone
to load the vendor flag? Maybe add another op/API to decode the
host IOMMU flags/caps exclusively?

Nicolin


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-08-27  6:45     ` Duan, Zhenzhong
@ 2025-08-27  8:51       ` Nicolin Chen
  0 siblings, 0 replies; 113+ messages in thread
From: Nicolin Chen @ 2025-08-27  8:51 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: Liu, Yi L, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P

On Wed, Aug 27, 2025 at 06:45:42AM +0000, Duan, Zhenzhong wrote:
> Hi
> 
> >-----Original Message-----
> >From: Nicolin Chen <nicolinc@nvidia.com>
> >Subject: Re: [PATCH v5 07/21] intel_iommu: Introduce a new structure
> >VTDHostIOMMUDevice
> >
> >Hi Zhenzhong/Yi,
> >
> >On Fri, Aug 22, 2025 at 02:40:45AM -0400, Zhenzhong Duan wrote:
> >> @@ -4371,6 +4374,7 @@ static bool vtd_dev_set_iommu_device(PCIBus
> >*bus, void *opaque, int devfn,
> >>                                       HostIOMMUDevice *hiod,
> >Error **errp)
> >>  {
> >>      IntelIOMMUState *s = opaque;
> >> +    VTDHostIOMMUDevice *vtd_hiod;
> >>      struct vtd_as_key key = {
> >>          .bus = bus,
> >>          .devfn = devfn,
> >
> >I wonder if the bus/devfn here would always reflect the actual BDF
> >numbers in this function, on an x86 VM.
> 
> devfn is enumerated by QEMU, see do_pci_register_device(),

Oh, thanks for the direction.

> bus number is enumerated in BIOS or kernel.
> So we can't use BDF number as key, we use PCIBus pointer + devfn
> as the key instead.

Yea, I figured that out.

> >With ARM, when the device is attached to a pxb bus, the bus/devfn
> >here are both 0, so PCI_BUILD_BDF() using these two returns 0 too.
> >
> >QEMU command for the device:
> > -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0 \
> > -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.1,accel=on \
> > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1,io-reserve=0 \
> > -device
> >vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.port1,rombar=0,id=dev0,iom
> >mufd=iommufd0
> >
> >QEMU log:
> >smmuv3_accel_set_iommu_device: bus=0, devfn=0, sid=0
> 
> There is only one device under pcie.port1, devfn is initialized to 0,
> bus number isn't enumerated yet during realize() so 0.

That's a pain for ARM... It needs to set BDF number early for some
use case. Shameer's current solution is doing after the guest kernel
boots, very late. So we might want to move it forward..

So, it'd be very ideal to have BDF in the set_iommu_device callback.
Otherwise, we'd have to add something like set_iommu_vdevice op to
invoke in the PCI core.

> >The set_iommu_device op is invoked by vfio_pci_realize() where the
> >the BDF number won't get ready for this kind of PCI setup until a
> >later stage that I can't identify yet..
> >
> >Given that VTD wants the BDF number too, I start to wonder whether
> >the set_iommu_device op is invoked in the right place or not..
> >
> >Maybe VTD works because it saves the bus pointer v.s. bus_num(=0),
> >so its bus_num would be updated when later code calculates the BDF
> >number using the saved bus pointer (in the key). Nonetheless, the
> >saved devfn (in the key) is 0, which wouldn't be updated later as
> >the bus_num. So, if the device is supposed to have a devfn (!=0),
> >this wouldn't work?
> 
> Both PCIBus pointer and devfn are fixed value for a QEMU instance,
> never changed.

I see. devfn wouldn't be changed. Only the bus_num will be updated
in the later stage. So, it's not a problem for Intel.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device
  2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (20 preceding siblings ...)
  2025-08-22  6:40 ` [PATCH v5 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
@ 2025-08-27 11:13 ` Yi Liu
  2025-08-28  5:53   ` Duan, Zhenzhong
  21 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-27 11:13 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> Hi,
> 
> For passthrough device with intel_iommu.x-flts=on, we don't do shadowing of
> guest page table for passthrough device but pass stage-1 page table to host
> side to construct a nested domain. There was some effort to enable this feature
> in old days, see [1] for details.
> 
> The key design is to utilize the dual-stage IOMMU translation (also known as
> IOMMU nested translation) capability in host IOMMU. As the below diagram shows,
> guest I/O page table pointer in GPA (guest physical address) is passed to host
> and be used to perform the stage-1 address translation. Along with it,
> modifications to present mappings in the guest I/O page table should be followed
> with an IOTLB invalidation.
> 
>          .-------------.  .---------------------------.
>          |   vIOMMU    |  | Guest I/O page table      |
>          |             |  '---------------------------'
>          .----------------/
>          | PASID Entry |--- PASID cache flush --+
>          '-------------'                        |
>          |             |                        V
>          |             |           I/O page table pointer in GPA
>          '-------------'
>      Guest
>      ------| Shadow |---------------------------|--------
>            v        v                           v
>      Host
>          .-------------.  .------------------------.
>          |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
>          |             |  '------------------------'
>          .----------------/  |
>          | PASID Entry |     V (Nested xlate)
>          '----------------\.--------------------------------------.
>          |             |   | Stage2 for GPA->HPA, unmanaged domain|
>          |             |   '--------------------------------------'
>          '-------------'
> For history reason, there are different namings in different VTD spec rev,
> Where:
>   - Stage1 = First stage = First level = flts
>   - Stage2 = Second stage = Second level = slts
> <Intel VT-d Nested translation>
> 
> This series reuse VFIO device's default hwpt as nested parent instead of
> creating new one. This way avoids duplicate code of a new memory listener,
> all existing feature from VFIO listener can be shared, e.g., ram discard,
> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
> under a PCI bridge with emulated device, because emulated device wants
> IOMMU AS and VFIO device stick to system AS;

should we document it somewhere?

> 2) not supporting kexec or
> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
> VFIO device's default hwpt is created with NEST_PARENT flag, kernel
> inhibit RO mappings when switch to shadow mode.

how does guest know this limitation and hold on such attempts?

> 
> This series is also a prerequisite work for vSVA, i.e. Sharing guest
> application address space with passthrough devices.
> 
> There are some interactions between VFIO and vIOMMU
> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>    subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>    instance to vIOMMU at vfio device realize stage.
> * vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
>    VFIO calls it to get vIOMMU exposed capabilities.
> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>    to bind/unbind device to IOMMUFD backed domains, either nested
>    domain or not.
> 
> See below diagram:
> 
>          VFIO Device                                 Intel IOMMU
>      .-----------------.                         .-------------------.
>      |                 |                         |                   |
>      |       .---------|PCIIOMMUOps              |.-------------.    |
>      |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU  |    |
>      |       | Device  |------------------------>|| Device list |    |
>      |       .---------|(get_viommu_cap)         |.-------------.    |
>      |                 |                         |       |           |
>      |                 |                         |       V           |
>      |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
>      |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
>      |       | link    |<------------------------|  |   Device    |  |
>      |       .---------|            (detach_hwpt)|  .-------------.  |
>      |                 |                         |       |           |
>      |                 |                         |       ...         |
>      .-----------------.                         .-------------------.
> 
> Below is an example to enable stage-1 translation for passthrough device:
> 
>      -M q35,...
>      -device intel-iommu,x-scalable-mode=on,x-flts=on...
>      -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
> 
> Test done:
> - VFIO devices hotplug/unplug
> - different VFIO devices linked to different iommufds
> - vhost net device ping test
> 
> PATCH1-7:  Some preparing work
> PATCH8-9:  Compatibility check between vIOMMU and Host IOMMU
> PATCH10-18:Implement stage-1 page table for passthrough device
> PATCH19-20:Workaround for ERRATA_772415_SPR17
> PATCH21:   Enable stage-1 translation for passthrough device
> 
> Qemu code can be found at [2]
> 
> Fault report isn't supported in this series, we presume guest kernel always
> construct correct stage1 page table for passthrough device. For emulated
> devices, the emulation code already provided stage1 fault injection.

just call out this series is only limited to gIOVA usage so far. vSVA is
later. :)

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-22  6:40 ` [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
  2025-08-22 22:22   ` Nicolin Chen
@ 2025-08-27 11:13   ` Yi Liu
  2025-08-27 11:22     ` Eric Auger
  1 sibling, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-27 11:13 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
> allows to retrieve capabilities exposed by a vIOMMU. The first planned
> vIOMMU device capability is VIOMMU_CAP_HW_NESTED that advertises the
> support of HW nested stage translation scheme. pci_device_get_viommu_cap
> is a wrapper that can be called on a PCI device potentially protected by
> a vIOMMU.
> 
> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
> capabilities which are only determined by user's configuration, no host
> capabilities involved. Reasons are:
> 
> 1. host may has heterogeneous IOMMUs, each with different capabilities
> 2. this is migration friendly, return value is consistent between source
>     and target.
> 3. host IOMMU capabilities are passed to vIOMMU through set_iommu_device()
>     interface which have to be after attach_device(), when get_viommu_cap()
>     is called in attach_device(), there is no way for vIOMMU to get host
>     IOMMU capabilities yet, so only emulated capabilities can be returned.
>     See below sequence:
> 
>       vfio_device_attach():
>           iommufd_cdev_attach():
>               pci_device_get_viommu_cap() for HW nesting cap
>               create a nesting parent hwpt
>               attach device to the hwpt
>               vfio_device_hiod_create_and_realize() creating hiod
>       ...
>       pci_device_set_iommu_device(hiod)

> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   MAINTAINERS          |  1 +
>   include/hw/iommu.h   | 19 +++++++++++++++++++
>   include/hw/pci/pci.h | 25 +++++++++++++++++++++++++
>   hw/pci/pci.c         | 11 +++++++++++
>   4 files changed, 56 insertions(+)
>   create mode 100644 include/hw/iommu.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index a07086ed76..54fb878128 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2305,6 +2305,7 @@ F: include/system/iommufd.h
>   F: backends/host_iommu_device.c
>   F: include/system/host_iommu_device.h
>   F: include/qemu/chardev_open.h
> +F: include/hw/iommu.h
>   F: util/chardev_open.c
>   F: docs/devel/vfio-iommufd.rst
>   
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> new file mode 100644
> index 0000000000..7dd0c11b16
> --- /dev/null
> +++ b/include/hw/iommu.h
> @@ -0,0 +1,19 @@
> +/*
> + * General vIOMMU capabilities, flags, etc
> + *
> + * Copyright (C) 2025 Intel Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_IOMMU_H
> +#define HW_IOMMU_H
> +
> +#include "qemu/bitops.h"
> +
> +enum {
> +    /* hardware nested stage-1 page table support */
> +    VIOMMU_CAP_HW_NESTED = BIT_ULL(0),

This naming is a bit confusing. get_viommu_cap indicates it will return
the viommu's capability while this naming is HW_NESTED. It's conflict
with the commit message which claims only emulated capability will be
returned.

TBH. I'm hesitating to name it as get_viommu_cap. The scope is a little
larger than what we want so far. So I'm wondering if it can be done in a
more straightforward way. e.g. just a bool op named
iommu_nested_wanted(). Just an example, maybe better naming. We can
extend the op to be returning a u64 value in the future when we see
another request on VFIO from vIOMMU.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device()
  2025-08-22  6:40 ` [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device() Zhenzhong Duan
  2025-08-22 22:40   ` Nicolin Chen via
@ 2025-08-27 11:13   ` Yi Liu
  2025-08-27 11:34   ` Eric Auger
  2025-09-01 16:36   ` Cédric Le Goater
  3 siblings, 0 replies; 113+ messages in thread
From: Yi Liu @ 2025-08-27 11:13 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> Introduce helper vfio_pci_from_vfio_device() to transform from VFIODevice
> to VFIOPCIDevice, also to hide low level VFIO_DEVICE_TYPE_PCI type check.
> 
> Suggested-by: Cédric Le Goater <clg@redhat.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> Link: https://lore.kernel.org/qemu-devel/20250801023533.1458644-1-zhenzhong.duan@intel.com
> [ clg: Added documentation ]
> Signed-off-by: Cédric Le Goater <clg@redhat.com>
> ---
>   hw/vfio/pci.h       | 12 ++++++++++++
>   hw/vfio/container.c |  4 ++--
>   hw/vfio/device.c    |  2 +-
>   hw/vfio/iommufd.c   |  4 ++--
>   hw/vfio/listener.c  |  4 ++--
>   hw/vfio/pci.c       |  9 +++++++++
>   6 files changed, 28 insertions(+), 7 deletions(-)

Reviewed-by: Yi Liu <yi.l.liu@intel.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 06/21] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  2025-08-22  6:40 ` [PATCH v5 06/21] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
  2025-08-22 23:13   ` Nicolin Chen
@ 2025-08-27 11:14   ` Yi Liu
  1 sibling, 0 replies; 113+ messages in thread
From: Yi Liu @ 2025-08-27 11:14 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> Returns true if PCI device is aliased or false otherwise. This will be
> used in following patch to determine if a PCI device is under a PCI
> bridge.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> ---
>   include/hw/pci/pci.h |  2 ++
>   hw/pci/pci.c         | 12 ++++++++----
>   2 files changed, 10 insertions(+), 4 deletions(-)


Reviewed-by: Yi Liu <yi.l.liu@intel.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-08-22  6:40 ` [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
  2025-08-22 23:17   ` Nicolin Chen
  2025-08-26 17:21   ` Nicolin Chen
@ 2025-08-27 11:14   ` Yi Liu
  2025-08-28  9:17     ` Duan, Zhenzhong
  2 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-27 11:14 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> Introduce a new structure VTDHostIOMMUDevice which replaces
> HostIOMMUDevice to be stored in hash table.
> 
> It includes a reference to HostIOMMUDevice and IntelIOMMUState,
> also includes BDF information which will be used in future
> patches.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> ---
>   hw/i386/intel_iommu_internal.h |  7 +++++++
>   include/hw/i386/intel_iommu.h  |  2 +-
>   hw/i386/intel_iommu.c          | 15 +++++++++++++--
>   3 files changed, 21 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 360e937989..c7046eb4e2 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -28,6 +28,7 @@
>   #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
>   #define HW_I386_INTEL_IOMMU_INTERNAL_H
>   #include "hw/i386/intel_iommu.h"
> +#include "system/host_iommu_device.h"
>   
>   /*
>    * Intel IOMMU register specification
> @@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
>   /* Bits to decide the offset for each level */
>   #define VTD_LEVEL_BITS           9
>   
> +typedef struct VTDHostIOMMUDevice {
> +    IntelIOMMUState *iommu_state;
> +    PCIBus *bus;
> +    uint8_t devfn;
> +    HostIOMMUDevice *hiod;
> +} VTDHostIOMMUDevice;
>   #endif
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index e95477e855..50f9b27a45 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -295,7 +295,7 @@ struct IntelIOMMUState {
>       /* list of registered notifiers */
>       QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>   
> -    GHashTable *vtd_host_iommu_dev;             /* HostIOMMUDevice */
> +    GHashTable *vtd_host_iommu_dev;             /* VTDHostIOMMUDevice */
>   
>       /* interrupt remapping */
>       bool intr_enabled;              /* Whether guest enabled IR */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index e3b871de70..512ca4fdc5 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
>   
>   static void vtd_hiod_destroy(gpointer v)
>   {
> -    object_unref(v);
> +    VTDHostIOMMUDevice *vtd_hiod = v;
> +
> +    object_unref(vtd_hiod->hiod);
> +    g_free(vtd_hiod);
>   }
>   
>   static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
> @@ -4371,6 +4374,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>                                        HostIOMMUDevice *hiod, Error **errp)
>   {
>       IntelIOMMUState *s = opaque;
> +    VTDHostIOMMUDevice *vtd_hiod;
>       struct vtd_as_key key = {
>           .bus = bus,
>           .devfn = devfn,
> @@ -4387,7 +4391,14 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>           return false;
>       }
>   
> +    vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
> +    vtd_hiod->bus = bus;
> +    vtd_hiod->devfn = (uint8_t)devfn;
> +    vtd_hiod->iommu_state = s;
> +    vtd_hiod->hiod = hiod;

how about moving it after the below if branch? :)

>       if (!vtd_check_hiod(s, hiod, errp)) {
> +        g_free(vtd_hiod);
>           vtd_iommu_unlock(s);
>           return false;
>       }
> @@ -4397,7 +4408,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>       new_key->devfn = devfn;
>   
>       object_ref(hiod);
> -    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
> +    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
>   
>       vtd_iommu_unlock(s);
>   

LGTM.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-27 11:13   ` Yi Liu
@ 2025-08-27 11:22     ` Eric Auger
  2025-08-27 12:30       ` Yi Liu
  0 siblings, 1 reply; 113+ messages in thread
From: Eric Auger @ 2025-08-27 11:22 UTC (permalink / raw)
  To: Yi Liu, Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

Hi

On 8/27/25 1:13 PM, Yi Liu wrote:
> On 2025/8/22 14:40, Zhenzhong Duan wrote:
>> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
>> allows to retrieve capabilities exposed by a vIOMMU. The first planned
>> vIOMMU device capability is VIOMMU_CAP_HW_NESTED that advertises the
>> support of HW nested stage translation scheme. pci_device_get_viommu_cap
>> is a wrapper that can be called on a PCI device potentially protected by
>> a vIOMMU.
>>
>> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
>> capabilities which are only determined by user's configuration, no host
>> capabilities involved. Reasons are:
>>
>> 1. host may has heterogeneous IOMMUs, each with different capabilities
>> 2. this is migration friendly, return value is consistent between source
>>     and target.
>> 3. host IOMMU capabilities are passed to vIOMMU through
>> set_iommu_device()
>>     interface which have to be after attach_device(), when
>> get_viommu_cap()
>>     is called in attach_device(), there is no way for vIOMMU to get host
>>     IOMMU capabilities yet, so only emulated capabilities can be
>> returned.
>>     See below sequence:
>>
>>       vfio_device_attach():
>>           iommufd_cdev_attach():
>>               pci_device_get_viommu_cap() for HW nesting cap
>>               create a nesting parent hwpt
>>               attach device to the hwpt
>>               vfio_device_hiod_create_and_realize() creating hiod
>>       ...
>>       pci_device_set_iommu_device(hiod)
>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   MAINTAINERS          |  1 +
>>   include/hw/iommu.h   | 19 +++++++++++++++++++
>>   include/hw/pci/pci.h | 25 +++++++++++++++++++++++++
>>   hw/pci/pci.c         | 11 +++++++++++
>>   4 files changed, 56 insertions(+)
>>   create mode 100644 include/hw/iommu.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index a07086ed76..54fb878128 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2305,6 +2305,7 @@ F: include/system/iommufd.h
>>   F: backends/host_iommu_device.c
>>   F: include/system/host_iommu_device.h
>>   F: include/qemu/chardev_open.h
>> +F: include/hw/iommu.h
>>   F: util/chardev_open.c
>>   F: docs/devel/vfio-iommufd.rst
>>   diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> new file mode 100644
>> index 0000000000..7dd0c11b16
>> --- /dev/null
>> +++ b/include/hw/iommu.h
>> @@ -0,0 +1,19 @@
>> +/*
>> + * General vIOMMU capabilities, flags, etc
>> + *
>> + * Copyright (C) 2025 Intel Corporation.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#ifndef HW_IOMMU_H
>> +#define HW_IOMMU_H
>> +
>> +#include "qemu/bitops.h"
>> +
>> +enum {
>> +    /* hardware nested stage-1 page table support */
>> +    VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
>
> This naming is a bit confusing. get_viommu_cap indicates it will return
> the viommu's capability while this naming is HW_NESTED. It's conflict
> with the commit message which claims only emulated capability will be
> returned.

it actually means the viommu has the code to handle HW nested case,
independently on the actual HW support.
maybe remove the "emulation" wording.

Otherwise we may also use the virtio has_feature naming?


>
> TBH. I'm hesitating to name it as get_viommu_cap. The scope is a little
> larger than what we want so far. So I'm wondering if it can be done in a
> more straightforward way. e.g. just a bool op named
> iommu_nested_wanted(). Just an example, maybe better naming. We can
> extend the op to be returning a u64 value in the future when we see
> another request on VFIO from vIOMMU.
personnally I am fine with the bitmask which looks more future proof.

besides,

Reviewed-by: Eric Auger <eric.auger@redhat.com>
>
> Regards,
> Yi Liu
>



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device()
  2025-08-22  6:40 ` [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device() Zhenzhong Duan
  2025-08-22 22:40   ` Nicolin Chen via
  2025-08-27 11:13   ` Yi Liu
@ 2025-08-27 11:34   ` Eric Auger
  2025-09-01 16:36   ` Cédric Le Goater
  3 siblings, 0 replies; 113+ messages in thread
From: Eric Auger @ 2025-08-27 11:34 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng



On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> Introduce helper vfio_pci_from_vfio_device() to transform from VFIODevice
> to VFIOPCIDevice, also to hide low level VFIO_DEVICE_TYPE_PCI type check.
>
> Suggested-by: Cédric Le Goater <clg@redhat.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> Link: https://lore.kernel.org/qemu-devel/20250801023533.1458644-1-zhenzhong.duan@intel.com
> [ clg: Added documentation ]
> Signed-off-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/vfio/pci.h       | 12 ++++++++++++
>  hw/vfio/container.c |  4 ++--
>  hw/vfio/device.c    |  2 +-
>  hw/vfio/iommufd.c   |  4 ++--
>  hw/vfio/listener.c  |  4 ++--
>  hw/vfio/pci.c       |  9 +++++++++
>  6 files changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 810a842f4a..beb8fb9ee7 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -221,6 +221,18 @@ void vfio_pci_write_config(PCIDevice *pdev,
>  uint64_t vfio_vga_read(void *opaque, hwaddr addr, unsigned size);
>  void vfio_vga_write(void *opaque, hwaddr addr, uint64_t data, unsigned size);
>  
> +/**
> + * vfio_pci_from_vfio_device: Transform from VFIODevice to
> + * VFIOPCIDevice
> + *
> + * This function checks if the given @vbasedev is a VFIO PCI device.
> + * If it is, it returns the containing VFIOPCIDevice.
> + *
> + * @vbasedev: The VFIODevice to transform
> + *
> + * Return: The VFIOPCIDevice on success, NULL on failure.
> + */
> +VFIOPCIDevice *vfio_pci_from_vfio_device(VFIODevice *vbasedev);
>  void vfio_sub_page_bar_update_mappings(VFIOPCIDevice *vdev);
>  bool vfio_opt_rom_in_denylist(VFIOPCIDevice *vdev);
>  bool vfio_config_quirk_setup(VFIOPCIDevice *vdev, Error **errp);
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 3e13feaa74..134ddccc52 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -1087,7 +1087,7 @@ static int vfio_legacy_pci_hot_reset(VFIODevice *vbasedev, bool single)
>          /* Prep dependent devices for reset and clear our marker. */
>          QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>              if (!vbasedev_iter->dev->realized ||
> -                vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
> +                !vfio_pci_from_vfio_device(vbasedev_iter)) {
>                  continue;
>              }
>              tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
> @@ -1172,7 +1172,7 @@ out:
>  
>          QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>              if (!vbasedev_iter->dev->realized ||
> -                vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
> +                !vfio_pci_from_vfio_device(vbasedev_iter)) {
>                  continue;
>              }
>              tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
> index 52a1996dc4..08f12ac31f 100644
> --- a/hw/vfio/device.c
> +++ b/hw/vfio/device.c
> @@ -129,7 +129,7 @@ static inline const char *action_to_str(int action)
>  
>  static const char *index_to_str(VFIODevice *vbasedev, int index)
>  {
> -    if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
> +    if (!vfio_pci_from_vfio_device(vbasedev)) {
>          return NULL;
>      }
>  
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 48c590b6a9..8c27222f75 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -737,8 +737,8 @@ iommufd_cdev_dep_get_realized_vpdev(struct vfio_pci_dependent_device *dep_dev,
>      }
>  
>      vbasedev_tmp = iommufd_cdev_pci_find_by_devid(dep_dev->devid);
> -    if (!vbasedev_tmp || !vbasedev_tmp->dev->realized ||
> -        vbasedev_tmp->type != VFIO_DEVICE_TYPE_PCI) {
> +    if (!vfio_pci_from_vfio_device(vbasedev_tmp) ||
> +        !vbasedev_tmp->dev->realized) {
>          return NULL;
>      }
>  
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index f498e23a93..903dfd8bf2 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -450,7 +450,7 @@ static void vfio_device_error_append(VFIODevice *vbasedev, Error **errp)
>       * MMIO region mapping failures are not fatal but in this case PCI
>       * peer-to-peer transactions are broken.
>       */
> -    if (vbasedev && vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +    if (vfio_pci_from_vfio_device(vbasedev)) {
>          error_append_hint(errp, "%s: PCI peer-to-peer transactions "
>                            "on BARs are not supported.\n", vbasedev->name);
>      }
> @@ -751,7 +751,7 @@ static bool vfio_section_is_vfio_pci(MemoryRegionSection *section,
>      owner = memory_region_owner(section->mr);
>  
>      QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
> -        if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
> +        if (!vfio_pci_from_vfio_device(vbasedev)) {
>              continue;
>          }
>          pcidev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 07257d0fa0..3fe5b03eb1 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2833,6 +2833,15 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>      return ret;
>  }
>  
> +/* Transform from VFIODevice to VFIOPCIDevice. Return NULL if fails. */
> +VFIOPCIDevice *vfio_pci_from_vfio_device(VFIODevice *vbasedev)
> +{
> +    if (vbasedev && vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        return container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    }
> +    return NULL;
> +}
> +
>  void vfio_sub_page_bar_update_mappings(VFIOPCIDevice *vdev)
>  {
>      PCIDevice *pdev = &vdev->pdev;



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 08/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-08-22  6:40 ` [PATCH v5 08/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-08-27 11:42   ` Yi Liu
  2025-08-28  9:37     ` Duan, Zhenzhong
  2025-08-27 11:55   ` Eric Auger
  1 sibling, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-27 11:42 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
> is passed to host to construct nested page table.

for passthrough devices :)

> We need to check
> compatibility of some critical IOMMU capabilities between vIOMMU and
> host IOMMU to ensure guest stage-1 page table could be used by host.
> 
> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
> does not, then this IOMMUFD backed device should fail.

do you have a list of what caps should be checked to ensure guest
stage-1 page table work on hw? I can see EAFS. But it is not yet exposed
to guest, so no need to check it for now.

> 
> Even of the checks pass, for now we willingly reject the association
> because all the bits are not there yet.

better call out it would be relaxed in the end of this series. Otherwise
it's a little confused. :)

> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/i386/intel_iommu_internal.h |  1 +
>   hw/i386/intel_iommu.c          | 30 +++++++++++++++++++++++++++++-
>   2 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index c7046eb4e2..f7510861d1 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -192,6 +192,7 @@
>   #define VTD_ECAP_PT                 (1ULL << 6)
>   #define VTD_ECAP_SC                 (1ULL << 7)
>   #define VTD_ECAP_MHMV               (15ULL << 20)
> +#define VTD_ECAP_NEST               (1ULL << 26)
>   #define VTD_ECAP_SRS                (1ULL << 31)
>   #define VTD_ECAP_PSS                (7ULL << 35) /* limit: MemTxAttrs::pid */
>   #define VTD_ECAP_PASID              (1ULL << 40)
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 512ca4fdc5..da355bda79 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -40,6 +40,7 @@
>   #include "kvm/kvm_i386.h"
>   #include "migration/vmstate.h"
>   #include "trace.h"
> +#include "system/iommufd.h"
>   
>   /* context entry operations */
>   #define VTD_CE_GET_RID2PASID(ce) \
> @@ -4366,7 +4367,34 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>           return true;
>       }
>   
> -    error_setg(errp, "host device is uncompatible with stage-1 translation");
> +#ifdef CONFIG_IOMMUFD
> +    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
> +    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
> +
> +    /* Remaining checks are all stage-1 translation specific */
> +    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> +        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
> +        return false;
> +    }
> +
> +    if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
> +        error_setg(errp, "Incompatible host platform IOMMU type %d",
> +                   caps->type);
> +        return false;
> +    }
> +
> +    if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
> +        error_setg(errp, "Host IOMMU doesn't support nested translation");
> +        return false;
> +    }

this check may be already been covered by the sync in patch 05 as
the set_iommu_device op is called after attach_device. If no NESTED cap,
allocating nested hwpt would be failed.

> +
> +    if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
> +        error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");

s/huge page/large page/ as VT-d spec use large page.

> +        return false;
> +    }
> +#endif > +
> +    error_setg(errp, "host IOMMU is incompatible with stage-1 translation");

s/stage-1 translation/guest stage-1 translation/

>       return false;
>   }
>   

with above minor nits done, the patch looks good to me. Hence,

Reviewed-by: Yi Liu <yi.l.liu@intel.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain
  2025-08-22  6:40 ` [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
  2025-08-22 23:12   ` Nicolin Chen
@ 2025-08-27 11:48   ` Eric Auger
  2025-08-28  9:53     ` Duan, Zhenzhong
  1 sibling, 1 reply; 113+ messages in thread
From: Eric Auger @ 2025-08-27 11:48 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng

Hi Zhenzhong,

On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> Call pci_device_get_viommu_cap() to get if vIOMMU supports VIOMMU_CAP_HW_NESTED,
> if yes, create nested parent domain which could be reused by vIOMMU to create
> nested domain.
>
> Introduce helper vfio_device_viommu_get_nested to facilitate this
> implementation.
>
> It is safe because even if VIOMMU_CAP_HW_NESTED is returned, s->flts is
> forbidden and VFIO device fails in set_iommu_device() call, until we support
> passthrough device with x-flts=on.
>
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/hw/vfio/vfio-device.h |  2 ++
>  hw/vfio/device.c              | 12 ++++++++++++
>  hw/vfio/iommufd.c             |  8 ++++++++
>  3 files changed, 22 insertions(+)
>
> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
> index 6e4d5ccdac..ecd82c16c7 100644
> --- a/include/hw/vfio/vfio-device.h
> +++ b/include/hw/vfio/vfio-device.h
> @@ -257,6 +257,8 @@ void vfio_device_prepare(VFIODevice *vbasedev, VFIOContainerBase *bcontainer,
>  
>  void vfio_device_unprepare(VFIODevice *vbasedev);
>  
> +bool vfio_device_viommu_get_nested(VFIODevice *vbasedev);
I would suggest vfio_device_viommu_has_feature_hw_nested or something alike
get usually means tou take a ref count associated with a put
> +
>  int vfio_device_get_region_info(VFIODevice *vbasedev, int index,
>                                  struct vfio_region_info **info);
>  int vfio_device_get_region_info_type(VFIODevice *vbasedev, uint32_t type,
> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
> index 08f12ac31f..3eeb71bd51 100644
> --- a/hw/vfio/device.c
> +++ b/hw/vfio/device.c
> @@ -23,6 +23,7 @@
>  
>  #include "hw/vfio/vfio-device.h"
>  #include "hw/vfio/pci.h"
> +#include "hw/iommu.h"
>  #include "hw/hw.h"
>  #include "trace.h"
>  #include "qapi/error.h"
> @@ -504,6 +505,17 @@ void vfio_device_unprepare(VFIODevice *vbasedev)
>      vbasedev->bcontainer = NULL;
>  }
>  
> +bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
> +{
> +    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
> +
> +    if (vdev) {
> +        return !!(pci_device_get_viommu_cap(&vdev->pdev) &
> +                  VIOMMU_CAP_HW_NESTED);
> +    }
> +    return false;
> +}
> +
>  /*
>   * Traditional ioctl() based io
>   */
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 8c27222f75..e503c232e1 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -379,6 +379,14 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>          flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>      }
>  
> +    /*
> +     * If vIOMMU supports stage-1 translation, force to create nested parent
I would rather not use another terminology here. You previously used
hw_nested, I think that's better. Also bear in mind that smmu supports
S1, S2 and S1+S2 in emulated code.

Thanks

Eric
> +     * domain which could be reused by vIOMMU to create nested domain.
> +     */
> +    if (vfio_device_viommu_get_nested(vbasedev)) {
> +        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
> +    }
> +
>      if (cpr_is_incoming()) {
>          hwpt_id = vbasedev->cpr.hwpt_id;
>          goto skip_alloc;



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain
  2025-08-25  8:28     ` Duan, Zhenzhong
@ 2025-08-27 11:51       ` Eric Auger
  0 siblings, 0 replies; 113+ messages in thread
From: Eric Auger @ 2025-08-27 11:51 UTC (permalink / raw)
  To: Duan, Zhenzhong, Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P



On 8/25/25 10:28 AM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Subject: Re: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent
>> domain
>>
>> On Fri, Aug 22, 2025 at 02:40:43AM -0400, Zhenzhong Duan wrote:
>>> Call pci_device_get_viommu_cap() to get if vIOMMU supports
>> VIOMMU_CAP_HW_NESTED,
>>> if yes, create nested parent domain which could be reused by vIOMMU to
>> create
>>> nested domain.
>>>
>>> Introduce helper vfio_device_viommu_get_nested to facilitate this
>>> implementation.
>> It'd be nicer to slightly mention the benefit of having it. Assuming
>> that QEMU commit message can be as long as 80 characters:
>>
>> -------------------------
>> Call pci_device_get_viommu_cap() to get if vIOMMU supports
>> VIOMMU_CAP_HW_NESTED.
>>
>> If yes, create a nesting parent domain and add it to the container's hwpt_list,
>> letting this parent domain cover the entire stage-2 mappings (gPA=>PA).
>>
>> This allows a VFIO passthrough device to directly attach to this default
>> domain
>> and then to use the system address space and its listener.
>>
>> Introduce a vfio_device_viommu_get_nested() helper to facilitate this
>> implementation.
>> -------------------------
> Thanks, will do.
>
>>> It is safe because even if VIOMMU_CAP_HW_NESTED is returned, s->flts is
>>> forbidden and VFIO device fails in set_iommu_device() call, until we support
>>> passthrough device with x-flts=on.
>> I think this is too vendor specific to be mentioned here. Likely
>> the previous VTD patch is the place to have this.
>>
>> Or you could say:
>>
>> --------------------------
>> It is safe to do so because a vIOMMU will be able to fail in
>> set_iommu_device()
>> call, if something else related to the VFIO device or vIOMMU isn't compatible.
>> --------------------------
> Will do.
I would say: it is safe to add the flags |=
IOMMU_HWPT_ALLOC_NEST_PARENT; at this stage, despite the whole
functionality is not in place because HW_NESTED is currently forced off in

set_iommu_device()

Eric

>
>>> +bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
>>> +{
>>> +    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
>>> +
>>> +    if (vdev) {
>>> +        return !!(pci_device_get_viommu_cap(&vdev->pdev) &
>>> +                  VIOMMU_CAP_HW_NESTED);
>> "get_nested" feels too general. Here it particularly means the cap:
>>
>> bool vfio_device_get_viommu_cap_hw_nested(VFIODevice *vbasedev)
> Will use vfio_device_get_viommu_cap_hw_nested()
>
>>> @@ -379,6 +379,14 @@ static bool
>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>          flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>>      }
>>>
>>> +    /*
>>> +     * If vIOMMU supports stage-1 translation, force to create nested
>> parent
>>
>> "nested parent" is a contradictory phrase. Parent is a container
>> holding some nested items. A nested parent sounds like a "parent"
>> item that lives inside another parent container.
>>
>> In kernel kdoc/uAPI, we use:
>> - "nesting parent" for stage-2 object
>> - "nested hwpt", "nested domain" for stage-1 object
> Thanks for sharing this info, I didn't notice that. I will fix the whole series to use 'nesting parent'.
>
> BRs,
> Zhenzhong
>



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 08/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-08-22  6:40 ` [PATCH v5 08/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
  2025-08-27 11:42   ` Yi Liu
@ 2025-08-27 11:55   ` Eric Auger
  1 sibling, 0 replies; 113+ messages in thread
From: Eric Auger @ 2025-08-27 11:55 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng



On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
> is passed to host to construct nested page table. We need to check
> compatibility of some critical IOMMU capabilities between vIOMMU and
> host IOMMU to ensure guest stage-1 page table could be used by host.
>
> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
> does not, then this IOMMUFD backed device should fail.
>
> Even of the checks pass, for now we willingly reject the association
> because all the bits are not there yet.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/i386/intel_iommu_internal.h |  1 +
>  hw/i386/intel_iommu.c          | 30 +++++++++++++++++++++++++++++-
>  2 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index c7046eb4e2..f7510861d1 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -192,6 +192,7 @@
>  #define VTD_ECAP_PT                 (1ULL << 6)
>  #define VTD_ECAP_SC                 (1ULL << 7)
>  #define VTD_ECAP_MHMV               (15ULL << 20)
> +#define VTD_ECAP_NEST               (1ULL << 26)
>  #define VTD_ECAP_SRS                (1ULL << 31)
>  #define VTD_ECAP_PSS                (7ULL << 35) /* limit: MemTxAttrs::pid */
>  #define VTD_ECAP_PASID              (1ULL << 40)
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 512ca4fdc5..da355bda79 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -40,6 +40,7 @@
>  #include "kvm/kvm_i386.h"
>  #include "migration/vmstate.h"
>  #include "trace.h"
> +#include "system/iommufd.h"
>  
>  /* context entry operations */
>  #define VTD_CE_GET_RID2PASID(ce) \
> @@ -4366,7 +4367,34 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>          return true;
>      }
>  
> -    error_setg(errp, "host device is uncompatible with stage-1 translation");
> +#ifdef CONFIG_IOMMUFD
> +    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
> +    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
> +
> +    /* Remaining checks are all stage-1 translation specific */
> +    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> +        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
> +        return false;
> +    }
> +
> +    if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
> +        error_setg(errp, "Incompatible host platform IOMMU type %d",
> +                   caps->type);
> +        return false;
> +    }
> +
> +    if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
> +        error_setg(errp, "Host IOMMU doesn't support nested translation");
> +        return false;
> +    }
> +
> +    if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
> +        error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
> +        return false;
> +    }
> +#endif
> +
> +    error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
>      return false;
>  }
>  



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
  2025-08-22 23:55   ` Nicolin Chen
  2025-08-25  9:21     ` Duan, Zhenzhong
@ 2025-08-27 11:56     ` Yi Liu
  2025-08-27 15:09       ` Nicolin Chen
  1 sibling, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-27 11:56 UTC (permalink / raw)
  To: Nicolin Chen, Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, joao.m.martins, clement.mathieu--drif,
	kevin.tian, chao.p.peng



On 2025/8/23 07:55, Nicolin Chen wrote:
> On Fri, Aug 22, 2025 at 02:40:58AM -0400, Zhenzhong Duan wrote:
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index e503c232e1..59735e878c 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -324,6 +324,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>   {
>>       ERRP_GUARD();
>>       IOMMUFDBackend *iommufd = vbasedev->iommufd;
>> +    struct iommu_hw_info_vtd vtd;
> 
> VendorCaps vendor_caps;
> 
>>       uint32_t type, flags = 0;
>>       uint64_t hw_caps;
>>       VFIOIOASHwpt *hwpt;
>> @@ -371,10 +372,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>        * instead.
>>        */
>>       if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
>> -                                         &type, NULL, 0, &hw_caps, errp)) {
>> +                                         &type, &vtd, sizeof(vtd), &hw_caps,
> 
> s/vtd/vendor_caps/g
> 
>> +                                         errp)) {
>>           return false;
>>       }
>>   
>> +    if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
>> +        container->bcontainer.bypass_ro = true;
> 
> This circled back to checking a vendor specific flag in the core..
> 
> Perhaps we could upgrade the get_viommu_cap op and its API:
> 
> enum viommu_flags {
>      VIOMMU_FLAG_HW_NESTED = BIT_ULL(0),
>      VIOMMU_FLAG_BYPASS_RO = BIT_ULL(1),

hmmm. I'm not quite on this idea as the two flags have different sources.
One determined by vIOMMU config, one by the hardware limit. Reporting
them in one API is strange.  I think the bypass RO can be determined in
VFIO just like the patch has done. But it should check if vIOMMU has 
requested nested hwpt and also the reported hw_info::type is
IOMMU_HW_INFO_TYPE_INTEL_VTD.

	if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) &&
             type == IOMMU_HW_INFO_TYPE_INTEL_VTD &&
             vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
             container->bcontainer.bypass_ro = true;
          }

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-27 11:22     ` Eric Auger
@ 2025-08-27 12:30       ` Yi Liu
  2025-08-27 12:32         ` Eric Auger
  0 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-27 12:30 UTC (permalink / raw)
  To: eric.auger, Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

Hi Eric,

On 2025/8/27 19:22, Eric Auger wrote:
> Hi
> 
> On 8/27/25 1:13 PM, Yi Liu wrote:
>> On 2025/8/22 14:40, Zhenzhong Duan wrote:
>>> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
>>> allows to retrieve capabilities exposed by a vIOMMU. The first planned
>>> vIOMMU device capability is VIOMMU_CAP_HW_NESTED that advertises the
>>> support of HW nested stage translation scheme. pci_device_get_viommu_cap
>>> is a wrapper that can be called on a PCI device potentially protected by
>>> a vIOMMU.
>>>
>>> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
>>> capabilities which are only determined by user's configuration, no host
>>> capabilities involved. Reasons are:
>>>
>>> 1. host may has heterogeneous IOMMUs, each with different capabilities
>>> 2. this is migration friendly, return value is consistent between source
>>>      and target.
>>> 3. host IOMMU capabilities are passed to vIOMMU through
>>> set_iommu_device()
>>>      interface which have to be after attach_device(), when
>>> get_viommu_cap()
>>>      is called in attach_device(), there is no way for vIOMMU to get host
>>>      IOMMU capabilities yet, so only emulated capabilities can be
>>> returned.
>>>      See below sequence:
>>>
>>>        vfio_device_attach():
>>>            iommufd_cdev_attach():
>>>                pci_device_get_viommu_cap() for HW nesting cap
>>>                create a nesting parent hwpt
>>>                attach device to the hwpt
>>>                vfio_device_hiod_create_and_realize() creating hiod
>>>        ...
>>>        pci_device_set_iommu_device(hiod)
>>
>>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>    MAINTAINERS          |  1 +
>>>    include/hw/iommu.h   | 19 +++++++++++++++++++
>>>    include/hw/pci/pci.h | 25 +++++++++++++++++++++++++
>>>    hw/pci/pci.c         | 11 +++++++++++
>>>    4 files changed, 56 insertions(+)
>>>    create mode 100644 include/hw/iommu.h
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index a07086ed76..54fb878128 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -2305,6 +2305,7 @@ F: include/system/iommufd.h
>>>    F: backends/host_iommu_device.c
>>>    F: include/system/host_iommu_device.h
>>>    F: include/qemu/chardev_open.h
>>> +F: include/hw/iommu.h
>>>    F: util/chardev_open.c
>>>    F: docs/devel/vfio-iommufd.rst
>>>    diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>>> new file mode 100644
>>> index 0000000000..7dd0c11b16
>>> --- /dev/null
>>> +++ b/include/hw/iommu.h
>>> @@ -0,0 +1,19 @@
>>> +/*
>>> + * General vIOMMU capabilities, flags, etc
>>> + *
>>> + * Copyright (C) 2025 Intel Corporation.
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#ifndef HW_IOMMU_H
>>> +#define HW_IOMMU_H
>>> +
>>> +#include "qemu/bitops.h"
>>> +
>>> +enum {
>>> +    /* hardware nested stage-1 page table support */
>>> +    VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
>>
>> This naming is a bit confusing. get_viommu_cap indicates it will return
>> the viommu's capability while this naming is HW_NESTED. It's conflict
>> with the commit message which claims only emulated capability will be
>> returned.
> 
> it actually means the viommu has the code to handle HW nested case,
> independently on the actual HW support.
> maybe remove the "emulation" wording.

yeah, I know the meaning and the purpose here. Just not quite satisfied
with the naming.

> 
> Otherwise we may also use the virtio has_feature naming?

has_feature seems better. Looks to ask if vIOMMU has something and then
do something.

> 
>>
>> TBH. I'm hesitating to name it as get_viommu_cap. The scope is a little
>> larger than what we want so far. So I'm wondering if it can be done in a
>> more straightforward way. e.g. just a bool op named
>> iommu_nested_wanted(). Just an example, maybe better naming. We can
>> extend the op to be returning a u64 value in the future when we see
>> another request on VFIO from vIOMMU.
> personnally I am fine with the bitmask which looks more future proof.

not quite sure if there is another info that needs to be checked in
this "VFIO asks vIOMMU" manner. Have you seen one beside this
nested hwpt requirement by vIOMMU?

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-27 12:30       ` Yi Liu
@ 2025-08-27 12:32         ` Eric Auger
  2025-08-27 15:30           ` Nicolin Chen
  0 siblings, 1 reply; 113+ messages in thread
From: Eric Auger @ 2025-08-27 12:32 UTC (permalink / raw)
  To: Yi Liu, Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

Hi Yi,

On 8/27/25 2:30 PM, Yi Liu wrote:
> Hi Eric,
>
> On 2025/8/27 19:22, Eric Auger wrote:
>> Hi
>>
>> On 8/27/25 1:13 PM, Yi Liu wrote:
>>> On 2025/8/22 14:40, Zhenzhong Duan wrote:
>>>> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
>>>> allows to retrieve capabilities exposed by a vIOMMU. The first planned
>>>> vIOMMU device capability is VIOMMU_CAP_HW_NESTED that advertises the
>>>> support of HW nested stage translation scheme.
>>>> pci_device_get_viommu_cap
>>>> is a wrapper that can be called on a PCI device potentially
>>>> protected by
>>>> a vIOMMU.
>>>>
>>>> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
>>>> capabilities which are only determined by user's configuration, no
>>>> host
>>>> capabilities involved. Reasons are:
>>>>
>>>> 1. host may has heterogeneous IOMMUs, each with different capabilities
>>>> 2. this is migration friendly, return value is consistent between
>>>> source
>>>>      and target.
>>>> 3. host IOMMU capabilities are passed to vIOMMU through
>>>> set_iommu_device()
>>>>      interface which have to be after attach_device(), when
>>>> get_viommu_cap()
>>>>      is called in attach_device(), there is no way for vIOMMU to
>>>> get host
>>>>      IOMMU capabilities yet, so only emulated capabilities can be
>>>> returned.
>>>>      See below sequence:
>>>>
>>>>        vfio_device_attach():
>>>>            iommufd_cdev_attach():
>>>>                pci_device_get_viommu_cap() for HW nesting cap
>>>>                create a nesting parent hwpt
>>>>                attach device to the hwpt
>>>>                vfio_device_hiod_create_and_realize() creating hiod
>>>>        ...
>>>>        pci_device_set_iommu_device(hiod)
>>>
>>>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>>    MAINTAINERS          |  1 +
>>>>    include/hw/iommu.h   | 19 +++++++++++++++++++
>>>>    include/hw/pci/pci.h | 25 +++++++++++++++++++++++++
>>>>    hw/pci/pci.c         | 11 +++++++++++
>>>>    4 files changed, 56 insertions(+)
>>>>    create mode 100644 include/hw/iommu.h
>>>>
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index a07086ed76..54fb878128 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -2305,6 +2305,7 @@ F: include/system/iommufd.h
>>>>    F: backends/host_iommu_device.c
>>>>    F: include/system/host_iommu_device.h
>>>>    F: include/qemu/chardev_open.h
>>>> +F: include/hw/iommu.h
>>>>    F: util/chardev_open.c
>>>>    F: docs/devel/vfio-iommufd.rst
>>>>    diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>>>> new file mode 100644
>>>> index 0000000000..7dd0c11b16
>>>> --- /dev/null
>>>> +++ b/include/hw/iommu.h
>>>> @@ -0,0 +1,19 @@
>>>> +/*
>>>> + * General vIOMMU capabilities, flags, etc
>>>> + *
>>>> + * Copyright (C) 2025 Intel Corporation.
>>>> + *
>>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>>> + */
>>>> +
>>>> +#ifndef HW_IOMMU_H
>>>> +#define HW_IOMMU_H
>>>> +
>>>> +#include "qemu/bitops.h"
>>>> +
>>>> +enum {
>>>> +    /* hardware nested stage-1 page table support */
>>>> +    VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
>>>
>>> This naming is a bit confusing. get_viommu_cap indicates it will return
>>> the viommu's capability while this naming is HW_NESTED. It's conflict
>>> with the commit message which claims only emulated capability will be
>>> returned.
>>
>> it actually means the viommu has the code to handle HW nested case,
>> independently on the actual HW support.
>> maybe remove the "emulation" wording.
>
> yeah, I know the meaning and the purpose here. Just not quite satisfied
> with the naming.
>
>>
>> Otherwise we may also use the virtio has_feature naming?
>
> has_feature seems better. Looks to ask if vIOMMU has something and then
> do something.
>
>>
>>>
>>> TBH. I'm hesitating to name it as get_viommu_cap. The scope is a little
>>> larger than what we want so far. So I'm wondering if it can be done
>>> in a
>>> more straightforward way. e.g. just a bool op named
>>> iommu_nested_wanted(). Just an example, maybe better naming. We can
>>> extend the op to be returning a u64 value in the future when we see
>>> another request on VFIO from vIOMMU.
>> personnally I am fine with the bitmask which looks more future proof.
>
> not quite sure if there is another info that needs to be checked in
> this "VFIO asks vIOMMU" manner. Have you seen one beside this
> nested hwpt requirement by vIOMMU?

I don't remember any at this point. But I guess with ARM CCA device
passthrough we might have other needs

Eric
>
> Regards,
> Yi Liu
>



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update
  2025-08-22  6:40 ` [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update Zhenzhong Duan
@ 2025-08-27 14:25   ` Eric Auger
  2025-09-01  3:17     ` Duan, Zhenzhong
  2025-08-28 12:05   ` Yi Liu
  1 sibling, 1 reply; 113+ messages in thread
From: Eric Auger @ 2025-08-27 14:25 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Yi Sun

Hi Zhenzhong,

On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
> pasid entry and track PASID usage and future PASID tagged DMA address
> translation support in vIOMMU.
>
> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
> never freed. For other pasid, VTDAddressSpace instance is created/destroyed
> per the guest pasid entry set up/destroy.
>
> When guest removes or updates a PASID entry, QEMU will capture the guest pasid
> selective pasid cache invalidation, removes VTDAddressSpace or update cached
> PASID entry.
>
> vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
> and compare it with cached PASID entry.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu_internal.h |  27 ++++-
>  include/hw/i386/intel_iommu.h  |   6 +
>  hw/i386/intel_iommu.c          | 196 +++++++++++++++++++++++++++++++--
>  hw/i386/trace-events           |   3 +
>  4 files changed, 220 insertions(+), 12 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index f7510861d1..b9b76dd996 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -316,6 +316,7 @@ typedef enum VTDFaultReason {
>                                    * request while disabled */
>      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>  
> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>      /* PASID directory entry access failure */
>      VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>      /* The Present(P) field of pasid directory entry is 0 */
> @@ -493,6 +494,15 @@ typedef union VTDInvDesc VTDInvDesc;
>  #define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000f1c0ULL
>  #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>  
> +/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
> +#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
> +#define VTD_INV_DESC_PASIDC_G_DSI       0
> +#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
> +#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
> +#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16, 16)
> +#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32, 20)
> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
> +
>  /* Information about page-selective IOTLB invalidate */
>  struct VTDIOTLBPageInvInfo {
>      uint16_t domain_id;
> @@ -553,6 +563,21 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>  
> +typedef enum VTDPCInvType {
> +    /* VTD spec defined PASID cache invalidation type */
> +    VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
> +    VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
> +    VTD_PASID_CACHE_GLOBAL_INV = VTD_INV_DESC_PASIDC_G_GLOBAL,
> +} VTDPCInvType;
> +
> +typedef struct VTDPASIDCacheInfo {
> +    VTDPCInvType type;
> +    uint16_t did;
> +    uint32_t pasid;
> +    PCIBus *bus;
> +    uint16_t devfn;
> +} VTDPASIDCacheInfo;
> +
>  /* PASID Table Related Definitions */
>  #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>  #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> @@ -574,7 +599,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>  
>  #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
> -#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
> +#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>  
>  #define VTD_SM_PASID_ENTRY_FLPM          3ULL
>  #define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 50f9b27a45..0e3826f6f0 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>      uint64_t val[8];
>  };
>  
> +typedef struct VTDPASIDCacheEntry {
> +    struct VTDPASIDEntry pasid_entry;
> +    bool valid;
> +} VTDPASIDCacheEntry;
> +
>  struct VTDAddressSpace {
>      PCIBus *bus;
>      uint8_t devfn;
> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>      MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
>      IntelIOMMUState *iommu_state;
>      VTDContextCacheEntry context_cache_entry;
> +    VTDPASIDCacheEntry pasid_cache_entry;
>      QLIST_ENTRY(VTDAddressSpace) next;
>      /* Superset of notifier flags that this address space has */
>      IOMMUNotifierFlag notifier_flags;
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1801f1cdf6..a2ee6d684e 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1675,7 +1675,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
>  
>      if (s->root_scalable) {
>          vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> -        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
> +        return VTD_SM_PASID_ENTRY_DID(&pe);
>      }
>  
>      return VTD_CONTEXT_ENTRY_DID(ce->hi);
> @@ -3112,6 +3112,183 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
>      return true;
>  }
>  
> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
> +                                            uint32_t pasid, VTDPASIDEntry *pe)
> +{
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    if (!s->root_scalable) {
> +        return -VTD_FR_RTADDR_INV_TTM;
> +    }
> +
> +    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
> +                                   &ce);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
> +}
> +
> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
> +{
> +    return !memcmp(p1, p2, sizeof(*p1));
> +}
> +
> +/*
> + * This function is a loop function which return value determines if
whose returned value determines whether current vtd_as iterator matches
the pasid cache entry info passed in user_data and needs to be removed
from the pasid cache. 
> + * vtd_as including cached pasid entry is removed.
> + *
> + * For PCI_NO_PASID, when corresponding cached pasid entry is cleared,
> + * it returns false so that vtd_as is reserved as it's owned by PCI
> + * sub-system. For other pasid, it returns true so vtd_as is removed.
> + */
> +static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
> +                                       gpointer user_data)
> +{
> +    VTDPASIDCacheInfo *pc_info = user_data;
> +    VTDAddressSpace *vtd_as = value;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> +    VTDPASIDEntry pe;
> +    uint16_t did;
> +    uint32_t pasid;
> +    int ret;
> +
> +    if (!pc_entry->valid) {
> +        return false;
> +    }
> +    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
> +
> +    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
> +        goto remove;
> +    }
> +
> +    switch (pc_info->type) {
> +    case VTD_PASID_CACHE_PASIDSI:
> +        if (pc_info->pasid != pasid) {
> +            return false;
> +        }
> +        /* fall through */
> +    case VTD_PASID_CACHE_DOMSI:
> +        if (pc_info->did != did) {
> +            return false;
> +        }
> +        /* fall through */
> +    case VTD_PASID_CACHE_GLOBAL_INV:
> +        break;
> +    default:
> +        error_setg(&error_fatal, "invalid pc_info->type for flush");
> +    }
> +
> +    /*
> +     * pasid cache invalidation may indicate a present pasid entry to present
> +     * pasid entry modification. To cover such case, vIOMMU emulator needs to
> +     * fetch latest guest pasid entry and compares with cached pasid entry,
> +     * then update pasid cache.
> +     */
> +    ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
> +    if (ret) {
> +        /*
> +         * No valid pasid entry in guest memory. e.g. pasid entry was modified
> +         * to be either all-zero or non-present. Either case means existing
> +         * pasid cache should be removed.
> +         */
> +        goto remove;
> +    }
> +
> +    /*
> +     * Update cached pasid entry if it's stale compared to what's in guest
> +     * memory.
> +     */
> +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> +        pc_entry->pasid_entry = pe;
> +    }
> +    return false;
> +
> +remove:
> +    pc_entry->valid = false;
> +
> +    /*
> +     * Don't remove address space of PCI_NO_PASID which is created for PCI
> +     * sub-system.
> +     */
> +    if (vtd_as->pasid == PCI_NO_PASID) {
> +        return false;
> +    }
> +    return true;
> +}
> +
> +/*
> + * For a PASID cache invalidation, this function handles below scenarios:
> + * a) a present cached pasid entry needs to be removed
> + * b) a present cached pasid entry needs to be updated
> + */
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
> +{
> +    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
> +        return;
> +    }
> +
> +    vtd_iommu_lock(s);
> +    /*
> +     * a,b): loop all the existing vtd_as instances for pasid cache removal
> +       or update.
> +     */
> +    g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
> +                                pc_info);
> +    vtd_iommu_unlock(s);
> +}
> +
> +static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> +                                   VTDInvDesc *inv_desc)
> +{
> +    uint16_t did;
> +    uint32_t pasid;
> +    VTDPASIDCacheInfo pc_info;
> +    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
> +                        VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
> +
> +    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
> +                                     __func__, "pasid cache inv")) {
> +        return false;
> +    }
> +
> +    did = VTD_INV_DESC_PASIDC_DID(inv_desc);
> +    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc);
> +
> +    switch (VTD_INV_DESC_PASIDC_G(inv_desc)) {
> +    case VTD_INV_DESC_PASIDC_G_DSI:
> +        trace_vtd_pasid_cache_dsi(did);
> +        pc_info.type = VTD_PASID_CACHE_DOMSI;
> +        pc_info.did = did;
> +        break;
> +
> +    case VTD_INV_DESC_PASIDC_G_PASID_SI:
> +        /* PASID selective implies a DID selective */
> +        trace_vtd_pasid_cache_psi(did, pasid);
> +        pc_info.type = VTD_PASID_CACHE_PASIDSI;
> +        pc_info.did = did;
> +        pc_info.pasid = pasid;
> +        break;
> +
> +    case VTD_INV_DESC_PASIDC_G_GLOBAL:
> +        trace_vtd_pasid_cache_gsi();
> +        pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
> +        break;
> +
> +    default:
> +        error_report_once("invalid granularity field in PASID-cache invalidate "
> +                          "descriptor, hi: 0x%"PRIx64" lo: 0x%" PRIx64,
> +                           inv_desc->val[1], inv_desc->val[0]);
what's the point of printing the 2nd 64b? Looking at Figure 6-2 in the
spec (6.5.2.2. PASID-cache invalidate descriptor) it does not seem to
contain anything?

Besides I read in the spec:
Domain-ID (DID): The DID field indicates the target domain-id. Hardware
ignores bits 31:(16+N), where N is the domain-id width reported in the
Capability Register.

How do you make sure N is same on both pIOMMU and vIOMMU?


> +        return false;
> +    }
> +
> +    vtd_pasid_cache_sync(s, &pc_info);
> +    return true;
> +}
> +
>  static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
>                                       VTDInvDesc *inv_desc)
>  {
> @@ -3274,6 +3451,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>          }
>          break;
>  
> +    case VTD_INV_DESC_PC:
> +        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
same here
> +        if (!vtd_process_pasid_desc(s, &inv_desc)) {
> +            return false;
> +        }
> +        break;
> +
>      case VTD_INV_DESC_PIOTLB:
>          trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
>          if (!vtd_process_piotlb_desc(s, &inv_desc)) {
> @@ -3309,16 +3493,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>          }
>          break;
>  
> -    /*
> -     * TODO: the entity of below two cases will be implemented in future series.
> -     * To make guest (which integrates scalable mode support patch set in
> -     * iommu driver) work, just return true is enough so far.
> -     */
> -    case VTD_INV_DESC_PC:
> -        if (s->scalable_mode) {
> -            break;
> -        }
> -    /* fallthrough */
>      default:
>          error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
>                            " (unknown type)", __func__, inv_desc.hi,
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index ac9e1a10aa..ae5bbfcdc0 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
>  vtd_inv_qi_tail(uint16_t head) "write tail %d"
>  vtd_inv_qi_fetch(void) ""
>  vtd_context_cache_reset(void) ""
> +vtd_pasid_cache_gsi(void) ""
> +vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
> +vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
>  vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
>  vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
>  vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
Besides the code looks good to me

Eric



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
  2025-08-27 11:56     ` Yi Liu
@ 2025-08-27 15:09       ` Nicolin Chen
  2025-08-29  8:16         ` Yi Liu
  0 siblings, 1 reply; 113+ messages in thread
From: Nicolin Chen @ 2025-08-27 15:09 UTC (permalink / raw)
  To: Yi Liu
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, eric.auger, mst,
	jasowang, peterx, ddutile, jgg, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On Wed, Aug 27, 2025 at 07:56:38PM +0800, Yi Liu wrote:
> On 2025/8/23 07:55, Nicolin Chen wrote:
> > > +    if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
> > > +        container->bcontainer.bypass_ro = true;
> > 
> > This circled back to checking a vendor specific flag in the core..
> > 
> > Perhaps we could upgrade the get_viommu_cap op and its API:
> > 
> > enum viommu_flags {
> >      VIOMMU_FLAG_HW_NESTED = BIT_ULL(0),
> >      VIOMMU_FLAG_BYPASS_RO = BIT_ULL(1),
> 
> hmmm. I'm not quite on this idea as the two flags have different sources.
> One determined by vIOMMU config, one by the hardware limit. Reporting
> them in one API is strange.

It's fair enough that we want to make such a clear boundary between
a vIOMMU flag and a HW IOMMU flag of the same vendor..

> I think the bypass RO can be determined in
> VFIO just like the patch has done. But it should check if vIOMMU has
> requested nested hwpt and also the reported hw_info::type is
> IOMMU_HW_INFO_TYPE_INTEL_VTD.
> 
> 	if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) &&
>             type == IOMMU_HW_INFO_TYPE_INTEL_VTD &&
>             vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
>             container->bcontainer.bypass_ro = true;
>          }

Then, it feels odd to me that we don't have a clear boundary between
a generic flag and a vendor flag :-/

It's fine if we want to keep all the host-level vendor flags outside
the vIOMMU code, but at least could we please have a generic looking
function outside this iommufd_cdev_autodomains_get() to translate a
vendor flag to a generic looking flag?

We could start with a function that loads the HostIOMMUDeviceCaps (or
just VendorCaps) dealing with vendor types and outputs generic ones:

        host_iommu_flags = host_iommu_decode_vendor_caps(&vendor_caps);

        if (hwpt_flags & IOMMU_HWPT_ALLOC_NEST_PARENT &&
            host_iommu_flags & HOST_IOMMU_FLAG_BYPASS_RO) {
             container->bcontainer.bypass_ro = true;
        }

Over time, it can even grow into a separate file, if there are more
vendor specific requirement.

Nicolin


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-27 12:32         ` Eric Auger
@ 2025-08-27 15:30           ` Nicolin Chen
  2025-08-28  8:26             ` Yi Liu
  0 siblings, 1 reply; 113+ messages in thread
From: Nicolin Chen @ 2025-08-27 15:30 UTC (permalink / raw)
  To: Eric Auger
  Cc: Yi Liu, Zhenzhong Duan, qemu-devel, alex.williamson, clg, mst,
	jasowang, peterx, ddutile, jgg, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On Wed, Aug 27, 2025 at 02:32:42PM +0200, Eric Auger wrote:
> On 8/27/25 2:30 PM, Yi Liu wrote:
> > On 2025/8/27 19:22, Eric Auger wrote:
> >>> TBH. I'm hesitating to name it as get_viommu_cap. The scope is a little
> >>> larger than what we want so far. So I'm wondering if it can be done
> >>> in a
> >>> more straightforward way. e.g. just a bool op named
> >>> iommu_nested_wanted(). Just an example, maybe better naming. We can
> >>> extend the op to be returning a u64 value in the future when we see
> >>> another request on VFIO from vIOMMU.
> >> personnally I am fine with the bitmask which looks more future proof.
> >
> > not quite sure if there is another info that needs to be checked in
> > this "VFIO asks vIOMMU" manner. Have you seen one beside this
> > nested hwpt requirement by vIOMMU?
> 
> I don't remember any at this point. But I guess with ARM CCA device
> passthrough we might have other needs

Yea. A Realm vSMMU instance won't allocate IOAS/HWPT. So it will
ask the core to bypass those allocations, via the same op.

I don't know: does "get_viommu_flags" sound more fitting to have
a clear meaning of "want"?

  VIOMMU_FLAG_WANT_NESTING_PARENT
  VIOMMU_FLAG_WANT_NO_IOAS

At least, the 2nd one being a "cap" wouldn't sound nice to me..

Nicolin


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 12/21] intel_iommu: Handle PASID entry addition
  2025-08-22  6:40 ` [PATCH v5 12/21] intel_iommu: Handle PASID entry addition Zhenzhong Duan
@ 2025-08-27 16:22   ` Eric Auger
  2025-09-01  9:03     ` Duan, Zhenzhong
  2025-08-29  5:46   ` Yi Liu
  1 sibling, 1 reply; 113+ messages in thread
From: Eric Auger @ 2025-08-27 16:22 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Yi Sun

Hi Zhenzhong,

On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> When guest creates new PASID entries, QEMU will capture the guest pasid
> selective pasid cache invalidation, walk through each passthrough device
> and each pasid, when a match is found, identify an existing vtd_as or
> create a new one and update its corresponding cached pasid entry.

You need to emphasize that the support is currently limited to
Requests-without-PASID
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu_internal.h |   2 +
>  hw/i386/intel_iommu.c          | 176 ++++++++++++++++++++++++++++++++-
>  2 files changed, 175 insertions(+), 3 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index b9b76dd996..fb2a919e87 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -559,6 +559,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>  
> +#define VTD_SM_CONTEXT_ENTRY_PDTS(x)        extract64((x)->val[0], 9, 3)
>  #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
> @@ -589,6 +590,7 @@ typedef struct VTDPASIDCacheInfo {
>  #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
>  #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
>  #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
> +#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
>  
>  /* PASID Granular Translation Type Mask */
>  #define VTD_PASID_ENTRY_P              1ULL
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index a2ee6d684e..7d2c9feae7 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -826,6 +826,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>      }
>  }
>  
> +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
> +{
> +    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
> +}
> +
>  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>  {
>      return pdire->val & 1;
> @@ -1647,9 +1652,9 @@ static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
>  }
>  
>  /* Translate iommu pasid to vtd_as */
> -static inline
> -VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
> -                                                uint16_t sid, uint32_t pasid)
> +static VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
> +                                                       uint16_t sid,
> +                                                       uint32_t pasid)
>  {
>      struct vtd_as_raw_key key = {
>          .sid = sid,
> @@ -3220,10 +3225,172 @@ remove:
>      return true;
>  }
>  
> +/*
> + * This function walks over PASID range within [start, end) in a single
> + * PASID table for entries matching @info type/did, then retrieve/create
> + * vtd_as and fill associated pasid entry cache.
> + */
> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
> +                                        dma_addr_t pt_base,
> +                                        int start,
> +                                        int end,
> +                                        VTDPASIDCacheInfo *info)
> +{
> +    VTDPASIDEntry pe;
> +    int pasid = start;
> +
> +    while (pasid < end) {
> +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
> +            && vtd_pe_present(&pe)) {
> +            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
> +            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
> +            VTDPASIDCacheEntry *pc_entry;
> +            VTDAddressSpace *vtd_as;
> +
> +            vtd_iommu_lock(s);
> +            /*
> +             * When indexed by rid2pasid, vtd_as should have been created,
> +             * e.g., by PCI subsystem. For other iommu pasid, we need to
> +             * create vtd_as dynamically. Other iommu pasid is same value
since you don't support somthing else than rid2pasid, I would drop that
and simplify the code. See below.
> +             * as PCI's pasid, so it's used as input of vtd_find_add_as().
> +             */
> +            vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
> +            vtd_iommu_unlock(s);
> +            if (!vtd_as) {
> +                vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
you could check the vtd_as already exists here per the rid2pasid support
limitation
> +            }
> +
> +            if ((info->type == VTD_PASID_CACHE_DOMSI ||
> +                 info->type == VTD_PASID_CACHE_PASIDSI) &&
> +                (info->did != VTD_SM_PASID_ENTRY_DID(&pe))) {
> +                /*
> +                 * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
> +                 * requires domain id check. If domain id check fail,
fails
> +                 * go to next pasid.
> +                 */
> +                pasid++;
> +                continue;
> +            }
> +
> +            pc_entry = &vtd_as->pasid_cache_entry;
> +            /*
> +             * pasid cache update and clear are handled in
> +             * vtd_flush_pasid_locked(), only care new pasid entry here.
> +             */
> +            if (!pc_entry->valid) {
> +                pc_entry->pasid_entry = pe;
> +                pc_entry->valid = true;
> +            }
> +        }
> +        pasid++;
> +    }
> +}
> +
> +/*
> + * In VT-d scalable mode translation, PASID dir + PASID table is used.
> + * This function aims at looping over a range of PASIDs in the given
> + * two level table to identify the pasid config in guest.
> + */
> +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
> +                                    dma_addr_t pdt_base,
> +                                    int start, int end,
> +                                    VTDPASIDCacheInfo *info)
> +{
> +    VTDPASIDDirEntry pdire;
> +    int pasid = start;
> +    int pasid_next;
> +    dma_addr_t pt_base;
> +
> +    while (pasid < end) {
> +        pasid_next =
> +             (pasid + VTD_PASID_TBL_ENTRY_NUM) & ~(VTD_PASID_TBL_ENTRY_NUM - 1);
> +        pasid_next = pasid_next < end ? pasid_next : end;
> +
> +        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
> +            && vtd_pdire_present(&pdire)) {
> +            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
> +            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
> +        }
> +        pasid = pasid_next;
> +    }
> +}
> +
> +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
> +                                          int start, int end,
> +                                          VTDPASIDCacheInfo *info)
> +{
> +    VTDContextEntry ce;
> +
> +    if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus), info->devfn,
> +                                  &ce)) {
> +        uint32_t max_pasid;
> +
> +        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
> +        if (end > max_pasid) {
> +            end = max_pasid;
> +        }
> +        vtd_sm_pasid_table_walk(s,
> +                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
> +                                start,
> +                                end,
> +                                info);
> +    }
> +}
> +
> +/*
> + * This function replays the guest pasid bindings by walking the two level
> + * guest PASID table. For each valid pasid entry, it finds or creates a
> + * vtd_as and caches pasid entry in vtd_as.
> + */
> +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> +                                            VTDPASIDCacheInfo *pc_info)
> +{
> +    /*
> +     * Currently only Requests-without-PASID is supported, as vIOMMU doesn't
> +     * support RPS(RID-PASID Support), pasid scope is fixed to [0, 1).
> +     */
> +    int start = 0, end = 1;
> +    VTDHostIOMMUDevice *vtd_hiod;
> +    VTDPASIDCacheInfo walk_info;
> +    GHashTableIter as_it;
> +
> +    switch (pc_info->type) {
> +    case VTD_PASID_CACHE_PASIDSI:
> +        start = pc_info->pasid;
> +        end = pc_info->pasid + 1;
if you never replay a range, you could simplify the code for now because
some code paths are not properly tested
> +       /* fall through */
> +    case VTD_PASID_CACHE_DOMSI:
Why can't we have other invalidation types along with request without
PASID? It is not obvious to me at least why it couldn't be used by the
guest. Would deserve a comment in the commit desc I think.
> +    case VTD_PASID_CACHE_GLOBAL_INV:
> +        /* loop all assigned devices */
> +        break;
> +    default:
> +        error_setg(&error_fatal, "invalid pc_info->type for replay");
> +    }
> +
> +    /*
> +     * In this replay, one only needs to care about the devices which are
> +     * backed by host IOMMU. Those devices have a corresponding vtd_hiod
> +     * in s->vtd_host_iommu_dev. For devices not backed by host IOMMU, it
> +     * is not necessary to replay the bindings since their cache could be
> +     * re-created in the future DMA address translation.
> +     *
> +     * VTD translation callback never accesses vtd_hiod and its corresponding
> +     * cached pasid entry, so no iommu lock needed here.
> +     */
> +    walk_info = *pc_info;
> +    g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
> +    while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
> +        walk_info.bus = vtd_hiod->bus;
> +        walk_info.devfn = vtd_hiod->devfn;
> +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> +    }
> +}
> +
>  /*
>   * For a PASID cache invalidation, this function handles below scenarios:
>   * a) a present cached pasid entry needs to be removed
>   * b) a present cached pasid entry needs to be updated
> + * c) a present cached pasid entry needs to be created
>   */
>  static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
>  {
> @@ -3239,6 +3406,9 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
>      g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
>                                  pc_info);
>      vtd_iommu_unlock(s);
> +
> +    /* c): loop all passthrough device for new pasid entries */
> +    vtd_replay_guest_pasid_bindings(s, pc_info);
>  }
>  
>  static bool vtd_process_pasid_desc(IntelIOMMUState *s,
Thanks

Eric



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 13/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
  2025-08-22  6:40 ` [PATCH v5 13/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
@ 2025-08-27 16:28   ` Eric Auger
  2025-08-29  5:56     ` Yi Liu
  2025-09-01  9:04     ` Duan, Zhenzhong
  0 siblings, 2 replies; 113+ messages in thread
From: Eric Auger @ 2025-08-27 16:28 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Yi Sun

Hi Zhenzhong,

On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> FORCE_RESET is different from GLOBAL_INV which updates pasid cache if
> underlying pasid entry is still valid, it drops all the pasid caches.
>
> FORCE_RESET isn't a VTD spec defined invalidation type for pasid cache,
> only used internally in system level reset.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu_internal.h |  9 +++++++++
>  hw/i386/intel_iommu.c          | 25 +++++++++++++++++++++++++
>  hw/i386/trace-events           |  1 +
>  3 files changed, 35 insertions(+)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index fb2a919e87..c510b09d1a 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -569,6 +569,15 @@ typedef enum VTDPCInvType {
>      VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
>      VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
>      VTD_PASID_CACHE_GLOBAL_INV = VTD_INV_DESC_PASIDC_G_GLOBAL,
> +
> +    /*
> +     * Internally used PASID cache invalidation type starts here,
> +     * 0x10 is large enough as invalidation type in pc_inv_desc
> +     * is 2bits in size.
> +     */
> +
> +    /* Reset all PASID cache entries, used in system level reset */
> +    VTD_PASID_CACHE_FORCE_RESET = 0x10,
I am not very keen on adding such an artifical enum value that does not
exist in the spec.

Why not simply introduce another function (instead of
vtd_flush_pasid_locked) that does the cleanup. To me it would be
cleaner. Thanks Eric
>  } VTDPCInvType;
>  
>  typedef struct VTDPASIDCacheInfo {
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 7d2c9feae7..af384ce7f0 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -87,6 +87,8 @@ struct vtd_iotlb_key {
>  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
> +static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
> +
>  static void vtd_panic_require_caching_mode(void)
>  {
>      error_report("We need to set caching-mode=on for intel-iommu to enable "
> @@ -391,6 +393,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
>      vtd_iommu_lock(s);
>      vtd_reset_iotlb_locked(s);
>      vtd_reset_context_cache_locked(s);
> +    vtd_pasid_cache_reset_locked(s);
>      vtd_iommu_unlock(s);
>  }
>  
> @@ -3183,6 +3186,8 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
>          /* fall through */
>      case VTD_PASID_CACHE_GLOBAL_INV:
>          break;
> +    case VTD_PASID_CACHE_FORCE_RESET:
> +        goto remove;
>      default:
>          error_setg(&error_fatal, "invalid pc_info->type for flush");
>      }
> @@ -3225,6 +3230,23 @@ remove:
>      return true;
>  }
>  
> +static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
> +{
> +    VTDPASIDCacheInfo pc_info;
> +
> +    trace_vtd_pasid_cache_reset();
> +
> +    pc_info.type = VTD_PASID_CACHE_FORCE_RESET;
> +
> +    /*
> +     * Reset pasid cache is a big hammer, so use g_hash_table_foreach_remove
> +     * which will free all vtd_as instances except those created for PCI
> +     * sub-system.
> +     */
> +    g_hash_table_foreach_remove(s->vtd_address_spaces,
> +                                vtd_flush_pasid_locked, &pc_info);
> +}
> +
>  /*
>   * This function walks over PASID range within [start, end) in a single
>   * PASID table for entries matching @info type/did, then retrieve/create
> @@ -3363,6 +3385,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
>      case VTD_PASID_CACHE_GLOBAL_INV:
>          /* loop all assigned devices */
>          break;
> +    case VTD_PASID_CACHE_FORCE_RESET:
> +        /* For force reset, no need to go further replay */
> +        return;
>      default:
>          error_setg(&error_fatal, "invalid pc_info->type for replay");
>      }
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index ae5bbfcdc0..c8a936eb46 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -24,6 +24,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
>  vtd_inv_qi_tail(uint16_t head) "write tail %d"
>  vtd_inv_qi_fetch(void) ""
>  vtd_context_cache_reset(void) ""
> +vtd_pasid_cache_reset(void) ""
>  vtd_pasid_cache_gsi(void) ""
>  vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
>  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-08-26 17:21   ` Nicolin Chen
  2025-08-27  6:45     ` Duan, Zhenzhong
@ 2025-08-27 16:36     ` Eric Auger
  2025-08-27 16:57       ` Nicolin Chen
  1 sibling, 1 reply; 113+ messages in thread
From: Eric Auger @ 2025-08-27 16:36 UTC (permalink / raw)
  To: Nicolin Chen, Zhenzhong Duan, yi.l.liu
  Cc: qemu-devel, alex.williamson, clg, mst, jasowang, peterx, ddutile,
	jgg, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

Hi Nicolin,

On 8/26/25 7:21 PM, Nicolin Chen wrote:
> Hi Zhenzhong/Yi,
>
> On Fri, Aug 22, 2025 at 02:40:45AM -0400, Zhenzhong Duan wrote:
>> @@ -4371,6 +4374,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>>                                       HostIOMMUDevice *hiod, Error **errp)
>>  {
>>      IntelIOMMUState *s = opaque;
>> +    VTDHostIOMMUDevice *vtd_hiod;
>>      struct vtd_as_key key = {
>>          .bus = bus,
>>          .devfn = devfn,
> I wonder if the bus/devfn here would always reflect the actual BDF
> numbers in this function, on an x86 VM.
>
> With ARM, when the device is attached to a pxb bus, the bus/devfn
> here are both 0, so PCI_BUILD_BDF() using these two returns 0 too.
>
> QEMU command for the device:
>  -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0 \
>  -device arm-smmuv3,primary-bus=pcie.1,id=smmuv3.1,accel=on \
>  -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1,io-reserve=0 \
>  -device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.port1,rombar=0,id=dev0,iommufd=iommufd0
>
> QEMU log:
> smmuv3_accel_set_iommu_device: bus=0, devfn=0, sid=0
>
> The set_iommu_device op is invoked by vfio_pci_realize() where the
> the BDF number won't get ready for this kind of PCI setup until a
> later stage that I can't identify yet..
>
> Given that VTD wants the BDF number too, I start to wonder whether
> the set_iommu_device op is invoked in the right place or not..
>
> Maybe VTD works because it saves the bus pointer v.s. bus_num(=0),
> so its bus_num would be updated when later code calculates the BDF
> number using the saved bus pointer (in the key). Nonetheless, the
> saved devfn (in the key) is 0, which wouldn't be updated later as
> the bus_num. So, if the device is supposed to have a devfn (!=0),
> this wouldn't work?

in hw/arm/smmu-common.c, along with smmu_find_smmu_pcibus() there is a
comment about late computation of bus number. This looks like a safe
place where the bus_num is known.

Thanks

Eric
>
> Thanks
> Nicolin
>



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-08-27 16:36     ` Eric Auger
@ 2025-08-27 16:57       ` Nicolin Chen
  0 siblings, 0 replies; 113+ messages in thread
From: Nicolin Chen @ 2025-08-27 16:57 UTC (permalink / raw)
  To: Eric Auger
  Cc: Zhenzhong Duan, yi.l.liu, qemu-devel, alex.williamson, clg, mst,
	jasowang, peterx, ddutile, jgg, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

Hi Eric,

On Wed, Aug 27, 2025 at 06:36:09PM +0200, Eric Auger wrote:
> On 8/26/25 7:21 PM, Nicolin Chen wrote:
> > QEMU log:
> > smmuv3_accel_set_iommu_device: bus=0, devfn=0, sid=0
> >
> > The set_iommu_device op is invoked by vfio_pci_realize() where the
> > the BDF number won't get ready for this kind of PCI setup until a
> > later stage that I can't identify yet..
> >
> > Given that VTD wants the BDF number too, I start to wonder whether
> > the set_iommu_device op is invoked in the right place or not..
> >
> > Maybe VTD works because it saves the bus pointer v.s. bus_num(=0),
> > so its bus_num would be updated when later code calculates the BDF
> > number using the saved bus pointer (in the key). Nonetheless, the
> > saved devfn (in the key) is 0, which wouldn't be updated later as
> > the bus_num. So, if the device is supposed to have a devfn (!=0),
> > this wouldn't work?
> 
> in hw/arm/smmu-common.c, along with smmu_find_smmu_pcibus() there is a
> comment about late computation of bus number. This looks like a safe
> place where the bus_num is known.

Yea, sid is a parameter of that smmu_find_smmu_pcibus() function,
so bus_num must be known.

What I want here is to allocate a vDEVICE (needs vSID), as early
as possible. This will be potentially a requirement for CCA.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 14/21] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
  2025-08-22  6:40 ` [PATCH v5 14/21] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
@ 2025-08-27 17:14   ` Eric Auger
  2025-08-29  6:06   ` Yi Liu
  1 sibling, 0 replies; 113+ messages in thread
From: Eric Auger @ 2025-08-27 17:14 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng

Hi Zhenzhong,

On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> When guest in scalable mode and x-flts=on, we stick to system MR for IOMMUFD
when x-flts is set on the iommu and guest uses scalable mode we don't
want to use IOMMU MR but rather continue using the system MR or
something alike

To me this deserves more explanation about we don't want IOMMU MR
anymore, from a qemu infrastructure point of view.
> backed host device. Then its default hwpt contains GPA->HPA mappings which is
> used directly if PGTT=PT and used as nested parent if PGTT=FLT. Otherwise
> fallback to original processing.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index af384ce7f0..15582977b8 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1773,6 +1773,28 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>  
>  }
>  
> +static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(IntelIOMMUState *s,
> +                                                 VTDAddressSpace *as)
> +{
> +    struct vtd_as_key key = {
> +        .bus = as->bus,
> +        .devfn = as->devfn,
> +    };
> +    VTDHostIOMMUDevice *vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev,
> +                                                       &key);
> +
> +    if (vtd_hiod && vtd_hiod->hiod &&
> +        object_dynamic_cast(OBJECT(vtd_hiod->hiod),
> +                            TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> +        return vtd_hiod;
> +    }
> +    return NULL;
> +}
> +
> +/*
> + * vtd_switch_address_space() calls vtd_as_pt_enabled() to determine which
> + * MR to switch to. Switch to system MR if return true, iommu MR otherwise.
I would use a proper doc comment and refer to this function first
> + */
>  static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>  {
>      IntelIOMMUState *s;
> @@ -1781,6 +1803,18 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>      assert(as);
>  
>      s = as->iommu_state;
> +
> +    /*
> +     * When guest in scalable mode and x-flts=on, we stick to system MR
> +     * for IOMMUFD backed host device. Then its default hwpt contains
> +     * GPA->HPA mappings which is used directly if PGTT=PT and used as
> +     * nested parent if PGTT=FLT. Otherwise fallback to original
> +     * processing.
> +     */
> +    if (s->root_scalable && s->flts && vtd_find_hiod_iommufd(s, as)) {
> +        return true;
> +    }
> +
>      if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
>                                   &ce)) {
>          /*
Thanks

Eric



^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device
  2025-08-27 11:13 ` [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Yi Liu
@ 2025-08-28  5:53   ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-28  5:53 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for
>passthrough device
>
>On 2025/8/22 14:40, Zhenzhong Duan wrote:
>> Hi,
>>
>> For passthrough device with intel_iommu.x-flts=on, we don't do shadowing
>of
>> guest page table for passthrough device but pass stage-1 page table to host
>> side to construct a nested domain. There was some effort to enable this
>feature
>> in old days, see [1] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation (also known as
>> IOMMU nested translation) capability in host IOMMU. As the below
>diagram shows,
>> guest I/O page table pointer in GPA (guest physical address) is passed to
>host
>> and be used to perform the stage-1 address translation. Along with it,
>> modifications to present mappings in the guest I/O page table should be
>followed
>> with an IOTLB invalidation.
>>
>>          .-------------.  .---------------------------.
>>          |   vIOMMU    |  | Guest I/O page table      |
>>          |             |  '---------------------------'
>>          .----------------/
>>          | PASID Entry |--- PASID cache flush --+
>>          '-------------'                        |
>>          |             |                        V
>>          |             |           I/O page table pointer in GPA
>>          '-------------'
>>      Guest
>>      ------| Shadow |---------------------------|--------
>>            v        v                           v
>>      Host
>>          .-------------.  .------------------------.
>>          |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
>>          |             |  '------------------------'
>>          .----------------/  |
>>          | PASID Entry |     V (Nested xlate)
>>          '----------------\.--------------------------------------.
>>          |             |   | Stage2 for GPA->HPA, unmanaged
>domain|
>>          |             |   '--------------------------------------'
>>          '-------------'
>> For history reason, there are different namings in different VTD spec rev,
>> Where:
>>   - Stage1 = First stage = First level = flts
>>   - Stage2 = Second stage = Second level = slts
>> <Intel VT-d Nested translation>
>>
>> This series reuse VFIO device's default hwpt as nested parent instead of
>> creating new one. This way avoids duplicate code of a new memory
>listener,
>> all existing feature from VFIO listener can be shared, e.g., ram discard,
>> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
>> under a PCI bridge with emulated device, because emulated device wants
>> IOMMU AS and VFIO device stick to system AS;
>
>should we document it somewhere?

Sure, docs/devel/vfio-iommufd.rst may be a good place for that.

>
>> 2) not supporting kexec or
>> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off",
>because
>> VFIO device's default hwpt is created with NEST_PARENT flag, kernel
>> inhibit RO mappings when switch to shadow mode.
>
>how does guest know this limitation and hold on such attempts?

No way for guest to know ERRATA in host IOMMU.

>
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing guest
>> application address space with passthrough devices.
>>
>> There are some interactions between VFIO and vIOMMU
>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>    subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>    instance to vIOMMU at vfio device realize stage.
>> * vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
>>    VFIO calls it to get vIOMMU exposed capabilities.
>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>    to bind/unbind device to IOMMUFD backed domains, either nested
>>    domain or not.
>>
>> See below diagram:
>>
>>          VFIO Device                                 Intel
>IOMMU
>>      .-----------------.                         .-------------------.
>>      |                 |                         |
>|
>>      |       .---------|PCIIOMMUOps              |.-------------.
>|
>>      |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU
>|    |
>>      |       | Device  |------------------------>|| Device list |    |
>>      |       .---------|(get_viommu_cap)         |.-------------.    |
>>      |                 |                         |       |
>|
>>      |                 |                         |       V
>|
>>      |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.
>|
>>      |       | IOMMUFD |            (attach_hwpt)|  | Host
>IOMMU  |  |
>>      |       | link    |<------------------------|  |   Device    |  |
>>      |       .---------|            (detach_hwpt)|  .-------------.  |
>>      |                 |                         |       |
>|
>>      |                 |                         |       ...
>|
>>      .-----------------.                         .-------------------.
>>
>> Below is an example to enable stage-1 translation for passthrough device:
>>
>>      -M q35,...
>>      -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>      -object iommufd,id=iommufd0 -device
>vfio-pci,iommufd=iommufd0,...
>>
>> Test done:
>> - VFIO devices hotplug/unplug
>> - different VFIO devices linked to different iommufds
>> - vhost net device ping test
>>
>> PATCH1-7:  Some preparing work
>> PATCH8-9:  Compatibility check between vIOMMU and Host IOMMU
>> PATCH10-18:Implement stage-1 page table for passthrough device
>> PATCH19-20:Workaround for ERRATA_772415_SPR17
>> PATCH21:   Enable stage-1 translation for passthrough device
>>
>> Qemu code can be found at [2]
>>
>> Fault report isn't supported in this series, we presume guest kernel always
>> construct correct stage1 page table for passthrough device. For emulated
>> devices, the emulation code already provided stage1 fault injection.
>
>just call out this series is only limited to gIOVA usage so far. vSVA is
>later. :)

Will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-27 15:30           ` Nicolin Chen
@ 2025-08-28  8:26             ` Yi Liu
  2025-08-28  9:06               ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-28  8:26 UTC (permalink / raw)
  To: Nicolin Chen, Eric Auger
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, mst, jasowang,
	peterx, ddutile, jgg, joao.m.martins, clement.mathieu--drif,
	kevin.tian, chao.p.peng

On 2025/8/27 23:30, Nicolin Chen wrote:
> On Wed, Aug 27, 2025 at 02:32:42PM +0200, Eric Auger wrote:
>> On 8/27/25 2:30 PM, Yi Liu wrote:
>>> On 2025/8/27 19:22, Eric Auger wrote:
>>>>> TBH. I'm hesitating to name it as get_viommu_cap. The scope is a little
>>>>> larger than what we want so far. So I'm wondering if it can be done
>>>>> in a
>>>>> more straightforward way. e.g. just a bool op named
>>>>> iommu_nested_wanted(). Just an example, maybe better naming. We can
>>>>> extend the op to be returning a u64 value in the future when we see
>>>>> another request on VFIO from vIOMMU.
>>>> personnally I am fine with the bitmask which looks more future proof.
>>>
>>> not quite sure if there is another info that needs to be checked in
>>> this "VFIO asks vIOMMU" manner. Have you seen one beside this
>>> nested hwpt requirement by vIOMMU?
>>
>> I don't remember any at this point. But I guess with ARM CCA device
>> passthrough we might have other needs
> 
> Yea. A Realm vSMMU instance won't allocate IOAS/HWPT. So it will
> ask the core to bypass those allocations, via the same op.
> 
> I don't know: does "get_viommu_flags" sound more fitting to have
> a clear meaning of "want"?
> 
>    VIOMMU_FLAG_WANT_NESTING_PARENT
>    VIOMMU_FLAG_WANT_NO_IOAS
> 
> At least, the 2nd one being a "cap" wouldn't sound nice to me..

this looks good to me.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-08-22  6:40 ` [PATCH v5 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-08-28  8:37   ` Eric Auger
  2025-08-29  7:05   ` Yi Liu
  1 sibling, 0 replies; 113+ messages in thread
From: Eric Auger @ 2025-08-28  8:37 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Yi Sun

Hi Zhenzhong,

On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> This captures the guest PASID table entry modifications and
> propagates the changes to host to attach a hwpt with type determined
> per guest IOMMU mode and PGTT configuration.
>
> When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
> page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
> vIOMMU reuse hwpt(GPA->HPA) provided by VFIO as nested parent to
> construct nested page table.
>
> When guest decides to use legacy mode then vIOMMU switches the MRs of
> the device's AS, hence the IOAS created by VFIO container would be
> switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
> switched to IOMMU MR. So it is able to support shadowing the guest IO
> page table.
>
> Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu_internal.h |  14 ++-
>  include/hw/i386/intel_iommu.h  |   1 +
>  hw/i386/intel_iommu.c          | 221 ++++++++++++++++++++++++++++++++-
>  hw/i386/trace-events           |   3 +
>  4 files changed, 233 insertions(+), 6 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index c510b09d1a..61e35dbdc0 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -564,6 +564,12 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>  
> +typedef enum VTDPASIDOp {
> +    VTD_PASID_BIND,
> +    VTD_PASID_UPDATE,
> +    VTD_PASID_UNBIND,
> +} VTDPASIDOp;
> +
>  typedef enum VTDPCInvType {
>      /* VTD spec defined PASID cache invalidation type */
>      VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
> @@ -612,8 +618,12 @@ typedef struct VTDPASIDCacheInfo {
>  #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
>  #define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>  
> -#define VTD_SM_PASID_ENTRY_FLPM          3ULL
> -#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
> +#define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
along with renaming, use extract64()
will be simpler than
intel_iommu.c:    return pe->val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
intel_iommu.c:            return pe.val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;

also this will take care of the upper bits.

> +#define VTD_SM_PASID_ENTRY_SRE_BIT(x)    extract64((x)->val[2], 0, 1)
> +/* 00: 4-level paging, 01: 5-level paging, 10-11: Reserved */
> +#define VTD_SM_PASID_ENTRY_FSPM(x)       extract64((x)->val[2], 2, 2)
> +#define VTD_SM_PASID_ENTRY_WPE_BIT(x)    extract64((x)->val[2], 4, 1)
> +#define VTD_SM_PASID_ENTRY_EAFE_BIT(x)   extract64((x)->val[2], 7, 1)
>  
>  /* First Level Paging Structure */
>  /* Masks for First Level Paging Entry */
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 0e3826f6f0..2affab36b2 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -104,6 +104,7 @@ struct VTDAddressSpace {
>      PCIBus *bus;
>      uint8_t devfn;
>      uint32_t pasid;
> +    uint32_t s1_hwpt;
>      AddressSpace as;
>      IOMMUMemoryRegion iommu;
>      MemoryRegion root;          /* The root container of the device */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 15582977b8..a10ee8eb4f 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -20,6 +20,7 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
>  #include "qemu/error-report.h"
>  #include "qemu/main-loop.h"
>  #include "qapi/error.h"
> @@ -41,6 +42,9 @@
>  #include "migration/vmstate.h"
>  #include "trace.h"
>  #include "system/iommufd.h"
> +#ifdef CONFIG_IOMMUFD
> +#include <linux/iommufd.h>
> +#endif
>  
>  /* context entry operations */
>  #define VTD_CE_GET_RID2PASID(ce) \
> @@ -50,10 +54,9 @@
>  
>  /* pe operations */
>  #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
> -#define VTD_PE_GET_FL_LEVEL(pe) \
> -    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM))
>  #define VTD_PE_GET_SL_LEVEL(pe) \
>      (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
> +#define VTD_PE_GET_FL_LEVEL(pe) (VTD_SM_PASID_ENTRY_FSPM(pe) + 4)
this change and above one are cleanups. They can easily be put in a
separate patch to ease the review.
>  
>  /*
>   * PCI bus number (or SID) is not reliable since the device is usaully
> @@ -834,6 +837,31 @@ static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
>      return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
>  }
>  
> +static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
> +{
> +    return pe->val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
> +}
> +
> +/*
> + * Stage-1 IOVA address width: 48 bits for 4-level paging(FSPM=00)
> + *                             57 bits for 5-level paging(FSPM=01)
> + */
> +static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
fl = first level? You may prefer fs (ifrts stage) which is used in fspm
terminology.
> +{
> +    return 48 + VTD_SM_PASID_ENTRY_FSPM(pe) * 9;
> +}
> +
> +static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
> +{
> +    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
> +}
> +
> +/* check if pgtt is first stage translation */
> +static inline bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
> +{
> +    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
> +}
> +
>  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>  {
>      return pdire->val & 1;
> @@ -1131,7 +1159,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
>      if (s->root_scalable) {
>          vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>          if (s->flts) {
> -            return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
> +            return pe.val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
>          } else {
>              return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
>          }
> @@ -1766,7 +1794,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>               */
>              return false;
>          }
> -        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
> +        return vtd_pe_pgtt_is_pt(&pe);
that change can be put in the cleanup patch too
>      }
>  
>      return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
> @@ -2433,6 +2461,178 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
>      vtd_iommu_replay_all(s);
>  }
>  
> +#ifdef CONFIG_IOMMUFD
> +static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
> +                                  VTDPASIDEntry *pe)
> +{
> +    memset(vtd, 0, sizeof(*vtd));
> +
> +    vtd->flags =  (VTD_SM_PASID_ENTRY_SRE_BIT(pe) ? IOMMU_VTD_S1_SRE : 0) |
> +                  (VTD_SM_PASID_ENTRY_WPE_BIT(pe) ? IOMMU_VTD_S1_WPE : 0) |
> +                  (VTD_SM_PASID_ENTRY_EAFE_BIT(pe) ? IOMMU_VTD_S1_EAFE : 0);
> +    vtd->addr_width = vtd_pe_get_fl_aw(pe);
> +    vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
> +}
> +
> +static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                              VTDPASIDEntry *pe, uint32_t *s1_hwpt,
> +                              Error **errp)
> +{
> +    struct iommu_hwpt_vtd_s1 vtd;
= {};
and get rid of above memset?
> +
> +    vtd_init_s1_hwpt_data(&vtd, pe);
not sure the leper is needed. Is it reused somewhere else?
> +
> +    return !iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> +                                       idev->hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
> +                                       sizeof(vtd), &vtd, s1_hwpt, errp);
> +}
> +
> +static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                                VTDAddressSpace *vtd_as)
> +{
> +    if (!vtd_as->s1_hwpt) {
> +        return;
> +    }
> +    iommufd_backend_free_id(idev->iommufd, vtd_as->s1_hwpt);
> +    vtd_as->s1_hwpt = 0;
> +}
> +
> +static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
> +                                     VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> +    VTDPASIDEntry *pe = &vtd_as->pasid_cache_entry.pasid_entry;
> +    uint32_t hwpt_id;
> +    int ret;
> +
> +    /*
> +     * We can get here only if flts=on, the supported PGTT is FLT and PT.
> +     * Catch invalid PGTT when processing invalidation request to avoid
> +     * attaching to wrong hwpt.
> +     */
> +    if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
> +        error_setg(errp, "Invalid PGTT type");
> +        return -EINVAL;
> +    }
> +
> +    if (vtd_pe_pgtt_is_flt(pe)) {
> +        /* Should fail if the FLPT base is 0 */
OK I see there is a mix of FL and FS terminology. Forget about my
previous comment.
> +        if (!vtd_pe_get_flpt_base(pe)) {
> +            error_setg(errp, "FLPT base is 0");
> +            return -EINVAL;
> +        }
> +
> +        if (vtd_create_s1_hwpt(idev, pe, &hwpt_id, errp)) {
> +            return -EINVAL;
> +        }
> +    } else {
> +        hwpt_id = idev->hwpt_id;
> +    }
> +
> +    ret = !host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp);
> +    trace_vtd_device_attach_hwpt(idev->devid, vtd_as->pasid, hwpt_id, ret);
> +    if (!ret) {
that double ! looks pretty bad ;-)
> +        vtd_destroy_s1_hwpt(idev, vtd_as);
> +        if (vtd_pe_pgtt_is_flt(pe)) {
> +            vtd_as->s1_hwpt = hwpt_id;
would deserve some comments
> +        }
> +    } else if (vtd_pe_pgtt_is_flt(pe)) {
> +        iommufd_backend_free_id(idev->iommufd, hwpt_id);
> +    }
> +
> +    return ret;
> +}
> +
> +static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
> +                                     VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> +    uint32_t pasid = vtd_as->pasid;
> +    int ret;
> +
> +    if (vtd_hiod->iommu_state->dmar_enabled) {
> +        ret = !host_iommu_device_iommufd_detach_hwpt(idev, errp);
> +        trace_vtd_device_detach_hwpt(idev->devid, pasid, ret);
> +    } else {
> +        ret = !host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
> +        trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
> +                                           ret);
> +    }
> +
> +    if (!ret) {
gere as well, !! lost me sorry
> +        vtd_destroy_s1_hwpt(idev, vtd_as);
> +    }
> +
> +    return ret;
> +}
> +
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
> +                                Error **errp)
> +{
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(s, vtd_as);
> +    int ret;
> +
> +    if (!vtd_hiod) {
> +        /* No need to go further, e.g. for emulated device */
> +        return 0;
> +    }
> +
> +    if (vtd_as->pasid != PCI_NO_PASID) {
> +        error_setg(errp, "Non-rid_pasid %d not supported yet", vtd_as->pasid);
> +        return -EINVAL;
> +    }
> +
> +    switch (op) {
> +    case VTD_PASID_UPDATE:
> +    case VTD_PASID_BIND:
> +    {
> +        ret = vtd_device_attach_iommufd(vtd_hiod, vtd_as, errp);
> +        break;
> +    }
> +    case VTD_PASID_UNBIND:
> +    {
> +        ret = vtd_device_detach_iommufd(vtd_hiod, vtd_as, errp);
> +        break;
> +    }
> +    default:
> +        error_setg(errp, "Unknown VTDPASIDOp!!!");
> +        break;
> +    }
> +
> +    return ret;
> +}
> +#else
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
> +                                Error **errp)
> +{
> +    return 0;
> +}
> +#endif
> +
> +static int vtd_bind_guest_pasid_report_err(VTDAddressSpace *vtd_as,
> +                                           VTDPASIDOp op)
> +{
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    /*
> +     * vIOMMU calls into kernel to do BIND/UNBIND, the failure reason
> +     * can be kernel, QEMU bug or invalid guest config. None of them
> +     * should be reported to guest in PASID cache invalidation
> +     * processing path. But at least, we can report it to QEMU console.
> +     *
> +     * TODO: for invalid guest config, DMA translation fault will be
> +     * caught by host and passed to QEMU to inject to guest in future.
> +     */
> +    ret = vtd_bind_guest_pasid(vtd_as, op, &local_err);
> +    if (ret) {
> +        error_report_err(local_err);
> +    }
> +
> +    return ret;
> +}
> +
>  /* Do a context-cache device-selective invalidation.
>   * @func_mask: FM field after shifting
>   */
> @@ -3248,10 +3448,20 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
>       */
>      if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
>          pc_entry->pasid_entry = pe;
> +        if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_UPDATE)) {
I would remove that helper.
> +            /*
> +             * In case update binding fails, tear down existing binding to
> +             * catch invalid pasid entry config during DMA translation.
> +             */
> +            goto remove;
> +        }
>      }
>      return false;
>  
>  remove:
> +    if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_UNBIND)) {
> +        return false;
> +    }
>      pc_entry->valid = false;
>  
>      /*
> @@ -3336,6 +3546,9 @@ static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
>              if (!pc_entry->valid) {
>                  pc_entry->pasid_entry = pe;
>                  pc_entry->valid = true;
> +                if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_BIND)) {
> +                    pc_entry->valid = false;
> +                }
>              }
>          }
>          pasid++;
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index c8a936eb46..1c31b9a873 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -73,6 +73,9 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
>  vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
>  vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
>  vtd_reset_exit(void) ""
> +vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
> +vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
> +vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
>  
>  # amd_iommu.c
>  amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
Thanks

Eric



^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-28  8:26             ` Yi Liu
@ 2025-08-28  9:06               ` Duan, Zhenzhong
  2025-08-29  1:54                 ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-28  9:06 UTC (permalink / raw)
  To: Liu, Yi L, Nicolin Chen, Eric Auger
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 02/21] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>On 2025/8/27 23:30, Nicolin Chen wrote:
>> On Wed, Aug 27, 2025 at 02:32:42PM +0200, Eric Auger wrote:
>>> On 8/27/25 2:30 PM, Yi Liu wrote:
>>>> On 2025/8/27 19:22, Eric Auger wrote:
>>>>>> TBH. I'm hesitating to name it as get_viommu_cap. The scope is a little
>>>>>> larger than what we want so far. So I'm wondering if it can be done
>>>>>> in a
>>>>>> more straightforward way. e.g. just a bool op named
>>>>>> iommu_nested_wanted(). Just an example, maybe better naming. We
>can
>>>>>> extend the op to be returning a u64 value in the future when we see
>>>>>> another request on VFIO from vIOMMU.
>>>>> personnally I am fine with the bitmask which looks more future proof.
>>>>
>>>> not quite sure if there is another info that needs to be checked in
>>>> this "VFIO asks vIOMMU" manner. Have you seen one beside this
>>>> nested hwpt requirement by vIOMMU?
>>>
>>> I don't remember any at this point. But I guess with ARM CCA device
>>> passthrough we might have other needs
>>
>> Yea. A Realm vSMMU instance won't allocate IOAS/HWPT. So it will
>> ask the core to bypass those allocations, via the same op.
>>
>> I don't know: does "get_viommu_flags" sound more fitting to have
>> a clear meaning of "want"?
>>
>>    VIOMMU_FLAG_WANT_NESTING_PARENT
>>    VIOMMU_FLAG_WANT_NO_IOAS
>>
>> At least, the 2nd one being a "cap" wouldn't sound nice to me..
>
>this looks good to me.

OK, will do s/get_viommu_cap/get_viommu_flags and s/VIOMMU_CAP_HW_NESTED/ VIOMMU_FLAG_WANT_NESTING_PARENT if no more suggestions.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-08-27 11:14   ` Yi Liu
@ 2025-08-28  9:17     ` Duan, Zhenzhong
  2025-08-29  2:57       ` Yi Liu
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-28  9:17 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 07/21] intel_iommu: Introduce a new structure
>VTDHostIOMMUDevice
>
>On 2025/8/22 14:40, Zhenzhong Duan wrote:
>> Introduce a new structure VTDHostIOMMUDevice which replaces
>> HostIOMMUDevice to be stored in hash table.
>>
>> It includes a reference to HostIOMMUDevice and IntelIOMMUState,
>> also includes BDF information which will be used in future
>> patches.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>> ---
>>   hw/i386/intel_iommu_internal.h |  7 +++++++
>>   include/hw/i386/intel_iommu.h  |  2 +-
>>   hw/i386/intel_iommu.c          | 15 +++++++++++++--
>>   3 files changed, 21 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 360e937989..c7046eb4e2 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -28,6 +28,7 @@
>>   #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
>>   #define HW_I386_INTEL_IOMMU_INTERNAL_H
>>   #include "hw/i386/intel_iommu.h"
>> +#include "system/host_iommu_device.h"
>>
>>   /*
>>    * Intel IOMMU register specification
>> @@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
>>   /* Bits to decide the offset for each level */
>>   #define VTD_LEVEL_BITS           9
>>
>> +typedef struct VTDHostIOMMUDevice {
>> +    IntelIOMMUState *iommu_state;
>> +    PCIBus *bus;
>> +    uint8_t devfn;
>> +    HostIOMMUDevice *hiod;
>> +} VTDHostIOMMUDevice;
>>   #endif
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index e95477e855..50f9b27a45 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -295,7 +295,7 @@ struct IntelIOMMUState {
>>       /* list of registered notifiers */
>>       QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>>
>> -    GHashTable *vtd_host_iommu_dev;             /*
>HostIOMMUDevice */
>> +    GHashTable *vtd_host_iommu_dev;             /*
>VTDHostIOMMUDevice */
>>
>>       /* interrupt remapping */
>>       bool intr_enabled;              /* Whether guest enabled IR */
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index e3b871de70..512ca4fdc5 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1,
>gconstpointer v2)
>>
>>   static void vtd_hiod_destroy(gpointer v)
>>   {
>> -    object_unref(v);
>> +    VTDHostIOMMUDevice *vtd_hiod = v;
>> +
>> +    object_unref(vtd_hiod->hiod);
>> +    g_free(vtd_hiod);
>>   }
>>
>>   static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer
>value,
>> @@ -4371,6 +4374,7 @@ static bool vtd_dev_set_iommu_device(PCIBus
>*bus, void *opaque, int devfn,
>>                                        HostIOMMUDevice *hiod,
>Error **errp)
>>   {
>>       IntelIOMMUState *s = opaque;
>> +    VTDHostIOMMUDevice *vtd_hiod;
>>       struct vtd_as_key key = {
>>           .bus = bus,
>>           .devfn = devfn,
>> @@ -4387,7 +4391,14 @@ static bool vtd_dev_set_iommu_device(PCIBus
>*bus, void *opaque, int devfn,
>>           return false;
>>       }
>>
>> +    vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
>> +    vtd_hiod->bus = bus;
>> +    vtd_hiod->devfn = (uint8_t)devfn;
>> +    vtd_hiod->iommu_state = s;
>> +    vtd_hiod->hiod = hiod;
>
>how about moving it after the below if branch? :)

They will be used in vtd_check_hiod(), so need to initialize them early.

Thanks
Zhenzhong

>
>>       if (!vtd_check_hiod(s, hiod, errp)) {
>> +        g_free(vtd_hiod);
>>           vtd_iommu_unlock(s);
>>           return false;
>>       }
>> @@ -4397,7 +4408,7 @@ static bool vtd_dev_set_iommu_device(PCIBus
>*bus, void *opaque, int devfn,
>>       new_key->devfn = devfn;
>>
>>       object_ref(hiod);
>> -    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
>> +    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
>>
>>       vtd_iommu_unlock(s);
>>
>
>LGTM.
>
>Reviewed-by: Yi Liu <yi.l.liu@intel.com>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 08/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-08-27 11:42   ` Yi Liu
@ 2025-08-28  9:37     ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-28  9:37 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 08/21] intel_iommu: Check for compatibility with
>IOMMUFD backed device when x-flts=on
>
>On 2025/8/22 14:40, Zhenzhong Duan wrote:
>> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
>> is passed to host to construct nested page table.
>
>for passthrough devices :)

OK.

>
>> We need to check
>> compatibility of some critical IOMMU capabilities between vIOMMU and
>> host IOMMU to ensure guest stage-1 page table could be used by host.
>>
>> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
>> does not, then this IOMMUFD backed device should fail.
>
>do you have a list of what caps should be checked to ensure guest
>stage-1 page table work on hw? I can see EAFS. But it is not yet exposed
>to guest, so no need to check it for now.

Currently I only see FS1GP, ATS, PRQ and PASID isn't supported yet. vIOMMU only enables a small set of capabilities when x-flts=on.

>
>>
>> Even of the checks pass, for now we willingly reject the association
>> because all the bits are not there yet.
>
>better call out it would be relaxed in the end of this series. Otherwise
>it's a little confused. :)

This comment is per Eric's suggestion, I'll combine yours with his, like:

"Even of the checks pass, for now we willingly reject the association
because all the bits are not there yet, it will be relaxed in the end of this series."

>
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   hw/i386/intel_iommu_internal.h |  1 +
>>   hw/i386/intel_iommu.c          | 30
>+++++++++++++++++++++++++++++-
>>   2 files changed, 30 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index c7046eb4e2..f7510861d1 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -192,6 +192,7 @@
>>   #define VTD_ECAP_PT                 (1ULL << 6)
>>   #define VTD_ECAP_SC                 (1ULL << 7)
>>   #define VTD_ECAP_MHMV               (15ULL << 20)
>> +#define VTD_ECAP_NEST               (1ULL << 26)
>>   #define VTD_ECAP_SRS                (1ULL << 31)
>>   #define VTD_ECAP_PSS                (7ULL << 35) /* limit:
>MemTxAttrs::pid */
>>   #define VTD_ECAP_PASID              (1ULL << 40)
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 512ca4fdc5..da355bda79 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -40,6 +40,7 @@
>>   #include "kvm/kvm_i386.h"
>>   #include "migration/vmstate.h"
>>   #include "trace.h"
>> +#include "system/iommufd.h"
>>
>>   /* context entry operations */
>>   #define VTD_CE_GET_RID2PASID(ce) \
>> @@ -4366,7 +4367,34 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>HostIOMMUDevice *hiod,
>>           return true;
>>       }
>>
>> -    error_setg(errp, "host device is uncompatible with stage-1
>translation");
>> +#ifdef CONFIG_IOMMUFD
>> +    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>> +    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
>> +
>> +    /* Remaining checks are all stage-1 translation specific */
>> +    if (!object_dynamic_cast(OBJECT(hiod),
>TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
>> +        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
>> +        return false;
>> +    }
>> +
>> +    if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>> +        error_setg(errp, "Incompatible host platform IOMMU type %d",
>> +                   caps->type);
>> +        return false;
>> +    }
>> +
>> +    if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
>> +        error_setg(errp, "Host IOMMU doesn't support nested
>translation");
>> +        return false;
>> +    }
>
>this check may be already been covered by the sync in patch 05 as
>the set_iommu_device op is called after attach_device. If no NESTED cap,
>allocating nested hwpt would be failed.

Indeed, will drop the check.

>
>> +
>> +    if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
>> +        error_setg(errp, "Stage-1 1GB huge page is unsupported by host
>IOMMU");
>
>s/huge page/large page/ as VT-d spec use large page.

Will do.

>
>> +        return false;
>> +    }
>> +#endif > +
>> +    error_setg(errp, "host IOMMU is incompatible with stage-1
>translation");
>
>s/stage-1 translation/guest stage-1 translation/

Will do

Thanks
Zhenzhong

>
>>       return false;
>>   }
>>
>
>with above minor nits done, the patch looks good to me. Hence,
>
>Reviewed-by: Yi Liu <yi.l.liu@intel.com>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 16/21] intel_iommu: Replay pasid bindings after context cache invalidation
  2025-08-22  6:40 ` [PATCH v5 16/21] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
@ 2025-08-28  9:43   ` Eric Auger
  2025-08-29  7:35     ` Yi Liu
  0 siblings, 1 reply; 113+ messages in thread
From: Eric Auger @ 2025-08-28  9:43 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Yi Sun



On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> This replays guest pasid bindings after context cache invalidation.
> This is a behavior to ensure safety. Actually, programmer should issue
> pasid cache invalidation with proper granularity after issuing a context
> cache invalidation.
So is this mandated? If the spec mandates specific invalidations and the
guest does not comply with the expected invalidation sequence shall we
do that behind the curtain?
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu_internal.h |  2 ++
>  hw/i386/intel_iommu.c          | 42 ++++++++++++++++++++++++++++++++++
>  hw/i386/trace-events           |  1 +
>  3 files changed, 45 insertions(+)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 61e35dbdc0..8af1004888 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -584,6 +584,8 @@ typedef enum VTDPCInvType {
>  
>      /* Reset all PASID cache entries, used in system level reset */
>      VTD_PASID_CACHE_FORCE_RESET = 0x10,
> +    /* Invalidate all PASID entries in a device */
> +    VTD_PASID_CACHE_DEVSI,
invalidation type that is not defined in the spec. I would avoid and
find another solution if you really need to do such kind of invalidation.
>  } VTDPCInvType;
>  
>  typedef struct VTDPASIDCacheInfo {
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index a10ee8eb4f..6c0e502d1c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -91,6 +91,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>  
>  static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> +                                 VTDPASIDCacheInfo *pc_info);
> +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> +                                  PCIBus *bus, uint16_t devfn);
>  
>  static void vtd_panic_require_caching_mode(void)
>  {
> @@ -2442,6 +2446,8 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
>  
>  static void vtd_context_global_invalidate(IntelIOMMUState *s)
>  {
> +    VTDPASIDCacheInfo pc_info;
> +
>      trace_vtd_inv_desc_cc_global();
>      /* Protects context cache */
>      vtd_iommu_lock(s);
> @@ -2459,6 +2465,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
>       * VT-d emulation codes.
>       */
>      vtd_iommu_replay_all(s);
> +
> +    pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
> +    vtd_pasid_cache_sync(s, &pc_info);
I would put this addition in a separate patch because it does not need
the new

VTD_PASID_CACHE_DEVSI stuff

>  }
>  
>  #ifdef CONFIG_IOMMUFD
> @@ -2691,6 +2700,15 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
>               * happened.
>               */
>              vtd_address_space_sync(vtd_as);
> +            /*
> +             * Per spec, context flush should also be followed with PASID
> +             * cache and iotlb flush. In order to work with a guest which
> +             * doesn't follow spec and missed PASID cache flush, we have
> +             * vtd_pasid_cache_devsi() to invalidate PASID caches of the
> +             * passthrough device. Host iommu driver would flush piotlb
> +             * when a pasid unbind is pass down to it.
> +             */
> +             vtd_pasid_cache_devsi(s, vtd_as->bus, devfn);
>          }
>      }
>  }
> @@ -3422,6 +3440,11 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
>          break;
>      case VTD_PASID_CACHE_FORCE_RESET:
>          goto remove;
> +    case VTD_PASID_CACHE_DEVSI:
> +        if (pc_info->bus != vtd_as->bus || pc_info->devfn != vtd_as->devfn) {
> +            return false;
> +        }
> +        break;
>      default:
>          error_setg(&error_fatal, "invalid pc_info->type for flush");
>      }
> @@ -3635,6 +3658,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
>      case VTD_PASID_CACHE_FORCE_RESET:
>          /* For force reset, no need to go further replay */
>          return;
> +    case VTD_PASID_CACHE_DEVSI:
> +        walk_info.bus = pc_info->bus;
> +        walk_info.devfn = pc_info->devfn;
> +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> +        return;
>      default:
>          error_setg(&error_fatal, "invalid pc_info->type for replay");
>      }
> @@ -3683,6 +3711,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
>      vtd_replay_guest_pasid_bindings(s, pc_info);
>  }
>  
> +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> +                                  PCIBus *bus, uint16_t devfn)
> +{
> +    VTDPASIDCacheInfo pc_info;
> +
> +    trace_vtd_pasid_cache_devsi(devfn);
> +
> +    pc_info.type = VTD_PASID_CACHE_DEVSI;
> +    pc_info.bus = bus;
> +    pc_info.devfn = devfn;
> +
> +    vtd_pasid_cache_sync(s, &pc_info);
> +}
> +
>  static bool vtd_process_pasid_desc(IntelIOMMUState *s,
>                                     VTDInvDesc *inv_desc)
>  {
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 1c31b9a873..830b11f68b 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -28,6 +28,7 @@ vtd_pasid_cache_reset(void) ""
>  vtd_pasid_cache_gsi(void) ""
>  vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
>  vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> +vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
>  vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
>  vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
>  vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
Eric



^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain
  2025-08-27 11:48   ` Eric Auger
@ 2025-08-28  9:53     ` Duan, Zhenzhong
  2025-08-28 13:00       ` Eric Auger
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-28  9:53 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent
>domain
>
>Hi Zhenzhong,
>
>On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>> Call pci_device_get_viommu_cap() to get if vIOMMU supports
>VIOMMU_CAP_HW_NESTED,
>> if yes, create nested parent domain which could be reused by vIOMMU to
>create
>> nested domain.
>>
>> Introduce helper vfio_device_viommu_get_nested to facilitate this
>> implementation.
>>
>> It is safe because even if VIOMMU_CAP_HW_NESTED is returned, s->flts is
>> forbidden and VFIO device fails in set_iommu_device() call, until we support
>> passthrough device with x-flts=on.
>>
>> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  include/hw/vfio/vfio-device.h |  2 ++
>>  hw/vfio/device.c              | 12 ++++++++++++
>>  hw/vfio/iommufd.c             |  8 ++++++++
>>  3 files changed, 22 insertions(+)
>>
>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>> index 6e4d5ccdac..ecd82c16c7 100644
>> --- a/include/hw/vfio/vfio-device.h
>> +++ b/include/hw/vfio/vfio-device.h
>> @@ -257,6 +257,8 @@ void vfio_device_prepare(VFIODevice *vbasedev,
>VFIOContainerBase *bcontainer,
>>
>>  void vfio_device_unprepare(VFIODevice *vbasedev);
>>
>> +bool vfio_device_viommu_get_nested(VFIODevice *vbasedev);
>I would suggest vfio_device_viommu_has_feature_hw_nested or something
>alike
>get usually means tou take a ref count associated with a put

Sure.

>> +
>>  int vfio_device_get_region_info(VFIODevice *vbasedev, int index,
>>                                  struct vfio_region_info **info);
>>  int vfio_device_get_region_info_type(VFIODevice *vbasedev, uint32_t
>type,
>> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>> index 08f12ac31f..3eeb71bd51 100644
>> --- a/hw/vfio/device.c
>> +++ b/hw/vfio/device.c
>> @@ -23,6 +23,7 @@
>>
>>  #include "hw/vfio/vfio-device.h"
>>  #include "hw/vfio/pci.h"
>> +#include "hw/iommu.h"
>>  #include "hw/hw.h"
>>  #include "trace.h"
>>  #include "qapi/error.h"
>> @@ -504,6 +505,17 @@ void vfio_device_unprepare(VFIODevice
>*vbasedev)
>>      vbasedev->bcontainer = NULL;
>>  }
>>
>> +bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
>> +{
>> +    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
>> +
>> +    if (vdev) {
>> +        return !!(pci_device_get_viommu_cap(&vdev->pdev) &
>> +                  VIOMMU_CAP_HW_NESTED);
>> +    }
>> +    return false;
>> +}
>> +
>>  /*
>>   * Traditional ioctl() based io
>>   */
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 8c27222f75..e503c232e1 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -379,6 +379,14 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>          flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>      }
>>
>> +    /*
>> +     * If vIOMMU supports stage-1 translation, force to create nested
>parent
>I would rather not use another terminology here. You previously used
>hw_nested, I think that's better. Also bear in mind that smmu supports
>S1, S2 and S1+S2 in emulated code.

What about 'nesting parent' to match kernel side terminology, per Nicolin's suggestion:

In kernel kdoc/uAPI, we use:
 - "nesting parent" for stage-2 object
 - "nested hwpt", "nested domain" for stage-1 object

Thanks
Zhenzhong
>
>Thanks
>
>Eric
>> +     * domain which could be reused by vIOMMU to create nested
>domain.
>> +     */
>> +    if (vfio_device_viommu_get_nested(vbasedev)) {
>> +        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
>> +    }
>> +
>>      if (cpr_is_incoming()) {
>>          hwpt_id = vbasedev->cpr.hwpt_id;
>>          goto skip_alloc;


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 17/21] intel_iommu: Propagate PASID-based iotlb invalidation to host
  2025-08-22  6:40 ` [PATCH v5 17/21] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
@ 2025-08-28 10:00   ` Eric Auger
  2025-08-28 12:11     ` Yi Liu
  2025-09-01  8:32     ` Duan, Zhenzhong
  0 siblings, 2 replies; 113+ messages in thread
From: Eric Auger @ 2025-08-28 10:00 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng, Yi Sun



On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> This traps the guest PASID-based iotlb invalidation request and propagate it
> to host.
>
> Intel VT-d 3.0 supports nested translation in PASID granularity. Guest SVA
> support could be implemented by configuring nested translation on specific
> pasid. This is also known as dual stage DMA translation.
>
> Under such configuration, guest owns the GVA->GPA translation which is
> configured as stage-1 page table on host side for a specific pasid, and host
> owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
> invalidation should be propagated to host since host IOMMU will cache first
> level page table related mappings during DMA address translation.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu_internal.h |  6 +++
>  hw/i386/intel_iommu.c          | 95 +++++++++++++++++++++++++++++++++-
>  2 files changed, 99 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 8af1004888..c1a9263651 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -596,6 +596,12 @@ typedef struct VTDPASIDCacheInfo {
>      uint16_t devfn;
>  } VTDPASIDCacheInfo;
>  
> +typedef struct VTDPIOTLBInvInfo {
> +    uint16_t domain_id;
> +    uint32_t pasid;
> +    struct iommu_hwpt_vtd_s1_invalidate *inv_data;
> +} VTDPIOTLBInvInfo;
> +
>  /* PASID Table Related Definitions */
>  #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>  #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 6c0e502d1c..7efa22f4ec 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -2611,12 +2611,99 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
>  
>      return ret;
>  }
> +
> +static void
> +vtd_invalidate_piotlb_locked(VTDAddressSpace *vtd_as,
> +                             struct iommu_hwpt_vtd_s1_invalidate *cache)
> +{
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(s, vtd_as);
> +    HostIOMMUDeviceIOMMUFD *idev;
> +    uint32_t entry_num = 1; /* Only implement one request for simplicity */
can you remind me what it is used for. What 1?
> +    Error *local_err = NULL;
> +
> +    if (!vtd_hiod || !vtd_as->s1_hwpt) {
> +        return;
> +    }
> +    idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> +
> +    if (!iommufd_backend_invalidate_cache(idev->iommufd, vtd_as->s1_hwpt,
> +                                          IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
> +                                          sizeof(*cache), &entry_num, cache,
> +                                          &local_err)) {
> +        /* Something wrong in kernel, but trying to continue */
> +        error_report_err(local_err);
> +    }
> +}
> +
> +/*
> + * This function is a loop function for the s->vtd_address_spaces
> + * list with VTDPIOTLBInvInfo as execution filter. It propagates
> + * the piotlb invalidation to host.
> + */
> +static void vtd_flush_host_piotlb_locked(gpointer key, gpointer value,
> +                                         gpointer user_data)
> +{
> +    VTDPIOTLBInvInfo *piotlb_info = user_data;
> +    VTDAddressSpace *vtd_as = value;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> +    uint32_t pasid;
> +    uint16_t did;
> +
> +    /* Replay only fills pasid entry cache for passthrough device */
> +    if (!pc_entry->valid ||
> +        !vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
> +        return;
> +    }
> +
> +    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
> +        return;
> +    }
> +
> +    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
> +
> +    if (piotlb_info->domain_id == did && piotlb_info->pasid == pasid) {
> +        vtd_invalidate_piotlb_locked(vtd_as, piotlb_info->inv_data);
> +    }
> +}
> +
> +static void
> +vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
> +                                 uint16_t domain_id, uint32_t pasid,
> +                                 hwaddr addr, uint64_t npages, bool ih)
> +{
> +    struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
> +    VTDPIOTLBInvInfo piotlb_info;
> +
> +    cache_info.addr = addr;
> +    cache_info.npages = npages;
> +    cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
> +
> +    piotlb_info.domain_id = domain_id;
> +    piotlb_info.pasid = pasid;
> +    piotlb_info.inv_data = &cache_info;
> +
> +    /*
> +     * Go through each vtd_as instance in s->vtd_address_spaces, find out
> +     * the affected host device which need host piotlb invalidation. Piotlb
Are you likely to find several vts_as that match invalidation params?
> +     * invalidation should check pasid cache per architecture point of view.
> +     */
> +    g_hash_table_foreach(s->vtd_address_spaces,
> +                         vtd_flush_host_piotlb_locked, &piotlb_info);
> +}
>  #else
>  static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
>                                  Error **errp)
>  {
>      return 0;
>  }
> +
> +static void
> +vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
> +                                 uint16_t domain_id, uint32_t pasid,
> +                                 hwaddr addr, uint64_t npages, bool ih)
> +{
> +}
>  #endif
Can't you put those stub stuff in a specific header as it is usually done?
>  
>  static int vtd_bind_guest_pasid_report_err(VTDAddressSpace *vtd_as,
> @@ -3295,6 +3382,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>      vtd_iommu_lock(s);
>      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
>                                  &info);
> +    vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, 0, (uint64_t)-1, 0);
>      vtd_iommu_unlock(s);
>  
>      QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
> @@ -3316,7 +3404,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>  }
>  
>  static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> -                                       uint32_t pasid, hwaddr addr, uint8_t am)
> +                                       uint32_t pasid, hwaddr addr, uint8_t am,
> +                                       bool ih)
>  {
>      VTDIOTLBPageInvInfo info;
>  
> @@ -3328,6 +3417,7 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>      vtd_iommu_lock(s);
>      g_hash_table_foreach_remove(s->iotlb,
>                                  vtd_hash_remove_by_page_piotlb, &info);
> +    vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, addr, 1 << am, ih);
>      vtd_iommu_unlock(s);
>  
>      vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am, pasid);
> @@ -3359,7 +3449,8 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
>      case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
>          am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
>          addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
> -        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am);
> +        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
> +                                   VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
>          break;
>  
>      default:
Thanks

Eric



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 18/21] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed
  2025-08-22  6:40 ` [PATCH v5 18/21] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
@ 2025-08-28 10:02   ` Eric Auger
  0 siblings, 0 replies; 113+ messages in thread
From: Eric Auger @ 2025-08-28 10:02 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng



On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
> all pasid bindings on host side become stale and need to be updated.
>
> Introduce a helper function vtd_replay_pasid_bindings_all() to go through all
> pasid entries in all passthrough devices to update host side bindings.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
> ---
>  hw/i386/intel_iommu.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 7efa22f4ec..f9cb13e945 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -89,6 +89,7 @@ struct vtd_iotlb_key {
>  
>  static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> +static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s);
>  
>  static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
>  static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> @@ -3050,6 +3051,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
>      vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
>      vtd_reset_caches(s);
>      vtd_address_space_refresh_all(s);
> +    vtd_replay_pasid_bindings_all(s);
>  }
>  
>  /* Set Interrupt Remap Table Pointer */
> @@ -3084,6 +3086,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>  
>      vtd_reset_caches(s);
>      vtd_address_space_refresh_all(s);
> +    vtd_replay_pasid_bindings_all(s);
>  }
>  
>  /* Handle Interrupt Remap Enable/Disable */
> @@ -3777,6 +3780,17 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
>      }
>  }
>  
> +static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s)
> +{
> +    VTDPASIDCacheInfo pc_info = { .type = VTD_PASID_CACHE_GLOBAL_INV };
> +
> +    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
> +        return;
> +    }
> +
> +    vtd_replay_guest_pasid_bindings(s, &pc_info);
> +}
> +
>  /*
>   * For a PASID cache invalidation, this function handles below scenarios:
>   * a) a present cached pasid entry needs to be removed



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 09/21] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  2025-08-22  6:40 ` [PATCH v5 09/21] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
@ 2025-08-28 10:33   ` Yi Liu
  2025-09-01  5:14     ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-28 10:33 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> Currently we don't support nested translation for passthrough device with
> emulated device under same PCI bridge, because they require different address
> space when x-flts=on.
> 
> In theory, we do support if devices under same PCI bridge are all passthrough
> devices. But emulated device can be hotplugged under same bridge. To simplify,
> just forbid passthrough device under PCI bridge no matter if there is, or will
> be emulated devices under same bridge. This is acceptable because PCIE bridge
> is more popular than PCI bridge now.
> 
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> ---
>   hw/i386/intel_iommu.c | 13 +++++++++++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index da355bda79..6edd91d94e 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -4341,9 +4341,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>       return vtd_dev_as;
>   }
>   
> -static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
> +static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
>                              Error **errp)
>   {
> +    HostIOMMUDevice *hiod = vtd_hiod->hiod;
>       HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>       int ret;
>   
> @@ -4370,6 +4371,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>   #ifdef CONFIG_IOMMUFD
>       struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>       struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
> +    PCIBus *bus = vtd_hiod->bus;
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), vtd_hiod->devfn);

pci_find_device() finds bus pointer with bus_num, this can be avoided
as you already have bus pointer. Perhaps this may be done by wrapping
bus->devices[devfn] to a helper. Especially, pci_bus_num() may not have
the correct bus number at this point.

>   
>       /* Remaining checks are all stage-1 translation specific */
>       if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> @@ -4392,6 +4395,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>           error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
>           return false;
>       }
> +
> +    if (pci_device_get_iommu_bus_devfn(pdev, &bus, NULL, NULL)) {
> +        error_setg(errp, "Host device under PCI bridge is unsupported "
> +                   "when x-flts=on");
> +        return false;
> +    }
>   #endif
>   
>       error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
> @@ -4425,7 +4434,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>       vtd_hiod->iommu_state = s;
>       vtd_hiod->hiod = hiod;
>   
> -    if (!vtd_check_hiod(s, hiod, errp)) {
> +    if (!vtd_check_hiod(s, vtd_hiod, errp)) {
>           g_free(vtd_hiod);
>           vtd_iommu_unlock(s);
>           return false;

other parts looks good to me.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 10/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  2025-08-22  6:40 ` [PATCH v5 10/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
@ 2025-08-28 11:36   ` Yi Liu
  2025-09-01  5:33     ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-28 11:36 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> PCI device supports two request types, Requests-without-PASID and
> Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
> prefix, IOMMU fetches rid_pasid from context entry and use it as IOMMU's
> pasid to index pasid table.
> 
> So we need to translate between PCI's pasid and IOMMU's pasid specially
> for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
> For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
> 
> vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to vtd_as
> which contains PCI's pasid vtd_as->pasid.
> 
> vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to iommu_pasid.

translate is somehow strange. convert or get might be better? Same to
the translate terms in the patch.

> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> ---
>   hw/i386/intel_iommu.c | 58 +++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 58 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 6edd91d94e..1801f1cdf6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1602,6 +1602,64 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>       return 0;
>   }
>   
> +static int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
> +                                        uint32_t *pasid)
> +{
> +    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    uint8_t bus_num = pci_bus_num(vtd_as->bus);
> +    uint8_t devfn = vtd_as->devfn;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    /* For Requests-with-PASID, its pasid value is used by vIOMMU directly */
> +    if (vtd_as->pasid != PCI_NO_PASID) {
> +        *pasid = vtd_as->pasid;
> +        return 0;
> +    }
> +
> +    if (cc_entry->context_cache_gen == s->context_cache_gen) {
> +        ce = cc_entry->context_entry;
> +    } else {
> +        ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +    *pasid = VTD_CE_GET_RID2PASID(&ce);

looks like we have quite a few code get rid_pasid from the context
entry. I think we may simplify it by using PASID #0 since vIOMMU does
not report ECAP.RPS bit at all. It could be done as a separate cleanup.

> +    return 0;
> +}
> +
> +static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
> +                                                   gpointer user_data)
> +{
> +    VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
> +    struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
> +    uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
> +    uint32_t pasid;
> +
> +    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
> +        return false;
> +    }
> +
> +    return (pasid == target->pasid) && (sid == target->sid);
> +}
> +
> +/* Translate iommu pasid to vtd_as */
> +static inline
> +VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
> +                                                uint16_t sid, uint32_t pasid)
> +{
> +    struct vtd_as_raw_key key = {
> +        .sid = sid,
> +        .pasid = pasid
> +    };
> +
> +    return g_hash_table_find(s->vtd_address_spaces,
> +                             vtd_find_as_by_sid_and_iommu_pasid, &key);
> +}
> +
>   static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
>                                        void *private)
>   {

the code looks good to me.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update
  2025-08-22  6:40 ` [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update Zhenzhong Duan
  2025-08-27 14:25   ` Eric Auger
@ 2025-08-28 12:05   ` Yi Liu
  2025-09-01  3:31     ` Duan, Zhenzhong
  1 sibling, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-28 12:05 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng, Yi Sun

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
> pasid entry and track PASID usage and future PASID tagged DMA address
> translation support in vIOMMU.

Have you seen any extra code needed based on this series to support non 
rid_pasid PASIDs? If no, may just relax the scope of this series.
otherwise, you may need to tweak the patch a little bit. e.g. factor
out setting x-flts and x-pasid-mode at the same time.

> 
> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
> never freed. For other pasid, VTDAddressSpace instance is created/destroyed
> per the guest pasid entry set up/destroy.

> When guest removes or updates a PASID entry, QEMU will capture the guest pasid
> selective pasid cache invalidation, removes VTDAddressSpace or update cached
> PASID entry.
> 
> vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
> and compare it with cached PASID entry.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/i386/intel_iommu_internal.h |  27 ++++-
>   include/hw/i386/intel_iommu.h  |   6 +
>   hw/i386/intel_iommu.c          | 196 +++++++++++++++++++++++++++++++--
>   hw/i386/trace-events           |   3 +
>   4 files changed, 220 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index f7510861d1..b9b76dd996 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -316,6 +316,7 @@ typedef enum VTDFaultReason {
>                                     * request while disabled */
>       VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>   
> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>       /* PASID directory entry access failure */
>       VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>       /* The Present(P) field of pasid directory entry is 0 */
> @@ -493,6 +494,15 @@ typedef union VTDInvDesc VTDInvDesc;
>   #define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000f1c0ULL
>   #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>   
> +/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
> +#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
> +#define VTD_INV_DESC_PASIDC_G_DSI       0
> +#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
> +#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
> +#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16, 16)
> +#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32, 20)
> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
> +
>   /* Information about page-selective IOTLB invalidate */
>   struct VTDIOTLBPageInvInfo {
>       uint16_t domain_id;
> @@ -553,6 +563,21 @@ typedef struct VTDRootEntry VTDRootEntry;
>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>   
> +typedef enum VTDPCInvType {
> +    /* VTD spec defined PASID cache invalidation type */
> +    VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
> +    VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
> +    VTD_PASID_CACHE_GLOBAL_INV = VTD_INV_DESC_PASIDC_G_GLOBAL,
> +} VTDPCInvType;
> +
> +typedef struct VTDPASIDCacheInfo {
> +    VTDPCInvType type;
> +    uint16_t did;
> +    uint32_t pasid;
> +    PCIBus *bus;
> +    uint16_t devfn;
> +} VTDPASIDCacheInfo;
> +
>   /* PASID Table Related Definitions */
>   #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>   #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> @@ -574,7 +599,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>   #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>   
>   #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
> -#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
> +#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>   
>   #define VTD_SM_PASID_ENTRY_FLPM          3ULL
>   #define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 50f9b27a45..0e3826f6f0 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>       uint64_t val[8];
>   };
>   
> +typedef struct VTDPASIDCacheEntry {
> +    struct VTDPASIDEntry pasid_entry;
> +    bool valid;
> +} VTDPASIDCacheEntry;
> +
>   struct VTDAddressSpace {
>       PCIBus *bus;
>       uint8_t devfn;
> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>       MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
>       IntelIOMMUState *iommu_state;
>       VTDContextCacheEntry context_cache_entry;
> +    VTDPASIDCacheEntry pasid_cache_entry;
>       QLIST_ENTRY(VTDAddressSpace) next;
>       /* Superset of notifier flags that this address space has */
>       IOMMUNotifierFlag notifier_flags;
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1801f1cdf6..a2ee6d684e 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1675,7 +1675,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
>   
>       if (s->root_scalable) {
>           vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> -        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
> +        return VTD_SM_PASID_ENTRY_DID(&pe);
>       }
>   
>       return VTD_CONTEXT_ENTRY_DID(ce->hi);
> @@ -3112,6 +3112,183 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
>       return true;
>   }
>   
> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
> +                                            uint32_t pasid, VTDPASIDEntry *pe)
> +{
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    if (!s->root_scalable) {
> +        return -VTD_FR_RTADDR_INV_TTM;
> +    }
> +
> +    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
> +                                   &ce);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
> +}
> +
> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
> +{
> +    return !memcmp(p1, p2, sizeof(*p1));
> +}
> +
> +/*
> + * This function is a loop function which return value determines if
> + * vtd_as including cached pasid entry is removed.
> + *
> + * For PCI_NO_PASID, when corresponding cached pasid entry is cleared,
> + * it returns false so that vtd_as is reserved as it's owned by PCI
> + * sub-system. For other pasid, it returns true so vtd_as is removed.

also, this helper will always return true if this series does not
support non-rid_pasid PASID.

> + */
> +static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
> +                                       gpointer user_data)
> +{
> +    VTDPASIDCacheInfo *pc_info = user_data;
> +    VTDAddressSpace *vtd_as = value;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> +    VTDPASIDEntry pe;
> +    uint16_t did;
> +    uint32_t pasid;
> +    int ret;
> +
> +    if (!pc_entry->valid) {
> +        return false;
> +    }
> +    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
> +
> +    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
> +        goto remove;
> +    }
> +
> +    switch (pc_info->type) {
> +    case VTD_PASID_CACHE_PASIDSI:
> +        if (pc_info->pasid != pasid) {
> +            return false;
> +        }
> +        /* fall through */
> +    case VTD_PASID_CACHE_DOMSI:
> +        if (pc_info->did != did) {
> +            return false;
> +        }
> +        /* fall through */
> +    case VTD_PASID_CACHE_GLOBAL_INV:
> +        break;
> +    default:
> +        error_setg(&error_fatal, "invalid pc_info->type for flush");
> +    }
> +
> +    /*
> +     * pasid cache invalidation may indicate a present pasid entry to present
> +     * pasid entry modification. To cover such case, vIOMMU emulator needs to
> +     * fetch latest guest pasid entry and compares with cached pasid entry,
> +     * then update pasid cache.
> +     */
> +    ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
> +    if (ret) {
> +        /*
> +         * No valid pasid entry in guest memory. e.g. pasid entry was modified
> +         * to be either all-zero or non-present. Either case means existing
> +         * pasid cache should be removed.
> +         */
> +        goto remove;
> +    }
> +
> +    /*
> +     * Update cached pasid entry if it's stale compared to what's in guest
> +     * memory.
> +     */
> +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> +        pc_entry->pasid_entry = pe;
> +    }
> +    return false;
> +
> +remove:
> +    pc_entry->valid = false;
> +
> +    /*
> +     * Don't remove address space of PCI_NO_PASID which is created for PCI
> +     * sub-system.
> +     */
> +    if (vtd_as->pasid == PCI_NO_PASID) {
> +        return false;
> +    }
> +    return true;
> +}
> +
> +/*
> + * For a PASID cache invalidation, this function handles below scenarios:
> + * a) a present cached pasid entry needs to be removed
> + * b) a present cached pasid entry needs to be updated
> + */
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
> +{
> +    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
> +        return;
> +    }
> +
> +    vtd_iommu_lock(s);
> +    /*
> +     * a,b): loop all the existing vtd_as instances for pasid cache removal
> +       or update.
> +     */
> +    g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
> +                                pc_info);
> +    vtd_iommu_unlock(s);
> +}
> +
> +static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> +                                   VTDInvDesc *inv_desc)
> +{
> +    uint16_t did;
> +    uint32_t pasid;
> +    VTDPASIDCacheInfo pc_info;
> +    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
> +                        VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
> +
> +    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
> +                                     __func__, "pasid cache inv")) {
> +        return false;
> +    }
> +
> +    did = VTD_INV_DESC_PASIDC_DID(inv_desc);
> +    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc);
> +
> +    switch (VTD_INV_DESC_PASIDC_G(inv_desc)) {
> +    case VTD_INV_DESC_PASIDC_G_DSI:
> +        trace_vtd_pasid_cache_dsi(did);
> +        pc_info.type = VTD_PASID_CACHE_DOMSI;
> +        pc_info.did = did;
> +        break;
> +
> +    case VTD_INV_DESC_PASIDC_G_PASID_SI:
> +        /* PASID selective implies a DID selective */
> +        trace_vtd_pasid_cache_psi(did, pasid);
> +        pc_info.type = VTD_PASID_CACHE_PASIDSI;
> +        pc_info.did = did;
> +        pc_info.pasid = pasid;
> +        break;
> +
> +    case VTD_INV_DESC_PASIDC_G_GLOBAL:
> +        trace_vtd_pasid_cache_gsi();
> +        pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
> +        break;
> +
> +    default:
> +        error_report_once("invalid granularity field in PASID-cache invalidate "
> +                          "descriptor, hi: 0x%"PRIx64" lo: 0x%" PRIx64,
> +                           inv_desc->val[1], inv_desc->val[0]);
> +        return false;
> +    }
> +
> +    vtd_pasid_cache_sync(s, &pc_info);
> +    return true;
> +}
> +
>   static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
>                                        VTDInvDesc *inv_desc)
>   {
> @@ -3274,6 +3451,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>           }
>           break;
>   
> +    case VTD_INV_DESC_PC:
> +        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
> +        if (!vtd_process_pasid_desc(s, &inv_desc)) {
> +            return false;
> +        }
> +        break;
> +
>       case VTD_INV_DESC_PIOTLB:
>           trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
>           if (!vtd_process_piotlb_desc(s, &inv_desc)) {
> @@ -3309,16 +3493,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>           }
>           break;
>   
> -    /*
> -     * TODO: the entity of below two cases will be implemented in future series.
> -     * To make guest (which integrates scalable mode support patch set in
> -     * iommu driver) work, just return true is enough so far.
> -     */
> -    case VTD_INV_DESC_PC:
> -        if (s->scalable_mode) {
> -            break;
> -        }
> -    /* fallthrough */
>       default:
>           error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
>                             " (unknown type)", __func__, inv_desc.hi,
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index ac9e1a10aa..ae5bbfcdc0 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
>   vtd_inv_qi_tail(uint16_t head) "write tail %d"
>   vtd_inv_qi_fetch(void) ""
>   vtd_context_cache_reset(void) ""
> +vtd_pasid_cache_gsi(void) ""
> +vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
> +vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
>   vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
>   vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
>   vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 17/21] intel_iommu: Propagate PASID-based iotlb invalidation to host
  2025-08-28 10:00   ` Eric Auger
@ 2025-08-28 12:11     ` Yi Liu
  2025-09-01  8:32     ` Duan, Zhenzhong
  1 sibling, 0 replies; 113+ messages in thread
From: Yi Liu @ 2025-08-28 12:11 UTC (permalink / raw)
  To: eric.auger, Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng, Yi Sun

On 2025/8/28 18:00, Eric Auger wrote:
> 
> 
> On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> This traps the guest PASID-based iotlb invalidation request and propagate it
>> to host.
>>
>> Intel VT-d 3.0 supports nested translation in PASID granularity. Guest SVA
>> support could be implemented by configuring nested translation on specific
>> pasid. This is also known as dual stage DMA translation.
>>
>> Under such configuration, guest owns the GVA->GPA translation which is
>> configured as stage-1 page table on host side for a specific pasid, and host
>> owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
>> invalidation should be propagated to host since host IOMMU will cache first
>> level page table related mappings during DMA address translation.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   hw/i386/intel_iommu_internal.h |  6 +++
>>   hw/i386/intel_iommu.c          | 95 +++++++++++++++++++++++++++++++++-
>>   2 files changed, 99 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> index 8af1004888..c1a9263651 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -596,6 +596,12 @@ typedef struct VTDPASIDCacheInfo {
>>       uint16_t devfn;
>>   } VTDPASIDCacheInfo;
>>   
>> +typedef struct VTDPIOTLBInvInfo {
>> +    uint16_t domain_id;
>> +    uint32_t pasid;
>> +    struct iommu_hwpt_vtd_s1_invalidate *inv_data;
>> +} VTDPIOTLBInvInfo;
>> +
>>   /* PASID Table Related Definitions */
>>   #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>>   #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 6c0e502d1c..7efa22f4ec 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -2611,12 +2611,99 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
>>   
>>       return ret;
>>   }
>> +
>> +static void
>> +vtd_invalidate_piotlb_locked(VTDAddressSpace *vtd_as,
>> +                             struct iommu_hwpt_vtd_s1_invalidate *cache)
>> +{
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(s, vtd_as);
>> +    HostIOMMUDeviceIOMMUFD *idev;
>> +    uint32_t entry_num = 1; /* Only implement one request for simplicity */
> can you remind me what it is used for. What 1?

the iommufd cache invalidation interface supports passing an array
of invalidation requests. For simplicity, we start with 1.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 19/21] vfio: Add a new element bypass_ro in VFIOContainerBase
  2025-08-22  6:40 ` [PATCH v5 19/21] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
@ 2025-08-28 12:47   ` Eric Auger
  0 siblings, 0 replies; 113+ messages in thread
From: Eric Auger @ 2025-08-28 12:47 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng

Hi Zhenzhong,

On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> When bypass_ro is true, readonly memory section is bypassed from
> mapping in the container.
>
> This is a preparing patch to workaround Intel ERRATA_772415.
I would explain what this ERRATA needs to implement

Thanks

Eric
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  include/hw/vfio/vfio-container-base.h |  1 +
>  hw/vfio/listener.c                    | 21 ++++++++++++++-------
>  2 files changed, 15 insertions(+), 7 deletions(-)
>
> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
> index bded6e993f..31fd784d76 100644
> --- a/include/hw/vfio/vfio-container-base.h
> +++ b/include/hw/vfio/vfio-container-base.h
> @@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
>      QLIST_HEAD(, VFIODevice) device_list;
>      GList *iova_ranges;
>      NotifierWithReturn cpr_reboot_notifier;
> +    bool bypass_ro;
>  } VFIOContainerBase;
>  
>  typedef struct VFIOGuestIOMMU {
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index 903dfd8bf2..5fa2bb7f1a 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -76,8 +76,13 @@ static bool vfio_log_sync_needed(const VFIOContainerBase *bcontainer)
>      return true;
>  }
>  
> -static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> +static bool vfio_listener_skipped_section(MemoryRegionSection *section,
> +                                          bool bypass_ro)
>  {
> +    if (bypass_ro && section->readonly) {
> +        return true;
> +    }
> +
>      return (!memory_region_is_ram(section->mr) &&
>              !memory_region_is_iommu(section->mr)) ||
>             memory_region_is_protected(section->mr) ||
> @@ -365,9 +370,9 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
>  }
>  
>  static bool vfio_listener_valid_section(MemoryRegionSection *section,
> -                                        const char *name)
> +                                        bool bypass_ro, const char *name)
>  {
> -    if (vfio_listener_skipped_section(section)) {
> +    if (vfio_listener_skipped_section(section, bypass_ro)) {
>          trace_vfio_listener_region_skip(name,
>                  section->offset_within_address_space,
>                  section->offset_within_address_space +
> @@ -494,7 +499,8 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
>      int ret;
>      Error *err = NULL;
>  
> -    if (!vfio_listener_valid_section(section, "region_add")) {
> +    if (!vfio_listener_valid_section(section, bcontainer->bypass_ro,
> +                                     "region_add")) {
>          return;
>      }
>  
> @@ -655,7 +661,8 @@ static void vfio_listener_region_del(MemoryListener *listener,
>      int ret;
>      bool try_unmap = true;
>  
> -    if (!vfio_listener_valid_section(section, "region_del")) {
> +    if (!vfio_listener_valid_section(section, bcontainer->bypass_ro,
> +                                     "region_del")) {
>          return;
>      }
>  
> @@ -812,7 +819,7 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
>          container_of(listener, VFIODirtyRangesListener, listener);
>      hwaddr iova, end;
>  
> -    if (!vfio_listener_valid_section(section, "tracking_update") ||
> +    if (!vfio_listener_valid_section(section, false, "tracking_update") ||
>          !vfio_get_section_iova_range(dirty->bcontainer, section,
>                                       &iova, &end, NULL)) {
>          return;
> @@ -1206,7 +1213,7 @@ static void vfio_listener_log_sync(MemoryListener *listener,
>      int ret;
>      Error *local_err = NULL;
>  
> -    if (vfio_listener_skipped_section(section)) {
> +    if (vfio_listener_skipped_section(section, false)) {
>          return;
>      }
>  



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode
  2025-08-22  6:40 ` [PATCH v5 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
@ 2025-08-28 12:51   ` Eric Auger
  2025-08-29  7:42   ` Yi Liu
  1 sibling, 0 replies; 113+ messages in thread
From: Eric Auger @ 2025-08-28 12:51 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng



On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
> Now that all infrastructures of supporting passthrough device running
> with stage-1 translation are there, enable it now.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f9cb13e945..04a412d460 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -5222,6 +5222,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
>                     "when x-flts=on");
>          return false;
>      }
> +
> +    return true;
The easiest one ;-)

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric
>  #endif
>  
>      error_setg(errp, "host IOMMU is incompatible with stage-1 translation");



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain
  2025-08-28  9:53     ` Duan, Zhenzhong
@ 2025-08-28 13:00       ` Eric Auger
  2025-08-29  1:40         ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Eric Auger @ 2025-08-28 13:00 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P



On 8/28/25 11:53 AM, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent
>> domain
>>
>> Hi Zhenzhong,
>>
>> On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>>> Call pci_device_get_viommu_cap() to get if vIOMMU supports
>> VIOMMU_CAP_HW_NESTED,
>>> if yes, create nested parent domain which could be reused by vIOMMU to
>> create
>>> nested domain.
>>>
>>> Introduce helper vfio_device_viommu_get_nested to facilitate this
>>> implementation.
>>>
>>> It is safe because even if VIOMMU_CAP_HW_NESTED is returned, s->flts is
>>> forbidden and VFIO device fails in set_iommu_device() call, until we support
>>> passthrough device with x-flts=on.
>>>
>>> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
>>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>  include/hw/vfio/vfio-device.h |  2 ++
>>>  hw/vfio/device.c              | 12 ++++++++++++
>>>  hw/vfio/iommufd.c             |  8 ++++++++
>>>  3 files changed, 22 insertions(+)
>>>
>>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>>> index 6e4d5ccdac..ecd82c16c7 100644
>>> --- a/include/hw/vfio/vfio-device.h
>>> +++ b/include/hw/vfio/vfio-device.h
>>> @@ -257,6 +257,8 @@ void vfio_device_prepare(VFIODevice *vbasedev,
>> VFIOContainerBase *bcontainer,
>>>  void vfio_device_unprepare(VFIODevice *vbasedev);
>>>
>>> +bool vfio_device_viommu_get_nested(VFIODevice *vbasedev);
>> I would suggest vfio_device_viommu_has_feature_hw_nested or something
>> alike
>> get usually means tou take a ref count associated with a put
> Sure.
>
>>> +
>>>  int vfio_device_get_region_info(VFIODevice *vbasedev, int index,
>>>                                  struct vfio_region_info **info);
>>>  int vfio_device_get_region_info_type(VFIODevice *vbasedev, uint32_t
>> type,
>>> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>>> index 08f12ac31f..3eeb71bd51 100644
>>> --- a/hw/vfio/device.c
>>> +++ b/hw/vfio/device.c
>>> @@ -23,6 +23,7 @@
>>>
>>>  #include "hw/vfio/vfio-device.h"
>>>  #include "hw/vfio/pci.h"
>>> +#include "hw/iommu.h"
>>>  #include "hw/hw.h"
>>>  #include "trace.h"
>>>  #include "qapi/error.h"
>>> @@ -504,6 +505,17 @@ void vfio_device_unprepare(VFIODevice
>> *vbasedev)
>>>      vbasedev->bcontainer = NULL;
>>>  }
>>>
>>> +bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
>>> +{
>>> +    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
>>> +
>>> +    if (vdev) {
>>> +        return !!(pci_device_get_viommu_cap(&vdev->pdev) &
>>> +                  VIOMMU_CAP_HW_NESTED);
>>> +    }
>>> +    return false;
>>> +}
>>> +
>>>  /*
>>>   * Traditional ioctl() based io
>>>   */
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 8c27222f75..e503c232e1 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -379,6 +379,14 @@ static bool
>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>          flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>>      }
>>>
>>> +    /*
>>> +     * If vIOMMU supports stage-1 translation, force to create nested
>> parent
>> I would rather not use another terminology here. You previously used
>> hw_nested, I think that's better. Also bear in mind that smmu supports
>> S1, S2 and S1+S2 in emulated code.
> What about 'nesting parent' to match kernel side terminology, per Nicolin's suggestion:
>
> In kernel kdoc/uAPI, we use:
>  - "nesting parent" for stage-2 object
>  - "nested hwpt", "nested domain" for stage-1 object
I still think that since you queried the HW_NESTED cap it makes sense to
continue using it. This can come along with the kernel terminology though.

Eric
>
> Thanks
> Zhenzhong
>> Thanks
>>
>> Eric
>>> +     * domain which could be reused by vIOMMU to create nested
>> domain.
>>> +     */
>>> +    if (vfio_device_viommu_get_nested(vbasedev)) {
>>> +        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
>>> +    }
>>> +
>>>      if (cpr_is_incoming()) {
>>>          hwpt_id = vbasedev->cpr.hwpt_id;
>>>          goto skip_alloc;



^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain
  2025-08-28 13:00       ` Eric Auger
@ 2025-08-29  1:40         ` Duan, Zhenzhong
  2025-08-29  3:47           ` Nicolin Chen
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-29  1:40 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent
>domain
>
>
>
>On 8/28/25 11:53 AM, Duan, Zhenzhong wrote:
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Subject: Re: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent
>>> domain
>>>
>>> Hi Zhenzhong,
>>>
>>> On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>>>> Call pci_device_get_viommu_cap() to get if vIOMMU supports
>>> VIOMMU_CAP_HW_NESTED,
>>>> if yes, create nested parent domain which could be reused by vIOMMU to
>>> create
>>>> nested domain.
>>>>
>>>> Introduce helper vfio_device_viommu_get_nested to facilitate this
>>>> implementation.
>>>>
>>>> It is safe because even if VIOMMU_CAP_HW_NESTED is returned, s->flts
>is
>>>> forbidden and VFIO device fails in set_iommu_device() call, until we
>support
>>>> passthrough device with x-flts=on.
>>>>
>>>> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
>>>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>>  include/hw/vfio/vfio-device.h |  2 ++
>>>>  hw/vfio/device.c              | 12 ++++++++++++
>>>>  hw/vfio/iommufd.c             |  8 ++++++++
>>>>  3 files changed, 22 insertions(+)
>>>>
>>>> diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h
>>>> index 6e4d5ccdac..ecd82c16c7 100644
>>>> --- a/include/hw/vfio/vfio-device.h
>>>> +++ b/include/hw/vfio/vfio-device.h
>>>> @@ -257,6 +257,8 @@ void vfio_device_prepare(VFIODevice *vbasedev,
>>> VFIOContainerBase *bcontainer,
>>>>  void vfio_device_unprepare(VFIODevice *vbasedev);
>>>>
>>>> +bool vfio_device_viommu_get_nested(VFIODevice *vbasedev);
>>> I would suggest vfio_device_viommu_has_feature_hw_nested or
>something
>>> alike
>>> get usually means tou take a ref count associated with a put
>> Sure.
>>
>>>> +
>>>>  int vfio_device_get_region_info(VFIODevice *vbasedev, int index,
>>>>                                  struct vfio_region_info **info);
>>>>  int vfio_device_get_region_info_type(VFIODevice *vbasedev, uint32_t
>>> type,
>>>> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
>>>> index 08f12ac31f..3eeb71bd51 100644
>>>> --- a/hw/vfio/device.c
>>>> +++ b/hw/vfio/device.c
>>>> @@ -23,6 +23,7 @@
>>>>
>>>>  #include "hw/vfio/vfio-device.h"
>>>>  #include "hw/vfio/pci.h"
>>>> +#include "hw/iommu.h"
>>>>  #include "hw/hw.h"
>>>>  #include "trace.h"
>>>>  #include "qapi/error.h"
>>>> @@ -504,6 +505,17 @@ void vfio_device_unprepare(VFIODevice
>>> *vbasedev)
>>>>      vbasedev->bcontainer = NULL;
>>>>  }
>>>>
>>>> +bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
>>>> +{
>>>> +    VFIOPCIDevice *vdev = vfio_pci_from_vfio_device(vbasedev);
>>>> +
>>>> +    if (vdev) {
>>>> +        return !!(pci_device_get_viommu_cap(&vdev->pdev) &
>>>> +                  VIOMMU_CAP_HW_NESTED);
>>>> +    }
>>>> +    return false;
>>>> +}
>>>> +
>>>>  /*
>>>>   * Traditional ioctl() based io
>>>>   */
>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>> index 8c27222f75..e503c232e1 100644
>>>> --- a/hw/vfio/iommufd.c
>>>> +++ b/hw/vfio/iommufd.c
>>>> @@ -379,6 +379,14 @@ static bool
>>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>          flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>>>      }
>>>>
>>>> +    /*
>>>> +     * If vIOMMU supports stage-1 translation, force to create nested
>>> parent
>>> I would rather not use another terminology here. You previously used
>>> hw_nested, I think that's better. Also bear in mind that smmu supports
>>> S1, S2 and S1+S2 in emulated code.
>> What about 'nesting parent' to match kernel side terminology, per Nicolin's
>suggestion:
>>
>> In kernel kdoc/uAPI, we use:
>>  - "nesting parent" for stage-2 object
>>  - "nested hwpt", "nested domain" for stage-1 object
>I still think that since you queried the HW_NESTED cap it makes sense to
>continue using it. This can come along with the kernel terminology though.

OK, like below, do I understand right?

+    /*
+     * If vIOMMU supports stage-1 translation, force to create hw_nested
+     * (aka. nesting parent in kernel) domain which could be reused by
+     * vIOMMU to create nested domain.
+     */

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-28  9:06               ` Duan, Zhenzhong
@ 2025-08-29  1:54                 ` Duan, Zhenzhong
  2025-08-29  3:26                   ` Nicolin Chen
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-08-29  1:54 UTC (permalink / raw)
  To: Liu, Yi L, Nicolin Chen, Eric Auger
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Duan, Zhenzhong
>Sent: Thursday, August 28, 2025 5:07 PM
>Subject: RE: [PATCH v5 02/21] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>
>
>>-----Original Message-----
>>From: Liu, Yi L <yi.l.liu@intel.com>
>>Subject: Re: [PATCH v5 02/21] hw/pci: Introduce
>>pci_device_get_viommu_cap()
>>
>>On 2025/8/27 23:30, Nicolin Chen wrote:
>>> On Wed, Aug 27, 2025 at 02:32:42PM +0200, Eric Auger wrote:
>>>> On 8/27/25 2:30 PM, Yi Liu wrote:
>>>>> On 2025/8/27 19:22, Eric Auger wrote:
>>>>>>> TBH. I'm hesitating to name it as get_viommu_cap. The scope is a
>little
>>>>>>> larger than what we want so far. So I'm wondering if it can be done
>>>>>>> in a
>>>>>>> more straightforward way. e.g. just a bool op named
>>>>>>> iommu_nested_wanted(). Just an example, maybe better naming. We
>>can
>>>>>>> extend the op to be returning a u64 value in the future when we see
>>>>>>> another request on VFIO from vIOMMU.
>>>>>> personnally I am fine with the bitmask which looks more future proof.
>>>>>
>>>>> not quite sure if there is another info that needs to be checked in
>>>>> this "VFIO asks vIOMMU" manner. Have you seen one beside this
>>>>> nested hwpt requirement by vIOMMU?
>>>>
>>>> I don't remember any at this point. But I guess with ARM CCA device
>>>> passthrough we might have other needs
>>>
>>> Yea. A Realm vSMMU instance won't allocate IOAS/HWPT. So it will
>>> ask the core to bypass those allocations, via the same op.
>>>
>>> I don't know: does "get_viommu_flags" sound more fitting to have
>>> a clear meaning of "want"?
>>>
>>>    VIOMMU_FLAG_WANT_NESTING_PARENT
>>>    VIOMMU_FLAG_WANT_NO_IOAS
>>>
>>> At least, the 2nd one being a "cap" wouldn't sound nice to me..
>>
>>this looks good to me.
>
>OK, will do s/get_viommu_cap/get_viommu_flags and
>s/VIOMMU_CAP_HW_NESTED/ VIOMMU_FLAG_WANT_NESTING_PARENT if
>no more suggestions.

I just noticed this change will conflict with your suggestion of using HW_NESTED terminology.
Let me know if you agree with this change or not?

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-08-28  9:17     ` Duan, Zhenzhong
@ 2025-08-29  2:57       ` Yi Liu
  0 siblings, 0 replies; 113+ messages in thread
From: Yi Liu @ 2025-08-29  2:57 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P

On 2025/8/28 17:17, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH v5 07/21] intel_iommu: Introduce a new structure
>> VTDHostIOMMUDevice
>>
>> On 2025/8/22 14:40, Zhenzhong Duan wrote:
>>> Introduce a new structure VTDHostIOMMUDevice which replaces
>>> HostIOMMUDevice to be stored in hash table.
>>>
>>> It includes a reference to HostIOMMUDevice and IntelIOMMUState,
>>> also includes BDF information which will be used in future
>>> patches.
>>>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>>> ---
>>>    hw/i386/intel_iommu_internal.h |  7 +++++++
>>>    include/hw/i386/intel_iommu.h  |  2 +-
>>>    hw/i386/intel_iommu.c          | 15 +++++++++++++--
>>>    3 files changed, 21 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/hw/i386/intel_iommu_internal.h
>> b/hw/i386/intel_iommu_internal.h
>>> index 360e937989..c7046eb4e2 100644
>>> --- a/hw/i386/intel_iommu_internal.h
>>> +++ b/hw/i386/intel_iommu_internal.h
>>> @@ -28,6 +28,7 @@
>>>    #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
>>>    #define HW_I386_INTEL_IOMMU_INTERNAL_H
>>>    #include "hw/i386/intel_iommu.h"
>>> +#include "system/host_iommu_device.h"
>>>
>>>    /*
>>>     * Intel IOMMU register specification
>>> @@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
>>>    /* Bits to decide the offset for each level */
>>>    #define VTD_LEVEL_BITS           9
>>>
>>> +typedef struct VTDHostIOMMUDevice {
>>> +    IntelIOMMUState *iommu_state;
>>> +    PCIBus *bus;
>>> +    uint8_t devfn;
>>> +    HostIOMMUDevice *hiod;
>>> +} VTDHostIOMMUDevice;
>>>    #endif
>>> diff --git a/include/hw/i386/intel_iommu.h
>> b/include/hw/i386/intel_iommu.h
>>> index e95477e855..50f9b27a45 100644
>>> --- a/include/hw/i386/intel_iommu.h
>>> +++ b/include/hw/i386/intel_iommu.h
>>> @@ -295,7 +295,7 @@ struct IntelIOMMUState {
>>>        /* list of registered notifiers */
>>>        QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>>>
>>> -    GHashTable *vtd_host_iommu_dev;             /*
>> HostIOMMUDevice */
>>> +    GHashTable *vtd_host_iommu_dev;             /*
>> VTDHostIOMMUDevice */
>>>
>>>        /* interrupt remapping */
>>>        bool intr_enabled;              /* Whether guest enabled IR */
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index e3b871de70..512ca4fdc5 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1,
>> gconstpointer v2)
>>>
>>>    static void vtd_hiod_destroy(gpointer v)
>>>    {
>>> -    object_unref(v);
>>> +    VTDHostIOMMUDevice *vtd_hiod = v;
>>> +
>>> +    object_unref(vtd_hiod->hiod);
>>> +    g_free(vtd_hiod);
>>>    }
>>>
>>>    static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer
>> value,
>>> @@ -4371,6 +4374,7 @@ static bool vtd_dev_set_iommu_device(PCIBus
>> *bus, void *opaque, int devfn,
>>>                                         HostIOMMUDevice *hiod,
>> Error **errp)
>>>    {
>>>        IntelIOMMUState *s = opaque;
>>> +    VTDHostIOMMUDevice *vtd_hiod;
>>>        struct vtd_as_key key = {
>>>            .bus = bus,
>>>            .devfn = devfn,
>>> @@ -4387,7 +4391,14 @@ static bool vtd_dev_set_iommu_device(PCIBus
>> *bus, void *opaque, int devfn,
>>>            return false;
>>>        }
>>>
>>> +    vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
>>> +    vtd_hiod->bus = bus;
>>> +    vtd_hiod->devfn = (uint8_t)devfn;
>>> +    vtd_hiod->iommu_state = s;
>>> +    vtd_hiod->hiod = hiod;
>>
>> how about moving it after the below if branch? :)
> 
> They will be used in vtd_check_hiod(), so need to initialize them early.

got it. it's needed by following patch.



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-29  1:54                 ` Duan, Zhenzhong
@ 2025-08-29  3:26                   ` Nicolin Chen
  2025-09-01  2:35                     ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Nicolin Chen @ 2025-08-29  3:26 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: Liu, Yi L, Eric Auger, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P

On Fri, Aug 29, 2025 at 01:54:50AM +0000, Duan, Zhenzhong wrote:
> >>On 2025/8/27 23:30, Nicolin Chen wrote:
> >>> On Wed, Aug 27, 2025 at 02:32:42PM +0200, Eric Auger wrote:
> >>>> On 8/27/25 2:30 PM, Yi Liu wrote:
> >>>>> On 2025/8/27 19:22, Eric Auger wrote:
> >>>>>>> TBH. I'm hesitating to name it as get_viommu_cap. The scope is a
> >little
> >>>>>>> larger than what we want so far. So I'm wondering if it can be done
> >>>>>>> in a
> >>>>>>> more straightforward way. e.g. just a bool op named
> >>>>>>> iommu_nested_wanted(). Just an example, maybe better naming. We
> >>can
> >>>>>>> extend the op to be returning a u64 value in the future when we see
> >>>>>>> another request on VFIO from vIOMMU.
> >>>>>> personnally I am fine with the bitmask which looks more future proof.
> >>>>>
> >>>>> not quite sure if there is another info that needs to be checked in
> >>>>> this "VFIO asks vIOMMU" manner. Have you seen one beside this
> >>>>> nested hwpt requirement by vIOMMU?
> >>>>
> >>>> I don't remember any at this point. But I guess with ARM CCA device
> >>>> passthrough we might have other needs
> >>>
> >>> Yea. A Realm vSMMU instance won't allocate IOAS/HWPT. So it will
> >>> ask the core to bypass those allocations, via the same op.
> >>>
> >>> I don't know: does "get_viommu_flags" sound more fitting to have
> >>> a clear meaning of "want"?
> >>>
> >>>    VIOMMU_FLAG_WANT_NESTING_PARENT
> >>>    VIOMMU_FLAG_WANT_NO_IOAS
> >>>
> >>> At least, the 2nd one being a "cap" wouldn't sound nice to me..
> >>
> >>this looks good to me.
> >
> >OK, will do s/get_viommu_cap/get_viommu_flags and
> >s/VIOMMU_CAP_HW_NESTED/ VIOMMU_FLAG_WANT_NESTING_PARENT if
> >no more suggestions.
> 
> I just noticed this change will conflict with your suggestion of using HW_NESTED terminology.
> Let me know if you agree with this change or not?

It wouldn't necessarily conflict. VIOMMU_FLAG_WANT_NESTING_PARENT
is a request, interchangeable with VIOMMU_FLAG_SUPPORT_HW_NESTED,
i.e. a cap.

At the end of the day, they are fundamentally the same thing that
is to tell the core to allocate a nesting parent HWPT. The former
one is just more straightforward, avoiding confusing terms such as
"stage-1" and "nested".

IMHO, you wouldn't even need the comments in the other thread, as
the flag explains clearly what it wants and what the core is doing.

Also, once you use the "want" one, the "HW_NESTED" terminology will
not exist in the code.

Nicolin


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain
  2025-08-29  1:40         ` Duan, Zhenzhong
@ 2025-08-29  3:47           ` Nicolin Chen
  0 siblings, 0 replies; 113+ messages in thread
From: Nicolin Chen @ 2025-08-29  3:47 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: eric.auger@redhat.com, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Liu, Yi L,
	Peng, Chao P

On Fri, Aug 29, 2025 at 01:40:01AM +0000, Duan, Zhenzhong wrote:
> >-----Original Message-----
> >On 8/28/25 11:53 AM, Duan, Zhenzhong wrote:
> >>>> +    /*
> >>>> +     * If vIOMMU supports stage-1 translation, force to create nested
> >>> parent
> >>> I would rather not use another terminology here. You previously used
> >>> hw_nested, I think that's better. Also bear in mind that smmu supports
> >>> S1, S2 and S1+S2 in emulated code.
> >> What about 'nesting parent' to match kernel side terminology, per Nicolin's
> >suggestion:
> >>
> >> In kernel kdoc/uAPI, we use:
> >>  - "nesting parent" for stage-2 object
> >>  - "nested hwpt", "nested domain" for stage-1 object
> >I still think that since you queried the HW_NESTED cap it makes sense to
> >continue using it. This can come along with the kernel terminology though.
> 
> OK, like below, do I understand right?
> 
> +    /*
> +     * If vIOMMU supports stage-1 translation, force to create hw_nested
> +     * (aka. nesting parent in kernel) domain which could be reused by
> +     * vIOMMU to create nested domain.
> +     */

FWIW, while I was targeting the word "nested parent", I think Eric
was commenting on the word "stage-1 translation".

The vSMMU code supports "stage-1", "stage-2", and even "nested" as
its full emulation modes (no HW acceleration). So, any word like
"stage-1 translation" or "nested S1" can be confusing to the vSMMU
folks, as neither of them necessarily means "HW_NESTED" that stands
for "HW-accelerated nested stage-1".

Also, "HW_NESTED" != "nesting parent". They're two different things.
Thus, "force to create hw_nested" isn't accurate. Here, we want to
create a "nesting parent" HWPT. There is no other alternative name,
IMHO, given this is essentially a kernel-defined object.

Anyway, if we all agree on the VIOMMU_FLAG_WANT_NESTING_PARENT, it
is not necessary to have this comment (at least the first part) --
we could still note that the nesting parent HWPT will be reused by
vIOMMU to create nested HWPTs, if you'd like to.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 12/21] intel_iommu: Handle PASID entry addition
  2025-08-22  6:40 ` [PATCH v5 12/21] intel_iommu: Handle PASID entry addition Zhenzhong Duan
  2025-08-27 16:22   ` Eric Auger
@ 2025-08-29  5:46   ` Yi Liu
  1 sibling, 0 replies; 113+ messages in thread
From: Yi Liu @ 2025-08-29  5:46 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng, Yi Sun

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> When guest creates new PASID entries, QEMU will capture the guest pasid
> selective pasid cache invalidation, walk through each passthrough device
> and each pasid, when a match is found, identify an existing vtd_as or
> create a new one and update its corresponding cached pasid entry.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/i386/intel_iommu_internal.h |   2 +
>   hw/i386/intel_iommu.c          | 176 ++++++++++++++++++++++++++++++++-
>   2 files changed, 175 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index b9b76dd996..fb2a919e87 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -559,6 +559,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>   #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>   #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>   
> +#define VTD_SM_CONTEXT_ENTRY_PDTS(x)        extract64((x)->val[0], 9, 3)
>   #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
> @@ -589,6 +590,7 @@ typedef struct VTDPASIDCacheInfo {
>   #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
>   #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
>   #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
> +#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
>   
>   /* PASID Granular Translation Type Mask */
>   #define VTD_PASID_ENTRY_P              1ULL
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index a2ee6d684e..7d2c9feae7 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -826,6 +826,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>       }
>   }
>   
> +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
> +{
> +    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
> +}
> +
>   static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>   {
>       return pdire->val & 1;
> @@ -1647,9 +1652,9 @@ static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
>   }
>   
>   /* Translate iommu pasid to vtd_as */
> -static inline
> -VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
> -                                                uint16_t sid, uint32_t pasid)
> +static VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
> +                                                       uint16_t sid,
> +                                                       uint32_t pasid)
>   {
>       struct vtd_as_raw_key key = {
>           .sid = sid,

this hunk can be merged with the patch that introduces 
vtd_as_from_iommu_pasid_locked.

> @@ -3220,10 +3225,172 @@ remove:
>       return true;
>   }
>   
> +/*
> + * This function walks over PASID range within [start, end) in a single
> + * PASID table for entries matching @info type/did, then retrieve/create
> + * vtd_as and fill associated pasid entry cache.
> + */
> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
> +                                        dma_addr_t pt_base,
> +                                        int start,
> +                                        int end,
> +                                        VTDPASIDCacheInfo *info)
> +{
> +    VTDPASIDEntry pe;
> +    int pasid = start;
> +
> +    while (pasid < end) {
> +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
> +            && vtd_pe_present(&pe)) {
> +            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
> +            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
> +            VTDPASIDCacheEntry *pc_entry;
> +            VTDAddressSpace *vtd_as;
> +
> +            vtd_iommu_lock(s);
> +            /*
> +             * When indexed by rid2pasid, vtd_as should have been created,
> +             * e.g., by PCI subsystem. For other iommu pasid, we need to
> +             * create vtd_as dynamically. Other iommu pasid is same value
> +             * as PCI's pasid, so it's used as input of vtd_find_add_as().
> +             */
> +            vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
> +            vtd_iommu_unlock(s);
> +            if (!vtd_as) {
> +                vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
> +            }
> +
> +            if ((info->type == VTD_PASID_CACHE_DOMSI ||
> +                 info->type == VTD_PASID_CACHE_PASIDSI) &&
> +                (info->did != VTD_SM_PASID_ENTRY_DID(&pe))) {
> +                /*
> +                 * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
> +                 * requires domain id check. If domain id check fail,
> +                 * go to next pasid.
> +                 */
> +                pasid++;
> +                continue;
> +            }
> +
> +            pc_entry = &vtd_as->pasid_cache_entry;
> +            /*
> +             * pasid cache update and clear are handled in
> +             * vtd_flush_pasid_locked(), only care new pasid entry here.
> +             */
> +            if (!pc_entry->valid) {
> +                pc_entry->pasid_entry = pe;
> +                pc_entry->valid = true;
> +            }
> +        }
> +        pasid++;
> +    }
> +}
> +
> +/*
> + * In VT-d scalable mode translation, PASID dir + PASID table is used.

remove translation.

> + * This function aims at looping over a range of PASIDs in the given
> + * two level table to identify the pasid config in guest.
> + */
> +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
> +                                    dma_addr_t pdt_base,
> +                                    int start, int end,
> +                                    VTDPASIDCacheInfo *info)
> +{
> +    VTDPASIDDirEntry pdire;
> +    int pasid = start;
> +    int pasid_next;
> +    dma_addr_t pt_base;
> +
> +    while (pasid < end) {
> +        pasid_next =
> +             (pasid + VTD_PASID_TBL_ENTRY_NUM) & ~(VTD_PASID_TBL_ENTRY_NUM - 1);
> +        pasid_next = pasid_next < end ? pasid_next : end;
> +
> +        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
> +            && vtd_pdire_present(&pdire)) {
> +            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
> +            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
> +        }
> +        pasid = pasid_next;
> +    }
> +}
> +
> +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
> +                                          int start, int end,
> +                                          VTDPASIDCacheInfo *info)
> +{
> +    VTDContextEntry ce;
> +
> +    if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus), info->devfn,
> +                                  &ce)) {
> +        uint32_t max_pasid;
> +
> +        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
> +        if (end > max_pasid) {
> +            end = max_pasid;
> +        }
> +        vtd_sm_pasid_table_walk(s,
> +                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
> +                                start,
> +                                end,
> +                                info);
> +    }
> +}
> +
> +/*
> + * This function replays the guest pasid bindings by walking the two level
> + * guest PASID table. For each valid pasid entry, it finds or creates a
> + * vtd_as and caches pasid entry in vtd_as.
> + */
> +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> +                                            VTDPASIDCacheInfo *pc_info)
> +{
> +    /*
> +     * Currently only Requests-without-PASID is supported, as vIOMMU doesn't
> +     * support RPS(RID-PASID Support), pasid scope is fixed to [0, 1).

RPS is not the reason for only requests-without-PASID. RPS is 0 means
rid_pasid is fixed to be 0. While if RPS is 1, rid_pasid can be other
values. This is different with requests-with-PASID. If only aims for
supporting requests-without-PASID when RPS is 0, we may want to re-org
the code here. No need to pretend to loop a range of PASIDs just use
pasid#0 as we know rid_pasid is 0.

To me, this series should be able to support non-rid_pasid PASIDs.

> +     */
> +    int start = 0, end = 1;
> +    VTDHostIOMMUDevice *vtd_hiod;
> +    VTDPASIDCacheInfo walk_info;
> +    GHashTableIter as_it;
> +
> +    switch (pc_info->type) {
> +    case VTD_PASID_CACHE_PASIDSI:
> +        start = pc_info->pasid;
> +        end = pc_info->pasid + 1;
> +       /* fall through */
> +    case VTD_PASID_CACHE_DOMSI:
> +    case VTD_PASID_CACHE_GLOBAL_INV:
> +        /* loop all assigned devices */
> +        break;
> +    default:
> +        error_setg(&error_fatal, "invalid pc_info->type for replay");
> +    }
> +
> +    /*
> +     * In this replay, one only needs to care about the devices which are
> +     * backed by host IOMMU. Those devices have a corresponding vtd_hiod
> +     * in s->vtd_host_iommu_dev. For devices not backed by host IOMMU, it
> +     * is not necessary to replay the bindings since their cache could be
> +     * re-created in the future DMA address translation.
> +     *
> +     * VTD translation callback never accesses vtd_hiod and its corresponding
> +     * cached pasid entry, so no iommu lock needed here.
> +     */
> +    walk_info = *pc_info;
> +    g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
> +    while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
> +        walk_info.bus = vtd_hiod->bus;
> +        walk_info.devfn = vtd_hiod->devfn;
> +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> +    }
> +}
> +
>   /*
>    * For a PASID cache invalidation, this function handles below scenarios:
>    * a) a present cached pasid entry needs to be removed
>    * b) a present cached pasid entry needs to be updated
> + * c) a present cached pasid entry needs to be created
>    */
>   static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
>   {
> @@ -3239,6 +3406,9 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
>       g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
>                                   pc_info);
>       vtd_iommu_unlock(s);
> +
> +    /* c): loop all passthrough device for new pasid entries */
> +    vtd_replay_guest_pasid_bindings(s, pc_info);
>   }
>   
>   static bool vtd_process_pasid_desc(IntelIOMMUState *s,


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 13/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
  2025-08-27 16:28   ` Eric Auger
@ 2025-08-29  5:56     ` Yi Liu
  2025-09-01  9:04     ` Duan, Zhenzhong
  1 sibling, 0 replies; 113+ messages in thread
From: Yi Liu @ 2025-08-29  5:56 UTC (permalink / raw)
  To: eric.auger, Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng, Yi Sun

On 2025/8/28 00:28, Eric Auger wrote:
> Hi Zhenzhong,
> 
> On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>> FORCE_RESET is different from GLOBAL_INV which updates pasid cache if
>> underlying pasid entry is still valid, it drops all the pasid caches.
>>
>> FORCE_RESET isn't a VTD spec defined invalidation type for pasid cache,
>> only used internally in system level reset.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   hw/i386/intel_iommu_internal.h |  9 +++++++++
>>   hw/i386/intel_iommu.c          | 25 +++++++++++++++++++++++++
>>   hw/i386/trace-events           |  1 +
>>   3 files changed, 35 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> index fb2a919e87..c510b09d1a 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -569,6 +569,15 @@ typedef enum VTDPCInvType {
>>       VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
>>       VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
>>       VTD_PASID_CACHE_GLOBAL_INV = VTD_INV_DESC_PASIDC_G_GLOBAL,
>> +
>> +    /*
>> +     * Internally used PASID cache invalidation type starts here,
>> +     * 0x10 is large enough as invalidation type in pc_inv_desc
>> +     * is 2bits in size.
>> +     */
>> +
>> +    /* Reset all PASID cache entries, used in system level reset */
>> +    VTD_PASID_CACHE_FORCE_RESET = 0x10,
> I am not very keen on adding such an artifical enum value that does not
> exist in the spec.
> 
> Why not simply introduce another function (instead of
> vtd_flush_pasid_locked) that does the cleanup. To me it would be
> cleaner. Thanks Eric

this makes sense. Just wrap the code after the remove tag into a helper.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 14/21] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
  2025-08-22  6:40 ` [PATCH v5 14/21] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
  2025-08-27 17:14   ` Eric Auger
@ 2025-08-29  6:06   ` Yi Liu
  1 sibling, 0 replies; 113+ messages in thread
From: Yi Liu @ 2025-08-29  6:06 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> When guest in scalable mode and x-flts=on, we stick to system MR for IOMMUFD
> backed host device. Then its default hwpt contains GPA->HPA mappings which is
> used directly if PGTT=PT and used as nested parent if PGTT=FLT. Otherwise
> fallback to original processing.
> 
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/i386/intel_iommu.c | 34 ++++++++++++++++++++++++++++++++++
>   1 file changed, 34 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index af384ce7f0..15582977b8 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1773,6 +1773,28 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>   
>   }
>   
> +static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(IntelIOMMUState *s,
> +                                                 VTDAddressSpace *as)
> +{
> +    struct vtd_as_key key = {
> +        .bus = as->bus,
> +        .devfn = as->devfn,
> +    };
> +    VTDHostIOMMUDevice *vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev,
> +                                                       &key);
> +
> +    if (vtd_hiod && vtd_hiod->hiod &&
> +        object_dynamic_cast(OBJECT(vtd_hiod->hiod),
> +                            TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> +        return vtd_hiod;
> +    }
> +    return NULL;
> +}
> +
> +/*
> + * vtd_switch_address_space() calls vtd_as_pt_enabled() to determine which
> + * MR to switch to. Switch to system MR if return true, iommu MR otherwise.
> + */
>   static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>   {
>       IntelIOMMUState *s;
> @@ -1781,6 +1803,18 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
>       assert(as);
>   
>       s = as->iommu_state;
> +
> +    /*
> +     * When guest in scalable mode and x-flts=on, we stick to system MR
> +     * for IOMMUFD backed host device. Then its default hwpt contains
> +     * GPA->HPA mappings which is used directly if PGTT=PT and used as
> +     * nested parent if PGTT=FLT. Otherwise fallback to original
> +     * processing.
> +     */
> +    if (s->root_scalable && s->flts && vtd_find_hiod_iommufd(s, as)) {
> +        return true;
> +    }
> +

I think you'd add this logic in vtd_switch_address_space() as the return
value of this helper is to reflect if guest has enabled pt. It may break
logic in the caller side.

Regards,
Yi Liu

>       if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
>                                    &ce)) {
>           /*


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 15/21] intel_iommu: Bind/unbind guest page table to host
  2025-08-22  6:40 ` [PATCH v5 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
  2025-08-28  8:37   ` Eric Auger
@ 2025-08-29  7:05   ` Yi Liu
  1 sibling, 0 replies; 113+ messages in thread
From: Yi Liu @ 2025-08-29  7:05 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng, Yi Sun

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> This captures the guest PASID table entry modifications and
> propagates the changes to host to attach a hwpt with type determined
> per guest IOMMU mode and PGTT configuration.
> 
> When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
> page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
> vIOMMU reuse hwpt(GPA->HPA) provided by VFIO as nested parent to
> construct nested page table.

maybe more straightforward, if PGTT is xxxx, attach PASID to a xxx
hwpt. e.g. PGTT==pt, attach to the default hwpt.

> When guest decides to use legacy mode then vIOMMU switches the MRs of
> the device's AS, hence the IOAS created by VFIO container would be
> switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
> switched to IOMMU MR. So it is able to support shadowing the guest IO
> page table.

this is not quite related to this patch as bind/unbind pasid only happes
when guest is operating in scalable mode. I think you may drop it and
consider to add it in another more related patch.

> Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/i386/intel_iommu_internal.h |  14 ++-
>   include/hw/i386/intel_iommu.h  |   1 +
>   hw/i386/intel_iommu.c          | 221 ++++++++++++++++++++++++++++++++-
>   hw/i386/trace-events           |   3 +
>   4 files changed, 233 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index c510b09d1a..61e35dbdc0 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -564,6 +564,12 @@ typedef struct VTDRootEntry VTDRootEntry;
>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>   
> +typedef enum VTDPASIDOp {
> +    VTD_PASID_BIND,
> +    VTD_PASID_UPDATE,
> +    VTD_PASID_UNBIND,
> +} VTDPASIDOp;
> +
>   typedef enum VTDPCInvType {
>       /* VTD spec defined PASID cache invalidation type */
>       VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
> @@ -612,8 +618,12 @@ typedef struct VTDPASIDCacheInfo {
>   #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
>   #define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>   
> -#define VTD_SM_PASID_ENTRY_FLPM          3ULL
> -#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
> +#define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
> +#define VTD_SM_PASID_ENTRY_SRE_BIT(x)    extract64((x)->val[2], 0, 1)
> +/* 00: 4-level paging, 01: 5-level paging, 10-11: Reserved */
> +#define VTD_SM_PASID_ENTRY_FSPM(x)       extract64((x)->val[2], 2, 2)
> +#define VTD_SM_PASID_ENTRY_WPE_BIT(x)    extract64((x)->val[2], 4, 1)
> +#define VTD_SM_PASID_ENTRY_EAFE_BIT(x)   extract64((x)->val[2], 7, 1)
>   
>   /* First Level Paging Structure */
>   /* Masks for First Level Paging Entry */
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 0e3826f6f0..2affab36b2 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -104,6 +104,7 @@ struct VTDAddressSpace {
>       PCIBus *bus;
>       uint8_t devfn;
>       uint32_t pasid;
> +    uint32_t s1_hwpt;
>       AddressSpace as;
>       IOMMUMemoryRegion iommu;
>       MemoryRegion root;          /* The root container of the device */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 15582977b8..a10ee8eb4f 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -20,6 +20,7 @@
>    */
>   
>   #include "qemu/osdep.h"
> +#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
>   #include "qapi/error.h"
> @@ -41,6 +42,9 @@
>   #include "migration/vmstate.h"
>   #include "trace.h"
>   #include "system/iommufd.h"
> +#ifdef CONFIG_IOMMUFD
> +#include <linux/iommufd.h>
> +#endif
>   
>   /* context entry operations */
>   #define VTD_CE_GET_RID2PASID(ce) \
> @@ -50,10 +54,9 @@
>   
>   /* pe operations */
>   #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
> -#define VTD_PE_GET_FL_LEVEL(pe) \
> -    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM))
>   #define VTD_PE_GET_SL_LEVEL(pe) \
>       (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
> +#define VTD_PE_GET_FL_LEVEL(pe) (VTD_SM_PASID_ENTRY_FSPM(pe) + 4)
>   
>   /*
>    * PCI bus number (or SID) is not reliable since the device is usaully
> @@ -834,6 +837,31 @@ static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
>       return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
>   }
>   
> +static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
> +{
> +    return pe->val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
> +}
> +
> +/*
> + * Stage-1 IOVA address width: 48 bits for 4-level paging(FSPM=00)
> + *                             57 bits for 5-level paging(FSPM=01)
> + */
> +static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
> +{
> +    return 48 + VTD_SM_PASID_ENTRY_FSPM(pe) * 9;
> +}
> +
> +static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
> +{
> +    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
> +}

I think the existing pt related code can use this helper as well. A
separate patch to add this helper and replace the existing opened code
would be helpful.

> +/* check if pgtt is first stage translation */
> +static inline bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
> +{
> +    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
> +}
> +
>   static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>   {
>       return pdire->val & 1;
> @@ -1131,7 +1159,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
>       if (s->root_scalable) {
>           vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>           if (s->flts) {
> -            return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
> +            return pe.val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;

such renaming better to be separate patch. If it's not too heavy to 
aford, we may consider to consolidate the "fs" and "fl" term to be "fs". :)

>           } else {
>               return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
>           }
> @@ -1766,7 +1794,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>                */
>               return false;
>           }
> -        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
> +        return vtd_pe_pgtt_is_pt(&pe);
>       }
>   
>       return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
> @@ -2433,6 +2461,178 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
>       vtd_iommu_replay_all(s);
>   }
>   
> +#ifdef CONFIG_IOMMUFD
> +static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
> +                                  VTDPASIDEntry *pe)
> +{
> +    memset(vtd, 0, sizeof(*vtd));
> +
> +    vtd->flags =  (VTD_SM_PASID_ENTRY_SRE_BIT(pe) ? IOMMU_VTD_S1_SRE : 0) |
> +                  (VTD_SM_PASID_ENTRY_WPE_BIT(pe) ? IOMMU_VTD_S1_WPE : 0) |
> +                  (VTD_SM_PASID_ENTRY_EAFE_BIT(pe) ? IOMMU_VTD_S1_EAFE : 0);
> +    vtd->addr_width = vtd_pe_get_fl_aw(pe);
> +    vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
> +}
> +
> +static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                              VTDPASIDEntry *pe, uint32_t *s1_hwpt,
> +                              Error **errp)
> +{
> +    struct iommu_hwpt_vtd_s1 vtd;
> +
> +    vtd_init_s1_hwpt_data(&vtd, pe);
> +
> +    return !iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> +                                       idev->hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
> +                                       sizeof(vtd), &vtd, s1_hwpt, errp);
> +}
> +
> +static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
> +                                VTDAddressSpace *vtd_as)
> +{
> +    if (!vtd_as->s1_hwpt) {
> +        return;
> +    }
> +    iommufd_backend_free_id(idev->iommufd, vtd_as->s1_hwpt);
> +    vtd_as->s1_hwpt = 0;
> +}
> +
> +static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
> +                                     VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);

should it check if idev is valid as vtd_hiod->hiod may be other type
rather than TYPE_HOST_IOMMU_DEVICE_IOMMUFD?

> +    VTDPASIDEntry *pe = &vtd_as->pasid_cache_entry.pasid_entry;
> +    uint32_t hwpt_id;
> +    int ret;
> +
> +    /*
> +     * We can get here only if flts=on, the supported PGTT is FLT and PT.
> +     * Catch invalid PGTT when processing invalidation request to avoid
> +     * attaching to wrong hwpt.

I think it is necessary to check x-flts=on in vtd_process_pasid_desc()
to gurantee the above comment. Existing vIOMMU has reported scalable
mode to guest without x-flts. For that configuration, vIOMMU just skip
the PASID cache flush as that configuration depends on shadowing guest
I/O page table to host. Now, we are deadling with PASID cache
invalidaiton because we need to bind guest I/O page table to host if
guest uses FS translation.

> +     */
> +    if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
> +        error_setg(errp, "Invalid PGTT type");
> +        return -EINVAL;
> +    }
> +
> +    if (vtd_pe_pgtt_is_flt(pe)) {
> +        /* Should fail if the FLPT base is 0 */
> +        if (!vtd_pe_get_flpt_base(pe)) {

aha, I cannot recall why 0 check is special and added here. If we want
to keep this check, I think the flpt_base should also be smaller than
the AW width of the s2_hwpt to maek the check completed. :)

> +            error_setg(errp, "FLPT base is 0");
> +            return -EINVAL;
> +        }
> +
> +        if (vtd_create_s1_hwpt(idev, pe, &hwpt_id, errp)) {
> +            return -EINVAL;
> +        }
> +    } else {
> +        hwpt_id = idev->hwpt_id;
> +    }
> +
> +    ret = !host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp);
> +    trace_vtd_device_attach_hwpt(idev->devid, vtd_as->pasid, hwpt_id, ret);
> +    if (!ret) {

above three lines have two !!. Consider to simplify it.

> +        vtd_destroy_s1_hwpt(idev, vtd_as);

suppose this is the succ branch. why destroy s1_hwpt?

> +        if (vtd_pe_pgtt_is_flt(pe)) {
> +            vtd_as->s1_hwpt = hwpt_id;
> +        }
> +    } else if (vtd_pe_pgtt_is_flt(pe)) {
> +        iommufd_backend_free_id(idev->iommufd, hwpt_id);
> +    }
> +
> +    return ret;
> +}
> +
> +static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
> +                                     VTDAddressSpace *vtd_as, Error **errp)
> +{
> +    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> +    uint32_t pasid = vtd_as->pasid;
> +    int ret;
> +
> +    if (vtd_hiod->iommu_state->dmar_enabled) {
> +        ret = !host_iommu_device_iommufd_detach_hwpt(idev, errp);
> +        trace_vtd_device_detach_hwpt(idev->devid, pasid, ret);
> +    } else {
> +        ret = !host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
> +        trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
> +                                           ret);
> +    }

you need a comment to explain why it differs per iommu_state->dmar_enabled.

> +
> +    if (!ret) {
> +        vtd_destroy_s1_hwpt(idev, vtd_as);
> +    }
> +
> +    return ret;
> +}
> +
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
> +                                Error **errp)
> +{
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(s, vtd_as);
> +    int ret;
> +
> +    if (!vtd_hiod) {
> +        /* No need to go further, e.g. for emulated device */
> +        return 0;
> +    }
> +
> +    if (vtd_as->pasid != PCI_NO_PASID) {
> +        error_setg(errp, "Non-rid_pasid %d not supported yet", vtd_as->pasid);

I see. This series is really only for rid_pasid. Then some prior patches
may need to be re-orged. Please refer to related comments to prior patches.

> +        return -EINVAL;
> +    }
> +
> +    switch (op) {
> +    case VTD_PASID_UPDATE:
> +    case VTD_PASID_BIND:

I'm doubting if we really want to have two types. BIND might be enough
since UPDATE is to bind device/pasid to a new page table and kernel
supports bind to a new page table wihout unbinding the old one.

> +    {
> +        ret = vtd_device_attach_iommufd(vtd_hiod, vtd_as, errp);
> +        break;
> +    }
> +    case VTD_PASID_UNBIND:
> +    {
> +        ret = vtd_device_detach_iommufd(vtd_hiod, vtd_as, errp);
> +        break;
> +    }
> +    default:
> +        error_setg(errp, "Unknown VTDPASIDOp!!!");
> +        break;
> +    }
> +
> +    return ret;
> +}
> +#else
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
> +                                Error **errp)
> +{
> +    return 0;
> +}
> +#endif
> +
> +static int vtd_bind_guest_pasid_report_err(VTDAddressSpace *vtd_as,
> +                                           VTDPASIDOp op)
> +{
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    /*
> +     * vIOMMU calls into kernel to do BIND/UNBIND, the failure reason
> +     * can be kernel, QEMU bug or invalid guest config. None of them
> +     * should be reported to guest in PASID cache invalidation
> +     * processing path. But at least, we can report it to QEMU console.
> +     *

could you elaborate the reason for the above comment? I agree that we
lack of a formal way to report the failure. But supressing the failure
does not seem correct to me.

Regards,
Yi Liu

> +     * TODO: for invalid guest config, DMA translation fault will be
> +     * caught by host and passed to QEMU to inject to guest in future.
> +     */
> +    ret = vtd_bind_guest_pasid(vtd_as, op, &local_err);
> +    if (ret) {
> +        error_report_err(local_err);
> +    }
> +
> +    return ret;
> +}
> +
>   /* Do a context-cache device-selective invalidation.
>    * @func_mask: FM field after shifting
>    */
> @@ -3248,10 +3448,20 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
>        */
>       if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
>           pc_entry->pasid_entry = pe;
> +        if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_UPDATE)) {
> +            /*
> +             * In case update binding fails, tear down existing binding to
> +             * catch invalid pasid entry config during DMA translation.
> +             */
> +            goto remove;
> +        }
>       }
>       return false;
>   
>   remove:
> +    if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_UNBIND)) {
> +        return false;
> +    }
>       pc_entry->valid = false;
>   
>       /*
> @@ -3336,6 +3546,9 @@ static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
>               if (!pc_entry->valid) {
>                   pc_entry->pasid_entry = pe;
>                   pc_entry->valid = true;
> +                if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_BIND)) {
> +                    pc_entry->valid = false;
> +                }
>               }
>           }
>           pasid++;
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index c8a936eb46..1c31b9a873 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -73,6 +73,9 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
>   vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
>   vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
>   vtd_reset_exit(void) ""
> +vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
> +vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
> +vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
>   
>   # amd_iommu.c
>   amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 16/21] intel_iommu: Replay pasid bindings after context cache invalidation
  2025-08-28  9:43   ` Eric Auger
@ 2025-08-29  7:35     ` Yi Liu
  2025-09-01  8:11       ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-29  7:35 UTC (permalink / raw)
  To: eric.auger, Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng, Yi Sun

On 2025/8/28 17:43, Eric Auger wrote:
> 
> 
> On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> This replays guest pasid bindings after context cache invalidation.
>> This is a behavior to ensure safety. Actually, programmer should issue
>> pasid cache invalidation with proper granularity after issuing a context
>> cache invalidation.
> So is this mandated? If the spec mandates specific invalidations and the
> guest does not comply with the expected invalidation sequence shall we
> do that behind the curtain?

I think this is following the below decision. We can discuss if it's
really needed to replay the pasid bind.

d4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800 2321) 
     /*
dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800 
2322)      * From VT-d spec 6.5.2.1, a global context entry invalidation
dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800 
2323)      * should be followed by a IOTLB global invalidation, so we should
dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800 
2324)      * be safe even without this. Hoewever, let's replay the region as
dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800 
2325)      * well to be safer, and go back here when we need finer tunes for
dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800 
2326)      * VT-d emulation codes.
dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800 
2327)      */
dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800 
2328)     vtd_iommu_replay_all(s);

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode
  2025-08-22  6:40 ` [PATCH v5 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
  2025-08-28 12:51   ` Eric Auger
@ 2025-08-29  7:42   ` Yi Liu
  1 sibling, 0 replies; 113+ messages in thread
From: Yi Liu @ 2025-08-29  7:42 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	chao.p.peng

On 2025/8/22 14:40, Zhenzhong Duan wrote:
> Now that all infrastructures of supporting passthrough device running
> with stage-1 translation are there, enable it now.
> 
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>   hw/i386/intel_iommu.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f9cb13e945..04a412d460 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -5222,6 +5222,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
>                      "when x-flts=on");
>           return false;
>       }
> +
> +    return true;
>   #endif

just to echo if this series does not support non-rid_pasid PASIDs, then
factor out the configuration that has both x-flts=on and x-pasid-mode=on.

>   
>       error_setg(errp, "host IOMMU is incompatible with stage-1 translation");

This patch itself looks good to me.

Reviewed-by: Yi Liu <yi.l.liu@intel.com>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
  2025-08-27 15:09       ` Nicolin Chen
@ 2025-08-29  8:16         ` Yi Liu
  2025-08-29  8:54           ` Nicolin Chen
  0 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-08-29  8:16 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, eric.auger, mst,
	jasowang, peterx, ddutile, jgg, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On 2025/8/27 23:09, Nicolin Chen wrote:
> On Wed, Aug 27, 2025 at 07:56:38PM +0800, Yi Liu wrote:
>> On 2025/8/23 07:55, Nicolin Chen wrote:
>>>> +    if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
>>>> +        container->bcontainer.bypass_ro = true;
>>>
>>> This circled back to checking a vendor specific flag in the core..
>>>
>>> Perhaps we could upgrade the get_viommu_cap op and its API:
>>>
>>> enum viommu_flags {
>>>       VIOMMU_FLAG_HW_NESTED = BIT_ULL(0),
>>>       VIOMMU_FLAG_BYPASS_RO = BIT_ULL(1),
>>
>> hmmm. I'm not quite on this idea as the two flags have different sources.
>> One determined by vIOMMU config, one by the hardware limit. Reporting
>> them in one API is strange.
> 
> It's fair enough that we want to make such a clear boundary between
> a vIOMMU flag and a HW IOMMU flag of the same vendor..
> 
>> I think the bypass RO can be determined in
>> VFIO just like the patch has done. But it should check if vIOMMU has
>> requested nested hwpt and also the reported hw_info::type is
>> IOMMU_HW_INFO_TYPE_INTEL_VTD.
>>
>> 	if ((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) &&
>>              type == IOMMU_HW_INFO_TYPE_INTEL_VTD &&
>>              vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
>>              container->bcontainer.bypass_ro = true;
>>           }
> 
> Then, it feels odd to me that we don't have a clear boundary between
> a generic flag and a vendor flag :-/
> 
> It's fine if we want to keep all the host-level vendor flags outside
> the vIOMMU code, but at least could we please have a generic looking
> function outside this iommufd_cdev_autodomains_get() to translate a
> vendor flag to a generic looking flag?
> 
> We could start with a function that loads the HostIOMMUDeviceCaps (or
> just VendorCaps) dealing with vendor types and outputs generic ones:
> 
>          host_iommu_flags = host_iommu_decode_vendor_caps(&vendor_caps);
> 
>          if (hwpt_flags & IOMMU_HWPT_ALLOC_NEST_PARENT &&
>              host_iommu_flags & HOST_IOMMU_FLAG_BYPASS_RO) {
>               container->bcontainer.bypass_ro = true;
>          }
> 
> Over time, it can even grow into a separate file, if there are more
> vendor specific requirement.

you also have valid point. I've also considered to let vIOMMU to invoke
the vfio_listener_register(). This might need to change the VFIO logic a
lot. Conceptually, it does not stand very well... And it is too heavy
for WA an errata...So may we just start with a function as you proposed?

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17
  2025-08-29  8:16         ` Yi Liu
@ 2025-08-29  8:54           ` Nicolin Chen
  0 siblings, 0 replies; 113+ messages in thread
From: Nicolin Chen @ 2025-08-29  8:54 UTC (permalink / raw)
  To: Yi Liu
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, eric.auger, mst,
	jasowang, peterx, ddutile, jgg, joao.m.martins,
	clement.mathieu--drif, kevin.tian, chao.p.peng

On Fri, Aug 29, 2025 at 04:16:41PM +0800, Yi Liu wrote:
> On 2025/8/27 23:09, Nicolin Chen wrote:
> > On Wed, Aug 27, 2025 at 07:56:38PM +0800, Yi Liu wrote:
> > > On 2025/8/23 07:55, Nicolin Chen wrote:
> > We could start with a function that loads the HostIOMMUDeviceCaps (or
> > just VendorCaps) dealing with vendor types and outputs generic ones:
> > 
> >          host_iommu_flags = host_iommu_decode_vendor_caps(&vendor_caps);
> > 
> >          if (hwpt_flags & IOMMU_HWPT_ALLOC_NEST_PARENT &&
> >              host_iommu_flags & HOST_IOMMU_FLAG_BYPASS_RO) {
> >               container->bcontainer.bypass_ro = true;
> >          }
> > 
> > Over time, it can even grow into a separate file, if there are more
> > vendor specific requirement.
> 
> you also have valid point. I've also considered to let vIOMMU to invoke
> the vfio_listener_register(). This might need to change the VFIO logic a
> lot. Conceptually, it does not stand very well... And it is too heavy
> for WA an errata...

I think it's fine to use a flag. Zhenzhong's bypass_ro patch looks
quite clean to me.

> So may we just start with a function as you proposed?

Yea.

I imagined that kind of decoding function in the backend/iommufd.c
or somewhere closer to HostIOMMU structure/function in this file.

And I expected that the WANTS_NESTING_PARENT flag would need some
validation from the vendor specific hw info too, by ensuring IOMMU
HW does support nesting. Then we could reject the allocation of a
nesting parent at an earlier stage.

Now, we are doing in a way of pre-allocating a nesting parent HWPT
(so long as vIOMMU wants) and letting the set_iommu_device callback
do the validation of the HW info. I think that's fine as well..

Yet one way or another, we do put iommu_hw_info_vtd (HW IOMMU caps)
in the vIOMMU code and validate that in the vIOMMU code right? So,
argubly the whole separation between vIOMMU and HW IOMMU things is
not that perfectly implemented? :-/

Thanks
Nic


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-08-29  3:26                   ` Nicolin Chen
@ 2025-09-01  2:35                     ` Duan, Zhenzhong
  2025-09-01  2:59                       ` Nicolin Chen
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-01  2:35 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Liu, Yi L, Eric Auger, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v5 02/21] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>On Fri, Aug 29, 2025 at 01:54:50AM +0000, Duan, Zhenzhong wrote:
>> >>On 2025/8/27 23:30, Nicolin Chen wrote:
>> >>> On Wed, Aug 27, 2025 at 02:32:42PM +0200, Eric Auger wrote:
>> >>>> On 8/27/25 2:30 PM, Yi Liu wrote:
>> >>>>> On 2025/8/27 19:22, Eric Auger wrote:
>> >>>>>>> TBH. I'm hesitating to name it as get_viommu_cap. The scope is a
>> >little
>> >>>>>>> larger than what we want so far. So I'm wondering if it can be
>done
>> >>>>>>> in a
>> >>>>>>> more straightforward way. e.g. just a bool op named
>> >>>>>>> iommu_nested_wanted(). Just an example, maybe better naming.
>We
>> >>can
>> >>>>>>> extend the op to be returning a u64 value in the future when we
>see
>> >>>>>>> another request on VFIO from vIOMMU.
>> >>>>>> personnally I am fine with the bitmask which looks more future
>proof.
>> >>>>>
>> >>>>> not quite sure if there is another info that needs to be checked in
>> >>>>> this "VFIO asks vIOMMU" manner. Have you seen one beside this
>> >>>>> nested hwpt requirement by vIOMMU?
>> >>>>
>> >>>> I don't remember any at this point. But I guess with ARM CCA device
>> >>>> passthrough we might have other needs
>> >>>
>> >>> Yea. A Realm vSMMU instance won't allocate IOAS/HWPT. So it will
>> >>> ask the core to bypass those allocations, via the same op.
>> >>>
>> >>> I don't know: does "get_viommu_flags" sound more fitting to have
>> >>> a clear meaning of "want"?
>> >>>
>> >>>    VIOMMU_FLAG_WANT_NESTING_PARENT
>> >>>    VIOMMU_FLAG_WANT_NO_IOAS
>> >>>
>> >>> At least, the 2nd one being a "cap" wouldn't sound nice to me..
>> >>
>> >>this looks good to me.
>> >
>> >OK, will do s/get_viommu_cap/get_viommu_flags and
>> >s/VIOMMU_CAP_HW_NESTED/
>VIOMMU_FLAG_WANT_NESTING_PARENT if
>> >no more suggestions.
>>
>> I just noticed this change will conflict with your suggestion of using
>HW_NESTED terminology.
>> Let me know if you agree with this change or not?
>
>It wouldn't necessarily conflict. VIOMMU_FLAG_WANT_NESTING_PARENT
>is a request, interchangeable with VIOMMU_FLAG_SUPPORT_HW_NESTED,
>i.e. a cap.
>
>At the end of the day, they are fundamentally the same thing that
>is to tell the core to allocate a nesting parent HWPT. The former
>one is just more straightforward, avoiding confusing terms such as
>"stage-1" and "nested".
>
>IMHO, you wouldn't even need the comments in the other thread, as
>the flag explains clearly what it wants and what the core is doing.
>
>Also, once you use the "want" one, the "HW_NESTED" terminology will
>not exist in the code.

OK, will use the *_flags and _WANT_* style, do you have suggestions for the name of vfio_device_viommu_get_nested() since "HW_NESTED" terminology will not exist, what about vfio_device_get_viommu_flags_W_N_P()?

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-09-01  2:35                     ` Duan, Zhenzhong
@ 2025-09-01  2:59                       ` Nicolin Chen
  2025-09-01  3:31                         ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Nicolin Chen @ 2025-09-01  2:59 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: Liu, Yi L, Eric Auger, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P

On Mon, Sep 01, 2025 at 02:35:29AM +0000, Duan, Zhenzhong wrote:
> >> I just noticed this change will conflict with your suggestion of using
> >HW_NESTED terminology.
> >> Let me know if you agree with this change or not?
> >
> >It wouldn't necessarily conflict. VIOMMU_FLAG_WANT_NESTING_PARENT
> >is a request, interchangeable with VIOMMU_FLAG_SUPPORT_HW_NESTED,
> >i.e. a cap.
> >
> >At the end of the day, they are fundamentally the same thing that
> >is to tell the core to allocate a nesting parent HWPT. The former
> >one is just more straightforward, avoiding confusing terms such as
> >"stage-1" and "nested".
> >
> >IMHO, you wouldn't even need the comments in the other thread, as
> >the flag explains clearly what it wants and what the core is doing.
> >
> >Also, once you use the "want" one, the "HW_NESTED" terminology will
> >not exist in the code.
> 
> OK, will use the *_flags and _WANT_* style, do you have suggestions
> for the name of vfio_device_viommu_get_nested() since "HW_NESTED"
> terminology will not exist, what about vfio_device_get_viommu_flags_W_N_P()?

I don't see it very necessary to have a specific API per flag. So,
it could be just:

    uint64_t viommu_flags = vfio_device_get_viommu_flags(vbasedev);

    if (viommu_flags & VIOMMU_FLAG_WANT_NEST_PARENT) {
        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
    }
?

Thanks
Nic


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update
  2025-08-27 14:25   ` Eric Auger
@ 2025-09-01  3:17     ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-01  3:17 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P, Yi Sun



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and
>update
snip...

>> +
>> +/*
>> + * This function is a loop function which return value determines if
>whose returned value determines whether current vtd_as iterator matches
>the pasid cache entry info passed in user_data and needs to be removed
>from the pasid cache.

Will do.

>> + * vtd_as including cached pasid entry is removed.
>> + *
>> + * For PCI_NO_PASID, when corresponding cached pasid entry is cleared,
>> + * it returns false so that vtd_as is reserved as it's owned by PCI
>> + * sub-system. For other pasid, it returns true so vtd_as is removed.
>> + */
>> +static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
>> +                                       gpointer user_data)
>> +{
>> +    VTDPASIDCacheInfo *pc_info = user_data;
>> +    VTDAddressSpace *vtd_as = value;
>> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> +    VTDPASIDEntry pe;
>> +    uint16_t did;
>> +    uint32_t pasid;
>> +    int ret;
>> +
>> +    if (!pc_entry->valid) {
>> +        return false;
>> +    }
>> +    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
>> +
>> +    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
>> +        goto remove;
>> +    }
>> +
>> +    switch (pc_info->type) {
>> +    case VTD_PASID_CACHE_PASIDSI:
>> +        if (pc_info->pasid != pasid) {
>> +            return false;
>> +        }
>> +        /* fall through */
>> +    case VTD_PASID_CACHE_DOMSI:
>> +        if (pc_info->did != did) {
>> +            return false;
>> +        }
>> +        /* fall through */
>> +    case VTD_PASID_CACHE_GLOBAL_INV:
>> +        break;
>> +    default:
>> +        error_setg(&error_fatal, "invalid pc_info->type for flush");
>> +    }
>> +
>> +    /*
>> +     * pasid cache invalidation may indicate a present pasid entry to
>present
>> +     * pasid entry modification. To cover such case, vIOMMU emulator
>needs to
>> +     * fetch latest guest pasid entry and compares with cached pasid
>entry,
>> +     * then update pasid cache.
>> +     */
>> +    ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
>> +    if (ret) {
>> +        /*
>> +         * No valid pasid entry in guest memory. e.g. pasid entry was
>modified
>> +         * to be either all-zero or non-present. Either case means
>existing
>> +         * pasid cache should be removed.
>> +         */
>> +        goto remove;
>> +    }
>> +
>> +    /*
>> +     * Update cached pasid entry if it's stale compared to what's in guest
>> +     * memory.
>> +     */
>> +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
>> +        pc_entry->pasid_entry = pe;
>> +    }
>> +    return false;
>> +
>> +remove:
>> +    pc_entry->valid = false;
>> +
>> +    /*
>> +     * Don't remove address space of PCI_NO_PASID which is created for
>PCI
>> +     * sub-system.
>> +     */
>> +    if (vtd_as->pasid == PCI_NO_PASID) {
>> +        return false;
>> +    }
>> +    return true;
>> +}
>> +
>> +/*
>> + * For a PASID cache invalidation, this function handles below scenarios:
>> + * a) a present cached pasid entry needs to be removed
>> + * b) a present cached pasid entry needs to be updated
>> + */
>> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
>VTDPASIDCacheInfo *pc_info)
>> +{
>> +    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
>> +        return;
>> +    }
>> +
>> +    vtd_iommu_lock(s);
>> +    /*
>> +     * a,b): loop all the existing vtd_as instances for pasid cache removal
>> +       or update.
>> +     */
>> +    g_hash_table_foreach_remove(s->vtd_address_spaces,
>vtd_flush_pasid_locked,
>> +                                pc_info);
>> +    vtd_iommu_unlock(s);
>> +}
>> +
>> +static bool vtd_process_pasid_desc(IntelIOMMUState *s,
>> +                                   VTDInvDesc *inv_desc)
>> +{
>> +    uint16_t did;
>> +    uint32_t pasid;
>> +    VTDPASIDCacheInfo pc_info;
>> +    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0,
>VTD_INV_DESC_ALL_ONE,
>> +                        VTD_INV_DESC_ALL_ONE,
>VTD_INV_DESC_ALL_ONE};
>> +
>> +    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
>> +                                     __func__, "pasid cache inv"))
>{
>> +        return false;
>> +    }
>> +
>> +    did = VTD_INV_DESC_PASIDC_DID(inv_desc);
>> +    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc);
>> +
>> +    switch (VTD_INV_DESC_PASIDC_G(inv_desc)) {
>> +    case VTD_INV_DESC_PASIDC_G_DSI:
>> +        trace_vtd_pasid_cache_dsi(did);
>> +        pc_info.type = VTD_PASID_CACHE_DOMSI;
>> +        pc_info.did = did;
>> +        break;
>> +
>> +    case VTD_INV_DESC_PASIDC_G_PASID_SI:
>> +        /* PASID selective implies a DID selective */
>> +        trace_vtd_pasid_cache_psi(did, pasid);
>> +        pc_info.type = VTD_PASID_CACHE_PASIDSI;
>> +        pc_info.did = did;
>> +        pc_info.pasid = pasid;
>> +        break;
>> +
>> +    case VTD_INV_DESC_PASIDC_G_GLOBAL:
>> +        trace_vtd_pasid_cache_gsi();
>> +        pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
>> +        break;
>> +
>> +    default:
>> +        error_report_once("invalid granularity field in PASID-cache
>invalidate "
>> +                          "descriptor, hi: 0x%"PRIx64" lo: 0x%"
>PRIx64,
>> +                           inv_desc->val[1], inv_desc->val[0]);
>what's the point of printing the 2nd 64b? Looking at Figure 6-2 in the
>spec (6.5.2.2. PASID-cache invalidate descriptor) it does not seem to
>contain anything?

I think it's a tradition in intel_iommu.c to print hi and low for 128bit or val[3-0] for 256bit inv_desc, even though hi may be reserved.

>
>Besides I read in the spec:
>Domain-ID (DID): The DID field indicates the target domain-id. Hardware
>ignores bits 31:(16+N), where N is the domain-id width reported in the
>Capability Register.
>
>How do you make sure N is same on both pIOMMU and vIOMMU?

There is no relationship between pIOMMU's and vIOMMU's DID. host and guest kernel manage their DID separately.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update
  2025-08-28 12:05   ` Yi Liu
@ 2025-09-01  3:31     ` Duan, Zhenzhong
  2025-09-03  7:58       ` Yi Liu
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-01  3:31 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P, Yi Sun



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and
>update
>
>On 2025/8/22 14:40, Zhenzhong Duan wrote:
>> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache
>the
>> pasid entry and track PASID usage and future PASID tagged DMA address
>> translation support in vIOMMU.
>
>Have you seen any extra code needed based on this series to support non
>rid_pasid PASIDs? If no, may just relax the scope of this series.
>otherwise, you may need to tweak the patch a little bit. e.g. factor
>out setting x-flts and x-pasid-mode at the same time.

There are quite a few code are common for both non-rid_pasid and rid_pasid.
So in this series, there are some infrastructure code that looks like it's for non-rid_pasid.

But to support non-rid_pasid, we need pasid_attach/detach() which is not implemented in this series.

Even if x-flts and x-pasid-mode both on, pasid isn't enabled since VFIO device doesn't expose pasid capability to guest, so guest never use non-rid_pasid with this VFIO device.

>
>>
>> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
>> never freed. For other pasid, VTDAddressSpace instance is
>created/destroyed
>> per the guest pasid entry set up/destroy.
>
>> When guest removes or updates a PASID entry, QEMU will capture the guest
>pasid
>> selective pasid cache invalidation, removes VTDAddressSpace or update
>cached
>> PASID entry.
>>
>> vIOMMU emulator could figure out the reason by fetching latest guest pasid
>entry
>> and compare it with cached PASID entry.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>   hw/i386/intel_iommu_internal.h |  27 ++++-
>>   include/hw/i386/intel_iommu.h  |   6 +
>>   hw/i386/intel_iommu.c          | 196
>+++++++++++++++++++++++++++++++--
>>   hw/i386/trace-events           |   3 +
>>   4 files changed, 220 insertions(+), 12 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index f7510861d1..b9b76dd996 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -316,6 +316,7 @@ typedef enum VTDFaultReason {
>>                                     * request while disabled */
>>       VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>>
>> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>>       /* PASID directory entry access failure */
>>       VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>>       /* The Present(P) field of pasid directory entry is 0 */
>> @@ -493,6 +494,15 @@ typedef union VTDInvDesc VTDInvDesc;
>>   #define VTD_INV_DESC_PIOTLB_RSVD_VAL0
>0xfff000000000f1c0ULL
>>   #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>>
>> +/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
>> +#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
>> +#define VTD_INV_DESC_PASIDC_G_DSI       0
>> +#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
>> +#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
>> +#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16,
>16)
>> +#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32,
>20)
>> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
>> +
>>   /* Information about page-selective IOTLB invalidate */
>>   struct VTDIOTLBPageInvInfo {
>>       uint16_t domain_id;
>> @@ -553,6 +563,21 @@ typedef struct VTDRootEntry VTDRootEntry;
>>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL |
>~VTD_HAW_MASK(aw))
>>   #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1
>0xffffffffffe00000ULL
>>
>> +typedef enum VTDPCInvType {
>> +    /* VTD spec defined PASID cache invalidation type */
>> +    VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
>> +    VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
>> +    VTD_PASID_CACHE_GLOBAL_INV =
>VTD_INV_DESC_PASIDC_G_GLOBAL,
>> +} VTDPCInvType;
>> +
>> +typedef struct VTDPASIDCacheInfo {
>> +    VTDPCInvType type;
>> +    uint16_t did;
>> +    uint32_t pasid;
>> +    PCIBus *bus;
>> +    uint16_t devfn;
>> +} VTDPASIDCacheInfo;
>> +
>>   /* PASID Table Related Definitions */
>>   #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>>   #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
>> @@ -574,7 +599,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>>   #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>>
>>   #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted
>guest-address-width */
>> -#define VTD_SM_PASID_ENTRY_DID(val)    ((val) &
>VTD_DOMAIN_ID_MASK)
>> +#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>>
>>   #define VTD_SM_PASID_ENTRY_FLPM          3ULL
>>   #define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
>> diff --git a/include/hw/i386/intel_iommu.h
>b/include/hw/i386/intel_iommu.h
>> index 50f9b27a45..0e3826f6f0 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>>       uint64_t val[8];
>>   };
>>
>> +typedef struct VTDPASIDCacheEntry {
>> +    struct VTDPASIDEntry pasid_entry;
>> +    bool valid;
>> +} VTDPASIDCacheEntry;
>> +
>>   struct VTDAddressSpace {
>>       PCIBus *bus;
>>       uint8_t devfn;
>> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>>       MemoryRegion iommu_ir_fault; /* Interrupt region for catching
>fault */
>>       IntelIOMMUState *iommu_state;
>>       VTDContextCacheEntry context_cache_entry;
>> +    VTDPASIDCacheEntry pasid_cache_entry;
>>       QLIST_ENTRY(VTDAddressSpace) next;
>>       /* Superset of notifier flags that this address space has */
>>       IOMMUNotifierFlag notifier_flags;
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 1801f1cdf6..a2ee6d684e 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -1675,7 +1675,7 @@ static uint16_t
>vtd_get_domain_id(IntelIOMMUState *s,
>>
>>       if (s->root_scalable) {
>>           vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>> -        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
>> +        return VTD_SM_PASID_ENTRY_DID(&pe);
>>       }
>>
>>       return VTD_CONTEXT_ENTRY_DID(ce->hi);
>> @@ -3112,6 +3112,183 @@ static bool
>vtd_process_piotlb_desc(IntelIOMMUState *s,
>>       return true;
>>   }
>>
>> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
>> +                                            uint32_t pasid,
>VTDPASIDEntry *pe)
>> +{
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    VTDContextEntry ce;
>> +    int ret;
>> +
>> +    if (!s->root_scalable) {
>> +        return -VTD_FR_RTADDR_INV_TTM;
>> +    }
>> +
>> +    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>vtd_as->devfn,
>> +                                   &ce);
>> +    if (ret) {
>> +        return ret;
>> +    }
>> +
>> +    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
>> +}
>> +
>> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry
>*p2)
>> +{
>> +    return !memcmp(p1, p2, sizeof(*p1));
>> +}
>> +
>> +/*
>> + * This function is a loop function which return value determines if
>> + * vtd_as including cached pasid entry is removed.
>> + *
>> + * For PCI_NO_PASID, when corresponding cached pasid entry is cleared,
>> + * it returns false so that vtd_as is reserved as it's owned by PCI
>> + * sub-system. For other pasid, it returns true so vtd_as is removed.
>
>also, this helper will always return true if this series does not
>support non-rid_pasid PASID.

Do you mean return false? I don't think it will return true.
For non-rid_pasid, it may return false.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-09-01  2:59                       ` Nicolin Chen
@ 2025-09-01  3:31                         ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-01  3:31 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Liu, Yi L, Eric Auger, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian,  Kevin, Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v5 02/21] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>On Mon, Sep 01, 2025 at 02:35:29AM +0000, Duan, Zhenzhong wrote:
>> >> I just noticed this change will conflict with your suggestion of using
>> >HW_NESTED terminology.
>> >> Let me know if you agree with this change or not?
>> >
>> >It wouldn't necessarily conflict. VIOMMU_FLAG_WANT_NESTING_PARENT
>> >is a request, interchangeable with
>VIOMMU_FLAG_SUPPORT_HW_NESTED,
>> >i.e. a cap.
>> >
>> >At the end of the day, they are fundamentally the same thing that
>> >is to tell the core to allocate a nesting parent HWPT. The former
>> >one is just more straightforward, avoiding confusing terms such as
>> >"stage-1" and "nested".
>> >
>> >IMHO, you wouldn't even need the comments in the other thread, as
>> >the flag explains clearly what it wants and what the core is doing.
>> >
>> >Also, once you use the "want" one, the "HW_NESTED" terminology will
>> >not exist in the code.
>>
>> OK, will use the *_flags and _WANT_* style, do you have suggestions
>> for the name of vfio_device_viommu_get_nested() since "HW_NESTED"
>> terminology will not exist, what about
>vfio_device_get_viommu_flags_W_N_P()?
>
>I don't see it very necessary to have a specific API per flag. So,
>it could be just:
>
>    uint64_t viommu_flags = vfio_device_get_viommu_flags(vbasedev);
>
>    if (viommu_flags & VIOMMU_FLAG_WANT_NEST_PARENT) {
>        flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
>    }
>?

OK, make sense.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 09/21] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  2025-08-28 10:33   ` Yi Liu
@ 2025-09-01  5:14     ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-01  5:14 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 09/21] intel_iommu: Fail passthrough device under PCI
>bridge if x-flts=on
>
>On 2025/8/22 14:40, Zhenzhong Duan wrote:
>> Currently we don't support nested translation for passthrough device with
>> emulated device under same PCI bridge, because they require different
>address
>> space when x-flts=on.
>>
>> In theory, we do support if devices under same PCI bridge are all
>passthrough
>> devices. But emulated device can be hotplugged under same bridge. To
>simplify,
>> just forbid passthrough device under PCI bridge no matter if there is, or will
>> be emulated devices under same bridge. This is acceptable because PCIE
>bridge
>> is more popular than PCI bridge now.
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>> ---
>>   hw/i386/intel_iommu.c | 13 +++++++++++--
>>   1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index da355bda79..6edd91d94e 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -4341,9 +4341,10 @@ VTDAddressSpace
>*vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>>       return vtd_dev_as;
>>   }
>>
>> -static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice
>*hiod,
>> +static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice
>*vtd_hiod,
>>                              Error **errp)
>>   {
>> +    HostIOMMUDevice *hiod = vtd_hiod->hiod;
>>       HostIOMMUDeviceClass *hiodc =
>HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>>       int ret;
>>
>> @@ -4370,6 +4371,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>HostIOMMUDevice *hiod,
>>   #ifdef CONFIG_IOMMUFD
>>       struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>>       struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
>> +    PCIBus *bus = vtd_hiod->bus;
>> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus),
>vtd_hiod->devfn);
>
>pci_find_device() finds bus pointer with bus_num, this can be avoided
>as you already have bus pointer. Perhaps this may be done by wrapping
>bus->devices[devfn] to a helper. Especially, pci_bus_num() may not have
>the correct bus number at this point.

Indeed, will do, just pdev=bus->devices[devfn] should work.

Thanks
Zhenzhong

>
>>
>>       /* Remaining checks are all stage-1 translation specific */
>>       if (!object_dynamic_cast(OBJECT(hiod),
>TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
>> @@ -4392,6 +4395,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>HostIOMMUDevice *hiod,
>>           error_setg(errp, "Stage-1 1GB huge page is unsupported by
>host IOMMU");
>>           return false;
>>       }
>> +
>> +    if (pci_device_get_iommu_bus_devfn(pdev, &bus, NULL, NULL)) {
>> +        error_setg(errp, "Host device under PCI bridge is unsupported "
>> +                   "when x-flts=on");
>> +        return false;
>> +    }
>>   #endif
>>
>>       error_setg(errp, "host IOMMU is incompatible with stage-1
>translation");
>> @@ -4425,7 +4434,7 @@ static bool vtd_dev_set_iommu_device(PCIBus
>*bus, void *opaque, int devfn,
>>       vtd_hiod->iommu_state = s;
>>       vtd_hiod->hiod = hiod;
>>
>> -    if (!vtd_check_hiod(s, hiod, errp)) {
>> +    if (!vtd_check_hiod(s, vtd_hiod, errp)) {
>>           g_free(vtd_hiod);
>>           vtd_iommu_unlock(s);
>>           return false;
>
>other parts looks good to me.
>
>Reviewed-by: Yi Liu <yi.l.liu@intel.com>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 10/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  2025-08-28 11:36   ` Yi Liu
@ 2025-09-01  5:33     ` Duan, Zhenzhong
  2025-09-03  6:30       ` Yi Liu
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-01  5:33 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 10/21] intel_iommu: Introduce two helpers
>vtd_as_from/to_iommu_pasid_locked
>
>On 2025/8/22 14:40, Zhenzhong Duan wrote:
>> PCI device supports two request types, Requests-without-PASID and
>> Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
>> prefix, IOMMU fetches rid_pasid from context entry and use it as IOMMU's
>> pasid to index pasid table.
>>
>> So we need to translate between PCI's pasid and IOMMU's pasid specially
>> for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
>> For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
>>
>> vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to
>vtd_as
>> which contains PCI's pasid vtd_as->pasid.
>>
>> vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to
>iommu_pasid.
>
>translate is somehow strange. convert or get might be better? Same to
>the translate terms in the patch.

OK, will use 'convert' terminology.

>
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>> ---
>>   hw/i386/intel_iommu.c | 58
>+++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 58 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 6edd91d94e..1801f1cdf6 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -1602,6 +1602,64 @@ static int
>vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>>       return 0;
>>   }
>>
>> +static int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
>> +                                        uint32_t *pasid)
>> +{
>> +    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    uint8_t bus_num = pci_bus_num(vtd_as->bus);
>> +    uint8_t devfn = vtd_as->devfn;
>> +    VTDContextEntry ce;
>> +    int ret;
>> +
>> +    /* For Requests-with-PASID, its pasid value is used by vIOMMU
>directly */
>> +    if (vtd_as->pasid != PCI_NO_PASID) {
>> +        *pasid = vtd_as->pasid;
>> +        return 0;
>> +    }
>> +
>> +    if (cc_entry->context_cache_gen == s->context_cache_gen) {
>> +        ce = cc_entry->context_entry;
>> +    } else {
>> +        ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
>> +        if (ret) {
>> +            return ret;
>> +        }
>> +    }
>> +    *pasid = VTD_CE_GET_RID2PASID(&ce);
>
>looks like we have quite a few code get rid_pasid from the context
>entry. I think we may simplify it by using PASID #0 since vIOMMU does
>not report ECAP.RPS bit at all. It could be done as a separate cleanup.

Yes, but we already have all code supporting RPS capability though RPS
isn't enabled in CAP register. In theory we can enable RPS easily by setting
the bit in CAP register. So I would like to be consistent with this instead of
dropping all the existing code about RPS cap.

Thanks
Zhenzhong

>
>> +    return 0;
>> +}
>> +
>> +static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key,
>gpointer value,
>> +                                                   gpointer
>user_data)
>> +{
>> +    VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
>> +    struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
>> +    uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus),
>vtd_as->devfn);
>> +    uint32_t pasid;
>> +
>> +    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
>> +        return false;
>> +    }
>> +
>> +    return (pasid == target->pasid) && (sid == target->sid);
>> +}
>> +
>> +/* Translate iommu pasid to vtd_as */
>> +static inline
>> +VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState
>*s,
>> +                                                uint16_t sid,
>uint32_t pasid)
>> +{
>> +    struct vtd_as_raw_key key = {
>> +        .sid = sid,
>> +        .pasid = pasid
>> +    };
>> +
>> +    return g_hash_table_find(s->vtd_address_spaces,
>> +                             vtd_find_as_by_sid_and_iommu_pasid,
>&key);
>> +}
>> +
>>   static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
>>                                        void *private)
>>   {
>
>the code looks good to me.
>
>Reviewed-by: Yi Liu <yi.l.liu@intel.com>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 16/21] intel_iommu: Replay pasid bindings after context cache invalidation
  2025-08-29  7:35     ` Yi Liu
@ 2025-09-01  8:11       ` Duan, Zhenzhong
  2025-09-03 10:18         ` Yi Liu
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-01  8:11 UTC (permalink / raw)
  To: Liu, Yi L, eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 16/21] intel_iommu: Replay pasid bindings after
>context cache invalidation
>
>On 2025/8/28 17:43, Eric Auger wrote:
>>
>>
>> On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>>> From: Yi Liu <yi.l.liu@intel.com>
>>>
>>> This replays guest pasid bindings after context cache invalidation.
>>> This is a behavior to ensure safety. Actually, programmer should issue
>>> pasid cache invalidation with proper granularity after issuing a context
>>> cache invalidation.
>> So is this mandated? If the spec mandates specific invalidations and the
>> guest does not comply with the expected invalidation sequence shall we
>> do that behind the curtain?
>
>I think this is following the below decision. We can discuss if it's
>really needed to replay the pasid bind.
>
>d4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>2321)
>     /*
>dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>2322)      * From VT-d spec 6.5.2.1, a global context entry invalidation
>dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>2323)      * should be followed by a IOTLB global invalidation, so we
>should
>dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>2324)      * be safe even without this. Hoewever, let's replay the region as
>dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>2325)      * well to be safer, and go back here when we need finer tunes
>for
>dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>2326)      * VT-d emulation codes.
>dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>2327)      */
>dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>2328)     vtd_iommu_replay_all(s);

I have tested this series with this patch reverted, it works with guest linux kernel.

Personally, I am inclined to stop adding workaround for guest kenrel bug, there will be more and more over time and it makes current code complex unnecessarily. @Eric, @Liu, Yi L your thought?

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 17/21] intel_iommu: Propagate PASID-based iotlb invalidation to host
  2025-08-28 10:00   ` Eric Auger
  2025-08-28 12:11     ` Yi Liu
@ 2025-09-01  8:32     ` Duan, Zhenzhong
  1 sibling, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-01  8:32 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P, Yi Sun



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v5 17/21] intel_iommu: Propagate PASID-based iotlb
>invalidation to host
>
>
>
>On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> This traps the guest PASID-based iotlb invalidation request and propagate it
>> to host.
>>
>> Intel VT-d 3.0 supports nested translation in PASID granularity. Guest SVA
>> support could be implemented by configuring nested translation on specific
>> pasid. This is also known as dual stage DMA translation.
>>
>> Under such configuration, guest owns the GVA->GPA translation which is
>> configured as stage-1 page table on host side for a specific pasid, and host
>> owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
>> invalidation should be propagated to host since host IOMMU will cache first
>> level page table related mappings during DMA address translation.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu_internal.h |  6 +++
>>  hw/i386/intel_iommu.c          | 95
>+++++++++++++++++++++++++++++++++-
>>  2 files changed, 99 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 8af1004888..c1a9263651 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -596,6 +596,12 @@ typedef struct VTDPASIDCacheInfo {
>>      uint16_t devfn;
>>  } VTDPASIDCacheInfo;
>>
>> +typedef struct VTDPIOTLBInvInfo {
>> +    uint16_t domain_id;
>> +    uint32_t pasid;
>> +    struct iommu_hwpt_vtd_s1_invalidate *inv_data;
>> +} VTDPIOTLBInvInfo;
>> +
>>  /* PASID Table Related Definitions */
>>  #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>>  #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 6c0e502d1c..7efa22f4ec 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -2611,12 +2611,99 @@ static int
>vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
>>
>>      return ret;
>>  }
>> +
>> +static void
>> +vtd_invalidate_piotlb_locked(VTDAddressSpace *vtd_as,
>> +                             struct iommu_hwpt_vtd_s1_invalidate
>*cache)
>> +{
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(s,
>vtd_as);
>> +    HostIOMMUDeviceIOMMUFD *idev;
>> +    uint32_t entry_num = 1; /* Only implement one request for simplicity
>*/
>can you remind me what it is used for. What 1?

I see Yi has answered this question.

>> +    Error *local_err = NULL;
>> +
>> +    if (!vtd_hiod || !vtd_as->s1_hwpt) {
>> +        return;
>> +    }
>> +    idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
>> +
>> +    if (!iommufd_backend_invalidate_cache(idev->iommufd,
>vtd_as->s1_hwpt,
>> +
>IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
>> +                                          sizeof(*cache),
>&entry_num, cache,
>> +                                          &local_err)) {
>> +        /* Something wrong in kernel, but trying to continue */
>> +        error_report_err(local_err);
>> +    }
>> +}
>> +
>> +/*
>> + * This function is a loop function for the s->vtd_address_spaces
>> + * list with VTDPIOTLBInvInfo as execution filter. It propagates
>> + * the piotlb invalidation to host.
>> + */
>> +static void vtd_flush_host_piotlb_locked(gpointer key, gpointer value,
>> +                                         gpointer user_data)
>> +{
>> +    VTDPIOTLBInvInfo *piotlb_info = user_data;
>> +    VTDAddressSpace *vtd_as = value;
>> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> +    uint32_t pasid;
>> +    uint16_t did;
>> +
>> +    /* Replay only fills pasid entry cache for passthrough device */
>> +    if (!pc_entry->valid ||
>> +        !vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
>> +        return;
>> +    }
>> +
>> +    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
>> +        return;
>> +    }
>> +
>> +    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
>> +
>> +    if (piotlb_info->domain_id == did && piotlb_info->pasid == pasid) {
>> +        vtd_invalidate_piotlb_locked(vtd_as, piotlb_info->inv_data);
>> +    }
>> +}
>> +
>> +static void
>> +vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
>> +                                 uint16_t domain_id, uint32_t
>pasid,
>> +                                 hwaddr addr, uint64_t npages,
>bool ih)
>> +{
>> +    struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
>> +    VTDPIOTLBInvInfo piotlb_info;
>> +
>> +    cache_info.addr = addr;
>> +    cache_info.npages = npages;
>> +    cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
>> +
>> +    piotlb_info.domain_id = domain_id;
>> +    piotlb_info.pasid = pasid;
>> +    piotlb_info.inv_data = &cache_info;
>> +
>> +    /*
>> +     * Go through each vtd_as instance in s->vtd_address_spaces, find
>out
>> +     * the affected host device which need host piotlb invalidation. Piotlb
>Are you likely to find several vts_as that match invalidation params?

This is possible, it depends on guest kernel implementation. There can be N devices
attached to one domain in guest, then in qemu, N nested HWPTs created and attached to N devices on host side.

>> +     * invalidation should check pasid cache per architecture point of
>view.
>> +     */
>> +    g_hash_table_foreach(s->vtd_address_spaces,
>> +                         vtd_flush_host_piotlb_locked,
>&piotlb_info);
>> +}
>>  #else
>>  static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp
>op,
>>                                  Error **errp)
>>  {
>>      return 0;
>>  }
>> +
>> +static void
>> +vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
>> +                                 uint16_t domain_id, uint32_t
>pasid,
>> +                                 hwaddr addr, uint64_t npages,
>bool ih)
>> +{
>> +}
>>  #endif
>Can't you put those stub stuff in a specific header as it is usually done?

That's usually true for public functions, but vtd_flush_host_piotlb_all_locked() is a static function, do we really want to put in header and expose it to other c files unnecessarily?

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 12/21] intel_iommu: Handle PASID entry addition
  2025-08-27 16:22   ` Eric Auger
@ 2025-09-01  9:03     ` Duan, Zhenzhong
  2025-09-03  8:52       ` Yi Liu
  0 siblings, 1 reply; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-01  9:03 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P, Yi Sun

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v5 12/21] intel_iommu: Handle PASID entry addition
>
>Hi Zhenzhong,
>
>On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>> When guest creates new PASID entries, QEMU will capture the guest pasid
>> selective pasid cache invalidation, walk through each passthrough device
>> and each pasid, when a match is found, identify an existing vtd_as or
>> create a new one and update its corresponding cached pasid entry.
>
>You need to emphasize that the support is currently limited to
>Requests-without-PASID

OK.

>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu_internal.h |   2 +
>>  hw/i386/intel_iommu.c          | 176
>++++++++++++++++++++++++++++++++-
>>  2 files changed, 175 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index b9b76dd996..fb2a919e87 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -559,6 +559,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>>  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>>  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>>
>> +#define VTD_SM_CONTEXT_ENTRY_PDTS(x)        extract64((x)->val[0],
>9, 3)
>>  #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL |
>~VTD_HAW_MASK(aw))
>>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1
>0xffffffffffe00000ULL
>> @@ -589,6 +590,7 @@ typedef struct VTDPASIDCacheInfo {
>>  #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
>>  #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) &
>VTD_PASID_TABLE_BITS_MASK)
>>  #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault
>Processing Disable */
>> +#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
>>
>>  /* PASID Granular Translation Type Mask */
>>  #define VTD_PASID_ENTRY_P              1ULL
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index a2ee6d684e..7d2c9feae7 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -826,6 +826,11 @@ static inline bool
>vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>>      }
>>  }
>>
>> +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry
>*ce)
>> +{
>> +    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
>> +}
>> +
>>  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>>  {
>>      return pdire->val & 1;
>> @@ -1647,9 +1652,9 @@ static gboolean
>vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
>>  }
>>
>>  /* Translate iommu pasid to vtd_as */
>> -static inline
>> -VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState
>*s,
>> -                                                uint16_t sid,
>uint32_t pasid)
>> +static VTDAddressSpace
>*vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
>> +
>uint16_t sid,
>> +
>uint32_t pasid)
>>  {
>>      struct vtd_as_raw_key key = {
>>          .sid = sid,
>> @@ -3220,10 +3225,172 @@ remove:
>>      return true;
>>  }
>>
>> +/*
>> + * This function walks over PASID range within [start, end) in a single
>> + * PASID table for entries matching @info type/did, then retrieve/create
>> + * vtd_as and fill associated pasid entry cache.
>> + */
>> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
>> +                                        dma_addr_t pt_base,
>> +                                        int start,
>> +                                        int end,
>> +                                        VTDPASIDCacheInfo
>*info)
>> +{
>> +    VTDPASIDEntry pe;
>> +    int pasid = start;
>> +
>> +    while (pasid < end) {
>> +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
>> +            && vtd_pe_present(&pe)) {
>> +            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
>> +            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
>> +            VTDPASIDCacheEntry *pc_entry;
>> +            VTDAddressSpace *vtd_as;
>> +
>> +            vtd_iommu_lock(s);
>> +            /*
>> +             * When indexed by rid2pasid, vtd_as should have been
>created,
>> +             * e.g., by PCI subsystem. For other iommu pasid, we need
>to
>> +             * create vtd_as dynamically. Other iommu pasid is same
>value
>since you don't support somthing else than rid2pasid, I would drop that
>and simplify the code. See below.
>> +             * as PCI's pasid, so it's used as input of vtd_find_add_as().
>> +             */
>> +            vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
>> +            vtd_iommu_unlock(s);
>> +            if (!vtd_as) {
>> +                vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
>you could check the vtd_as already exists here per the rid2pasid support
>limitation

In this series, I do include some basic codes for non-rid2pasid because they share some common code with rid2pasid and we already have emulated rid2pasid support in vIOMMU for a long time, it's not bad to accumulate some supporting code for non-rid2pasid for passthrough device. But I can do the factor out if you insist to have only rid_pasid code.

>> +            }
>> +
>> +            if ((info->type == VTD_PASID_CACHE_DOMSI ||
>> +                 info->type == VTD_PASID_CACHE_PASIDSI) &&
>> +                (info->did != VTD_SM_PASID_ENTRY_DID(&pe))) {
>> +                /*
>> +                 * VTD_PASID_CACHE_DOMSI and
>VTD_PASID_CACHE_PASIDSI
>> +                 * requires domain id check. If domain id check fail,
>fails
>> +                 * go to next pasid.
>> +                 */
>> +                pasid++;
>> +                continue;
>> +            }
>> +
>> +            pc_entry = &vtd_as->pasid_cache_entry;
>> +            /*
>> +             * pasid cache update and clear are handled in
>> +             * vtd_flush_pasid_locked(), only care new pasid entry
>here.
>> +             */
>> +            if (!pc_entry->valid) {
>> +                pc_entry->pasid_entry = pe;
>> +                pc_entry->valid = true;
>> +            }
>> +        }
>> +        pasid++;
>> +    }
>> +}
>> +
>> +/*
>> + * In VT-d scalable mode translation, PASID dir + PASID table is used.
>> + * This function aims at looping over a range of PASIDs in the given
>> + * two level table to identify the pasid config in guest.
>> + */
>> +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
>> +                                    dma_addr_t pdt_base,
>> +                                    int start, int end,
>> +                                    VTDPASIDCacheInfo *info)
>> +{
>> +    VTDPASIDDirEntry pdire;
>> +    int pasid = start;
>> +    int pasid_next;
>> +    dma_addr_t pt_base;
>> +
>> +    while (pasid < end) {
>> +        pasid_next =
>> +             (pasid + VTD_PASID_TBL_ENTRY_NUM) &
>~(VTD_PASID_TBL_ENTRY_NUM - 1);
>> +        pasid_next = pasid_next < end ? pasid_next : end;
>> +
>> +        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
>> +            && vtd_pdire_present(&pdire)) {
>> +            pt_base = pdire.val &
>VTD_PASID_TABLE_BASE_ADDR_MASK;
>> +            vtd_sm_pasid_table_walk_one(s, pt_base, pasid,
>pasid_next, info);
>> +        }
>> +        pasid = pasid_next;
>> +    }
>> +}
>> +
>> +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
>> +                                          int start, int end,
>> +                                          VTDPASIDCacheInfo
>*info)
>> +{
>> +    VTDContextEntry ce;
>> +
>> +    if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus),
>info->devfn,
>> +                                  &ce)) {
>> +        uint32_t max_pasid;
>> +
>> +        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) *
>VTD_PASID_TBL_ENTRY_NUM;
>> +        if (end > max_pasid) {
>> +            end = max_pasid;
>> +        }
>> +        vtd_sm_pasid_table_walk(s,
>> +
>VTD_CE_GET_PASID_DIR_TABLE(&ce),
>> +                                start,
>> +                                end,
>> +                                info);
>> +    }
>> +}
>> +
>> +/*
>> + * This function replays the guest pasid bindings by walking the two level
>> + * guest PASID table. For each valid pasid entry, it finds or creates a
>> + * vtd_as and caches pasid entry in vtd_as.
>> + */
>> +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
>> +                                            VTDPASIDCacheInfo
>*pc_info)
>> +{
>> +    /*
>> +     * Currently only Requests-without-PASID is supported, as vIOMMU
>doesn't
>> +     * support RPS(RID-PASID Support), pasid scope is fixed to [0, 1).
>> +     */
>> +    int start = 0, end = 1;
>> +    VTDHostIOMMUDevice *vtd_hiod;
>> +    VTDPASIDCacheInfo walk_info;
>> +    GHashTableIter as_it;
>> +
>> +    switch (pc_info->type) {
>> +    case VTD_PASID_CACHE_PASIDSI:
>> +        start = pc_info->pasid;
>> +        end = pc_info->pasid + 1;
>if you never replay a range, you could simplify the code for now because
>some code paths are not properly tested

OK. Instead of assignment of start and end variable, maybe just an assert(!pc_info->pasid).

>> +       /* fall through */
>> +    case VTD_PASID_CACHE_DOMSI:
>Why can't we have other invalidation types along with request without
>PASID? It is not obvious to me at least why it couldn't be used by the
>guest. Would deserve a comment in the commit desc I think.

Other invalidation types are indeed used, just in pasid scope [0, 1), because [start, end) are already initialized to [0, 1), nothing more here, so just break.

Thanks
Zhenzhong

>> +    case VTD_PASID_CACHE_GLOBAL_INV:
>> +        /* loop all assigned devices */
>> +        break;
>> +    default:
>> +        error_setg(&error_fatal, "invalid pc_info->type for replay");
>> +    }
>> +
>> +    /*
>> +     * In this replay, one only needs to care about the devices which are
>> +     * backed by host IOMMU. Those devices have a corresponding
>vtd_hiod
>> +     * in s->vtd_host_iommu_dev. For devices not backed by host
>IOMMU, it
>> +     * is not necessary to replay the bindings since their cache could be
>> +     * re-created in the future DMA address translation.
>> +     *
>> +     * VTD translation callback never accesses vtd_hiod and its
>corresponding
>> +     * cached pasid entry, so no iommu lock needed here.
>> +     */
>> +    walk_info = *pc_info;
>> +    g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
>> +    while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
>> +        walk_info.bus = vtd_hiod->bus;
>> +        walk_info.devfn = vtd_hiod->devfn;
>> +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
>> +    }
>> +}
>> +
>>  /*
>>   * For a PASID cache invalidation, this function handles below scenarios:
>>   * a) a present cached pasid entry needs to be removed
>>   * b) a present cached pasid entry needs to be updated
>> + * c) a present cached pasid entry needs to be created
>>   */
>>  static void vtd_pasid_cache_sync(IntelIOMMUState *s,
>VTDPASIDCacheInfo *pc_info)
>>  {
>> @@ -3239,6 +3406,9 @@ static void
>vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
>>      g_hash_table_foreach_remove(s->vtd_address_spaces,
>vtd_flush_pasid_locked,
>>                                  pc_info);
>>      vtd_iommu_unlock(s);
>> +
>> +    /* c): loop all passthrough device for new pasid entries */
>> +    vtd_replay_guest_pasid_bindings(s, pc_info);
>>  }
>>
>>  static bool vtd_process_pasid_desc(IntelIOMMUState *s,
>Thanks
>
>Eric


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 13/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
  2025-08-27 16:28   ` Eric Auger
  2025-08-29  5:56     ` Yi Liu
@ 2025-09-01  9:04     ` Duan, Zhenzhong
  1 sibling, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-01  9:04 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P, Yi Sun



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v5 13/21] intel_iommu: Introduce a new pasid cache
>invalidation type FORCE_RESET
>
>Hi Zhenzhong,
>
>On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>> FORCE_RESET is different from GLOBAL_INV which updates pasid cache if
>> underlying pasid entry is still valid, it drops all the pasid caches.
>>
>> FORCE_RESET isn't a VTD spec defined invalidation type for pasid cache,
>> only used internally in system level reset.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu_internal.h |  9 +++++++++
>>  hw/i386/intel_iommu.c          | 25 +++++++++++++++++++++++++
>>  hw/i386/trace-events           |  1 +
>>  3 files changed, 35 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index fb2a919e87..c510b09d1a 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -569,6 +569,15 @@ typedef enum VTDPCInvType {
>>      VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
>>      VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
>>      VTD_PASID_CACHE_GLOBAL_INV =
>VTD_INV_DESC_PASIDC_G_GLOBAL,
>> +
>> +    /*
>> +     * Internally used PASID cache invalidation type starts here,
>> +     * 0x10 is large enough as invalidation type in pc_inv_desc
>> +     * is 2bits in size.
>> +     */
>> +
>> +    /* Reset all PASID cache entries, used in system level reset */
>> +    VTD_PASID_CACHE_FORCE_RESET = 0x10,
>I am not very keen on adding such an artifical enum value that does not
>exist in the spec.
>
>Why not simply introduce another function (instead of
>vtd_flush_pasid_locked) that does the cleanup. To me it would be
>cleaner. Thanks Eric

Good suggestions, will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device()
  2025-08-22  6:40 ` [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device() Zhenzhong Duan
                     ` (2 preceding siblings ...)
  2025-08-27 11:34   ` Eric Auger
@ 2025-09-01 16:36   ` Cédric Le Goater
  2025-09-02  2:12     ` Duan, Zhenzhong
  3 siblings, 1 reply; 113+ messages in thread
From: Cédric Le Goater @ 2025-09-01 16:36 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, joao.m.martins, clement.mathieu--drif, kevin.tian,
	yi.l.liu, chao.p.peng

Zhenzhong,

On 8/22/25 08:40, Zhenzhong Duan wrote:
> Introduce helper vfio_pci_from_vfio_device() to transform from VFIODevice
> to VFIOPCIDevice, also to hide low level VFIO_DEVICE_TYPE_PCI type check.
> 
> Suggested-by: Cédric Le Goater <clg@redhat.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> Link: https://lore.kernel.org/qemu-devel/20250801023533.1458644-1-zhenzhong.duan@intel.com
> [ clg: Added documentation ]
> Signed-off-by: Cédric Le Goater <clg@redhat.com>

Would you like me to merge these VFIO changes upfront ?

They do not seem to conflict with the series I have queued for 10.2 so far.

C.



> ---
>   hw/vfio/pci.h       | 12 ++++++++++++
>   hw/vfio/container.c |  4 ++--
>   hw/vfio/device.c    |  2 +-
>   hw/vfio/iommufd.c   |  4 ++--
>   hw/vfio/listener.c  |  4 ++--
>   hw/vfio/pci.c       |  9 +++++++++
>   6 files changed, 28 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 810a842f4a..beb8fb9ee7 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -221,6 +221,18 @@ void vfio_pci_write_config(PCIDevice *pdev,
>   uint64_t vfio_vga_read(void *opaque, hwaddr addr, unsigned size);
>   void vfio_vga_write(void *opaque, hwaddr addr, uint64_t data, unsigned size);
>   
> +/**
> + * vfio_pci_from_vfio_device: Transform from VFIODevice to
> + * VFIOPCIDevice
> + *
> + * This function checks if the given @vbasedev is a VFIO PCI device.
> + * If it is, it returns the containing VFIOPCIDevice.
> + *
> + * @vbasedev: The VFIODevice to transform
> + *
> + * Return: The VFIOPCIDevice on success, NULL on failure.
> + */
> +VFIOPCIDevice *vfio_pci_from_vfio_device(VFIODevice *vbasedev);
>   void vfio_sub_page_bar_update_mappings(VFIOPCIDevice *vdev);
>   bool vfio_opt_rom_in_denylist(VFIOPCIDevice *vdev);
>   bool vfio_config_quirk_setup(VFIOPCIDevice *vdev, Error **errp);
> diff --git a/hw/vfio/container.c b/hw/vfio/container.c
> index 3e13feaa74..134ddccc52 100644
> --- a/hw/vfio/container.c
> +++ b/hw/vfio/container.c
> @@ -1087,7 +1087,7 @@ static int vfio_legacy_pci_hot_reset(VFIODevice *vbasedev, bool single)
>           /* Prep dependent devices for reset and clear our marker. */
>           QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>               if (!vbasedev_iter->dev->realized ||
> -                vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
> +                !vfio_pci_from_vfio_device(vbasedev_iter)) {
>                   continue;
>               }
>               tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
> @@ -1172,7 +1172,7 @@ out:
>   
>           QLIST_FOREACH(vbasedev_iter, &group->device_list, next) {
>               if (!vbasedev_iter->dev->realized ||
> -                vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
> +                !vfio_pci_from_vfio_device(vbasedev_iter)) {
>                   continue;
>               }
>               tmp = container_of(vbasedev_iter, VFIOPCIDevice, vbasedev);
> diff --git a/hw/vfio/device.c b/hw/vfio/device.c
> index 52a1996dc4..08f12ac31f 100644
> --- a/hw/vfio/device.c
> +++ b/hw/vfio/device.c
> @@ -129,7 +129,7 @@ static inline const char *action_to_str(int action)
>   
>   static const char *index_to_str(VFIODevice *vbasedev, int index)
>   {
> -    if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
> +    if (!vfio_pci_from_vfio_device(vbasedev)) {
>           return NULL;
>       }
>   
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 48c590b6a9..8c27222f75 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -737,8 +737,8 @@ iommufd_cdev_dep_get_realized_vpdev(struct vfio_pci_dependent_device *dep_dev,
>       }
>   
>       vbasedev_tmp = iommufd_cdev_pci_find_by_devid(dep_dev->devid);
> -    if (!vbasedev_tmp || !vbasedev_tmp->dev->realized ||
> -        vbasedev_tmp->type != VFIO_DEVICE_TYPE_PCI) {
> +    if (!vfio_pci_from_vfio_device(vbasedev_tmp) ||
> +        !vbasedev_tmp->dev->realized) {
>           return NULL;
>       }
>   
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index f498e23a93..903dfd8bf2 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -450,7 +450,7 @@ static void vfio_device_error_append(VFIODevice *vbasedev, Error **errp)
>        * MMIO region mapping failures are not fatal but in this case PCI
>        * peer-to-peer transactions are broken.
>        */
> -    if (vbasedev && vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +    if (vfio_pci_from_vfio_device(vbasedev)) {
>           error_append_hint(errp, "%s: PCI peer-to-peer transactions "
>                             "on BARs are not supported.\n", vbasedev->name);
>       }
> @@ -751,7 +751,7 @@ static bool vfio_section_is_vfio_pci(MemoryRegionSection *section,
>       owner = memory_region_owner(section->mr);
>   
>       QLIST_FOREACH(vbasedev, &bcontainer->device_list, container_next) {
> -        if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
> +        if (!vfio_pci_from_vfio_device(vbasedev)) {
>               continue;
>           }
>           pcidev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 07257d0fa0..3fe5b03eb1 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2833,6 +2833,15 @@ static int vfio_pci_load_config(VFIODevice *vbasedev, QEMUFile *f)
>       return ret;
>   }
>   
> +/* Transform from VFIODevice to VFIOPCIDevice. Return NULL if fails. */
> +VFIOPCIDevice *vfio_pci_from_vfio_device(VFIODevice *vbasedev)
> +{
> +    if (vbasedev && vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        return container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +    }
> +    return NULL;
> +}
> +
>   void vfio_sub_page_bar_update_mappings(VFIOPCIDevice *vdev)
>   {
>       PCIDevice *pdev = &vdev->pdev;



^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device()
  2025-09-01 16:36   ` Cédric Le Goater
@ 2025-09-02  2:12     ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-02  2:12 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hi Cédric,

>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v5 04/21] vfio: Introduce helper
>vfio_pci_from_vfio_device()
>
>Zhenzhong,
>
>On 8/22/25 08:40, Zhenzhong Duan wrote:
>> Introduce helper vfio_pci_from_vfio_device() to transform from VFIODevice
>> to VFIOPCIDevice, also to hide low level VFIO_DEVICE_TYPE_PCI type
>check.
>>
>> Suggested-by: Cédric Le Goater <clg@redhat.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Cédric Le Goater <clg@redhat.com>
>> Link:
>https://lore.kernel.org/qemu-devel/20250801023533.1458644-1-zhenzhong.
>duan@intel.com
>> [ clg: Added documentation ]
>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>
>Would you like me to merge these VFIO changes upfront ?
>
>They do not seem to conflict with the series I have queued for 10.2 so far.

Yes, I think it's fine to pick this patch upfront.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 10/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  2025-09-01  5:33     ` Duan, Zhenzhong
@ 2025-09-03  6:30       ` Yi Liu
  2025-09-03  7:13         ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-09-03  6:30 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P

On 2025/9/1 13:33, Duan, Zhenzhong wrote:

>>> +static int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
>>> +                                        uint32_t *pasid)
>>> +{
>>> +    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
>>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>>> +    uint8_t bus_num = pci_bus_num(vtd_as->bus);
>>> +    uint8_t devfn = vtd_as->devfn;
>>> +    VTDContextEntry ce;
>>> +    int ret;
>>> +
>>> +    /* For Requests-with-PASID, its pasid value is used by vIOMMU
>> directly */
>>> +    if (vtd_as->pasid != PCI_NO_PASID) {
>>> +        *pasid = vtd_as->pasid;
>>> +        return 0;
>>> +    }
>>> +
>>> +    if (cc_entry->context_cache_gen == s->context_cache_gen) {
>>> +        ce = cc_entry->context_entry;

just realized, if you don't record the context_entry in the below
branch, then this flow will always go with the below branch for
passthrough device. is it?

>>> +    } else {
>>> +        ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
>>> +        if (ret) {
>>> +            return ret;
>>> +        }
>>> +    }
>>> +    *pasid = VTD_CE_GET_RID2PASID(&ce);
>>
>> looks like we have quite a few code get rid_pasid from the context
>> entry. I think we may simplify it by using PASID #0 since vIOMMU does
>> not report ECAP.RPS bit at all. It could be done as a separate cleanup.
> 
> Yes, but we already have all code supporting RPS capability though RPS
> isn't enabled in CAP register. In theory we can enable RPS easily by setting
> the bit in CAP register. So I would like to be consistent with this instead of
> dropping all the existing code about RPS cap.

right. The code is almost there. But I haven't seen the possibility to
report RPS==1 to guest. It's somehow aligned that pasid#0 would be used
as rid_pasid. You may have noticed Linux even does not check RPS bit. So
such a guest will ignore RPS. This means this reading rid_pasid from ce
entry is not necessary. This is not urgent task anyhow.


Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 10/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  2025-09-03  6:30       ` Yi Liu
@ 2025-09-03  7:13         ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-03  7:13 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 10/21] intel_iommu: Introduce two helpers
>vtd_as_from/to_iommu_pasid_locked
>
>On 2025/9/1 13:33, Duan, Zhenzhong wrote:
>
>>>> +static int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
>>>> +                                        uint32_t *pasid)
>>>> +{
>>>> +    VTDContextCacheEntry *cc_entry =
>&vtd_as->context_cache_entry;
>>>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>>>> +    uint8_t bus_num = pci_bus_num(vtd_as->bus);
>>>> +    uint8_t devfn = vtd_as->devfn;
>>>> +    VTDContextEntry ce;
>>>> +    int ret;
>>>> +
>>>> +    /* For Requests-with-PASID, its pasid value is used by vIOMMU
>>> directly */
>>>> +    if (vtd_as->pasid != PCI_NO_PASID) {
>>>> +        *pasid = vtd_as->pasid;
>>>> +        return 0;
>>>> +    }
>>>> +
>>>> +    if (cc_entry->context_cache_gen == s->context_cache_gen) {
>>>> +        ce = cc_entry->context_entry;
>
>just realized, if you don't record the context_entry in the below
>branch, then this flow will always go with the below branch for
>passthrough device. is it?

Yes.

>
>>>> +    } else {
>>>> +        ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
>>>> +        if (ret) {
>>>> +            return ret;
>>>> +        }
>>>> +    }
>>>> +    *pasid = VTD_CE_GET_RID2PASID(&ce);
>>>
>>> looks like we have quite a few code get rid_pasid from the context
>>> entry. I think we may simplify it by using PASID #0 since vIOMMU does
>>> not report ECAP.RPS bit at all. It could be done as a separate cleanup.
>>
>> Yes, but we already have all code supporting RPS capability though RPS
>> isn't enabled in CAP register. In theory we can enable RPS easily by setting
>> the bit in CAP register. So I would like to be consistent with this instead of
>> dropping all the existing code about RPS cap.
>
>right. The code is almost there. But I haven't seen the possibility to
>report RPS==1 to guest. It's somehow aligned that pasid#0 would be used
>as rid_pasid. You may have noticed Linux even does not check RPS bit. So
>such a guest will ignore RPS. This means this reading rid_pasid from ce
>entry is not necessary. This is not urgent task anyhow.

OK, if we have no plan to support RPS, a lot of code will be simplified. I'll do it in this series if no objection.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update
  2025-09-01  3:31     ` Duan, Zhenzhong
@ 2025-09-03  7:58       ` Yi Liu
  2025-09-04  2:37         ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-09-03  7:58 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P, Yi Sun

On 2025/9/1 11:31, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and
>> update
>>
>> On 2025/8/22 14:40, Zhenzhong Duan wrote:
>>> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache
>> the
>>> pasid entry and track PASID usage and future PASID tagged DMA address
>>> translation support in vIOMMU.
>>
>> Have you seen any extra code needed based on this series to support non
>> rid_pasid PASIDs? If no, may just relax the scope of this series.
>> otherwise, you may need to tweak the patch a little bit. e.g. factor
>> out setting x-flts and x-pasid-mode at the same time.
> 
> There are quite a few code are common for both non-rid_pasid and rid_pasid.
> So in this series, there are some infrastructure code that looks like it's for non-rid_pasid.
> 
> But to support non-rid_pasid, we need pasid_attach/detach() which is not implemented in this series.

I see. Besides that, the vIOMMU internal infrastructure should be ready
for non-rid_pasid after this series.

> Even if x-flts and x-pasid-mode both on, pasid isn't enabled since VFIO device doesn't > expose pasid capability to guest, so guest never use non-rid_pasid 
with this VFIO device.

ok. Given that 1st stage for emulated device has already sbeen upported,
it's fine to rely on the knob in device side.

>>>
>>> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
>>> never freed. For other pasid, VTDAddressSpace instance is
>> created/destroyed
>>> per the guest pasid entry set up/destroy.
>>
>>> When guest removes or updates a PASID entry, QEMU will capture the guest
>> pasid
>>> selective pasid cache invalidation, removes VTDAddressSpace or update
>> cached
>>> PASID entry.
>>>
>>> vIOMMU emulator could figure out the reason by fetching latest guest pasid
>> entry
>>> and compare it with cached PASID entry.
>>>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>    hw/i386/intel_iommu_internal.h |  27 ++++-
>>>    include/hw/i386/intel_iommu.h  |   6 +
>>>    hw/i386/intel_iommu.c          | 196
>> +++++++++++++++++++++++++++++++--
>>>    hw/i386/trace-events           |   3 +
>>>    4 files changed, 220 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/hw/i386/intel_iommu_internal.h
>> b/hw/i386/intel_iommu_internal.h
>>> index f7510861d1..b9b76dd996 100644
>>> --- a/hw/i386/intel_iommu_internal.h
>>> +++ b/hw/i386/intel_iommu_internal.h
>>> @@ -316,6 +316,7 @@ typedef enum VTDFaultReason {
>>>                                      * request while disabled */
>>>        VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>>>
>>> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>>>        /* PASID directory entry access failure */
>>>        VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>>>        /* The Present(P) field of pasid directory entry is 0 */
>>> @@ -493,6 +494,15 @@ typedef union VTDInvDesc VTDInvDesc;
>>>    #define VTD_INV_DESC_PIOTLB_RSVD_VAL0
>> 0xfff000000000f1c0ULL
>>>    #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>>>
>>> +/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
>>> +#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
>>> +#define VTD_INV_DESC_PASIDC_G_DSI       0
>>> +#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
>>> +#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
>>> +#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16,
>> 16)
>>> +#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32,
>> 20)
>>> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
>>> +
>>>    /* Information about page-selective IOTLB invalidate */
>>>    struct VTDIOTLBPageInvInfo {
>>>        uint16_t domain_id;
>>> @@ -553,6 +563,21 @@ typedef struct VTDRootEntry VTDRootEntry;
>>>    #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL |
>> ~VTD_HAW_MASK(aw))
>>>    #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1
>> 0xffffffffffe00000ULL
>>>
>>> +typedef enum VTDPCInvType {
>>> +    /* VTD spec defined PASID cache invalidation type */
>>> +    VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
>>> +    VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
>>> +    VTD_PASID_CACHE_GLOBAL_INV =
>> VTD_INV_DESC_PASIDC_G_GLOBAL,
>>> +} VTDPCInvType;
>>> +
>>> +typedef struct VTDPASIDCacheInfo {
>>> +    VTDPCInvType type;
>>> +    uint16_t did;
>>> +    uint32_t pasid;
>>> +    PCIBus *bus;
>>> +    uint16_t devfn;
>>> +} VTDPASIDCacheInfo;
>>> +
>>>    /* PASID Table Related Definitions */
>>>    #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>>>    #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
>>> @@ -574,7 +599,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>>>    #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>>>
>>>    #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted
>> guest-address-width */
>>> -#define VTD_SM_PASID_ENTRY_DID(val)    ((val) &
>> VTD_DOMAIN_ID_MASK)
>>> +#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>>>
>>>    #define VTD_SM_PASID_ENTRY_FLPM          3ULL
>>>    #define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
>>> diff --git a/include/hw/i386/intel_iommu.h
>> b/include/hw/i386/intel_iommu.h
>>> index 50f9b27a45..0e3826f6f0 100644
>>> --- a/include/hw/i386/intel_iommu.h
>>> +++ b/include/hw/i386/intel_iommu.h
>>> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>>>        uint64_t val[8];
>>>    };
>>>
>>> +typedef struct VTDPASIDCacheEntry {
>>> +    struct VTDPASIDEntry pasid_entry;
>>> +    bool valid;
>>> +} VTDPASIDCacheEntry;
>>> +
>>>    struct VTDAddressSpace {
>>>        PCIBus *bus;
>>>        uint8_t devfn;
>>> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>>>        MemoryRegion iommu_ir_fault; /* Interrupt region for catching
>> fault */
>>>        IntelIOMMUState *iommu_state;
>>>        VTDContextCacheEntry context_cache_entry;
>>> +    VTDPASIDCacheEntry pasid_cache_entry;
>>>        QLIST_ENTRY(VTDAddressSpace) next;
>>>        /* Superset of notifier flags that this address space has */
>>>        IOMMUNotifierFlag notifier_flags;
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index 1801f1cdf6..a2ee6d684e 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -1675,7 +1675,7 @@ static uint16_t
>> vtd_get_domain_id(IntelIOMMUState *s,
>>>
>>>        if (s->root_scalable) {
>>>            vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>>> -        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
>>> +        return VTD_SM_PASID_ENTRY_DID(&pe);
>>>        }
>>>
>>>        return VTD_CONTEXT_ENTRY_DID(ce->hi);
>>> @@ -3112,6 +3112,183 @@ static bool
>> vtd_process_piotlb_desc(IntelIOMMUState *s,
>>>        return true;
>>>    }
>>>
>>> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
>>> +                                            uint32_t pasid,
>> VTDPASIDEntry *pe)
>>> +{
>>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>>> +    VTDContextEntry ce;
>>> +    int ret;
>>> +
>>> +    if (!s->root_scalable) {
>>> +        return -VTD_FR_RTADDR_INV_TTM;
>>> +    }
>>> +
>>> +    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>> vtd_as->devfn,
>>> +                                   &ce);
>>> +    if (ret) {
>>> +        return ret;
>>> +    }
>>> +
>>> +    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
>>> +}
>>> +
>>> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry
>> *p2)
>>> +{
>>> +    return !memcmp(p1, p2, sizeof(*p1));
>>> +}
>>> +
>>> +/*
>>> + * This function is a loop function which return value determines if
>>> + * vtd_as including cached pasid entry is removed.
>>> + *
>>> + * For PCI_NO_PASID, when corresponding cached pasid entry is cleared,
>>> + * it returns false so that vtd_as is reserved as it's owned by PCI
>>> + * sub-system. For other pasid, it returns true so vtd_as is removed.
>>
>> also, this helper will always return true if this series does not
>> support non-rid_pasid PASID.
> 
> Do you mean return false? I don't think it will return true.
> For non-rid_pasid, it may return false.

aha, yes. for rid_pasid, you need to keep the vtd_as instance.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 12/21] intel_iommu: Handle PASID entry addition
  2025-09-01  9:03     ` Duan, Zhenzhong
@ 2025-09-03  8:52       ` Yi Liu
  2025-09-04  2:45         ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-09-03  8:52 UTC (permalink / raw)
  To: Duan, Zhenzhong, eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun

On 2025/9/1 17:03, Duan, Zhenzhong wrote:

>>>   }
>>>
>>> +/*
>>> + * This function walks over PASID range within [start, end) in a single
>>> + * PASID table for entries matching @info type/did, then retrieve/create
>>> + * vtd_as and fill associated pasid entry cache.
>>> + */
>>> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
>>> +                                        dma_addr_t pt_base,
>>> +                                        int start,
>>> +                                        int end,
>>> +                                        VTDPASIDCacheInfo
>> *info)
>>> +{
>>> +    VTDPASIDEntry pe;
>>> +    int pasid = start;
>>> +
>>> +    while (pasid < end) {
>>> +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
>>> +            && vtd_pe_present(&pe)) {
>>> +            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
>>> +            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
>>> +            VTDPASIDCacheEntry *pc_entry;
>>> +            VTDAddressSpace *vtd_as;
>>> +
>>> +            vtd_iommu_lock(s);
>>> +            /*
>>> +             * When indexed by rid2pasid, vtd_as should have been
>> created,
>>> +             * e.g., by PCI subsystem. For other iommu pasid, we need
>> to
>>> +             * create vtd_as dynamically. Other iommu pasid is same
>> value
>> since you don't support somthing else than rid2pasid, I would drop that
>> and simplify the code. See below.
>>> +             * as PCI's pasid, so it's used as input of vtd_find_add_as().
>>> +             */
>>> +            vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
>>> +            vtd_iommu_unlock(s);
>>> +            if (!vtd_as) {
>>> +                vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
>> you could check the vtd_as already exists here per the rid2pasid support
>> limitation
> 
> In this series, I do include some basic codes for non-rid2pasid because they share some common code with rid2pasid and we already have emulated rid2pasid support in vIOMMU for a long time, it's not bad to accumulate some supporting code for non-rid2pasid for passthrough device. But I can do the factor out if you insist to have only rid_pasid code.

I think it's a reasonable ask. :)

> 
>>> +            }
>>> +
>>> +            if ((info->type == VTD_PASID_CACHE_DOMSI ||
>>> +                 info->type == VTD_PASID_CACHE_PASIDSI) &&
>>> +                (info->did != VTD_SM_PASID_ENTRY_DID(&pe))) {
>>> +                /*
>>> +                 * VTD_PASID_CACHE_DOMSI and
>> VTD_PASID_CACHE_PASIDSI
>>> +                 * requires domain id check. If domain id check fail,
>> fails
>>> +                 * go to next pasid.
>>> +                 */
>>> +                pasid++;
>>> +                continue;
>>> +            }
>>> +
>>> +            pc_entry = &vtd_as->pasid_cache_entry;
>>> +            /*
>>> +             * pasid cache update and clear are handled in
>>> +             * vtd_flush_pasid_locked(), only care new pasid entry
>> here.
>>> +             */
>>> +            if (!pc_entry->valid) {
>>> +                pc_entry->pasid_entry = pe;
>>> +                pc_entry->valid = true;
>>> +            }
>>> +        }
>>> +        pasid++;
>>> +    }
>>> +}
>>> +
>>> +/*
>>> + * In VT-d scalable mode translation, PASID dir + PASID table is used.
>>> + * This function aims at looping over a range of PASIDs in the given
>>> + * two level table to identify the pasid config in guest.
>>> + */
>>> +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
>>> +                                    dma_addr_t pdt_base,
>>> +                                    int start, int end,
>>> +                                    VTDPASIDCacheInfo *info)
>>> +{
>>> +    VTDPASIDDirEntry pdire;
>>> +    int pasid = start;
>>> +    int pasid_next;
>>> +    dma_addr_t pt_base;
>>> +
>>> +    while (pasid < end) {
>>> +        pasid_next =
>>> +             (pasid + VTD_PASID_TBL_ENTRY_NUM) &
>> ~(VTD_PASID_TBL_ENTRY_NUM - 1);
>>> +        pasid_next = pasid_next < end ? pasid_next : end;
>>> +
>>> +        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
>>> +            && vtd_pdire_present(&pdire)) {
>>> +            pt_base = pdire.val &
>> VTD_PASID_TABLE_BASE_ADDR_MASK;
>>> +            vtd_sm_pasid_table_walk_one(s, pt_base, pasid,
>> pasid_next, info);
>>> +        }
>>> +        pasid = pasid_next;
>>> +    }
>>> +}
>>> +
>>> +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
>>> +                                          int start, int end,
>>> +                                          VTDPASIDCacheInfo
>> *info)
>>> +{
>>> +    VTDContextEntry ce;
>>> +
>>> +    if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus),
>> info->devfn,
>>> +                                  &ce)) {
>>> +        uint32_t max_pasid;
>>> +
>>> +        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) *
>> VTD_PASID_TBL_ENTRY_NUM;
>>> +        if (end > max_pasid) {
>>> +            end = max_pasid;
>>> +        }
>>> +        vtd_sm_pasid_table_walk(s,
>>> +
>> VTD_CE_GET_PASID_DIR_TABLE(&ce),
>>> +                                start,
>>> +                                end,
>>> +                                info);
>>> +    }
>>> +}
>>> +
>>> +/*
>>> + * This function replays the guest pasid bindings by walking the two level
>>> + * guest PASID table. For each valid pasid entry, it finds or creates a
>>> + * vtd_as and caches pasid entry in vtd_as.
>>> + */
>>> +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
>>> +                                            VTDPASIDCacheInfo
>> *pc_info)
>>> +{
>>> +    /*
>>> +     * Currently only Requests-without-PASID is supported, as vIOMMU
>> doesn't
>>> +     * support RPS(RID-PASID Support), pasid scope is fixed to [0, 1).
>>> +     */
>>> +    int start = 0, end = 1;
>>> +    VTDHostIOMMUDevice *vtd_hiod;
>>> +    VTDPASIDCacheInfo walk_info;
>>> +    GHashTableIter as_it;
>>> +
>>> +    switch (pc_info->type) {
>>> +    case VTD_PASID_CACHE_PASIDSI:
>>> +        start = pc_info->pasid;
>>> +        end = pc_info->pasid + 1;
>> if you never replay a range, you could simplify the code for now because
>> some code paths are not properly tested
> 
> OK. Instead of assignment of start and end variable, maybe just an assert(!pc_info->pasid).

I think there are two reasons for this range replay.

1) as a preparation for patch 16 of this series.
2) support domain selective or global pasid cache invalidation

> 
>>> +       /* fall through */
>>> +    case VTD_PASID_CACHE_DOMSI:
>> Why can't we have other invalidation types along with request without
>> PASID? It is not obvious to me at least why it couldn't be used by the
>> guest. Would deserve a comment in the commit desc I think.
> 
> Other invalidation types are indeed used, just in pasid scope [0, 1), because [start, end) are already initialized to [0, 1), nothing more here, so just break.

hmmm. The fixed scope makes the range replay a fake one. It's better
holding the range replay logic for now and add it when there is
non-rid_pasid support for passthrough devices.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v5 16/21] intel_iommu: Replay pasid bindings after context cache invalidation
  2025-09-01  8:11       ` Duan, Zhenzhong
@ 2025-09-03 10:18         ` Yi Liu
  2025-09-04  6:42           ` Duan, Zhenzhong
  0 siblings, 1 reply; 113+ messages in thread
From: Yi Liu @ 2025-09-03 10:18 UTC (permalink / raw)
  To: Duan, Zhenzhong, eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun

On 2025/9/1 16:11, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH v5 16/21] intel_iommu: Replay pasid bindings after
>> context cache invalidation
>>
>> On 2025/8/28 17:43, Eric Auger wrote:
>>>
>>>
>>> On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>
>>>> This replays guest pasid bindings after context cache invalidation.
>>>> This is a behavior to ensure safety. Actually, programmer should issue
>>>> pasid cache invalidation with proper granularity after issuing a context
>>>> cache invalidation.
>>> So is this mandated? If the spec mandates specific invalidations and the
>>> guest does not comply with the expected invalidation sequence shall we
>>> do that behind the curtain?
>>
>> I think this is following the below decision. We can discuss if it's
>> really needed to replay the pasid bind.
>>
>> d4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>> 2321)
>>      /*
>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>> 2322)      * From VT-d spec 6.5.2.1, a global context entry invalidation
>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>> 2323)      * should be followed by a IOTLB global invalidation, so we
>> should
>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>> 2324)      * be safe even without this. Hoewever, let's replay the region as
>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>> 2325)      * well to be safer, and go back here when we need finer tunes
>> for
>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>> 2326)      * VT-d emulation codes.
>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>> 2327)      */
>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15 +0800
>> 2328)     vtd_iommu_replay_all(s);
> 
> I have tested this series with this patch reverted, it works with guest linux kernel.
> 
> Personally, I am inclined to stop adding workaround for guest kenrel bug, there will be more and more over time and it makes current code complex unnecessarily. @Eric, @Liu, Yi L your thought?

Let's go back to the original purpose of this. Peter has identified a
case in which a context modification is not followed by IOTLB
invalidation. [1] This is a valid behavior since the old domain is still
in use, no need to invalidate IOTLB. Hence the shadow page of the
changed device has not been updated. So the vIOMMU chose to enforce a
synchronization on the shadow page per context entry modification. Let's
see if similar requirement on PASID table.

Let me ask one question: since PASID cache is also tagged with domain
ID, if the DID has not changed, maybe iommu driver will skip the PASID
cache flush?

[1] https://lore.kernel.org/qemu-devel/20170117084604.2b1f5e50@t450s.home/

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update
  2025-09-03  7:58       ` Yi Liu
@ 2025-09-04  2:37         ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-04  2:37 UTC (permalink / raw)
  To: Liu, Yi L, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	Tian, Kevin, Peng, Chao P, Yi Sun



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and
>update
>
>On 2025/9/1 11:31, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Subject: Re: [PATCH v5 11/21] intel_iommu: Handle PASID entry removal
>and
>>> update
>>>
>>> On 2025/8/22 14:40, Zhenzhong Duan wrote:
>>>> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to
>cache
>>> the
>>>> pasid entry and track PASID usage and future PASID tagged DMA address
>>>> translation support in vIOMMU.
>>>
>>> Have you seen any extra code needed based on this series to support non
>>> rid_pasid PASIDs? If no, may just relax the scope of this series.
>>> otherwise, you may need to tweak the patch a little bit. e.g. factor
>>> out setting x-flts and x-pasid-mode at the same time.
>>
>> There are quite a few code are common for both non-rid_pasid and
>rid_pasid.
>> So in this series, there are some infrastructure code that looks like it's for
>non-rid_pasid.
>>
>> But to support non-rid_pasid, we need pasid_attach/detach() which is not
>implemented in this series.
>
>I see. Besides that, the vIOMMU internal infrastructure should be ready
>for non-rid_pasid after this series.

Okey, I'll following you and Eric's suggestion to simplify this series with only rid_pasid support.

Thanks

>
>> Even if x-flts and x-pasid-mode both on, pasid isn't enabled since VFIO
>device doesn't > expose pasid capability to guest, so guest never use
>non-rid_pasid
>with this VFIO device.
>
>ok. Given that 1st stage for emulated device has already sbeen upported,
>it's fine to rely on the knob in device side.
>
>>>>
>>>> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged
>and
>>>> never freed. For other pasid, VTDAddressSpace instance is
>>> created/destroyed
>>>> per the guest pasid entry set up/destroy.
>>>
>>>> When guest removes or updates a PASID entry, QEMU will capture the
>guest
>>> pasid
>>>> selective pasid cache invalidation, removes VTDAddressSpace or update
>>> cached
>>>> PASID entry.
>>>>
>>>> vIOMMU emulator could figure out the reason by fetching latest guest
>pasid
>>> entry
>>>> and compare it with cached PASID entry.
>>>>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>>    hw/i386/intel_iommu_internal.h |  27 ++++-
>>>>    include/hw/i386/intel_iommu.h  |   6 +
>>>>    hw/i386/intel_iommu.c          | 196
>>> +++++++++++++++++++++++++++++++--
>>>>    hw/i386/trace-events           |   3 +
>>>>    4 files changed, 220 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/hw/i386/intel_iommu_internal.h
>>> b/hw/i386/intel_iommu_internal.h
>>>> index f7510861d1..b9b76dd996 100644
>>>> --- a/hw/i386/intel_iommu_internal.h
>>>> +++ b/hw/i386/intel_iommu_internal.h
>>>> @@ -316,6 +316,7 @@ typedef enum VTDFaultReason {
>>>>                                      * request while disabled */
>>>>        VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>>>>
>>>> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>>>>        /* PASID directory entry access failure */
>>>>        VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>>>>        /* The Present(P) field of pasid directory entry is 0 */
>>>> @@ -493,6 +494,15 @@ typedef union VTDInvDesc VTDInvDesc;
>>>>    #define VTD_INV_DESC_PIOTLB_RSVD_VAL0
>>> 0xfff000000000f1c0ULL
>>>>    #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>>>>
>>>> +/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
>>>> +#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4,
>2)
>>>> +#define VTD_INV_DESC_PASIDC_G_DSI       0
>>>> +#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
>>>> +#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
>>>> +#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16,
>>> 16)
>>>> +#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32,
>>> 20)
>>>> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0
>0xfff000000000f1c0ULL
>>>> +
>>>>    /* Information about page-selective IOTLB invalidate */
>>>>    struct VTDIOTLBPageInvInfo {
>>>>        uint16_t domain_id;
>>>> @@ -553,6 +563,21 @@ typedef struct VTDRootEntry VTDRootEntry;
>>>>    #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL |
>>> ~VTD_HAW_MASK(aw))
>>>>    #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1
>>> 0xffffffffffe00000ULL
>>>>
>>>> +typedef enum VTDPCInvType {
>>>> +    /* VTD spec defined PASID cache invalidation type */
>>>> +    VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
>>>> +    VTD_PASID_CACHE_PASIDSI =
>VTD_INV_DESC_PASIDC_G_PASID_SI,
>>>> +    VTD_PASID_CACHE_GLOBAL_INV =
>>> VTD_INV_DESC_PASIDC_G_GLOBAL,
>>>> +} VTDPCInvType;
>>>> +
>>>> +typedef struct VTDPASIDCacheInfo {
>>>> +    VTDPCInvType type;
>>>> +    uint16_t did;
>>>> +    uint32_t pasid;
>>>> +    PCIBus *bus;
>>>> +    uint16_t devfn;
>>>> +} VTDPASIDCacheInfo;
>>>> +
>>>>    /* PASID Table Related Definitions */
>>>>    #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>>>>    #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
>>>> @@ -574,7 +599,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>>>>    #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>>>>
>>>>    #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted
>>> guest-address-width */
>>>> -#define VTD_SM_PASID_ENTRY_DID(val)    ((val) &
>>> VTD_DOMAIN_ID_MASK)
>>>> +#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0,
>16)
>>>>
>>>>    #define VTD_SM_PASID_ENTRY_FLPM          3ULL
>>>>    #define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
>>>> diff --git a/include/hw/i386/intel_iommu.h
>>> b/include/hw/i386/intel_iommu.h
>>>> index 50f9b27a45..0e3826f6f0 100644
>>>> --- a/include/hw/i386/intel_iommu.h
>>>> +++ b/include/hw/i386/intel_iommu.h
>>>> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>>>>        uint64_t val[8];
>>>>    };
>>>>
>>>> +typedef struct VTDPASIDCacheEntry {
>>>> +    struct VTDPASIDEntry pasid_entry;
>>>> +    bool valid;
>>>> +} VTDPASIDCacheEntry;
>>>> +
>>>>    struct VTDAddressSpace {
>>>>        PCIBus *bus;
>>>>        uint8_t devfn;
>>>> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>>>>        MemoryRegion iommu_ir_fault; /* Interrupt region for catching
>>> fault */
>>>>        IntelIOMMUState *iommu_state;
>>>>        VTDContextCacheEntry context_cache_entry;
>>>> +    VTDPASIDCacheEntry pasid_cache_entry;
>>>>        QLIST_ENTRY(VTDAddressSpace) next;
>>>>        /* Superset of notifier flags that this address space has */
>>>>        IOMMUNotifierFlag notifier_flags;
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index 1801f1cdf6..a2ee6d684e 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -1675,7 +1675,7 @@ static uint16_t
>>> vtd_get_domain_id(IntelIOMMUState *s,
>>>>
>>>>        if (s->root_scalable) {
>>>>            vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
>>>> -        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
>>>> +        return VTD_SM_PASID_ENTRY_DID(&pe);
>>>>        }
>>>>
>>>>        return VTD_CONTEXT_ENTRY_DID(ce->hi);
>>>> @@ -3112,6 +3112,183 @@ static bool
>>> vtd_process_piotlb_desc(IntelIOMMUState *s,
>>>>        return true;
>>>>    }
>>>>
>>>> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
>>>> +                                            uint32_t pasid,
>>> VTDPASIDEntry *pe)
>>>> +{
>>>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>>>> +    VTDContextEntry ce;
>>>> +    int ret;
>>>> +
>>>> +    if (!s->root_scalable) {
>>>> +        return -VTD_FR_RTADDR_INV_TTM;
>>>> +    }
>>>> +
>>>> +    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>> vtd_as->devfn,
>>>> +                                   &ce);
>>>> +    if (ret) {
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
>>>> +}
>>>> +
>>>> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1,
>VTDPASIDEntry
>>> *p2)
>>>> +{
>>>> +    return !memcmp(p1, p2, sizeof(*p1));
>>>> +}
>>>> +
>>>> +/*
>>>> + * This function is a loop function which return value determines if
>>>> + * vtd_as including cached pasid entry is removed.
>>>> + *
>>>> + * For PCI_NO_PASID, when corresponding cached pasid entry is
>cleared,
>>>> + * it returns false so that vtd_as is reserved as it's owned by PCI
>>>> + * sub-system. For other pasid, it returns true so vtd_as is removed.
>>>
>>> also, this helper will always return true if this series does not
>>> support non-rid_pasid PASID.
>>
>> Do you mean return false? I don't think it will return true.
>> For non-rid_pasid, it may return false.
>
>aha, yes. for rid_pasid, you need to keep the vtd_as instance.
>
>Regards,
>Yi Liu

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 12/21] intel_iommu: Handle PASID entry addition
  2025-09-03  8:52       ` Yi Liu
@ 2025-09-04  2:45         ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-04  2:45 UTC (permalink / raw)
  To: Liu, Yi L, eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 12/21] intel_iommu: Handle PASID entry addition
>
>On 2025/9/1 17:03, Duan, Zhenzhong wrote:
>
>>>>   }
>>>>
>>>> +/*
>>>> + * This function walks over PASID range within [start, end) in a single
>>>> + * PASID table for entries matching @info type/did, then retrieve/create
>>>> + * vtd_as and fill associated pasid entry cache.
>>>> + */
>>>> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
>>>> +                                        dma_addr_t pt_base,
>>>> +                                        int start,
>>>> +                                        int end,
>>>> +                                        VTDPASIDCacheInfo
>>> *info)
>>>> +{
>>>> +    VTDPASIDEntry pe;
>>>> +    int pasid = start;
>>>> +
>>>> +    while (pasid < end) {
>>>> +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
>>>> +            && vtd_pe_present(&pe)) {
>>>> +            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
>>>> +            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
>>>> +            VTDPASIDCacheEntry *pc_entry;
>>>> +            VTDAddressSpace *vtd_as;
>>>> +
>>>> +            vtd_iommu_lock(s);
>>>> +            /*
>>>> +             * When indexed by rid2pasid, vtd_as should have been
>>> created,
>>>> +             * e.g., by PCI subsystem. For other iommu pasid, we
>need
>>> to
>>>> +             * create vtd_as dynamically. Other iommu pasid is same
>>> value
>>> since you don't support somthing else than rid2pasid, I would drop that
>>> and simplify the code. See below.
>>>> +             * as PCI's pasid, so it's used as input of
>vtd_find_add_as().
>>>> +             */
>>>> +            vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
>>>> +            vtd_iommu_unlock(s);
>>>> +            if (!vtd_as) {
>>>> +                vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
>>> you could check the vtd_as already exists here per the rid2pasid support
>>> limitation
>>
>> In this series, I do include some basic codes for non-rid2pasid because they
>share some common code with rid2pasid and we already have emulated
>rid2pasid support in vIOMMU for a long time, it's not bad to accumulate some
>supporting code for non-rid2pasid for passthrough device. But I can do the
>factor out if you insist to have only rid_pasid code.
>
>I think it's a reasonable ask. :)

OK, will do.

>
>>
>>>> +            }
>>>> +
>>>> +            if ((info->type == VTD_PASID_CACHE_DOMSI ||
>>>> +                 info->type == VTD_PASID_CACHE_PASIDSI) &&
>>>> +                (info->did != VTD_SM_PASID_ENTRY_DID(&pe))) {
>>>> +                /*
>>>> +                 * VTD_PASID_CACHE_DOMSI and
>>> VTD_PASID_CACHE_PASIDSI
>>>> +                 * requires domain id check. If domain id check fail,
>>> fails
>>>> +                 * go to next pasid.
>>>> +                 */
>>>> +                pasid++;
>>>> +                continue;
>>>> +            }
>>>> +
>>>> +            pc_entry = &vtd_as->pasid_cache_entry;
>>>> +            /*
>>>> +             * pasid cache update and clear are handled in
>>>> +             * vtd_flush_pasid_locked(), only care new pasid entry
>>> here.
>>>> +             */
>>>> +            if (!pc_entry->valid) {
>>>> +                pc_entry->pasid_entry = pe;
>>>> +                pc_entry->valid = true;
>>>> +            }
>>>> +        }
>>>> +        pasid++;
>>>> +    }
>>>> +}
>>>> +
>>>> +/*
>>>> + * In VT-d scalable mode translation, PASID dir + PASID table is used.
>>>> + * This function aims at looping over a range of PASIDs in the given
>>>> + * two level table to identify the pasid config in guest.
>>>> + */
>>>> +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
>>>> +                                    dma_addr_t pdt_base,
>>>> +                                    int start, int end,
>>>> +                                    VTDPASIDCacheInfo *info)
>>>> +{
>>>> +    VTDPASIDDirEntry pdire;
>>>> +    int pasid = start;
>>>> +    int pasid_next;
>>>> +    dma_addr_t pt_base;
>>>> +
>>>> +    while (pasid < end) {
>>>> +        pasid_next =
>>>> +             (pasid + VTD_PASID_TBL_ENTRY_NUM) &
>>> ~(VTD_PASID_TBL_ENTRY_NUM - 1);
>>>> +        pasid_next = pasid_next < end ? pasid_next : end;
>>>> +
>>>> +        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
>>>> +            && vtd_pdire_present(&pdire)) {
>>>> +            pt_base = pdire.val &
>>> VTD_PASID_TABLE_BASE_ADDR_MASK;
>>>> +            vtd_sm_pasid_table_walk_one(s, pt_base, pasid,
>>> pasid_next, info);
>>>> +        }
>>>> +        pasid = pasid_next;
>>>> +    }
>>>> +}
>>>> +
>>>> +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
>>>> +                                          int start, int end,
>>>> +                                          VTDPASIDCacheInfo
>>> *info)
>>>> +{
>>>> +    VTDContextEntry ce;
>>>> +
>>>> +    if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus),
>>> info->devfn,
>>>> +                                  &ce)) {
>>>> +        uint32_t max_pasid;
>>>> +
>>>> +        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) *
>>> VTD_PASID_TBL_ENTRY_NUM;
>>>> +        if (end > max_pasid) {
>>>> +            end = max_pasid;
>>>> +        }
>>>> +        vtd_sm_pasid_table_walk(s,
>>>> +
>>> VTD_CE_GET_PASID_DIR_TABLE(&ce),
>>>> +                                start,
>>>> +                                end,
>>>> +                                info);
>>>> +    }
>>>> +}
>>>> +
>>>> +/*
>>>> + * This function replays the guest pasid bindings by walking the two level
>>>> + * guest PASID table. For each valid pasid entry, it finds or creates a
>>>> + * vtd_as and caches pasid entry in vtd_as.
>>>> + */
>>>> +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
>>>> +
>VTDPASIDCacheInfo
>>> *pc_info)
>>>> +{
>>>> +    /*
>>>> +     * Currently only Requests-without-PASID is supported, as
>vIOMMU
>>> doesn't
>>>> +     * support RPS(RID-PASID Support), pasid scope is fixed to [0, 1).
>>>> +     */
>>>> +    int start = 0, end = 1;
>>>> +    VTDHostIOMMUDevice *vtd_hiod;
>>>> +    VTDPASIDCacheInfo walk_info;
>>>> +    GHashTableIter as_it;
>>>> +
>>>> +    switch (pc_info->type) {
>>>> +    case VTD_PASID_CACHE_PASIDSI:
>>>> +        start = pc_info->pasid;
>>>> +        end = pc_info->pasid + 1;
>>> if you never replay a range, you could simplify the code for now because
>>> some code paths are not properly tested
>>
>> OK. Instead of assignment of start and end variable, maybe just an
>assert(!pc_info->pasid).
>
>I think there are two reasons for this range replay.
>
>1) as a preparation for patch 16 of this series.
>2) support domain selective or global pasid cache invalidation

Exactly.

>
>>
>>>> +       /* fall through */
>>>> +    case VTD_PASID_CACHE_DOMSI:
>>> Why can't we have other invalidation types along with request without
>>> PASID? It is not obvious to me at least why it couldn't be used by the
>>> guest. Would deserve a comment in the commit desc I think.
>>
>> Other invalidation types are indeed used, just in pasid scope [0, 1), because
>[start, end) are already initialized to [0, 1), nothing more here, so just break.
>
>hmmm. The fixed scope makes the range replay a fake one. It's better
>holding the range replay logic for now and add it when there is
>non-rid_pasid support for passthrough devices.

Yes, will delete the non-rid_pasid code.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 113+ messages in thread

* RE: [PATCH v5 16/21] intel_iommu: Replay pasid bindings after context cache invalidation
  2025-09-03 10:18         ` Yi Liu
@ 2025-09-04  6:42           ` Duan, Zhenzhong
  0 siblings, 0 replies; 113+ messages in thread
From: Duan, Zhenzhong @ 2025-09-04  6:42 UTC (permalink / raw)
  To: Liu, Yi L, eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
	Yi Sun



>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v5 16/21] intel_iommu: Replay pasid bindings after
>context cache invalidation
>
>On 2025/9/1 16:11, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Subject: Re: [PATCH v5 16/21] intel_iommu: Replay pasid bindings after
>>> context cache invalidation
>>>
>>> On 2025/8/28 17:43, Eric Auger wrote:
>>>>
>>>>
>>>> On 8/22/25 8:40 AM, Zhenzhong Duan wrote:
>>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>>
>>>>> This replays guest pasid bindings after context cache invalidation.
>>>>> This is a behavior to ensure safety. Actually, programmer should issue
>>>>> pasid cache invalidation with proper granularity after issuing a context
>>>>> cache invalidation.
>>>> So is this mandated? If the spec mandates specific invalidations and the
>>>> guest does not comply with the expected invalidation sequence shall we
>>>> do that behind the curtain?
>>>
>>> I think this is following the below decision. We can discuss if it's
>>> really needed to replay the pasid bind.
>>>
>>> d4d607e40d (Peter Xu                     2017-04-07 18:59:15
>+0800
>>> 2321)
>>>      /*
>>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15
>+0800
>>> 2322)      * From VT-d spec 6.5.2.1, a global context entry invalidation
>>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15
>+0800
>>> 2323)      * should be followed by a IOTLB global invalidation, so we
>>> should
>>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15
>+0800
>>> 2324)      * be safe even without this. Hoewever, let's replay the region
>as
>>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15
>+0800
>>> 2325)      * well to be safer, and go back here when we need finer tunes
>>> for
>>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15
>+0800
>>> 2326)      * VT-d emulation codes.
>>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15
>+0800
>>> 2327)      */
>>> dd4d607e40d (Peter Xu                     2017-04-07 18:59:15
>+0800
>>> 2328)     vtd_iommu_replay_all(s);
>>
>> I have tested this series with this patch reverted, it works with guest linux
>kernel.
>>
>> Personally, I am inclined to stop adding workaround for guest kenrel bug,
>there will be more and more over time and it makes current code complex
>unnecessarily. @Eric, @Liu, Yi L your thought?
>
>Let's go back to the original purpose of this. Peter has identified a
>case in which a context modification is not followed by IOTLB
>invalidation. [1] This is a valid behavior since the old domain is still
>in use, no need to invalidate IOTLB. Hence the shadow page of the
>changed device has not been updated. So the vIOMMU chose to enforce a
>synchronization on the shadow page per context entry modification. Let's
>see if similar requirement on PASID table.

Different devices can share one domain, but It's a rare case to see different devices sharing same PASID table except they are in same iommu group, but if they are in same iommu group, they should always use a common PASID table. I think no need to support such rare case? At least linux does not work this way.

>
>Let me ask one question: since PASID cache is also tagged with domain
>ID, if the DID has not changed, maybe iommu driver will skip the PASID
>cache flush?

My understanding is no matter what's changed in PASID entry, there should be PASID cache invalidation, either domain scope, pasid scope or global invalidation.

Thanks
Zhenzhong

>
>[1]
>https://lore.kernel.org/qemu-devel/20170117084604.2b1f5e50@t450s.home
>/
>
>Regards,
>Yi Liu

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2025-09-04  6:43 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22  6:40 [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-08-22  6:40 ` [PATCH v5 01/21] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-08-22 22:19   ` Nicolin Chen via
2025-08-25  6:01     ` Duan, Zhenzhong
2025-08-22  6:40 ` [PATCH v5 02/21] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
2025-08-22 22:22   ` Nicolin Chen
2025-08-27 11:13   ` Yi Liu
2025-08-27 11:22     ` Eric Auger
2025-08-27 12:30       ` Yi Liu
2025-08-27 12:32         ` Eric Auger
2025-08-27 15:30           ` Nicolin Chen
2025-08-28  8:26             ` Yi Liu
2025-08-28  9:06               ` Duan, Zhenzhong
2025-08-29  1:54                 ` Duan, Zhenzhong
2025-08-29  3:26                   ` Nicolin Chen
2025-09-01  2:35                     ` Duan, Zhenzhong
2025-09-01  2:59                       ` Nicolin Chen
2025-09-01  3:31                         ` Duan, Zhenzhong
2025-08-22  6:40 ` [PATCH v5 03/21] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
2025-08-22 22:23   ` Nicolin Chen
2025-08-22  6:40 ` [PATCH v5 04/21] vfio: Introduce helper vfio_pci_from_vfio_device() Zhenzhong Duan
2025-08-22 22:40   ` Nicolin Chen via
2025-08-25  6:06     ` Duan, Zhenzhong
2025-08-27 11:13   ` Yi Liu
2025-08-27 11:34   ` Eric Auger
2025-09-01 16:36   ` Cédric Le Goater
2025-09-02  2:12     ` Duan, Zhenzhong
2025-08-22  6:40 ` [PATCH v5 05/21] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
2025-08-22 23:12   ` Nicolin Chen
2025-08-25  8:28     ` Duan, Zhenzhong
2025-08-27 11:51       ` Eric Auger
2025-08-27 11:48   ` Eric Auger
2025-08-28  9:53     ` Duan, Zhenzhong
2025-08-28 13:00       ` Eric Auger
2025-08-29  1:40         ` Duan, Zhenzhong
2025-08-29  3:47           ` Nicolin Chen
2025-08-22  6:40 ` [PATCH v5 06/21] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
2025-08-22 23:13   ` Nicolin Chen
2025-08-27 11:14   ` Yi Liu
2025-08-22  6:40 ` [PATCH v5 07/21] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-08-22 23:17   ` Nicolin Chen
2025-08-26 17:21   ` Nicolin Chen
2025-08-27  6:45     ` Duan, Zhenzhong
2025-08-27  8:51       ` Nicolin Chen
2025-08-27 16:36     ` Eric Auger
2025-08-27 16:57       ` Nicolin Chen
2025-08-27 11:14   ` Yi Liu
2025-08-28  9:17     ` Duan, Zhenzhong
2025-08-29  2:57       ` Yi Liu
2025-08-22  6:40 ` [PATCH v5 08/21] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-08-27 11:42   ` Yi Liu
2025-08-28  9:37     ` Duan, Zhenzhong
2025-08-27 11:55   ` Eric Auger
2025-08-22  6:40 ` [PATCH v5 09/21] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
2025-08-28 10:33   ` Yi Liu
2025-09-01  5:14     ` Duan, Zhenzhong
2025-08-22  6:40 ` [PATCH v5 10/21] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
2025-08-28 11:36   ` Yi Liu
2025-09-01  5:33     ` Duan, Zhenzhong
2025-09-03  6:30       ` Yi Liu
2025-09-03  7:13         ` Duan, Zhenzhong
2025-08-22  6:40 ` [PATCH v5 11/21] intel_iommu: Handle PASID entry removal and update Zhenzhong Duan
2025-08-27 14:25   ` Eric Auger
2025-09-01  3:17     ` Duan, Zhenzhong
2025-08-28 12:05   ` Yi Liu
2025-09-01  3:31     ` Duan, Zhenzhong
2025-09-03  7:58       ` Yi Liu
2025-09-04  2:37         ` Duan, Zhenzhong
2025-08-22  6:40 ` [PATCH v5 12/21] intel_iommu: Handle PASID entry addition Zhenzhong Duan
2025-08-27 16:22   ` Eric Auger
2025-09-01  9:03     ` Duan, Zhenzhong
2025-09-03  8:52       ` Yi Liu
2025-09-04  2:45         ` Duan, Zhenzhong
2025-08-29  5:46   ` Yi Liu
2025-08-22  6:40 ` [PATCH v5 13/21] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
2025-08-27 16:28   ` Eric Auger
2025-08-29  5:56     ` Yi Liu
2025-09-01  9:04     ` Duan, Zhenzhong
2025-08-22  6:40 ` [PATCH v5 14/21] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
2025-08-27 17:14   ` Eric Auger
2025-08-29  6:06   ` Yi Liu
2025-08-22  6:40 ` [PATCH v5 15/21] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-08-28  8:37   ` Eric Auger
2025-08-29  7:05   ` Yi Liu
2025-08-22  6:40 ` [PATCH v5 16/21] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
2025-08-28  9:43   ` Eric Auger
2025-08-29  7:35     ` Yi Liu
2025-09-01  8:11       ` Duan, Zhenzhong
2025-09-03 10:18         ` Yi Liu
2025-09-04  6:42           ` Duan, Zhenzhong
2025-08-22  6:40 ` [PATCH v5 17/21] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-08-28 10:00   ` Eric Auger
2025-08-28 12:11     ` Yi Liu
2025-09-01  8:32     ` Duan, Zhenzhong
2025-08-22  6:40 ` [PATCH v5 18/21] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
2025-08-28 10:02   ` Eric Auger
2025-08-22  6:40 ` [PATCH v5 19/21] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
2025-08-28 12:47   ` Eric Auger
2025-08-22  6:40 ` [PATCH v5 20/21] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
2025-08-22 23:55   ` Nicolin Chen
2025-08-25  9:21     ` Duan, Zhenzhong
2025-08-25 16:58       ` Nicolin Chen
2025-08-27  7:11         ` Duan, Zhenzhong
2025-08-27  8:42           ` Nicolin Chen
2025-08-27 11:56     ` Yi Liu
2025-08-27 15:09       ` Nicolin Chen
2025-08-29  8:16         ` Yi Liu
2025-08-29  8:54           ` Nicolin Chen
2025-08-22  6:40 ` [PATCH v5 21/21] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
2025-08-28 12:51   ` Eric Auger
2025-08-29  7:42   ` Yi Liu
2025-08-27 11:13 ` [PATCH v5 00/21] intel_iommu: Enable stage-1 translation for passthrough device Yi Liu
2025-08-28  5:53   ` Duan, Zhenzhong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).