[PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device
@ 2025-07-08 11:05 Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 01/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
                   ` (19 more replies)
  0 siblings, 20 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Hi,

For passthrough device with intel_iommu.x-flts=on, we don't do shadowing of
guest page table for passthrough device but pass stage-1 page table to host
side to construct a nested domain. There was some effort to enable this feature
in old days, see [1] for details.

The key design is to utilize the dual-stage IOMMU translation (also known as
IOMMU nested translation) capability in host IOMMU. As the below diagram shows,
guest I/O page table pointer in GPA (guest physical address) is passed to host
and be used to perform the stage-1 address translation. Along with it,
modifications to present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

        .-------------.  .---------------------------.
        |   vIOMMU    |  | Guest I/O page table      |
        |             |  '---------------------------'
        .----------------/
        | PASID Entry |--- PASID cache flush --+
        '-------------'                        |
        |             |                        V
        |             |           I/O page table pointer in GPA
        '-------------'
    Guest
    ------| Shadow |---------------------------|--------
          v        v                           v
    Host
        .-------------.  .------------------------.
        |   pIOMMU    |  | Stage1 for GIOVA->GPA  |
        |             |  '------------------------'
        .----------------/  |
        | PASID Entry |     V (Nested xlate)
        '----------------\.--------------------------------------.
        |             |   | Stage2 for GPA->HPA, unmanaged domain|
        |             |   '--------------------------------------'
        '-------------'
For history reason, there are different namings in different VTD spec rev,
Where:
 - Stage1 = First stage = First level = flts
 - Stage2 = Second stage = Second level = slts
<Intel VT-d Nested translation>

This series reuse VFIO device's default hwpt as nested parent instead of
creating new one. This way avoids duplicate code of a new memory listener,
all existing feature from VFIO listener can be shared, e.g., ram discard,
dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
under a PCI bridge with emulated device, because emulated device wants
IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
VFIO device's default hwpt is created with NEST_PARENT flag, kernel
inhibit RO mappings when switch to shadow mode.

This series is also a prerequisite work for vSVA, i.e. Sharing guest
application address space with passthrough devices.

There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
  subsystem. VFIO calls them to register/unregister HostIOMMUDevice
  instance to vIOMMU at vfio device realize stage.
* vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
  VFIO calls it to get vIOMMU exposed capabilities.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
  to bind/unbind device to IOMMUFD backed domains, either nested
  domain or not.

See below diagram:

        VFIO Device                                 Intel IOMMU
    .-----------------.                         .-------------------.
    |                 |                         |                   |
    |       .---------|PCIIOMMUOps              |.-------------.    |
    |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU  |    |
    |       | Device  |------------------------>|| Device list |    |
    |       .---------|(get_viommu_cap)         |.-------------.    |
    |                 |                         |       |           |
    |                 |                         |       V           |
    |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.  |
    |       | IOMMUFD |            (attach_hwpt)|  | Host IOMMU  |  |
    |       | link    |<------------------------|  |   Device    |  |
    |       .---------|            (detach_hwpt)|  .-------------.  |
    |                 |                         |       |           |
    |                 |                         |       ...         |
    .-----------------.                         .-------------------.

Below is an exmaple to enable stage-1 translation for passthrough device:

    -M q35,...
    -device intel-iommu,x-scalable-mode=on,x-flts=on...
    -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...

Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test

PATCH1-6:  Some preparing work
PATCH7-8:  Compatibility check between vIOMMU and Host IOMMU
PATCH9-17: Implement stage-1 page table for passthrough device
PATCH18-19:Workaround for ERRATA_772415_SPR17
PATCH20:   Enable stage-1 translation for passthrough device

Qemu code can be found at [2]

Fault report isn't supported in this series, we presume guest kernel always
construct correct stage1 page table for passthrough device. For emulated
devices, the emulation code already provided stage1 fault injection.

TODO:
- Fault report to guest when HW stage1 faults

[1] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v3

Thanks
Zhenzhong

Changelog:
v3:
- define enum type for VIOMMU_CAP_* (Eric)
- drop inline flag in the patch which uses the helper (Eric)
- use extract64 in new introduced MACRO (Eric)
- polish comments and fix typo error (Eric)
- split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
- optimize bind/unbind error path processing

v2:
- introduce get_viommu_cap() to get STAGE1 flag to create nested parent hwpt (Liuyi)
- reuse VFIO's default hwpt as parent hwpt of nested translation (Nicolin, Liuyi)
- abandon support of VFIO device under pcie-to-pci bridge to simplify design (Liuyi)
- bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 (Liuyi)
- drop vtd_dev_to_context_entry optimization (Liuyi)

v1:
- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
- rebase to master

rfcv3:
- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer)
- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
- simplify return value check of get_cap() (Eric)
- drop realize_late (Cedric, Eric)
- split patch13:intel_iommu: Add PASID cache management infrastructure (Eric)
- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
- refine comments (Eric, Donald)

rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
  iommu pasid, this is important for dropping VTDPASIDAddressSpace


Yi Liu (3):
  intel_iommu: Replay pasid bindings after context cache invalidation
  intel_iommu: Propagate PASID-based iotlb invalidation to host
  intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
    changed

Zhenzhong Duan (17):
  intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
    vtd_ce_get_pasid_entry
  hw/pci: Introduce pci_device_get_viommu_cap()
  intel_iommu: Implement get_viommu_cap() callback
  vfio/iommufd: Force creating nested parent domain
  hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  intel_iommu: Check for compatibility with IOMMUFD backed device when
    x-flts=on
  intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  intel_iommu: Handle PASID entry removing and updating
  intel_iommu: Handle PASID entry adding
  intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
  intel_iommu: Stick to system MR for IOMMUFD backed host device when
    x-fls=on
  intel_iommu: Bind/unbind guest page table to host
  vfio: Add a new element bypass_ro in VFIOContainerBase
  Workaround for ERRATA_772415_SPR17
  intel_iommu: Enable host device when x-flts=on in scalable mode

 MAINTAINERS                           |   1 +
 hw/i386/intel_iommu.c                 | 933 +++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h        |  68 +-
 hw/i386/trace-events                  |   8 +
 hw/pci/pci.c                          |  23 +-
 hw/vfio/iommufd.c                     |  22 +-
 hw/vfio/listener.c                    |  13 +-
 include/hw/i386/intel_iommu.h         |   9 +-
 include/hw/iommu.h                    |  16 +
 include/hw/pci/pci.h                  |  25 +
 include/hw/vfio/vfio-container-base.h |   1 +
 11 files changed, 1077 insertions(+), 42 deletions(-)
 create mode 100644 include/hw/iommu.h

-- 
2.47.1



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 01/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
of rid2pasid, then it was extended to get any pasid entry. So a new name
vtd_ce_get_pasid_entry is better to match what it actually does.

No functional change intended.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 hw/i386/intel_iommu.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 69d72ad35c..f0b1f90eff 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
     return 0;
 }
 
-static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
+static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
                                       VTDContextEntry *ce,
                                       VTDPASIDEntry *pe,
                                       uint32_t pasid)
@@ -1025,7 +1025,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return VTD_PE_GET_FL_LEVEL(&pe);
         } else {
@@ -1048,7 +1048,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
     }
 
@@ -1116,7 +1116,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
             return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
         } else {
@@ -1522,7 +1522,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
      * has valid rid2pasid setting, which includes valid
      * rid2pasid field and corresponding pasid entry setting
      */
-    return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
+    return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
 }
 
 /* Map a device to its corresponding domain (context-entry) */
@@ -1611,7 +1611,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
     VTDPASIDEntry pe;
 
     if (s->root_scalable) {
-        vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
     }
 
@@ -1687,7 +1687,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
     int ret;
 
     if (s->root_scalable) {
-        ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+        ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (ret) {
             /*
              * This error is guest triggerable. We should assumt PT
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 01/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-09  0:39   ` Nicolin Chen
  2025-07-15 15:36   ` Eric Auger
  2025-07-08 11:05 ` [PATCH v3 03/20] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
                   ` (17 subsequent siblings)
  19 siblings, 2 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

pci_device_get_viommu_cap() call pci_device_get_iommu_bus_devfn()
to get iommu_bus->iommu_ops and call get_viommu_cap() callback to
get a bitmap with each bit represents a vIOMMU exposed capability.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 MAINTAINERS          |  1 +
 hw/pci/pci.c         | 11 +++++++++++
 include/hw/iommu.h   | 16 ++++++++++++++++
 include/hw/pci/pci.h | 23 +++++++++++++++++++++++
 4 files changed, 51 insertions(+)
 create mode 100644 include/hw/iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 1842c3dd83..d9fc977b81 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2302,6 +2302,7 @@ F: include/system/iommufd.h
 F: backends/host_iommu_device.c
 F: include/system/host_iommu_device.h
 F: include/qemu/chardev_open.h
+F: include/hw/iommu.h
 F: util/chardev_open.c
 F: docs/devel/vfio-iommufd.rst
 
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index c70b5ceeba..df1fb615a8 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2992,6 +2992,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
     }
 }
 
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev)
+{
+    PCIBus *iommu_bus;
+
+    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
+    if (iommu_bus && iommu_bus->iommu_ops->get_viommu_cap) {
+        return iommu_bus->iommu_ops->get_viommu_cap(iommu_bus->iommu_opaque);
+    }
+    return 0;
+}
+
 int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
                          bool exec_req, hwaddr addr, bool lpig,
                          uint16_t prgi, bool is_read, bool is_write)
diff --git a/include/hw/iommu.h b/include/hw/iommu.h
new file mode 100644
index 0000000000..e80aaf4431
--- /dev/null
+++ b/include/hw/iommu.h
@@ -0,0 +1,16 @@
+/*
+ * General vIOMMU capabilities, flags, etc
+ *
+ * Copyright (C) 2025 Intel Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_IOMMU_H
+#define HW_IOMMU_H
+
+enum {
+    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table supported */
+};
+
+#endif /* HW_IOMMU_H */
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index df3cc7b875..a11ab14bdc 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -453,6 +453,19 @@ typedef struct PCIIOMMUOps {
      * @devfn: device and function number of the PCI device.
      */
     void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
+    /**
+     * @get_viommu_cap: get vIOMMU capabilities
+     *
+     * Optional callback, if not implemented, then vIOMMU doesn't
+     * support exposing capabilities to other subsystem, e.g., VFIO.
+     * vIOMMU can choose which capabilities to expose.
+     *
+     * @opaque: the data passed to pci_setup_iommu().
+     *
+     * Returns: 64bit bitmap with each bit represents a capability emulated
+     * by VIOMMU_CAP_* in include/hw/iommu.h
+     */
+    uint64_t (*get_viommu_cap)(void *opaque);
     /**
      * @get_iotlb_info: get properties required to initialize a device IOTLB.
      *
@@ -633,6 +646,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
 void pci_device_unset_iommu_device(PCIDevice *dev);
 
+/**
+ * pci_device_get_viommu_cap: get vIOMMU capabilities.
+ *
+ * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
+ * capability, 0 if vIOMMU doesn't support esposing capabilities.
+ *
+ * @dev: PCI device pointer.
+ */
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev);
+
 /**
  * pci_iommu_get_iotlb_info: get properties required to initialize a
  * device IOTLB.
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-08 11:05 ` [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
@ 2025-07-09  0:39   ` Nicolin Chen
  2025-07-09  3:38     ` Duan, Zhenzhong
  2025-07-09 17:55     ` Donald Dutile
  2025-07-15 15:36   ` Eric Auger
  1 sibling, 2 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-07-09  0:39 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, ddutile, jgg, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

On Tue, Jul 08, 2025 at 07:05:43AM -0400, Zhenzhong Duan wrote:
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> new file mode 100644
> index 0000000000..e80aaf4431
> --- /dev/null
> +++ b/include/hw/iommu.h
> @@ -0,0 +1,16 @@
> +/*
> + * General vIOMMU capabilities, flags, etc
> + *
> + * Copyright (C) 2025 Intel Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_IOMMU_H
> +#define HW_IOMMU_H
> +
> +enum {
> +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table supported */
> +};

Thanks for this work. I am happy to see that we can share the
common code that allocates a NESTING_PARENT in the core using
this flag.

Yet on ARM, a STAGE1 page table isn't always a nested S1, the
hardware accelerated one. More often, it can be just a regular
1-stage translation table via emulated translation code and an
emulated iotlb.

I think this flag should indicate that the vIOMMU supports a
HW-accelerated nested S1 HWPT allocation/invalidation.

So, perhaps:
    /* hardware-accelerated nested stage-1 page table support */
    VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
?

Nicolin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-09  0:39   ` Nicolin Chen
@ 2025-07-09  3:38     ` Duan, Zhenzhong
  2025-07-09  4:03       ` Nicolin Chen
  2025-07-09 17:55     ` Donald Dutile
  1 sibling, 1 reply; 55+ messages in thread
From: Duan, Zhenzhong @ 2025-07-09  3:38 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v3 02/20] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>On Tue, Jul 08, 2025 at 07:05:43AM -0400, Zhenzhong Duan wrote:
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> new file mode 100644
>> index 0000000000..e80aaf4431
>> --- /dev/null
>> +++ b/include/hw/iommu.h
>> @@ -0,0 +1,16 @@
>> +/*
>> + * General vIOMMU capabilities, flags, etc
>> + *
>> + * Copyright (C) 2025 Intel Corporation.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#ifndef HW_IOMMU_H
>> +#define HW_IOMMU_H
>> +
>> +enum {
>> +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table
>supported */
>> +};
>
>Thanks for this work. I am happy to see that we can share the
>common code that allocates a NESTING_PARENT in the core using
>this flag.
>
>Yet on ARM, a STAGE1 page table isn't always a nested S1, the
>hardware accelerated one. More often, it can be just a regular
>1-stage translation table via emulated translation code and an
>emulated iotlb.

What to return for an emulated device, VIOMMU_CAP_NESTED_S1 or 0?

>
>I think this flag should indicate that the vIOMMU supports a
>HW-accelerated nested S1 HWPT allocation/invalidation.

I'm OK to use VIOMMU_CAP_NESTED_S1 name. But still unclear which checks should we have in .get_viommu_cap() callback to determine returning VIOMMU_CAP_NESTED_S1 or not,
as we know nesting support is determined by host IOMMU not vIOMMU.

Thanks
Zhenzhong

>
>So, perhaps:
>    /* hardware-accelerated nested stage-1 page table support */
>    VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
>?
>
>Nicolin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-09  3:38     ` Duan, Zhenzhong
@ 2025-07-09  4:03       ` Nicolin Chen
  2025-07-09  5:52         ` Duan, Zhenzhong
  0 siblings, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-07-09  4:03 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

On Wed, Jul 09, 2025 at 03:38:49AM +0000, Duan, Zhenzhong wrote:
> >> +enum {
> >> +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table
> >supported */
> >> +};
> >
> >Thanks for this work. I am happy to see that we can share the
> >common code that allocates a NESTING_PARENT in the core using
> >this flag.
> >
> >Yet on ARM, a STAGE1 page table isn't always a nested S1, the
> >hardware accelerated one. More often, it can be just a regular
> >1-stage translation table via emulated translation code and an
> >emulated iotlb.
> 
> What to return for an emulated device, VIOMMU_CAP_NESTED_S1 or 0?

It would be ideally 0, since the get_viommu_cap is a per VFIO/PCI
device op. But I see the caller of pci_device_get_viommu_cap() in
this series has already done the identification between "emulated"
and "passthrough" devices?

That being said, it doesn't hurt to run it again in the callback
function.

> >I think this flag should indicate that the vIOMMU supports a
> >HW-accelerated nested S1 HWPT allocation/invalidation.
> 
> I'm OK to use VIOMMU_CAP_NESTED_S1 name. But still unclear which
> checks should we have in .get_viommu_cap() callback to determine
> returning VIOMMU_CAP_NESTED_S1 or not,
> as we know nesting support is determined by host IOMMU not vIOMMU.

I would say it's determined by both the host and vIOMMU.

This is a common API that may return other caps too: eventually
there can be a case where vIOMMU has its get_viommu_cap callback
function but doesn't implement HW nesting via iommufd, even if
the host IOMMU driver supports nesting.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-09  4:03       ` Nicolin Chen
@ 2025-07-09  5:52         ` Duan, Zhenzhong
  0 siblings, 0 replies; 55+ messages in thread
From: Duan, Zhenzhong @ 2025-07-09  5:52 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
	eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
	peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v3 02/20] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>On Wed, Jul 09, 2025 at 03:38:49AM +0000, Duan, Zhenzhong wrote:
>> >> +enum {
>> >> +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table
>> >supported */
>> >> +};
>> >
>> >Thanks for this work. I am happy to see that we can share the
>> >common code that allocates a NESTING_PARENT in the core using
>> >this flag.
>> >
>> >Yet on ARM, a STAGE1 page table isn't always a nested S1, the
>> >hardware accelerated one. More often, it can be just a regular
>> >1-stage translation table via emulated translation code and an
>> >emulated iotlb.
>>
>> What to return for an emulated device, VIOMMU_CAP_NESTED_S1 or 0?
>
>It would be ideally 0, since the get_viommu_cap is a per VFIO/PCI
>device op. But I see the caller of pci_device_get_viommu_cap() in
>this series has already done the identification between "emulated"
>and "passthrough" devices?

We do allow emulated PCI device calling pci_device_get_viommu_cap().
It's just no such request for now.

>
>That being said, it doesn't hurt to run it again in the callback
>function.
>
>> >I think this flag should indicate that the vIOMMU supports a
>> >HW-accelerated nested S1 HWPT allocation/invalidation.
>>
>> I'm OK to use VIOMMU_CAP_NESTED_S1 name. But still unclear which
>> checks should we have in .get_viommu_cap() callback to determine
>> returning VIOMMU_CAP_NESTED_S1 or not,
>> as we know nesting support is determined by host IOMMU not vIOMMU.
>
>I would say it's determined by both the host and vIOMMU.
>
>This is a common API that may return other caps too: eventually
>there can be a case where vIOMMU has its get_viommu_cap callback
>function but doesn't implement HW nesting via iommufd, even if
>the host IOMMU driver supports nesting.

OK, will use VIOMMU_CAP_NESTED_S1.
But for VTD, we check nesting compatibility in .set_iommu_device callback,
in . get_viommu_cap, we will return VIOMMU_CAP_NESTED_S1 if vIOMMU support it, not checking for host IOMMU for simpler code.

Thanks
Zhenzhong


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-09  0:39   ` Nicolin Chen
  2025-07-09  3:38     ` Duan, Zhenzhong
@ 2025-07-09 17:55     ` Donald Dutile
  2025-07-09 19:20       ` Nicolin Chen
  1 sibling, 1 reply; 55+ messages in thread
From: Donald Dutile @ 2025-07-09 17:55 UTC (permalink / raw)
  To: Nicolin Chen, Zhenzhong Duan
  Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
	peterx, jgg, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng



On 7/8/25 8:39 PM, Nicolin Chen wrote:
> On Tue, Jul 08, 2025 at 07:05:43AM -0400, Zhenzhong Duan wrote:
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> new file mode 100644
>> index 0000000000..e80aaf4431
>> --- /dev/null
>> +++ b/include/hw/iommu.h
>> @@ -0,0 +1,16 @@
>> +/*
>> + * General vIOMMU capabilities, flags, etc
>> + *
>> + * Copyright (C) 2025 Intel Corporation.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#ifndef HW_IOMMU_H
>> +#define HW_IOMMU_H
>> +
>> +enum {
>> +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table supported */
>> +};
> 
> Thanks for this work. I am happy to see that we can share the
> common code that allocates a NESTING_PARENT in the core using
> this flag.
> 
> Yet on ARM, a STAGE1 page table isn't always a nested S1, the
> hardware accelerated one. More often, it can be just a regular
> 1-stage translation table via emulated translation code and an
> emulated iotlb.
> 
Because the user-created smmuv3 started as 'accelerated smmuv3',
and had been 'de-accelerated' to simply 'user created smmuv3',
I'm looking for some clarification in the above statement/request.

Is the above suppose to reflect that a nested IOMMU has some hw-acceleration
in its Stage1 implementation?
If so, then call it that: STAGE1_ACCEL.
If it's suppose to represent that an IOMMU has nested/2-stage support,
then the above is a valid cap;  -but-, having a nested/2-stage support IOMMU
doesn't necessarily mean its accelerated.

So, let's ensure terminology, esp. bit-macro names(pace) reflects
the (exact) meaning, and not any indirect meaning.

A recent kvm-arm email thread had such indirect naming cleaned up
when referring to pfn-map/device-map/struct-page-mapped pages, which
I'd like not to start down a similar mis-naming/indirect-naming path here.

Look forward to the clarification.
- Don

> I think this flag should indicate that the vIOMMU supports a
> HW-accelerated nested S1 HWPT allocation/invalidation.
> 
> So, perhaps:
>      /* hardware-accelerated nested stage-1 page table support */
>      VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
> ?
> 
> Nicolin
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-09 17:55     ` Donald Dutile
@ 2025-07-09 19:20       ` Nicolin Chen
  2025-07-10  1:22         ` Donald Dutile
  2025-07-10  8:11         ` Shameerali Kolothum Thodi via
  0 siblings, 2 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-07-09 19:20 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, eric.auger, mst,
	jasowang, peterx, jgg, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote:
> > > +enum {
> > > +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table supported */
> > > +};
> > 
> > Thanks for this work. I am happy to see that we can share the
> > common code that allocates a NESTING_PARENT in the core using
> > this flag.
> > 
> > Yet on ARM, a STAGE1 page table isn't always a nested S1, the
> > hardware accelerated one. More often, it can be just a regular
> > 1-stage translation table via emulated translation code and an
> > emulated iotlb.
> > 
> Because the user-created smmuv3 started as 'accelerated smmuv3',
> and had been 'de-accelerated' to simply 'user created smmuv3',
> I'm looking for some clarification in the above statement/request.
> 
> Is the above suppose to reflect that a nested IOMMU has some hw-acceleration
> in its Stage1 implementation?
> If so, then call it that: STAGE1_ACCEL.
> If it's suppose to represent that an IOMMU has nested/2-stage support,
> then the above is a valid cap;  -but-, having a nested/2-stage support IOMMU
> doesn't necessarily mean its accelerated.

Well, there are an emulated "nested" mode and an hw-accelerated
"nested" mode in the smmuv3 code, so we had to choose something
like "accel" over "nested".

Here, on the other hand, I think the core using this CAP would
unlikely care about an emulated "nested" mode in the individual
vIOMMU..

So I suggested:
     /* hardware-accelerated nested stage-1 page table support */
    VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),

which it should be clear IMHO.

If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-09 19:20       ` Nicolin Chen
@ 2025-07-10  1:22         ` Donald Dutile
  2025-07-15 15:28           ` Eric Auger
  2025-07-10  8:11         ` Shameerali Kolothum Thodi via
  1 sibling, 1 reply; 55+ messages in thread
From: Donald Dutile @ 2025-07-10  1:22 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, eric.auger, mst,
	jasowang, peterx, jgg, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng



On 7/9/25 3:20 PM, Nicolin Chen wrote:
> On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote:
>>>> +enum {
>>>> +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table supported */
>>>> +};
>>>
>>> Thanks for this work. I am happy to see that we can share the
>>> common code that allocates a NESTING_PARENT in the core using
>>> this flag.
>>>
>>> Yet on ARM, a STAGE1 page table isn't always a nested S1, the
>>> hardware accelerated one. More often, it can be just a regular
>>> 1-stage translation table via emulated translation code and an
>>> emulated iotlb.
>>>
>> Because the user-created smmuv3 started as 'accelerated smmuv3',
>> and had been 'de-accelerated' to simply 'user created smmuv3',
>> I'm looking for some clarification in the above statement/request.
>>
>> Is the above suppose to reflect that a nested IOMMU has some hw-acceleration
>> in its Stage1 implementation?
>> If so, then call it that: STAGE1_ACCEL.
>> If it's suppose to represent that an IOMMU has nested/2-stage support,
>> then the above is a valid cap;  -but-, having a nested/2-stage support IOMMU
>> doesn't necessarily mean its accelerated.
> 
> Well, there are an emulated "nested" mode and an hw-accelerated
> "nested" mode in the smmuv3 code, so we had to choose something
> like "accel" over "nested".
> 
> Here, on the other hand, I think the core using this CAP would
> unlikely care about an emulated "nested" mode in the individual
> vIOMMU..
> 
> So I suggested:
>       /* hardware-accelerated nested stage-1 page table support */
>      VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
> 
> which it should be clear IMHO.
> 
> If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?
> 
> Thanks
> Nicolin
> 
If the distinction is hw-based s1 vs emulated-based s1, than
I'd prefer the use of VIOMMU_CAP_HW_NESTED_S1, and avoid the use
of 'accel'/'ACCEL' unless it is an explicitly stated 'acceleration'
feature/option in the SMMU spec.

Thanks,
- Don



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-10  1:22         ` Donald Dutile
@ 2025-07-15 15:28           ` Eric Auger
  2025-07-15 16:42             ` Donald Dutile
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Auger @ 2025-07-15 15:28 UTC (permalink / raw)
  To: Donald Dutile, Nicolin Chen
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, mst, jasowang,
	peterx, jgg, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

Hi,

On 7/10/25 3:22 AM, Donald Dutile wrote:
>
>
> On 7/9/25 3:20 PM, Nicolin Chen wrote:
>> On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote:
>>>>> +enum {
>>>>> +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table
>>>>> supported */
>>>>> +};
>>>>
>>>> Thanks for this work. I am happy to see that we can share the
>>>> common code that allocates a NESTING_PARENT in the core using
>>>> this flag.
>>>>
>>>> Yet on ARM, a STAGE1 page table isn't always a nested S1, the
>>>> hardware accelerated one. More often, it can be just a regular
>>>> 1-stage translation table via emulated translation code and an
>>>> emulated iotlb.
>>>>
>>> Because the user-created smmuv3 started as 'accelerated smmuv3',
>>> and had been 'de-accelerated' to simply 'user created smmuv3',
>>> I'm looking for some clarification in the above statement/request.
>>>
>>> Is the above suppose to reflect that a nested IOMMU has some
>>> hw-acceleration
>>> in its Stage1 implementation?
>>> If so, then call it that: STAGE1_ACCEL.
>>> If it's suppose to represent that an IOMMU has nested/2-stage support,
>>> then the above is a valid cap;  -but-, having a nested/2-stage
>>> support IOMMU
>>> doesn't necessarily mean its accelerated.
>>
>> Well, there are an emulated "nested" mode and an hw-accelerated
>> "nested" mode in the smmuv3 code, so we had to choose something
>> like "accel" over "nested".
>>
>> Here, on the other hand, I think the core using this CAP would
>> unlikely care about an emulated "nested" mode in the individual
>> vIOMMU..
>>
>> So I suggested:
>>       /* hardware-accelerated nested stage-1 page table support */
>>      VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
>>
>> which it should be clear IMHO.
>>
>> If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?
>>
>> Thanks
>> Nicolin
>>
> If the distinction is hw-based s1 vs emulated-based s1, than
> I'd prefer the use of VIOMMU_CAP_HW_NESTED_S1, and avoid the use
VIOMMU_CAP_HW_NESTED_S1 or even VIOMMU_CAP_HW_NESTED look good to me too

Thanks

Eric
> of 'accel'/'ACCEL' unless it is an explicitly stated 'acceleration'
> feature/option in the SMMU spec.
>
> Thanks,
> - Don
>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-15 15:28           ` Eric Auger
@ 2025-07-15 16:42             ` Donald Dutile
  0 siblings, 0 replies; 55+ messages in thread
From: Donald Dutile @ 2025-07-15 16:42 UTC (permalink / raw)
  To: eric.auger, Nicolin Chen
  Cc: Zhenzhong Duan, qemu-devel, alex.williamson, clg, mst, jasowang,
	peterx, jgg, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng



On 7/15/25 11:28 AM, Eric Auger wrote:
> Hi,
> 
> On 7/10/25 3:22 AM, Donald Dutile wrote:
>>
>>
>> On 7/9/25 3:20 PM, Nicolin Chen wrote:
>>> On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote:
>>>>>> +enum {
>>>>>> +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table
>>>>>> supported */
>>>>>> +};
>>>>>
>>>>> Thanks for this work. I am happy to see that we can share the
>>>>> common code that allocates a NESTING_PARENT in the core using
>>>>> this flag.
>>>>>
>>>>> Yet on ARM, a STAGE1 page table isn't always a nested S1, the
>>>>> hardware accelerated one. More often, it can be just a regular
>>>>> 1-stage translation table via emulated translation code and an
>>>>> emulated iotlb.
>>>>>
>>>> Because the user-created smmuv3 started as 'accelerated smmuv3',
>>>> and had been 'de-accelerated' to simply 'user created smmuv3',
>>>> I'm looking for some clarification in the above statement/request.
>>>>
>>>> Is the above suppose to reflect that a nested IOMMU has some
>>>> hw-acceleration
>>>> in its Stage1 implementation?
>>>> If so, then call it that: STAGE1_ACCEL.
>>>> If it's suppose to represent that an IOMMU has nested/2-stage support,
>>>> then the above is a valid cap;  -but-, having a nested/2-stage
>>>> support IOMMU
>>>> doesn't necessarily mean its accelerated.
>>>
>>> Well, there are an emulated "nested" mode and an hw-accelerated
>>> "nested" mode in the smmuv3 code, so we had to choose something
>>> like "accel" over "nested".
>>>
>>> Here, on the other hand, I think the core using this CAP would
>>> unlikely care about an emulated "nested" mode in the individual
>>> vIOMMU..
>>>
>>> So I suggested:
>>>        /* hardware-accelerated nested stage-1 page table support */
>>>       VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
>>>
>>> which it should be clear IMHO.
>>>
>>> If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?
>>>
>>> Thanks
>>> Nicolin
>>>
>> If the distinction is hw-based s1 vs emulated-based s1, than
>> I'd prefer the use of VIOMMU_CAP_HW_NESTED_S1, and avoid the use
> VIOMMU_CAP_HW_NESTED_S1 or even VIOMMU_CAP_HW_NESTED look good to me too
> 
> Thanks
> 
> Eric
+1 to VIOMMU_CAP_HW_NESTED ; the S1 is unnecessary if nested IOMMU is the info being conveyed.
- Don

>> of 'accel'/'ACCEL' unless it is an explicitly stated 'acceleration'
>> feature/option in the SMMU spec.
>>
>> Thanks,
>> - Don
>>
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-09 19:20       ` Nicolin Chen
  2025-07-10  1:22         ` Donald Dutile
@ 2025-07-10  8:11         ` Shameerali Kolothum Thodi via
  2025-07-10 17:01           ` Donald Dutile
  2025-07-10 17:16           ` Nicolin Chen
  1 sibling, 2 replies; 55+ messages in thread
From: Shameerali Kolothum Thodi via @ 2025-07-10  8:11 UTC (permalink / raw)
  To: Nicolin Chen, Donald Dutile
  Cc: Zhenzhong Duan, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, July 9, 2025 8:20 PM
> To: Donald Dutile <ddutile@redhat.com>
> Cc: Zhenzhong Duan <zhenzhong.duan@intel.com>; qemu-
> devel@nongnu.org; alex.williamson@redhat.com; clg@redhat.com;
> eric.auger@redhat.com; mst@redhat.com; jasowang@redhat.com;
> peterx@redhat.com; jgg@nvidia.com; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; joao.m.martins@oracle.com;
> clement.mathieu--drif@eviden.com; kevin.tian@intel.com;
> yi.l.liu@intel.com; chao.p.peng@intel.com
> Subject: Re: [PATCH v3 02/20] hw/pci: Introduce
> pci_device_get_viommu_cap()
> 
> On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote:
> > > > +enum {
> > > > +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table
> supported */
> > > > +};
> > >
> > > Thanks for this work. I am happy to see that we can share the
> > > common code that allocates a NESTING_PARENT in the core using
> > > this flag.
> > >
> > > Yet on ARM, a STAGE1 page table isn't always a nested S1, the
> > > hardware accelerated one. More often, it can be just a regular
> > > 1-stage translation table via emulated translation code and an
> > > emulated iotlb.
> > >
> > Because the user-created smmuv3 started as 'accelerated smmuv3',
> > and had been 'de-accelerated' to simply 'user created smmuv3',
> > I'm looking for some clarification in the above statement/request.
> >
> > Is the above suppose to reflect that a nested IOMMU has some hw-
> acceleration
> > in its Stage1 implementation?
> > If so, then call it that: STAGE1_ACCEL.
> > If it's suppose to represent that an IOMMU has nested/2-stage support,
> > then the above is a valid cap;  -but-, having a nested/2-stage support
> IOMMU
> > doesn't necessarily mean its accelerated.
> 
> Well, there are an emulated "nested" mode and an hw-accelerated
> "nested" mode in the smmuv3 code, so we had to choose something
> like "accel" over "nested".
> 
> Here, on the other hand, I think the core using this CAP would
> unlikely care about an emulated "nested" mode in the individual
> vIOMMU..
> 
> So I suggested:
>      /* hardware-accelerated nested stage-1 page table support */
>     VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
> 
> which it should be clear IMHO.
> 
> If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?

I am not sure the _S1 part makes much sense in ARM case. It doesn't matter
whether the Guest SMMUv3 is configured in s1/s2 or nested for this CAP. 
With the new SMMUv3 dev support, the user can pretty much specify,

-device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1,accel=on,stage={stage1|stage2|nested}

And I think it will work with a host SMMUv3 nested configuration in all the
above cases. Unless I am missing something and we need to restrict its
use with stage=stage1 only.

Thanks,
Shameer



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-10  8:11         ` Shameerali Kolothum Thodi via
@ 2025-07-10 17:01           ` Donald Dutile
  2025-07-10 17:07             ` Donald Dutile
  2025-07-10 17:16           ` Nicolin Chen
  1 sibling, 1 reply; 55+ messages in thread
From: Donald Dutile @ 2025-07-10 17:01 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Nicolin Chen
  Cc: Zhenzhong Duan, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com



On 7/10/25 4:11 AM, Shameerali Kolothum Thodi wrote:
> 
> 
>> -----Original Message-----
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Sent: Wednesday, July 9, 2025 8:20 PM
>> To: Donald Dutile <ddutile@redhat.com>
>> Cc: Zhenzhong Duan <zhenzhong.duan@intel.com>; qemu-
>> devel@nongnu.org; alex.williamson@redhat.com; clg@redhat.com;
>> eric.auger@redhat.com; mst@redhat.com; jasowang@redhat.com;
>> peterx@redhat.com; jgg@nvidia.com; Shameerali Kolothum Thodi
>> <shameerali.kolothum.thodi@huawei.com>; joao.m.martins@oracle.com;
>> clement.mathieu--drif@eviden.com; kevin.tian@intel.com;
>> yi.l.liu@intel.com; chao.p.peng@intel.com
>> Subject: Re: [PATCH v3 02/20] hw/pci: Introduce
>> pci_device_get_viommu_cap()
>>
>> On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote:
>>>>> +enum {
>>>>> +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table
>> supported */
>>>>> +};
>>>>
>>>> Thanks for this work. I am happy to see that we can share the
>>>> common code that allocates a NESTING_PARENT in the core using
>>>> this flag.
>>>>
>>>> Yet on ARM, a STAGE1 page table isn't always a nested S1, the
>>>> hardware accelerated one. More often, it can be just a regular
>>>> 1-stage translation table via emulated translation code and an
>>>> emulated iotlb.
>>>>
>>> Because the user-created smmuv3 started as 'accelerated smmuv3',
>>> and had been 'de-accelerated' to simply 'user created smmuv3',
>>> I'm looking for some clarification in the above statement/request.
>>>
>>> Is the above suppose to reflect that a nested IOMMU has some hw-
>> acceleration
>>> in its Stage1 implementation?
>>> If so, then call it that: STAGE1_ACCEL.
>>> If it's suppose to represent that an IOMMU has nested/2-stage support,
>>> then the above is a valid cap;  -but-, having a nested/2-stage support
>> IOMMU
>>> doesn't necessarily mean its accelerated.
>>
>> Well, there are an emulated "nested" mode and an hw-accelerated
>> "nested" mode in the smmuv3 code, so we had to choose something
>> like "accel" over "nested".
>>
>> Here, on the other hand, I think the core using this CAP would
>> unlikely care about an emulated "nested" mode in the individual
>> vIOMMU..
>>
>> So I suggested:
>>       /* hardware-accelerated nested stage-1 page table support */
>>      VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
>>
>> which it should be clear IMHO.
>>
>> If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?
> 
> I am not sure the _S1 part makes much sense in ARM case. It doesn't matter
> whether the Guest SMMUv3 is configured in s1/s2 or nested for this CAP.
> With the new SMMUv3 dev support, the user can pretty much specify,
> 
> -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1,accel=on,stage={stage1|stage2|nested}
> 
> And I think it will work with a host SMMUv3 nested configuration in all the
> above cases. Unless I am missing something and we need to restrict its
> use with stage=stage1 only.
> 
> Thanks,
> Shameer
> 



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-10 17:01           ` Donald Dutile
@ 2025-07-10 17:07             ` Donald Dutile
  2025-07-10 17:25               ` Nicolin Chen
  0 siblings, 1 reply; 55+ messages in thread
From: Donald Dutile @ 2025-07-10 17:07 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi, Nicolin Chen
  Cc: Zhenzhong Duan, qemu-devel@nongnu.org, alex.williamson@redhat.com,
	clg@redhat.com, eric.auger@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, jgg@nvidia.com,
	joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
	kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com

oops, inadvertently hit send before I could add reply... apologies... reply below...

On 7/10/25 1:01 PM, Donald Dutile wrote:
> 
> 
> On 7/10/25 4:11 AM, Shameerali Kolothum Thodi wrote:
>>
>>
>>> -----Original Message-----
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>> Sent: Wednesday, July 9, 2025 8:20 PM
>>> To: Donald Dutile <ddutile@redhat.com>
>>> Cc: Zhenzhong Duan <zhenzhong.duan@intel.com>; qemu-
>>> devel@nongnu.org; alex.williamson@redhat.com; clg@redhat.com;
>>> eric.auger@redhat.com; mst@redhat.com; jasowang@redhat.com;
>>> peterx@redhat.com; jgg@nvidia.com; Shameerali Kolothum Thodi
>>> <shameerali.kolothum.thodi@huawei.com>; joao.m.martins@oracle.com;
>>> clement.mathieu--drif@eviden.com; kevin.tian@intel.com;
>>> yi.l.liu@intel.com; chao.p.peng@intel.com
>>> Subject: Re: [PATCH v3 02/20] hw/pci: Introduce
>>> pci_device_get_viommu_cap()
>>>
>>> On Wed, Jul 09, 2025 at 01:55:46PM -0400, Donald Dutile wrote:
>>>>>> +enum {
>>>>>> +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table
>>> supported */
>>>>>> +};
>>>>>
>>>>> Thanks for this work. I am happy to see that we can share the
>>>>> common code that allocates a NESTING_PARENT in the core using
>>>>> this flag.
>>>>>
>>>>> Yet on ARM, a STAGE1 page table isn't always a nested S1, the
>>>>> hardware accelerated one. More often, it can be just a regular
>>>>> 1-stage translation table via emulated translation code and an
>>>>> emulated iotlb.
>>>>>
>>>> Because the user-created smmuv3 started as 'accelerated smmuv3',
>>>> and had been 'de-accelerated' to simply 'user created smmuv3',
>>>> I'm looking for some clarification in the above statement/request.
>>>>
>>>> Is the above suppose to reflect that a nested IOMMU has some hw-
>>> acceleration
>>>> in its Stage1 implementation?
>>>> If so, then call it that: STAGE1_ACCEL.
>>>> If it's suppose to represent that an IOMMU has nested/2-stage support,
>>>> then the above is a valid cap;  -but-, having a nested/2-stage support
>>> IOMMU
>>>> doesn't necessarily mean its accelerated.
>>>
>>> Well, there are an emulated "nested" mode and an hw-accelerated
>>> "nested" mode in the smmuv3 code, so we had to choose something
>>> like "accel" over "nested".
>>>
>>> Here, on the other hand, I think the core using this CAP would
>>> unlikely care about an emulated "nested" mode in the individual
>>> vIOMMU..
>>>
>>> So I suggested:
>>>       /* hardware-accelerated nested stage-1 page table support */
>>>      VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
>>>
>>> which it should be clear IMHO.
>>>
>>> If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?
>>
>> I am not sure the _S1 part makes much sense in ARM case. It doesn't matter
>> whether the Guest SMMUv3 is configured in s1/s2 or nested for this CAP.
>> With the new SMMUv3 dev support, the user can pretty much specify,
>>
>> -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1,accel=on,stage={stage1|stage2|nested}
>>
>> And I think it will work with a host SMMUv3 nested configuration in all the
>> above cases. Unless I am missing something and we need to restrict its
>> use with stage=stage1 only.
>>
>> Thanks,
>> Shameer
>>
Shameer,
I didn't think of these config options this way... so is this CAP always 0 for smmuv3's associated VIOMMU?

or ... should intel-iommu follow this same/smmuv3 config option(s) and avoid this VIOMMU CAP req. ?

- Don



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-10 17:07             ` Donald Dutile
@ 2025-07-10 17:25               ` Nicolin Chen
  0 siblings, 0 replies; 55+ messages in thread
From: Nicolin Chen @ 2025-07-10 17:25 UTC (permalink / raw)
  To: Donald Dutile
  Cc: Shameerali Kolothum Thodi, Zhenzhong Duan, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, kevin.tian@intel.com,
	yi.l.liu@intel.com, chao.p.peng@intel.com

On Thu, Jul 10, 2025 at 01:07:11PM -0400, Donald Dutile wrote:
> > > > If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?
> > > 
> > > I am not sure the _S1 part makes much sense in ARM case. It doesn't matter
> > > whether the Guest SMMUv3 is configured in s1/s2 or nested for this CAP.
> > > With the new SMMUv3 dev support, the user can pretty much specify,
> > > 
> > > -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1,accel=on,stage={stage1|stage2|nested}
> > > 
> > > And I think it will work with a host SMMUv3 nested configuration in all the
> > > above cases. Unless I am missing something and we need to restrict its
> > > use with stage=stage1 only.
> > > 
> > > Thanks,
> > > Shameer
> > > 
> Shameer,
> I didn't think of these config options this way... so is this CAP always 0 for smmuv3's associated VIOMMU?

There should be a "bool hw_nesting" flag (or some other name) in
the SMMUState or SMMUDevice structure, that's set earlier by the
support of nested HWPT allocation/invalidation and other factors.
E.g. if "accel=on" is missing, set hw_nesting to false.

Then SMMU's get_viommu_cap() callback function will simply check
that flag.

> or ... should intel-iommu follow this same/smmuv3 config option(s) and avoid this VIOMMU CAP req. ?

We need this CAP for the vfio/pci core to allocate an S2 nesting
parent HWPT.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-10  8:11         ` Shameerali Kolothum Thodi via
  2025-07-10 17:01           ` Donald Dutile
@ 2025-07-10 17:16           ` Nicolin Chen
  2025-07-11 13:18             ` Shameerali Kolothum Thodi via
  1 sibling, 1 reply; 55+ messages in thread
From: Nicolin Chen @ 2025-07-10 17:16 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Donald Dutile, Zhenzhong Duan, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, kevin.tian@intel.com,
	yi.l.liu@intel.com, chao.p.peng@intel.com

On Thu, Jul 10, 2025 at 08:11:28AM +0000, Shameerali Kolothum Thodi wrote:
> > So I suggested:
> >      /* hardware-accelerated nested stage-1 page table support */
> >     VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
> > 
> > which it should be clear IMHO.
> > 
> > If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?
> 
> I am not sure the _S1 part makes much sense in ARM case. It doesn't matter
> whether the Guest SMMUv3 is configured in s1/s2 or nested for this CAP. 
> With the new SMMUv3 dev support, the user can pretty much specify,
> 
> -device arm-smmuv3,primary-bus=pcie.0,id=smmuv3.1,accel=on,stage={stage1|stage2|nested}
> 
> And I think it will work with a host SMMUv3 nested configuration in all the
> above cases. Unless I am missing something and
[...]
> we need to restrict its
> use with stage=stage1 only.

I think we do..

The HW nesting works when we forward the s1ctxptr and make sure
that "stage-1" is the last stage, in other word, the only stage.

Otherwise how can we support stage2/nested in a HW nesting case?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-10 17:16           ` Nicolin Chen
@ 2025-07-11 13:18             ` Shameerali Kolothum Thodi via
  0 siblings, 0 replies; 55+ messages in thread
From: Shameerali Kolothum Thodi via @ 2025-07-11 13:18 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Donald Dutile, Zhenzhong Duan, qemu-devel@nongnu.org,
	alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
	mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
	jgg@nvidia.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, kevin.tian@intel.com,
	yi.l.liu@intel.com, chao.p.peng@intel.com



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, July 10, 2025 6:16 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Donald Dutile <ddutile@redhat.com>; Zhenzhong Duan
> <zhenzhong.duan@intel.com>; qemu-devel@nongnu.org;
> alex.williamson@redhat.com; clg@redhat.com; eric.auger@redhat.com;
> mst@redhat.com; jasowang@redhat.com; peterx@redhat.com;
> jgg@nvidia.com; joao.m.martins@oracle.com; clement.mathieu--
> drif@eviden.com; kevin.tian@intel.com; yi.l.liu@intel.com;
> chao.p.peng@intel.com
> Subject: Re: [PATCH v3 02/20] hw/pci: Introduce
> pci_device_get_viommu_cap()
> 
> On Thu, Jul 10, 2025 at 08:11:28AM +0000, Shameerali Kolothum Thodi
> wrote:
> > > So I suggested:
> > >      /* hardware-accelerated nested stage-1 page table support */
> > >     VIOMMU_CAP_NESTED_S1 = BIT_ULL(0),
> > >
> > > which it should be clear IMHO.
> > >
> > > If not, maybe go a bit further like "VIOMMU_CAP_HW_NESTED_S1"?
> >
> > I am not sure the _S1 part makes much sense in ARM case. It doesn't
> matter
> > whether the Guest SMMUv3 is configured in s1/s2 or nested for this CAP.
> > With the new SMMUv3 dev support, the user can pretty much specify,
> >
> > -device arm-smmuv3,primary-
> bus=pcie.0,id=smmuv3.1,accel=on,stage={stage1|stage2|nested}
> >
> > And I think it will work with a host SMMUv3 nested configuration in all
> the
> > above cases. Unless I am missing something and
> [...]
> > we need to restrict its
> > use with stage=stage1 only.
> 
> I think we do..
> 
> The HW nesting works when we forward the s1ctxptr and make sure
> that "stage-1" is the last stage, in other word, the only stage.
> 
> Otherwise how can we support stage2/nested in a HW nesting case?

Yep. That's right. Stage 2 is a no. But nesting may work if the Guest only
uses S1. But its better to restrict it to S1.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
  2025-07-08 11:05 ` [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
  2025-07-09  0:39   ` Nicolin Chen
@ 2025-07-15 15:36   ` Eric Auger
  1 sibling, 0 replies; 55+ messages in thread
From: Eric Auger @ 2025-07-15 15:36 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

Hi Zhenzhong,

On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
> pci_device_get_viommu_cap() call pci_device_get_iommu_bus_devfn()
> to get iommu_bus->iommu_ops and call get_viommu_cap() callback to
> get a bitmap with each bit represents a vIOMMU exposed capability.
Suggesting:
Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
allows to retrieve capabilities exposed by a vIOMMU. The first planned
capability is VIOMMU_CAP_HW_NESTED that advertises the support of HW
nested stage translation scheme. pci_device_get_viommu_cap is a wrapper
that can be called on a PCI device potentially protected by a vIOMMU.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  MAINTAINERS          |  1 +
>  hw/pci/pci.c         | 11 +++++++++++
>  include/hw/iommu.h   | 16 ++++++++++++++++
>  include/hw/pci/pci.h | 23 +++++++++++++++++++++++
>  4 files changed, 51 insertions(+)
>  create mode 100644 include/hw/iommu.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1842c3dd83..d9fc977b81 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2302,6 +2302,7 @@ F: include/system/iommufd.h
>  F: backends/host_iommu_device.c
>  F: include/system/host_iommu_device.h
>  F: include/qemu/chardev_open.h
> +F: include/hw/iommu.h
>  F: util/chardev_open.c
>  F: docs/devel/vfio-iommufd.rst
>  
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index c70b5ceeba..df1fb615a8 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2992,6 +2992,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
>      }
>  }
>  
> +uint64_t pci_device_get_viommu_cap(PCIDevice *dev)
> +{
> +    PCIBus *iommu_bus;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
> +    if (iommu_bus && iommu_bus->iommu_ops->get_viommu_cap) {
> +        return iommu_bus->iommu_ops->get_viommu_cap(iommu_bus->iommu_opaque);
> +    }
> +    return 0;
> +}
> +
>  int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
>                           bool exec_req, hwaddr addr, bool lpig,
>                           uint16_t prgi, bool is_read, bool is_write)
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> new file mode 100644
> index 0000000000..e80aaf4431
> --- /dev/null
> +++ b/include/hw/iommu.h
> @@ -0,0 +1,16 @@
> +/*
> + * General vIOMMU capabilities, flags, etc
> + *
> + * Copyright (C) 2025 Intel Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_IOMMU_H
> +#define HW_IOMMU_H
> +
> +enum {
> +    VIOMMU_CAP_STAGE1 = BIT_ULL(0),  /* stage1 page table supported */
with the enum name change,

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> +};
> +
> +#endif /* HW_IOMMU_H */
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index df3cc7b875..a11ab14bdc 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -453,6 +453,19 @@ typedef struct PCIIOMMUOps {
>       * @devfn: device and function number of the PCI device.
>       */
>      void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
> +    /**
> +     * @get_viommu_cap: get vIOMMU capabilities
> +     *
> +     * Optional callback, if not implemented, then vIOMMU doesn't
> +     * support exposing capabilities to other subsystem, e.g., VFIO.
> +     * vIOMMU can choose which capabilities to expose.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * Returns: 64bit bitmap with each bit represents a capability emulated
> +     * by VIOMMU_CAP_* in include/hw/iommu.h
> +     */
> +    uint64_t (*get_viommu_cap)(void *opaque);
>      /**
>       * @get_iotlb_info: get properties required to initialize a device IOTLB.
>       *
> @@ -633,6 +646,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
>                                   Error **errp);
>  void pci_device_unset_iommu_device(PCIDevice *dev);
>  
> +/**
> + * pci_device_get_viommu_cap: get vIOMMU capabilities.
> + *
> + * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
> + * capability, 0 if vIOMMU doesn't support esposing capabilities.
> + *
> + * @dev: PCI device pointer.
> + */
> +uint64_t pci_device_get_viommu_cap(PCIDevice *dev);
> +
>  /**
>   * pci_iommu_get_iotlb_info: get properties required to initialize a
>   * device IOTLB.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 03/20] intel_iommu: Implement get_viommu_cap() callback
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 01/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-15 16:32   ` Eric Auger
  2025-07-08 11:05 ` [PATCH v3 04/20] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Implement get_viommu_cap() callback and expose stage-1 capability for now.

VFIO uses it to create nested parent domain which is further used to create
nested domain in vIOMMU. All these will be implemented in following patches.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f0b1f90eff..702973da5c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -24,6 +24,7 @@
 #include "qemu/main-loop.h"
 #include "qapi/error.h"
 #include "hw/sysbus.h"
+#include "hw/iommu.h"
 #include "intel_iommu_internal.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/pci_bus.h"
@@ -4412,6 +4413,16 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
     vtd_iommu_unlock(s);
 }
 
+static uint64_t vtd_get_viommu_cap(void *opaque)
+{
+    IntelIOMMUState *s = opaque;
+    uint64_t caps;
+
+    caps = s->flts ? VIOMMU_CAP_STAGE1 : 0;
+
+    return caps;
+}
+
 /* Unmap the whole range in the notifier's scope. */
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
 {
@@ -4734,6 +4745,7 @@ static PCIIOMMUOps vtd_iommu_ops = {
     .get_address_space = vtd_host_dma_iommu,
     .set_iommu_device = vtd_dev_set_iommu_device,
     .unset_iommu_device = vtd_dev_unset_iommu_device,
+    .get_viommu_cap = vtd_get_viommu_cap,
 };
 
 static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 03/20] intel_iommu: Implement get_viommu_cap() callback
  2025-07-08 11:05 ` [PATCH v3 03/20] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
@ 2025-07-15 16:32   ` Eric Auger
  0 siblings, 0 replies; 55+ messages in thread
From: Eric Auger @ 2025-07-15 16:32 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

Hi,

On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
> Implement get_viommu_cap() callback and expose stage-1 capability for now.
>
> VFIO uses it to create nested parent domain which is further used to create
> nested domain in vIOMMU. All these will be implemented in following patches.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f0b1f90eff..702973da5c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -24,6 +24,7 @@
>  #include "qemu/main-loop.h"
>  #include "qapi/error.h"
>  #include "hw/sysbus.h"
> +#include "hw/iommu.h"
>  #include "intel_iommu_internal.h"
>  #include "hw/pci/pci.h"
>  #include "hw/pci/pci_bus.h"
> @@ -4412,6 +4413,16 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
>      vtd_iommu_unlock(s);
>  }
>  
> +static uint64_t vtd_get_viommu_cap(void *opaque)
> +{
> +    IntelIOMMUState *s = opaque;
> +    uint64_t caps;
> +
> +    caps = s->flts ? VIOMMU_CAP_STAGE1 : 0;
with CAP_HW_NESTED
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Eric
> +
> +    return caps;
> +}
> +
>  /* Unmap the whole range in the notifier's scope. */
>  static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
>  {
> @@ -4734,6 +4745,7 @@ static PCIIOMMUOps vtd_iommu_ops = {
>      .get_address_space = vtd_host_dma_iommu,
>      .set_iommu_device = vtd_dev_set_iommu_device,
>      .unset_iommu_device = vtd_dev_unset_iommu_device,
> +    .get_viommu_cap = vtd_get_viommu_cap,
>  };
>  
>  static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 04/20] vfio/iommufd: Force creating nested parent domain
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (2 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 03/20] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-15 16:36   ` Eric Auger
  2025-07-08 11:05 ` [PATCH v3 05/20] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Call pci_device_get_viommu_cap() to get if vIOMMU supports VIOMMU_CAP_STAGE1,
if yes, create nested parent domain which could be reused by vIOMMU to create
nested domain.

It is safe because hw_caps & VIOMMU_CAP_STAGE1 cannot be set yet because
s->flts is forbidden until we support passthrough device with x-flts=on.

Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/iommufd.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 48c590b6a9..c172c177fc 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -20,6 +20,7 @@
 #include "trace.h"
 #include "qapi/error.h"
 #include "system/iommufd.h"
+#include "hw/iommu.h"
 #include "hw/qdev-core.h"
 #include "hw/vfio/vfio-cpr.h"
 #include "system/reset.h"
@@ -379,6 +380,19 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
         flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
     }
 
+    /*
+     * If vIOMMU supports stage-1 translation, force to create nested parent
+     * domain which could be reused by vIOMMU to create nested domain.
+     */
+    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+        hw_caps = pci_device_get_viommu_cap(&vdev->pdev);
+        if (hw_caps & VIOMMU_CAP_STAGE1) {
+            flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
+        }
+    }
+
     if (cpr_is_incoming()) {
         hwpt_id = vbasedev->cpr.hwpt_id;
         goto skip_alloc;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 04/20] vfio/iommufd: Force creating nested parent domain
  2025-07-08 11:05 ` [PATCH v3 04/20] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
@ 2025-07-15 16:36   ` Eric Auger
  0 siblings, 0 replies; 55+ messages in thread
From: Eric Auger @ 2025-07-15 16:36 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng



On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
> Call pci_device_get_viommu_cap() to get if vIOMMU supports VIOMMU_CAP_STAGE1,
> if yes, create nested parent domain which could be reused by vIOMMU to create
> nested domain.
>
> It is safe because hw_caps & VIOMMU_CAP_STAGE1 cannot be set yet because
> s->flts is forbidden until we support passthrough device with x-flts=on.
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric

>
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/vfio/iommufd.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 48c590b6a9..c172c177fc 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -20,6 +20,7 @@
>  #include "trace.h"
>  #include "qapi/error.h"
>  #include "system/iommufd.h"
> +#include "hw/iommu.h"
>  #include "hw/qdev-core.h"
>  #include "hw/vfio/vfio-cpr.h"
>  #include "system/reset.h"
> @@ -379,6 +380,19 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>          flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>      }
>  
> +    /*
> +     * If vIOMMU supports stage-1 translation, force to create nested parent
> +     * domain which could be reused by vIOMMU to create nested domain.
> +     */
> +    if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> +        VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +
> +        hw_caps = pci_device_get_viommu_cap(&vdev->pdev);
> +        if (hw_caps & VIOMMU_CAP_STAGE1) {
> +            flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
> +        }
> +    }
> +
>      if (cpr_is_incoming()) {
>          hwpt_id = vbasedev->cpr.hwpt_id;
>          goto skip_alloc;



^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 05/20] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (3 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 04/20] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-15 16:40   ` Eric Auger
  2025-07-08 11:05 ` [PATCH v3 06/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Returns true if PCI device is aliased or false otherwise. This will be
used in following patch to determine if a PCI device is under a PCI
bridge.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/pci/pci.c         | 12 ++++++++----
 include/hw/pci/pci.h |  2 ++
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index df1fb615a8..87f7c942b3 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2857,20 +2857,21 @@ static void pci_device_class_base_init(ObjectClass *klass, const void *data)
  * For call sites which don't need aliased BDF, passing NULL to
  * aliased_[bus|devfn] is allowed.
  *
+ * Returns true if PCI device is aliased or false otherwise.
+ *
  * @piommu_bus: return root #PCIBus backed by an IOMMU for the PCI device.
  *
  * @aliased_bus: return aliased #PCIBus of the PCI device, optional.
  *
  * @aliased_devfn: return aliased devfn of the PCI device, optional.
  */
-static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
-                                           PCIBus **piommu_bus,
-                                           PCIBus **aliased_bus,
-                                           int *aliased_devfn)
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+                                    PCIBus **aliased_bus, int *aliased_devfn)
 {
     PCIBus *bus = pci_get_bus(dev);
     PCIBus *iommu_bus = bus;
     int devfn = dev->devfn;
+    bool aliased = false;
 
     while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
         PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
@@ -2907,6 +2908,7 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
                 devfn = parent->devfn;
                 bus = parent_bus;
             }
+            aliased = true;
         }
 
         iommu_bus = parent_bus;
@@ -2928,6 +2930,8 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
     if (aliased_devfn) {
         *aliased_devfn = devfn;
     }
+
+    return aliased;
 }
 
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index a11ab14bdc..8795808155 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -641,6 +641,8 @@ typedef struct PCIIOMMUOps {
                             bool is_write);
 } PCIIOMMUOps;
 
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+                                    PCIBus **aliased_bus, int *aliased_devfn);
 AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 05/20] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  2025-07-08 11:05 ` [PATCH v3 05/20] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
@ 2025-07-15 16:40   ` Eric Auger
  2025-07-16  3:29     ` Duan, Zhenzhong
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Auger @ 2025-07-15 16:40 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng



On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
> Returns true if PCI device is aliased or false otherwise. This will be
> used in following patch to determine if a PCI device is under a PCI
> bridge.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/pci/pci.c         | 12 ++++++++----
>  include/hw/pci/pci.h |  2 ++
>  2 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index df1fb615a8..87f7c942b3 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2857,20 +2857,21 @@ static void pci_device_class_base_init(ObjectClass *klass, const void *data)
>   * For call sites which don't need aliased BDF, passing NULL to
>   * aliased_[bus|devfn] is allowed.
>   *
> + * Returns true if PCI device is aliased or false otherwise.
s/PCI device/PCI device RID

Besides
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> + *
>   * @piommu_bus: return root #PCIBus backed by an IOMMU for the PCI device.
>   *
>   * @aliased_bus: return aliased #PCIBus of the PCI device, optional.
>   *
>   * @aliased_devfn: return aliased devfn of the PCI device, optional.
>   */
> -static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
> -                                           PCIBus **piommu_bus,
> -                                           PCIBus **aliased_bus,
> -                                           int *aliased_devfn)
> +bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
> +                                    PCIBus **aliased_bus, int *aliased_devfn)
>  {
>      PCIBus *bus = pci_get_bus(dev);
>      PCIBus *iommu_bus = bus;
>      int devfn = dev->devfn;
> +    bool aliased = false;
>  
>      while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
>          PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> @@ -2907,6 +2908,7 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
>                  devfn = parent->devfn;
>                  bus = parent_bus;
>              }
> +            aliased = true;
>          }
>  
>          iommu_bus = parent_bus;
> @@ -2928,6 +2930,8 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
>      if (aliased_devfn) {
>          *aliased_devfn = devfn;
>      }
> +
> +    return aliased;
>  }
>  
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index a11ab14bdc..8795808155 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -641,6 +641,8 @@ typedef struct PCIIOMMUOps {
>                              bool is_write);
>  } PCIIOMMUOps;
>  
> +bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
> +                                    PCIBus **aliased_bus, int *aliased_devfn);
>  AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>  bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
>                                   Error **errp);



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 05/20] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
  2025-07-15 16:40   ` Eric Auger
@ 2025-07-16  3:29     ` Duan, Zhenzhong
  0 siblings, 0 replies; 55+ messages in thread
From: Duan, Zhenzhong @ 2025-07-16  3:29 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v3 05/20] hw/pci: Export
>pci_device_get_iommu_bus_devfn() and return bool
>
>
>
>On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
>> Returns true if PCI device is aliased or false otherwise. This will be
>> used in following patch to determine if a PCI device is under a PCI
>> bridge.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/pci/pci.c         | 12 ++++++++----
>>  include/hw/pci/pci.h |  2 ++
>>  2 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index df1fb615a8..87f7c942b3 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -2857,20 +2857,21 @@ static void
>pci_device_class_base_init(ObjectClass *klass, const void *data)
>>   * For call sites which don't need aliased BDF, passing NULL to
>>   * aliased_[bus|devfn] is allowed.
>>   *
>> + * Returns true if PCI device is aliased or false otherwise.
>s/PCI device/PCI device RID
>
>Besides
>Reviewed-by: Eric Auger <eric.auger@redhat.com>

Applied your suggestions for PATCH2-5, see https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_nesting.v4.wip/

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 06/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (4 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 05/20] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Introduce a new structure VTDHostIOMMUDevice which replaces
HostIOMMUDevice to be stored in hash table.

It includes a reference to HostIOMMUDevice and IntelIOMMUState,
also includes BDF information which will be used in future
patches.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
 hw/i386/intel_iommu.c          | 15 +++++++++++++--
 hw/i386/intel_iommu_internal.h |  7 +++++++
 include/hw/i386/intel_iommu.h  |  2 +-
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 702973da5c..e90fd2f28f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
 
 static void vtd_hiod_destroy(gpointer v)
 {
-    object_unref(v);
+    VTDHostIOMMUDevice *vtd_hiod = v;
+
+    object_unref(vtd_hiod->hiod);
+    g_free(vtd_hiod);
 }
 
 static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
@@ -4360,6 +4363,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
                                      HostIOMMUDevice *hiod, Error **errp)
 {
     IntelIOMMUState *s = opaque;
+    VTDHostIOMMUDevice *vtd_hiod;
     struct vtd_as_key key = {
         .bus = bus,
         .devfn = devfn,
@@ -4376,7 +4380,14 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
         return false;
     }
 
+    vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
+    vtd_hiod->bus = bus;
+    vtd_hiod->devfn = (uint8_t)devfn;
+    vtd_hiod->iommu_state = s;
+    vtd_hiod->hiod = hiod;
+
     if (!vtd_check_hiod(s, hiod, errp)) {
+        g_free(vtd_hiod);
         vtd_iommu_unlock(s);
         return false;
     }
@@ -4386,7 +4397,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     new_key->devfn = devfn;
 
     object_ref(hiod);
-    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
+    g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
 
     vtd_iommu_unlock(s);
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e8b211e8b0..7aba259ef8 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -28,6 +28,7 @@
 #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
 #define HW_I386_INTEL_IOMMU_INTERNAL_H
 #include "hw/i386/intel_iommu.h"
+#include "system/host_iommu_device.h"
 
 /*
  * Intel IOMMU register specification
@@ -607,4 +608,10 @@ typedef struct VTDRootEntry VTDRootEntry;
 /* Bits to decide the offset for each level */
 #define VTD_LEVEL_BITS           9
 
+typedef struct VTDHostIOMMUDevice {
+    IntelIOMMUState *iommu_state;
+    PCIBus *bus;
+    uint8_t devfn;
+    HostIOMMUDevice *hiod;
+} VTDHostIOMMUDevice;
 #endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index e95477e855..50f9b27a45 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -295,7 +295,7 @@ struct IntelIOMMUState {
     /* list of registered notifiers */
     QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
 
-    GHashTable *vtd_host_iommu_dev;             /* HostIOMMUDevice */
+    GHashTable *vtd_host_iommu_dev;             /* VTDHostIOMMUDevice */
 
     /* interrupt remapping */
     bool intr_enabled;              /* Whether guest enabled IR */
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (5 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 06/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-16  9:22   ` Eric Auger
  2025-07-08 11:05 ` [PATCH v3 08/20] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
is passed to host to construct nested page table. We need to check
compatibility of some critical IOMMU capabilities between vIOMMU and
host IOMMU to ensure guest stage-1 page table could be used by host.

For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
does not, then this IOMMUFD backed device should fail.

Even of the checks pass, for now we willingly reject the association
because all the bits are not there yet.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c          | 30 +++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h |  1 +
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e90fd2f28f..c57ca02cdd 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
 #include "kvm/kvm_i386.h"
 #include "migration/vmstate.h"
 #include "trace.h"
+#include "system/iommufd.h"
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -4355,7 +4356,34 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         return true;
     }
 
-    error_setg(errp, "host device is uncompatible with stage-1 translation");
+#ifdef CONFIG_IOMMUFD
+    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
+    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+
+    /* Remaining checks are all stage-1 translation specific */
+    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
+        return false;
+    }
+
+    if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
+        error_setg(errp, "Incompatible host platform IOMMU type %d",
+                   caps->type);
+        return false;
+    }
+
+    if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
+        error_setg(errp, "Host IOMMU doesn't support nested translation");
+        return false;
+    }
+
+    if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
+        error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
+        return false;
+    }
+#endif
+
+    error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
     return false;
 }
 
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 7aba259ef8..18bc22fc72 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -192,6 +192,7 @@
 #define VTD_ECAP_PT                 (1ULL << 6)
 #define VTD_ECAP_SC                 (1ULL << 7)
 #define VTD_ECAP_MHMV               (15ULL << 20)
+#define VTD_ECAP_NEST               (1ULL << 26)
 #define VTD_ECAP_SRS                (1ULL << 31)
 #define VTD_ECAP_PASID              (1ULL << 40)
 #define VTD_ECAP_SMTS               (1ULL << 43)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-07-08 11:05 ` [PATCH v3 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-07-16  9:22   ` Eric Auger
  2025-07-16 10:31     ` Duan, Zhenzhong
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Auger @ 2025-07-16  9:22 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

Hi Zhenzhong,

On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
> is passed to host to construct nested page table. We need to check
> compatibility of some critical IOMMU capabilities between vIOMMU and
> host IOMMU to ensure guest stage-1 page table could be used by host.
>
> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
> does not, then this IOMMUFD backed device should fail.
>
> Even of the checks pass, for now we willingly reject the association
> because all the bits are not there yet.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c          | 30 +++++++++++++++++++++++++++++-
>  hw/i386/intel_iommu_internal.h |  1 +
>  2 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index e90fd2f28f..c57ca02cdd 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -40,6 +40,7 @@
>  #include "kvm/kvm_i386.h"
>  #include "migration/vmstate.h"
>  #include "trace.h"
> +#include "system/iommufd.h"
>  
>  /* context entry operations */
>  #define VTD_CE_GET_RID2PASID(ce) \
> @@ -4355,7 +4356,34 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>          return true;
>      }
>  
> -    error_setg(errp, "host device is uncompatible with stage-1 translation");
> +#ifdef CONFIG_IOMMUFD
> +    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
> +    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;

I am now confused about how this relates to vtd_get_viommu_cap().
PCIIOMMUOps.set_iommu_device = vtd_dev_set_iommu_device calls
vtd_check_hiod()
viommu might return HW_NESTED_CAP through PCIIOMMUOps.get_viommu_cap
without making sure the underlying HW IOMMU does support it. Is that a
correct understanding? Maybe we should clarify the calling order between
set_iommu_device vs get_viommu_cap? Could we check HW IOMMU
prerequisites in vtd_get_viommu_cap() by enforcing this is called after
set_iommu_device. I think we should clarify the exact semantic of
get_viommu_cap().Thanks Eric


> +
> +    /* Remaining checks are all stage-1 translation specific */
> +    if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> +        error_setg(errp, "Need IOMMUFD backend when x-flts=on");
> +        return false;
> +    }
> +
> +    if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
> +        error_setg(errp, "Incompatible host platform IOMMU type %d",
> +                   caps->type);
> +        return false;
> +    }
> +
> +    if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
> +        error_setg(errp, "Host IOMMU doesn't support nested translation");
> +        return false;
> +    }
> +
> +    if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
> +        error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
> +        return false;
> +    }
> +#endif
> +
> +    error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
>      return false;
>  }
>  
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 7aba259ef8..18bc22fc72 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -192,6 +192,7 @@
>  #define VTD_ECAP_PT                 (1ULL << 6)
>  #define VTD_ECAP_SC                 (1ULL << 7)
>  #define VTD_ECAP_MHMV               (15ULL << 20)
> +#define VTD_ECAP_NEST               (1ULL << 26)
>  #define VTD_ECAP_SRS                (1ULL << 31)
>  #define VTD_ECAP_PASID              (1ULL << 40)
>  #define VTD_ECAP_SMTS               (1ULL << 43)



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-07-16  9:22   ` Eric Auger
@ 2025-07-16 10:31     ` Duan, Zhenzhong
  2025-07-16 12:09       ` Eric Auger
  0 siblings, 1 reply; 55+ messages in thread
From: Duan, Zhenzhong @ 2025-07-16 10:31 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with
>IOMMUFD backed device when x-flts=on
>
>Hi Zhenzhong,
>
>On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
>> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
>> is passed to host to construct nested page table. We need to check
>> compatibility of some critical IOMMU capabilities between vIOMMU and
>> host IOMMU to ensure guest stage-1 page table could be used by host.
>>
>> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
>> does not, then this IOMMUFD backed device should fail.
>>
>> Even of the checks pass, for now we willingly reject the association
>> because all the bits are not there yet.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu.c          | 30
>+++++++++++++++++++++++++++++-
>>  hw/i386/intel_iommu_internal.h |  1 +
>>  2 files changed, 30 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index e90fd2f28f..c57ca02cdd 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -40,6 +40,7 @@
>>  #include "kvm/kvm_i386.h"
>>  #include "migration/vmstate.h"
>>  #include "trace.h"
>> +#include "system/iommufd.h"
>>
>>  /* context entry operations */
>>  #define VTD_CE_GET_RID2PASID(ce) \
>> @@ -4355,7 +4356,34 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>HostIOMMUDevice *hiod,
>>          return true;
>>      }
>>
>> -    error_setg(errp, "host device is uncompatible with stage-1
>translation");
>> +#ifdef CONFIG_IOMMUFD
>> +    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>> +    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
>
>I am now confused about how this relates to vtd_get_viommu_cap().
>PCIIOMMUOps.set_iommu_device = vtd_dev_set_iommu_device calls
>vtd_check_hiod()
>viommu might return HW_NESTED_CAP through
>PCIIOMMUOps.get_viommu_cap
>without making sure the underlying HW IOMMU does support it. Is that a
>correct understanding? Maybe we should clarify the calling order between
>set_iommu_device vs get_viommu_cap? Could we check HW IOMMU
>prerequisites in vtd_get_viommu_cap() by enforcing this is called after
>set_iommu_device. I think we should clarify the exact semantic of
>get_viommu_cap().Thanks Eric

My understanding get_viommu_cap() returns pure vIOMMU's capabilities
with no host IOMMU's capabilities involved.

For example, returned HW_NESTED_CAP means this vIOMMU has code
to support creating nested hwpt and attaching, no matter if host IOMMU
supports nesting or not.

The compatibility check between host IOMMU vs vIOMMU is done in
set_iommu_device(), see vtd_check_hiod().

It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
because we need get_viommu_cap() to determine if creating nested parent
hwpt or not at attaching stage.

If host doesn't support nesting, then creating nested parent hwpt will fail
early in vfio realize before vtd_check_hiod() is called and we fail the hotplug.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-07-16 10:31     ` Duan, Zhenzhong
@ 2025-07-16 12:09       ` Eric Auger
  2025-07-17  3:47         ` Duan, Zhenzhong
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Auger @ 2025-07-16 12:09 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hi Zhenzhong,

On 7/16/25 12:31 PM, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with
>> IOMMUFD backed device when x-flts=on
>>
>> Hi Zhenzhong,
>>
>> On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
>>> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
>>> is passed to host to construct nested page table. We need to check
>>> compatibility of some critical IOMMU capabilities between vIOMMU and
>>> host IOMMU to ensure guest stage-1 page table could be used by host.
>>>
>>> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
>>> does not, then this IOMMUFD backed device should fail.
>>>
>>> Even of the checks pass, for now we willingly reject the association
>>> because all the bits are not there yet.
>>>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>>  hw/i386/intel_iommu.c          | 30
>> +++++++++++++++++++++++++++++-
>>>  hw/i386/intel_iommu_internal.h |  1 +
>>>  2 files changed, 30 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index e90fd2f28f..c57ca02cdd 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -40,6 +40,7 @@
>>>  #include "kvm/kvm_i386.h"
>>>  #include "migration/vmstate.h"
>>>  #include "trace.h"
>>> +#include "system/iommufd.h"
>>>
>>>  /* context entry operations */
>>>  #define VTD_CE_GET_RID2PASID(ce) \
>>> @@ -4355,7 +4356,34 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>> HostIOMMUDevice *hiod,
>>>          return true;
>>>      }
>>>
>>> -    error_setg(errp, "host device is uncompatible with stage-1
>> translation");
>>> +#ifdef CONFIG_IOMMUFD
>>> +    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>>> +    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
>> I am now confused about how this relates to vtd_get_viommu_cap().
>> PCIIOMMUOps.set_iommu_device = vtd_dev_set_iommu_device calls
>> vtd_check_hiod()
>> viommu might return HW_NESTED_CAP through
>> PCIIOMMUOps.get_viommu_cap
>> without making sure the underlying HW IOMMU does support it. Is that a
>> correct understanding? Maybe we should clarify the calling order between
>> set_iommu_device vs get_viommu_cap? Could we check HW IOMMU
>> prerequisites in vtd_get_viommu_cap() by enforcing this is called after
>> set_iommu_device. I think we should clarify the exact semantic of
>> get_viommu_cap().Thanks Eric
> My understanding get_viommu_cap() returns pure vIOMMU's capabilities
> with no host IOMMU's capabilities involved.
>
> For example, returned HW_NESTED_CAP means this vIOMMU has code
> to support creating nested hwpt and attaching, no matter if host IOMMU
> supports nesting or not.

Then I think you need to refine the description in 2/20 to make this clear.
stating explicitly get_viommu_cap returns theoretical capabilities which
are independent on the actual host capabilities they may depend on.
>
> The compatibility check between host IOMMU vs vIOMMU is done in
> set_iommu_device(), see vtd_check_hiod().
>
> It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
> because we need get_viommu_cap() to determine if creating nested parent
> hwpt or not at attaching stage.
isn't it possible to rework the call sequence?

Eric
>
> If host doesn't support nesting, then creating nested parent hwpt will fail
> early in vfio realize before vtd_check_hiod() is called and we fail the hotplug.
>
> Thanks
> Zhenzhong



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-07-16 12:09       ` Eric Auger
@ 2025-07-17  3:47         ` Duan, Zhenzhong
  2025-07-17  6:48           ` Eric Auger
  0 siblings, 1 reply; 55+ messages in thread
From: Duan, Zhenzhong @ 2025-07-17  3:47 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Sent: Wednesday, July 16, 2025 8:09 PM
>To: Duan, Zhenzhong <zhenzhong.duan@intel.com>;
>qemu-devel@nongnu.org
>Cc: alex.williamson@redhat.com; clg@redhat.com; mst@redhat.com;
>jasowang@redhat.com; peterx@redhat.com; ddutile@redhat.com;
>jgg@nvidia.com; nicolinc@nvidia.com;
>shameerali.kolothum.thodi@huawei.com; joao.m.martins@oracle.com;
>clement.mathieu--drif@eviden.com; Tian, Kevin <kevin.tian@intel.com>; Liu,
>Yi L <yi.l.liu@intel.com>; Peng, Chao P <chao.p.peng@intel.com>
>Subject: Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with
>IOMMUFD backed device when x-flts=on
>
>Hi Zhenzhong,
>
>On 7/16/25 12:31 PM, Duan, Zhenzhong wrote:
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Subject: Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with
>>> IOMMUFD backed device when x-flts=on
>>>
>>> Hi Zhenzhong,
>>>
>>> On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
>>>> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page
>table
>>>> is passed to host to construct nested page table. We need to check
>>>> compatibility of some critical IOMMU capabilities between vIOMMU and
>>>> host IOMMU to ensure guest stage-1 page table could be used by host.
>>>>
>>>> For instance, vIOMMU supports stage-1 1GB huge page mapping, but
>host
>>>> does not, then this IOMMUFD backed device should fail.
>>>>
>>>> Even of the checks pass, for now we willingly reject the association
>>>> because all the bits are not there yet.
>>>>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>>  hw/i386/intel_iommu.c          | 30
>>> +++++++++++++++++++++++++++++-
>>>>  hw/i386/intel_iommu_internal.h |  1 +
>>>>  2 files changed, 30 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index e90fd2f28f..c57ca02cdd 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -40,6 +40,7 @@
>>>>  #include "kvm/kvm_i386.h"
>>>>  #include "migration/vmstate.h"
>>>>  #include "trace.h"
>>>> +#include "system/iommufd.h"
>>>>
>>>>  /* context entry operations */
>>>>  #define VTD_CE_GET_RID2PASID(ce) \
>>>> @@ -4355,7 +4356,34 @@ static bool vtd_check_hiod(IntelIOMMUState
>*s,
>>> HostIOMMUDevice *hiod,
>>>>          return true;
>>>>      }
>>>>
>>>> -    error_setg(errp, "host device is uncompatible with stage-1
>>> translation");
>>>> +#ifdef CONFIG_IOMMUFD
>>>> +    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>>>> +    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
>>> I am now confused about how this relates to vtd_get_viommu_cap().
>>> PCIIOMMUOps.set_iommu_device = vtd_dev_set_iommu_device calls
>>> vtd_check_hiod()
>>> viommu might return HW_NESTED_CAP through
>>> PCIIOMMUOps.get_viommu_cap
>>> without making sure the underlying HW IOMMU does support it. Is that a
>>> correct understanding? Maybe we should clarify the calling order between
>>> set_iommu_device vs get_viommu_cap? Could we check HW IOMMU
>>> prerequisites in vtd_get_viommu_cap() by enforcing this is called after
>>> set_iommu_device. I think we should clarify the exact semantic of
>>> get_viommu_cap().Thanks Eric
>> My understanding get_viommu_cap() returns pure vIOMMU's capabilities
>> with no host IOMMU's capabilities involved.
>>
>> For example, returned HW_NESTED_CAP means this vIOMMU has code
>> to support creating nested hwpt and attaching, no matter if host IOMMU
>> supports nesting or not.
>
>Then I think you need to refine the description in 2/20 to make this clear.
>stating explicitly get_viommu_cap returns theoretical capabilities which
>are independent on the actual host capabilities they may depend on.

Will do.

For virtual vtd, we are unable to return capabilities depending on host capacities,
Because different host IOMMU may have different capabilities, we want to return
a consistent result only depending on user's cmdline config.

>>
>> The compatibility check between host IOMMU vs vIOMMU is done in
>> set_iommu_device(), see vtd_check_hiod().
>>
>> It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
>> because we need get_viommu_cap() to determine if creating nested parent
>> hwpt or not at attaching stage.
>isn't it possible to rework the call sequence?

I think not. Current sequence:

attach_device()
    get_viommu_cap()
    create hwpt
...
create hiod
set_iommu_device(hiod)

Hiod realize needs iommufd,devid and hwpt_id which are ready after attach_device().

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-07-17  3:47         ` Duan, Zhenzhong
@ 2025-07-17  6:48           ` Eric Auger
  2025-07-17  7:03             ` Duan, Zhenzhong
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Auger @ 2025-07-17  6:48 UTC (permalink / raw)
  To: Duan, Zhenzhong, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hi Zhenzhong,

On 7/17/25 5:47 AM, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: Wednesday, July 16, 2025 8:09 PM
>> To: Duan, Zhenzhong <zhenzhong.duan@intel.com>;
>> qemu-devel@nongnu.org
>> Cc: alex.williamson@redhat.com; clg@redhat.com; mst@redhat.com;
>> jasowang@redhat.com; peterx@redhat.com; ddutile@redhat.com;
>> jgg@nvidia.com; nicolinc@nvidia.com;
>> shameerali.kolothum.thodi@huawei.com; joao.m.martins@oracle.com;
>> clement.mathieu--drif@eviden.com; Tian, Kevin <kevin.tian@intel.com>; Liu,
>> Yi L <yi.l.liu@intel.com>; Peng, Chao P <chao.p.peng@intel.com>
>> Subject: Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with
>> IOMMUFD backed device when x-flts=on
>>
>> Hi Zhenzhong,
>>
>> On 7/16/25 12:31 PM, Duan, Zhenzhong wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Subject: Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with
>>>> IOMMUFD backed device when x-flts=on
>>>>
>>>> Hi Zhenzhong,
>>>>
>>>> On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
>>>>> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page
>> table
>>>>> is passed to host to construct nested page table. We need to check
>>>>> compatibility of some critical IOMMU capabilities between vIOMMU and
>>>>> host IOMMU to ensure guest stage-1 page table could be used by host.
>>>>>
>>>>> For instance, vIOMMU supports stage-1 1GB huge page mapping, but
>> host
>>>>> does not, then this IOMMUFD backed device should fail.
>>>>>
>>>>> Even of the checks pass, for now we willingly reject the association
>>>>> because all the bits are not there yet.
>>>>>
>>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>> ---
>>>>>  hw/i386/intel_iommu.c          | 30
>>>> +++++++++++++++++++++++++++++-
>>>>>  hw/i386/intel_iommu_internal.h |  1 +
>>>>>  2 files changed, 30 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>> index e90fd2f28f..c57ca02cdd 100644
>>>>> --- a/hw/i386/intel_iommu.c
>>>>> +++ b/hw/i386/intel_iommu.c
>>>>> @@ -40,6 +40,7 @@
>>>>>  #include "kvm/kvm_i386.h"
>>>>>  #include "migration/vmstate.h"
>>>>>  #include "trace.h"
>>>>> +#include "system/iommufd.h"
>>>>>
>>>>>  /* context entry operations */
>>>>>  #define VTD_CE_GET_RID2PASID(ce) \
>>>>> @@ -4355,7 +4356,34 @@ static bool vtd_check_hiod(IntelIOMMUState
>> *s,
>>>> HostIOMMUDevice *hiod,
>>>>>          return true;
>>>>>      }
>>>>>
>>>>> -    error_setg(errp, "host device is uncompatible with stage-1
>>>> translation");
>>>>> +#ifdef CONFIG_IOMMUFD
>>>>> +    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>>>>> +    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
>>>> I am now confused about how this relates to vtd_get_viommu_cap().
>>>> PCIIOMMUOps.set_iommu_device = vtd_dev_set_iommu_device calls
>>>> vtd_check_hiod()
>>>> viommu might return HW_NESTED_CAP through
>>>> PCIIOMMUOps.get_viommu_cap
>>>> without making sure the underlying HW IOMMU does support it. Is that a
>>>> correct understanding? Maybe we should clarify the calling order between
>>>> set_iommu_device vs get_viommu_cap? Could we check HW IOMMU
>>>> prerequisites in vtd_get_viommu_cap() by enforcing this is called after
>>>> set_iommu_device. I think we should clarify the exact semantic of
>>>> get_viommu_cap().Thanks Eric
>>> My understanding get_viommu_cap() returns pure vIOMMU's capabilities
>>> with no host IOMMU's capabilities involved.
>>>
>>> For example, returned HW_NESTED_CAP means this vIOMMU has code
>>> to support creating nested hwpt and attaching, no matter if host IOMMU
>>> supports nesting or not.
>> Then I think you need to refine the description in 2/20 to make this clear.
>> stating explicitly get_viommu_cap returns theoretical capabilities which
>> are independent on the actual host capabilities they may depend on.
> Will do.
>
> For virtual vtd, we are unable to return capabilities depending on host capacities,
> Because different host IOMMU may have different capabilities, we want to return
> a consistent result only depending on user's cmdline config.
ok
>
>>> The compatibility check between host IOMMU vs vIOMMU is done in
>>> set_iommu_device(), see vtd_check_hiod().
>>>
>>> It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
>>> because we need get_viommu_cap() to determine if creating nested parent
>>> hwpt or not at attaching stage.
>> isn't it possible to rework the call sequence?
> I think not. Current sequence:
>
> attach_device()
>     get_viommu_cap()
>     create hwpt
> ...
> create hiod
> set_iommu_device(hiod)
>
> Hiod realize needs iommufd,devid and hwpt_id which are ready after attach_device().
OK. I would add this explanation in the commit msg too.
>
> Thanks
> Zhenzhong
Thanks

Eric



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
  2025-07-17  6:48           ` Eric Auger
@ 2025-07-17  7:03             ` Duan, Zhenzhong
  0 siblings, 0 replies; 55+ messages in thread
From: Duan, Zhenzhong @ 2025-07-17  7:03 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with
>IOMMUFD backed device when x-flts=on
>
>Hi Zhenzhong,
>
>On 7/17/25 5:47 AM, Duan, Zhenzhong wrote:
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Sent: Wednesday, July 16, 2025 8:09 PM
>>> To: Duan, Zhenzhong <zhenzhong.duan@intel.com>;
>>> qemu-devel@nongnu.org
>>> Cc: alex.williamson@redhat.com; clg@redhat.com; mst@redhat.com;
>>> jasowang@redhat.com; peterx@redhat.com; ddutile@redhat.com;
>>> jgg@nvidia.com; nicolinc@nvidia.com;
>>> shameerali.kolothum.thodi@huawei.com; joao.m.martins@oracle.com;
>>> clement.mathieu--drif@eviden.com; Tian, Kevin <kevin.tian@intel.com>;
>Liu,
>>> Yi L <yi.l.liu@intel.com>; Peng, Chao P <chao.p.peng@intel.com>
>>> Subject: Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with
>>> IOMMUFD backed device when x-flts=on
>>>
>>> Hi Zhenzhong,
>>>
>>> On 7/16/25 12:31 PM, Duan, Zhenzhong wrote:
>>>> Hi Eric,
>>>>
>>>>> -----Original Message-----
>>>>> From: Eric Auger <eric.auger@redhat.com>
>>>>> Subject: Re: [PATCH v3 07/20] intel_iommu: Check for compatibility with
>>>>> IOMMUFD backed device when x-flts=on
>>>>>
>>>>> Hi Zhenzhong,
>>>>>
>>>>> On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
>>>>>> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page
>>> table
>>>>>> is passed to host to construct nested page table. We need to check
>>>>>> compatibility of some critical IOMMU capabilities between vIOMMU
>and
>>>>>> host IOMMU to ensure guest stage-1 page table could be used by host.
>>>>>>
>>>>>> For instance, vIOMMU supports stage-1 1GB huge page mapping, but
>>> host
>>>>>> does not, then this IOMMUFD backed device should fail.
>>>>>>
>>>>>> Even of the checks pass, for now we willingly reject the association
>>>>>> because all the bits are not there yet.
>>>>>>
>>>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>>> ---
>>>>>>  hw/i386/intel_iommu.c          | 30
>>>>> +++++++++++++++++++++++++++++-
>>>>>>  hw/i386/intel_iommu_internal.h |  1 +
>>>>>>  2 files changed, 30 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>>> index e90fd2f28f..c57ca02cdd 100644
>>>>>> --- a/hw/i386/intel_iommu.c
>>>>>> +++ b/hw/i386/intel_iommu.c
>>>>>> @@ -40,6 +40,7 @@
>>>>>>  #include "kvm/kvm_i386.h"
>>>>>>  #include "migration/vmstate.h"
>>>>>>  #include "trace.h"
>>>>>> +#include "system/iommufd.h"
>>>>>>
>>>>>>  /* context entry operations */
>>>>>>  #define VTD_CE_GET_RID2PASID(ce) \
>>>>>> @@ -4355,7 +4356,34 @@ static bool
>vtd_check_hiod(IntelIOMMUState
>>> *s,
>>>>> HostIOMMUDevice *hiod,
>>>>>>          return true;
>>>>>>      }
>>>>>>
>>>>>> -    error_setg(errp, "host device is uncompatible with stage-1
>>>>> translation");
>>>>>> +#ifdef CONFIG_IOMMUFD
>>>>>> +    struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>>>>>> +    struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
>>>>> I am now confused about how this relates to vtd_get_viommu_cap().
>>>>> PCIIOMMUOps.set_iommu_device = vtd_dev_set_iommu_device calls
>>>>> vtd_check_hiod()
>>>>> viommu might return HW_NESTED_CAP through
>>>>> PCIIOMMUOps.get_viommu_cap
>>>>> without making sure the underlying HW IOMMU does support it. Is that
>a
>>>>> correct understanding? Maybe we should clarify the calling order
>between
>>>>> set_iommu_device vs get_viommu_cap? Could we check HW IOMMU
>>>>> prerequisites in vtd_get_viommu_cap() by enforcing this is called after
>>>>> set_iommu_device. I think we should clarify the exact semantic of
>>>>> get_viommu_cap().Thanks Eric
>>>> My understanding get_viommu_cap() returns pure vIOMMU's capabilities
>>>> with no host IOMMU's capabilities involved.
>>>>
>>>> For example, returned HW_NESTED_CAP means this vIOMMU has code
>>>> to support creating nested hwpt and attaching, no matter if host IOMMU
>>>> supports nesting or not.
>>> Then I think you need to refine the description in 2/20 to make this clear.
>>> stating explicitly get_viommu_cap returns theoretical capabilities which
>>> are independent on the actual host capabilities they may depend on.
>> Will do.
>>
>> For virtual vtd, we are unable to return capabilities depending on host
>capacities,
>> Because different host IOMMU may have different capabilities, we want to
>return
>> a consistent result only depending on user's cmdline config.
>ok
>>
>>>> The compatibility check between host IOMMU vs vIOMMU is done in
>>>> set_iommu_device(), see vtd_check_hiod().
>>>>
>>>> It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
>>>> because we need get_viommu_cap() to determine if creating nested
>parent
>>>> hwpt or not at attaching stage.
>>> isn't it possible to rework the call sequence?
>> I think not. Current sequence:
>>
>> attach_device()
>>     get_viommu_cap()
>>     create hwpt
>> ...
>> create hiod
>> set_iommu_device(hiod)
>>
>> Hiod realize needs iommufd,devid and hwpt_id which are ready after
>attach_device().
>OK. I would add this explanation in the commit msg too.

Got it, will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 08/20] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (6 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-16  9:53   ` Eric Auger
  2025-07-08 11:05 ` [PATCH v3 09/20] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Currently we don't support nested translation for passthrough device with
emulated device under same PCI bridge, because they require different address
space when x-flts=on.

In theory, we do support if devices under same PCI bridge are all passthrough
devices. But emulated device can be hotplugged under same bridge. For simplify,
just forbid passthrough device under PCI bridge no matter if there is, or will
be emulated devices under same bridge. This is acceptable because PCIE bridge
is more popular than PCI bridge now.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c57ca02cdd..15f4393d6f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4330,9 +4330,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
     return vtd_dev_as;
 }
 
-static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
+static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
                            Error **errp)
 {
+    HostIOMMUDevice *hiod = vtd_hiod->hiod;
     HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
     int ret;
 
@@ -4359,6 +4360,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
 #ifdef CONFIG_IOMMUFD
     struct HostIOMMUDeviceCaps *caps = &hiod->caps;
     struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+    PCIBus *bus = vtd_hiod->bus;
+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), vtd_hiod->devfn);
 
     /* Remaining checks are all stage-1 translation specific */
     if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
@@ -4381,6 +4384,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
         error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
         return false;
     }
+
+    if (pci_device_get_iommu_bus_devfn(pdev, &bus, NULL, NULL)) {
+        error_setg(errp, "Host device under PCI bridge is unsupported "
+                   "when x-flts=on");
+        return false;
+    }
 #endif
 
     error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
@@ -4414,7 +4423,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
     vtd_hiod->iommu_state = s;
     vtd_hiod->hiod = hiod;
 
-    if (!vtd_check_hiod(s, hiod, errp)) {
+    if (!vtd_check_hiod(s, vtd_hiod, errp)) {
         g_free(vtd_hiod);
         vtd_iommu_unlock(s);
         return false;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 08/20] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  2025-07-08 11:05 ` [PATCH v3 08/20] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
@ 2025-07-16  9:53   ` Eric Auger
  2025-07-17  3:24     ` Duan, Zhenzhong
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Auger @ 2025-07-16  9:53 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

Hi Zhenzhong,

On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
> Currently we don't support nested translation for passthrough device with
> emulated device under same PCI bridge, because they require different address
> space when x-flts=on.
>
> In theory, we do support if devices under same PCI bridge are all passthrough
> devices. But emulated device can be hotplugged under same bridge. For simplify,
to simplify
> just forbid passthrough device under PCI bridge no matter if there is, or will
> be emulated devices under same bridge. This is acceptable because PCIE bridge
> is more popular than PCI bridge now.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index c57ca02cdd..15f4393d6f 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -4330,9 +4330,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>      return vtd_dev_as;
>  }
>  
> -static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
> +static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
>                             Error **errp)
>  {
> +    HostIOMMUDevice *hiod = vtd_hiod->hiod;
>      HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>      int ret;
>  
> @@ -4359,6 +4360,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>  #ifdef CONFIG_IOMMUFD
>      struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>      struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
> +    PCIBus *bus = vtd_hiod->bus;
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), vtd_hiod->devfn);
>  
>      /* Remaining checks are all stage-1 translation specific */
>      if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> @@ -4381,6 +4384,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>          error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
>          return false;
>      }
> +
> +    if (pci_device_get_iommu_bus_devfn(pdev, &bus, NULL, NULL)) {
> +        error_setg(errp, "Host device under PCI bridge is unsupported "
> +                   "when x-flts=on");
> +        return false;
> +    }
>  #endif
>  
>      error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
> @@ -4414,7 +4423,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
>      vtd_hiod->iommu_state = s;
>      vtd_hiod->hiod = hiod;
>  
> -    if (!vtd_check_hiod(s, hiod, errp)) {
> +    if (!vtd_check_hiod(s, vtd_hiod, errp)) {
>          g_free(vtd_hiod);
>          vtd_iommu_unlock(s);
>          return false;
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Eric



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 08/20] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
  2025-07-16  9:53   ` Eric Auger
@ 2025-07-17  3:24     ` Duan, Zhenzhong
  0 siblings, 0 replies; 55+ messages in thread
From: Duan, Zhenzhong @ 2025-07-17  3:24 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v3 08/20] intel_iommu: Fail passthrough device under PCI
>bridge if x-flts=on
>
>Hi Zhenzhong,
>
>On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
>> Currently we don't support nested translation for passthrough device with
>> emulated device under same PCI bridge, because they require different
>address
>> space when x-flts=on.
>>
>> In theory, we do support if devices under same PCI bridge are all
>passthrough
>> devices. But emulated device can be hotplugged under same bridge. For
>simplify,
>to simplify

Done.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 09/20] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (7 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 08/20] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-16 12:53   ` Eric Auger
  2025-07-08 11:05 ` [PATCH v3 10/20] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

PCI device supports two request types, Requests-without-PASID and
Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
prefix, IOMMU fetches rid_pasid from context entry and use it as IOMMU's
pasid to index pasid table.

So we need to translate between PCI's pasid and IOMMU's pasid specially
for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.

vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to vtd_as
which contains PCI's pasid vtd_as->pasid.

vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to iommu_pasid.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 58 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 15f4393d6f..38e7f7b7be 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1602,6 +1602,64 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
     return 0;
 }
 
+static int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
+                                        uint32_t *pasid)
+{
+    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    uint8_t bus_num = pci_bus_num(vtd_as->bus);
+    uint8_t devfn = vtd_as->devfn;
+    VTDContextEntry ce;
+    int ret;
+
+    if (cc_entry->context_cache_gen == s->context_cache_gen) {
+        ce = cc_entry->context_entry;
+    } else {
+        ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+        if (ret) {
+            return ret;
+        }
+    }
+
+    /* Translate to iommu pasid if PCI_NO_PASID */
+    if (vtd_as->pasid == PCI_NO_PASID) {
+        *pasid = VTD_CE_GET_RID2PASID(&ce);
+    } else {
+        *pasid = vtd_as->pasid;
+    }
+
+    return 0;
+}
+
+static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
+                                                   gpointer user_data)
+{
+    VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
+    struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
+    uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
+    uint32_t pasid;
+
+    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+        return false;
+    }
+
+    return (pasid == target->pasid) && (sid == target->sid);
+}
+
+/* Translate iommu pasid to vtd_as */
+static inline
+VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
+                                                uint16_t sid, uint32_t pasid)
+{
+    struct vtd_as_raw_key key = {
+        .sid = sid,
+        .pasid = pasid
+    };
+
+    return g_hash_table_find(s->vtd_address_spaces,
+                             vtd_find_as_by_sid_and_iommu_pasid, &key);
+}
+
 static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
                                      void *private)
 {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 09/20] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  2025-07-08 11:05 ` [PATCH v3 09/20] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
@ 2025-07-16 12:53   ` Eric Auger
  2025-07-17  3:48     ` Duan, Zhenzhong
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Auger @ 2025-07-16 12:53 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng

Hi Zhenzhong,

On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
> PCI device supports two request types, Requests-without-PASID and
> Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
> prefix, IOMMU fetches rid_pasid from context entry and use it as IOMMU's
> pasid to index pasid table.
>
> So we need to translate between PCI's pasid and IOMMU's pasid specially
> for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
> For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
>
> vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to vtd_as
> which contains PCI's pasid vtd_as->pasid.
>
> vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to iommu_pasid.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c | 58 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 58 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 15f4393d6f..38e7f7b7be 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1602,6 +1602,64 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>      return 0;
>  }
>  
> +static int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
> +                                        uint32_t *pasid)
> +{
> +    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    uint8_t bus_num = pci_bus_num(vtd_as->bus);
> +    uint8_t devfn = vtd_as->devfn;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    if (cc_entry->context_cache_gen == s->context_cache_gen) {
> +        ce = cc_entry->context_entry;
> +    } else {
> +        ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> +        if (ret) {
> +            return ret;
you need to retrieve the ce only if vtd_as->pasid. So this can be moved
to the conditional block below.
> +        }
> +    }
> +
> +    /* Translate to iommu pasid if PCI_NO_PASID */
> +    if (vtd_as->pasid == PCI_NO_PASID) {
> +        *pasid = VTD_CE_GET_RID2PASID(&ce);
> +    } else {
> +        *pasid = vtd_as->pasid;
> +    }
> +
> +    return 0;
> +}
> +
> +static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
> +                                                   gpointer user_data)
> +{
> +    VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
> +    struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
> +    uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
> +    uint32_t pasid;
> +
> +    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
> +        return false;
> +    }
> +
> +    return (pasid == target->pasid) && (sid == target->sid);
> +}
> +
> +/* Translate iommu pasid to vtd_as */
> +static inline
> +VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
> +                                                uint16_t sid, uint32_t pasid)
> +{
> +    struct vtd_as_raw_key key = {
> +        .sid = sid,
> +        .pasid = pasid
> +    };
> +
> +    return g_hash_table_find(s->vtd_address_spaces,
> +                             vtd_find_as_by_sid_and_iommu_pasid, &key);
> +}
> +
>  static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
>                                       void *private)
>  {
With that addressed,
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Eric



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 09/20] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
  2025-07-16 12:53   ` Eric Auger
@ 2025-07-17  3:48     ` Duan, Zhenzhong
  0 siblings, 0 replies; 55+ messages in thread
From: Duan, Zhenzhong @ 2025-07-17  3:48 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v3 09/20] intel_iommu: Introduce two helpers
>vtd_as_from/to_iommu_pasid_locked
>
>Hi Zhenzhong,
>
>On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
>> PCI device supports two request types, Requests-without-PASID and
>> Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
>> prefix, IOMMU fetches rid_pasid from context entry and use it as IOMMU's
>> pasid to index pasid table.
>>
>> So we need to translate between PCI's pasid and IOMMU's pasid specially
>> for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
>> For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
>>
>> vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to
>vtd_as
>> which contains PCI's pasid vtd_as->pasid.
>>
>> vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to
>iommu_pasid.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu.c | 58
>+++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 58 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 15f4393d6f..38e7f7b7be 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -1602,6 +1602,64 @@ static int
>vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>>      return 0;
>>  }
>>
>> +static int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
>> +                                        uint32_t *pasid)
>> +{
>> +    VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
>> +    IntelIOMMUState *s = vtd_as->iommu_state;
>> +    uint8_t bus_num = pci_bus_num(vtd_as->bus);
>> +    uint8_t devfn = vtd_as->devfn;
>> +    VTDContextEntry ce;
>> +    int ret;
>> +
>> +    if (cc_entry->context_cache_gen == s->context_cache_gen) {
>> +        ce = cc_entry->context_entry;
>> +    } else {
>> +        ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
>> +        if (ret) {
>> +            return ret;
>you need to retrieve the ce only if vtd_as->pasid. So this can be moved
>to the conditional block below.

Good idea! Will do.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 10/20] intel_iommu: Handle PASID entry removing and updating
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (8 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 09/20] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-16 15:10   ` Eric Auger
  2025-07-08 11:05 ` [PATCH v3 11/20] intel_iommu: Handle PASID entry adding Zhenzhong Duan
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Yi Sun

This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
pasid entry and track PASID usage and future PASID tagged DMA address
translation support in vIOMMU.

VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
never freed. For other pasid, VTDAddressSpace instance is created/destroyed
per the guest pasid entry set up/destroy.

When guest modifies a PASID entry, QEMU will capture the guest pasid selective
pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
invalidation reasons:

    a) a present pasid entry moved to non-present
    b) a present pasid entry to be a present entry
    c) a non-present pasid entry moved to present

This patch handles a) and b), following patch will handle c).

vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
and compare it with cached PASID entry.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c          | 194 +++++++++++++++++++++++++++++++--
 hw/i386/intel_iommu_internal.h |  27 ++++-
 hw/i386/trace-events           |   3 +
 include/hw/i386/intel_iommu.h  |   6 +
 4 files changed, 218 insertions(+), 12 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 38e7f7b7be..5bda439de6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1675,7 +1675,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
 
     if (s->root_scalable) {
         vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
-        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
+        return VTD_SM_PASID_ENTRY_DID(&pe);
     }
 
     return VTD_CONTEXT_ENTRY_DID(ce->hi);
@@ -3103,6 +3103,181 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
     return true;
 }
 
+static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
+                                            uint32_t pasid, VTDPASIDEntry *pe)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDContextEntry ce;
+    int ret;
+
+    if (!s->root_scalable) {
+        return -VTD_FR_RTADDR_INV_TTM;
+    }
+
+    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
+                                   &ce);
+    if (ret) {
+        return ret;
+    }
+
+    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+    return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/*
+ * This function is used to update or clear cached pasid entry in vtd_as.
+ * vtd_as is released when corresponding cached pasid entry is cleared,
+ * except for PCI_NO_PASID which vtd_as is owen by PCI sub-system.
+ */
+static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
+                                       gpointer user_data)
+{
+    VTDPASIDCacheInfo *pc_info = user_data;
+    VTDAddressSpace *vtd_as = value;
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    VTDPASIDEntry pe;
+    uint16_t did;
+    uint32_t pasid;
+    int ret;
+
+    if (!pc_entry->valid) {
+        return false;
+    }
+    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
+
+    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+        goto remove;
+    }
+
+    switch (pc_info->type) {
+    case VTD_PASID_CACHE_PASIDSI:
+        if (pc_info->pasid != pasid) {
+            return false;
+        }
+        /* fall through */
+    case VTD_PASID_CACHE_DOMSI:
+        if (pc_info->did != did) {
+            return false;
+        }
+        /* fall through */
+    case VTD_PASID_CACHE_GLOBAL_INV:
+        break;
+    default:
+        error_setg(&error_fatal, "invalid pc_info->type for flush");
+    }
+
+    /*
+     * pasid cache invalidation may indicate a present pasid
+     * entry to present pasid entry modification. To cover such
+     * case, vIOMMU emulator needs to fetch latest guest pasid
+     * entry and compares with cached pasid entry, then update
+     * pasid cache.
+     */
+    ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
+    if (ret) {
+        /*
+         * No valid pasid entry in guest memory. e.g. pasid entry
+         * was modified to be either all-zero or non-present. Either
+         * case means existing pasid cache should be removed.
+         */
+        goto remove;
+    }
+
+    /* No need to update if cached pasid entry is latest */
+    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
+        pc_entry->pasid_entry = pe;
+    }
+    return false;
+
+remove:
+    pc_entry->valid = false;
+
+    /*
+     * Don't remove address space of PCI_NO_PASID which is created for PCI
+     * sub-system.
+     */
+    if (vtd_as->pasid == PCI_NO_PASID) {
+        return false;
+    }
+    return true;
+}
+
+/* Update the pasid cache in vIOMMU */
+static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
+{
+    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
+        return;
+    }
+
+    /*
+     * Regards to a pasid cache invalidation, e.g. a PSI.
+     * It could be either cases of below:
+     * a) a present pasid entry moved to non-present
+     * b) a present pasid entry to be a present entry
+     * c) a non-present pasid entry moved to present
+     */
+    vtd_iommu_lock(s);
+    /*
+     * a,b): loop all the existing vtd_as instances for pasid cache remove
+       or update.
+     */
+    g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
+                                pc_info);
+    vtd_iommu_unlock(s);
+}
+
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+                                   VTDInvDesc *inv_desc)
+{
+    uint16_t did;
+    uint32_t pasid;
+    VTDPASIDCacheInfo pc_info;
+    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
+                        VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
+
+    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
+                                     __func__, "pasid cache inv")) {
+        return false;
+    }
+
+    did = VTD_INV_DESC_PASIDC_DID(inv_desc);
+    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc);
+
+    switch (VTD_INV_DESC_PASIDC_G(inv_desc)) {
+    case VTD_INV_DESC_PASIDC_G_DSI:
+        trace_vtd_pasid_cache_dsi(did);
+        pc_info.type = VTD_PASID_CACHE_DOMSI;
+        pc_info.did = did;
+        break;
+
+    case VTD_INV_DESC_PASIDC_G_PASID_SI:
+        /* PASID selective implies a DID selective */
+        trace_vtd_pasid_cache_psi(did, pasid);
+        pc_info.type = VTD_PASID_CACHE_PASIDSI;
+        pc_info.did = did;
+        pc_info.pasid = pasid;
+        break;
+
+    case VTD_INV_DESC_PASIDC_G_GLOBAL:
+        trace_vtd_pasid_cache_gsi();
+        pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+        break;
+
+    default:
+        error_report_once("invalid granularity field in PASID-cache invalidate "
+                          "descriptor, hi: 0x%"PRIx64" lo: 0x%" PRIx64,
+                           inv_desc->val[1], inv_desc->val[0]);
+        return false;
+    }
+
+    vtd_pasid_cache_sync(s, &pc_info);
+    return true;
+}
+
 static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
                                      VTDInvDesc *inv_desc)
 {
@@ -3264,6 +3439,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
+    case VTD_INV_DESC_PC:
+        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+        if (!vtd_process_pasid_desc(s, &inv_desc)) {
+            return false;
+        }
+        break;
+
     case VTD_INV_DESC_PIOTLB:
         trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
         if (!vtd_process_piotlb_desc(s, &inv_desc)) {
@@ -3299,16 +3481,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
         }
         break;
 
-    /*
-     * TODO: the entity of below two cases will be implemented in future series.
-     * To make guest (which integrates scalable mode support patch set in
-     * iommu driver) work, just return true is enough so far.
-     */
-    case VTD_INV_DESC_PC:
-        if (s->scalable_mode) {
-            break;
-        }
-    /* fallthrough */
     default:
         error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
                           " (unknown type)", __func__, inv_desc.hi,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 18bc22fc72..87059d26aa 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
                                   * request while disabled */
     VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
 
+    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
     /* PASID directory entry access failure */
     VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
     /* The Present(P) field of pasid directory entry is 0 */
@@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
 #define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000f1c0ULL
 #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
 
+/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
+#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
+#define VTD_INV_DESC_PASIDC_G_DSI       0
+#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
+#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
+#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16, 16)
+#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32, 20)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
+
 /* Information about page-selective IOTLB invalidate */
 struct VTDIOTLBPageInvInfo {
     uint16_t domain_id;
@@ -552,6 +562,21 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+typedef enum VTDPCInvType {
+    /* VTD spec defined PASID cache invalidation type */
+    VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
+    VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
+    VTD_PASID_CACHE_GLOBAL_INV = VTD_INV_DESC_PASIDC_G_GLOBAL,
+} VTDPCInvType;
+
+typedef struct VTDPASIDCacheInfo {
+    VTDPCInvType type;
+    uint16_t did;
+    uint32_t pasid;
+    PCIBus *bus;
+    uint16_t devfn;
+} VTDPASIDCacheInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
@@ -573,7 +598,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
 
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
-#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
+#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
 
 #define VTD_SM_PASID_ENTRY_FLPM          3ULL
 #define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ac9e1a10aa..ae5bbfcdc0 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 50f9b27a45..0e3826f6f0 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -95,6 +95,11 @@ struct VTDPASIDEntry {
     uint64_t val[8];
 };
 
+typedef struct VTDPASIDCacheEntry {
+    struct VTDPASIDEntry pasid_entry;
+    bool valid;
+} VTDPASIDCacheEntry;
+
 struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
@@ -107,6 +112,7 @@ struct VTDAddressSpace {
     MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
     IntelIOMMUState *iommu_state;
     VTDContextCacheEntry context_cache_entry;
+    VTDPASIDCacheEntry pasid_cache_entry;
     QLIST_ENTRY(VTDAddressSpace) next;
     /* Superset of notifier flags that this address space has */
     IOMMUNotifierFlag notifier_flags;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 10/20] intel_iommu: Handle PASID entry removing and updating
  2025-07-08 11:05 ` [PATCH v3 10/20] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
@ 2025-07-16 15:10   ` Eric Auger
  2025-07-17  7:02     ` Duan, Zhenzhong
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Auger @ 2025-07-16 15:10 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun



On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
> pasid entry and track PASID usage and future PASID tagged DMA address
> translation support in vIOMMU.
>
> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
> never freed. For other pasid, VTDAddressSpace instance is created/destroyed
> per the guest pasid entry set up/destroy.
>
> When guest modifies a PASID entry, QEMU will capture the guest pasid selective
> pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
> invalidation reasons:
>
>     a) a present pasid entry moved to non-present
>     b) a present pasid entry to be a present entry
>     c) a non-present pasid entry moved to present
>
> This patch handles a) and b), following patch will handle c).
>
> vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
> and compare it with cached PASID entry.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c          | 194 +++++++++++++++++++++++++++++++--
>  hw/i386/intel_iommu_internal.h |  27 ++++-
>  hw/i386/trace-events           |   3 +
>  include/hw/i386/intel_iommu.h  |   6 +
>  4 files changed, 218 insertions(+), 12 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 38e7f7b7be..5bda439de6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1675,7 +1675,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
>  
>      if (s->root_scalable) {
>          vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> -        return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
> +        return VTD_SM_PASID_ENTRY_DID(&pe);
>      }
>  
>      return VTD_CONTEXT_ENTRY_DID(ce->hi);
> @@ -3103,6 +3103,181 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
>      return true;
>  }
>  
> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
> +                                            uint32_t pasid, VTDPASIDEntry *pe)
> +{
> +    IntelIOMMUState *s = vtd_as->iommu_state;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    if (!s->root_scalable) {
> +        return -VTD_FR_RTADDR_INV_TTM;
> +    }
> +
> +    ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
> +                                   &ce);
> +    if (ret) {
> +        return ret;
> +    }
> +
> +    return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
> +}
> +
> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
> +{
> +    return !memcmp(p1, p2, sizeof(*p1));
> +}
> +
> +/*
> + * This function is used to update or clear cached pasid entry in vtd_as.
> + * vtd_as is released when corresponding cached pasid entry is cleared,
> + * except for PCI_NO_PASID which vtd_as is owen by PCI sub-system.
I would document when it is supposed to return true (indicates that the
cached pasid entry needs to be removed).
> + */
> +static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
> +                                       gpointer user_data)
> +{
> +    VTDPASIDCacheInfo *pc_info = user_data;
> +    VTDAddressSpace *vtd_as = value;
> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> +    VTDPASIDEntry pe;
> +    uint16_t did;
> +    uint32_t pasid;
> +    int ret;
> +
> +    if (!pc_entry->valid) {
> +        return false;
> +    }
> +    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
> +
> +    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
> +        goto remove;
> +    }
> +
> +    switch (pc_info->type) {
> +    case VTD_PASID_CACHE_PASIDSI:
> +        if (pc_info->pasid != pasid) {
> +            return false;
> +        }
> +        /* fall through */
> +    case VTD_PASID_CACHE_DOMSI:
> +        if (pc_info->did != did) {
> +            return false;
> +        }
> +        /* fall through */
> +    case VTD_PASID_CACHE_GLOBAL_INV:
> +        break;
> +    default:
> +        error_setg(&error_fatal, "invalid pc_info->type for flush");
> +    }
> +
> +    /*
> +     * pasid cache invalidation may indicate a present pasid
> +     * entry to present pasid entry modification. To cover such
> +     * case, vIOMMU emulator needs to fetch latest guest pasid
> +     * entry and compares with cached pasid entry, then update
> +     * pasid cache.
> +     */
> +    ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
> +    if (ret) {
> +        /*
> +         * No valid pasid entry in guest memory. e.g. pasid entry
> +         * was modified to be either all-zero or non-present. Either
> +         * case means existing pasid cache should be removed.
> +         */
> +        goto remove;
> +    }
> +
> +    /* No need to update if cached pasid entry is latest */
that comment sounds really weird to me. In case the cache entry does not
match the one in guest mem, you update it below - at least that's what I
understand ;-) - However you want to return false because you don't want
g_hash_table_foreach_remove() to remove the entry.
> +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
> +        pc_entry->pasid_entry = pe;
> +    }
> +    return false;
> +
> +remove:
> +    pc_entry->valid = false;
> +
> +    /*
> +     * Don't remove address space of PCI_NO_PASID which is created for PCI
> +     * sub-system.
> +     */
> +    if (vtd_as->pasid == PCI_NO_PASID) {
> +        return false;
> +    }
> +    return true;
> +}
> +
> +/* Update the pasid cache in vIOMMU */
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
> +{
> +    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
> +        return;
> +    }
> +
> +    /*
> +     * Regards to a pasid cache invalidation, e.g. a PSI.
Regarded PASID cache invalidation?
> +     * It could be either cases of below:
> +     * a) a present pasid entry moved to non-present
a present cache pasid entry needs to be removed
> +     * b) a present pasid entry to be a present entry
above sounds a bit weird. A present cached pasid entry needs to be updated
> +     * c) a non-present pasid entry moved to present
a present cached pasid entry needs to be created. As this is not handled
in this patch I would move this to next one.
Besides since there is another comment below I am not even sure this is
requested or at least I would put this in a doc comment for the function
and not within the code.
> +     */
> +    vtd_iommu_lock(s);
> +    /*
> +     * a,b): loop all the existing vtd_as instances for pasid cache remove
> +       or update.
see above
> +     */
> +    g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
> +                                pc_info);
> +    vtd_iommu_unlock(s);
> +}
> +
> +static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> +                                   VTDInvDesc *inv_desc)
> +{
> +    uint16_t did;
> +    uint32_t pasid;
> +    VTDPASIDCacheInfo pc_info;
> +    uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
> +                        VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
> +
> +    if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
> +                                     __func__, "pasid cache inv")) {
> +        return false;
> +    }
> +
> +    did = VTD_INV_DESC_PASIDC_DID(inv_desc);
> +    pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc);
> +
> +    switch (VTD_INV_DESC_PASIDC_G(inv_desc)) {
> +    case VTD_INV_DESC_PASIDC_G_DSI:
> +        trace_vtd_pasid_cache_dsi(did);
> +        pc_info.type = VTD_PASID_CACHE_DOMSI;
> +        pc_info.did = did;
> +        break;
> +
> +    case VTD_INV_DESC_PASIDC_G_PASID_SI:
> +        /* PASID selective implies a DID selective */
> +        trace_vtd_pasid_cache_psi(did, pasid);
> +        pc_info.type = VTD_PASID_CACHE_PASIDSI;
> +        pc_info.did = did;
> +        pc_info.pasid = pasid;
> +        break;
> +
> +    case VTD_INV_DESC_PASIDC_G_GLOBAL:
> +        trace_vtd_pasid_cache_gsi();
> +        pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
> +        break;
> +
> +    default:
> +        error_report_once("invalid granularity field in PASID-cache invalidate "
> +                          "descriptor, hi: 0x%"PRIx64" lo: 0x%" PRIx64,
> +                           inv_desc->val[1], inv_desc->val[0]);
> +        return false;
> +    }
> +
> +    vtd_pasid_cache_sync(s, &pc_info);
> +    return true;
> +}
> +
>  static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
>                                       VTDInvDesc *inv_desc)
>  {
> @@ -3264,6 +3439,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>          }
>          break;
>  
> +    case VTD_INV_DESC_PC:
> +        trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
> +        if (!vtd_process_pasid_desc(s, &inv_desc)) {
> +            return false;
> +        }
> +        break;
> +
>      case VTD_INV_DESC_PIOTLB:
>          trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
>          if (!vtd_process_piotlb_desc(s, &inv_desc)) {
> @@ -3299,16 +3481,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
>          }
>          break;
>  
> -    /*
> -     * TODO: the entity of below two cases will be implemented in future series.
> -     * To make guest (which integrates scalable mode support patch set in
> -     * iommu driver) work, just return true is enough so far.
> -     */
> -    case VTD_INV_DESC_PC:
> -        if (s->scalable_mode) {
> -            break;
> -        }
> -    /* fallthrough */
>      default:
>          error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
>                            " (unknown type)", __func__, inv_desc.hi,
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 18bc22fc72..87059d26aa 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
>                                    * request while disabled */
>      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>  
> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>      /* PASID directory entry access failure */
>      VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>      /* The Present(P) field of pasid directory entry is 0 */
> @@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
>  #define VTD_INV_DESC_PIOTLB_RSVD_VAL0     0xfff000000000f1c0ULL
>  #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>  
> +/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
> +#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
> +#define VTD_INV_DESC_PASIDC_G_DSI       0
> +#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
> +#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
> +#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16, 16)
> +#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32, 20)
thanks for converting to extract64 ;-)
> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
nit: I find such mask definition error prone and difficult to review.
combined MAKE_64BIT_MASK() would make it clearer I think
> +
>  /* Information about page-selective IOTLB invalidate */
>  struct VTDIOTLBPageInvInfo {
>      uint16_t domain_id;
> @@ -552,6 +562,21 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
>  
> +typedef enum VTDPCInvType {
> +    /* VTD spec defined PASID cache invalidation type */
> +    VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
> +    VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
> +    VTD_PASID_CACHE_GLOBAL_INV = VTD_INV_DESC_PASIDC_G_GLOBAL,
> +} VTDPCInvType;
> +
> +typedef struct VTDPASIDCacheInfo {
> +    VTDPCInvType type;
> +    uint16_t did;
> +    uint32_t pasid;
> +    PCIBus *bus;
> +    uint16_t devfn;
> +} VTDPASIDCacheInfo;
> +
>  /* PASID Table Related Definitions */
>  #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
>  #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> @@ -573,7 +598,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_SM_PASID_ENTRY_PT          (4ULL << 6)
>  
>  #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
> -#define VTD_SM_PASID_ENTRY_DID(val)    ((val) & VTD_DOMAIN_ID_MASK)
> +#define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
>  
>  #define VTD_SM_PASID_ENTRY_FLPM          3ULL
>  #define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index ac9e1a10aa..ae5bbfcdc0 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
>  vtd_inv_qi_tail(uint16_t head) "write tail %d"
>  vtd_inv_qi_fetch(void) ""
>  vtd_context_cache_reset(void) ""
> +vtd_pasid_cache_gsi(void) ""
> +vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
> +vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
>  vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
>  vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
>  vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 50f9b27a45..0e3826f6f0 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>      uint64_t val[8];
>  };
>  
> +typedef struct VTDPASIDCacheEntry {
> +    struct VTDPASIDEntry pasid_entry;
> +    bool valid;
> +} VTDPASIDCacheEntry;
> +
>  struct VTDAddressSpace {
>      PCIBus *bus;
>      uint8_t devfn;
> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>      MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
>      IntelIOMMUState *iommu_state;
>      VTDContextCacheEntry context_cache_entry;
> +    VTDPASIDCacheEntry pasid_cache_entry;
>      QLIST_ENTRY(VTDAddressSpace) next;
>      /* Superset of notifier flags that this address space has */
>      IOMMUNotifierFlag notifier_flags;
Thanks

Eric



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 10/20] intel_iommu: Handle PASID entry removing and updating
  2025-07-16 15:10   ` Eric Auger
@ 2025-07-17  7:02     ` Duan, Zhenzhong
  0 siblings, 0 replies; 55+ messages in thread
From: Duan, Zhenzhong @ 2025-07-17  7:02 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P, Yi Sun



>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v3 10/20] intel_iommu: Handle PASID entry removing and
>updating
...
>> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry
>*p2)
>> +{
>> +    return !memcmp(p1, p2, sizeof(*p1));
>> +}
>> +
>> +/*
>> + * This function is used to update or clear cached pasid entry in vtd_as.
>> + * vtd_as is released when corresponding cached pasid entry is cleared,
>> + * except for PCI_NO_PASID which vtd_as is owen by PCI sub-system.
>I would document when it is supposed to return true (indicates that the
>cached pasid entry needs to be removed).

Will do.

>> + */
>> +static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
>> +                                       gpointer user_data)
>> +{
>> +    VTDPASIDCacheInfo *pc_info = user_data;
>> +    VTDAddressSpace *vtd_as = value;
>> +    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> +    VTDPASIDEntry pe;
>> +    uint16_t did;
>> +    uint32_t pasid;
>> +    int ret;
>> +
>> +    if (!pc_entry->valid) {
>> +        return false;
>> +    }
>> +    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
>> +
>> +    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
>> +        goto remove;
>> +    }
>> +
>> +    switch (pc_info->type) {
>> +    case VTD_PASID_CACHE_PASIDSI:
>> +        if (pc_info->pasid != pasid) {
>> +            return false;
>> +        }
>> +        /* fall through */
>> +    case VTD_PASID_CACHE_DOMSI:
>> +        if (pc_info->did != did) {
>> +            return false;
>> +        }
>> +        /* fall through */
>> +    case VTD_PASID_CACHE_GLOBAL_INV:
>> +        break;
>> +    default:
>> +        error_setg(&error_fatal, "invalid pc_info->type for flush");
>> +    }
>> +
>> +    /*
>> +     * pasid cache invalidation may indicate a present pasid
>> +     * entry to present pasid entry modification. To cover such
>> +     * case, vIOMMU emulator needs to fetch latest guest pasid
>> +     * entry and compares with cached pasid entry, then update
>> +     * pasid cache.
>> +     */
>> +    ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
>> +    if (ret) {
>> +        /*
>> +         * No valid pasid entry in guest memory. e.g. pasid entry
>> +         * was modified to be either all-zero or non-present. Either
>> +         * case means existing pasid cache should be removed.
>> +         */
>> +        goto remove;
>> +    }
>> +
>> +    /* No need to update if cached pasid entry is latest */
>that comment sounds really weird to me. In case the cache entry does not
>match the one in guest mem, you update it below - at least that's what I
>understand ;-) - However you want to return false because you don't want
>g_hash_table_foreach_remove() to remove the entry.

You understand totally right😊, will rephase to:
/* Update cached pasid entry if it's stale compared to what's in guest memory */

>> +    if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
>> +        pc_entry->pasid_entry = pe;
>> +    }
>> +    return false;
>> +
>> +remove:
>> +    pc_entry->valid = false;
>> +
>> +    /*
>> +     * Don't remove address space of PCI_NO_PASID which is created for
>PCI
>> +     * sub-system.
>> +     */
>> +    if (vtd_as->pasid == PCI_NO_PASID) {
>> +        return false;
>> +    }
>> +    return true;
>> +}
>> +
>> +/* Update the pasid cache in vIOMMU */
>> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
>VTDPASIDCacheInfo *pc_info)
>> +{
>> +    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Regards to a pasid cache invalidation, e.g. a PSI.
>Regarded PASID cache invalidation?
>> +     * It could be either cases of below:
>> +     * a) a present pasid entry moved to non-present
>a present cache pasid entry needs to be removed
>> +     * b) a present pasid entry to be a present entry
>above sounds a bit weird. A present cached pasid entry needs to be updated
>> +     * c) a non-present pasid entry moved to present
>a present cached pasid entry needs to be created. As this is not handled
>in this patch I would move this to next one.
>Besides since there is another comment below I am not even sure this is
>requested or at least I would put this in a doc comment for the function
>and not within the code.

Will do.

>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
>>                                    * request while disabled */
>>      VTD_FR_IR_SID_ERR = 0x26,   /* Invalid Source-ID */
>>
>> +    VTD_FR_RTADDR_INV_TTM = 0x31,  /* Invalid TTM in RTADDR */
>>      /* PASID directory entry access failure */
>>      VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>>      /* The Present(P) field of pasid directory entry is 0 */
>> @@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
>>  #define VTD_INV_DESC_PIOTLB_RSVD_VAL0
>0xfff000000000f1c0ULL
>>  #define VTD_INV_DESC_PIOTLB_RSVD_VAL1     0xf80ULL
>>
>> +/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
>> +#define VTD_INV_DESC_PASIDC_G(x)        extract64((x)->val[0], 4, 2)
>> +#define VTD_INV_DESC_PASIDC_G_DSI       0
>> +#define VTD_INV_DESC_PASIDC_G_PASID_SI  1
>> +#define VTD_INV_DESC_PASIDC_G_GLOBAL    3
>> +#define VTD_INV_DESC_PASIDC_DID(x)      extract64((x)->val[0], 16,
>16)
>> +#define VTD_INV_DESC_PASIDC_PASID(x)    extract64((x)->val[0], 32,
>20)
>thanks for converting to extract64 ;-)
>> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0   0xfff000000000f1c0ULL
>nit: I find such mask definition error prone and difficult to review.
>combined MAKE_64BIT_MASK() would make it clearer I think

Yes, so we need to be careful to ensure its correctness when coding.
But I think a single 0xfff000000000f1c0ULL is cleaner than many MAKE_64BIT_MASK() combined.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 11/20] intel_iommu: Handle PASID entry adding
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (9 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 10/20] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-16 16:44   ` Eric Auger
  2025-07-08 11:05 ` [PATCH v3 12/20] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Yi Sun

When guest modifies a PASID entry, QEMU will capture the guest pasid selective
pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
invalidation reasons:

    a) a present pasid entry moved to non-present
    b) a present pasid entry to be a present entry
    c) a non-present pasid entry moved to present

This handles c) by going through each passthrough device and each pasid. When
a new valid pasid entry is founded, find or create a vtd_as and cache pasid
entry in it.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c          | 170 ++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h |   2 +
 2 files changed, 169 insertions(+), 3 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 5bda439de6..cf2c959b60 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -826,6 +826,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
     }
 }
 
+static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
+{
+    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1647,9 +1652,9 @@ static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
 }
 
 /* Translate iommu pasid to vtd_as */
-static inline
-VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
-                                                uint16_t sid, uint32_t pasid)
+static VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
+                                                       uint16_t sid,
+                                                       uint32_t pasid)
 {
     struct vtd_as_raw_key key = {
         .sid = sid,
@@ -3206,6 +3211,162 @@ remove:
     return true;
 }
 
+static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+                                        dma_addr_t pt_base,
+                                        int start,
+                                        int end,
+                                        VTDPASIDCacheInfo *info)
+{
+    VTDPASIDEntry pe;
+    int pasid = start;
+
+    while (pasid < end) {
+        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+            && vtd_pe_present(&pe)) {
+            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
+            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
+            VTDPASIDCacheEntry *pc_entry;
+            VTDAddressSpace *vtd_as;
+
+            vtd_iommu_lock(s);
+            /*
+             * When indexed by rid2pasid, vtd_as should have been created,
+             * e.g., by PCI subsystem. For other iommu pasid, we need to
+             * create vtd_as dynamically. The other iommu pasid is same as
+             * PCI's pasid, so it's used as input of vtd_find_add_as().
+             */
+            vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
+            vtd_iommu_unlock(s);
+            if (!vtd_as) {
+                vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
+            }
+
+            if ((info->type == VTD_PASID_CACHE_DOMSI ||
+                 info->type == VTD_PASID_CACHE_PASIDSI) &&
+                (info->did != VTD_SM_PASID_ENTRY_DID(&pe))) {
+                /*
+                 * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
+                 * requires domain id check. If domain id check fail,
+                 * go to next pasid.
+                 */
+                pasid++;
+                continue;
+            }
+
+            pc_entry = &vtd_as->pasid_cache_entry;
+            /*
+             * pasic cache update and clear are handled in
+             * vtd_flush_pasid_locked(), only care new pasid entry here.
+             */
+            if (!pc_entry->valid) {
+                pc_entry->pasid_entry = pe;
+                pc_entry->valid = true;
+            }
+        }
+        pasid++;
+    }
+}
+
+/*
+ * In VT-d scalable mode translation, PASID dir + PASID table is used.
+ * This function aims at looping over a range of PASIDs in the given
+ * two level table to identify the pasid config in guest.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
+                                    dma_addr_t pdt_base,
+                                    int start, int end,
+                                    VTDPASIDCacheInfo *info)
+{
+    VTDPASIDDirEntry pdire;
+    int pasid = start;
+    int pasid_next;
+    dma_addr_t pt_base;
+
+    while (pasid < end) {
+        pasid_next =
+             (pasid + VTD_PASID_TBL_ENTRY_NUM) & ~(VTD_PASID_TBL_ENTRY_NUM - 1);
+        pasid_next = pasid_next < end ? pasid_next : end;
+
+        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+            && vtd_pdire_present(&pdire)) {
+            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
+        }
+        pasid = pasid_next;
+    }
+}
+
+static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
+                                          int start, int end,
+                                          VTDPASIDCacheInfo *info)
+{
+    VTDContextEntry ce;
+
+    if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus), info->devfn,
+                                  &ce)) {
+        uint32_t max_pasid;
+
+        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
+        if (end > max_pasid) {
+            end = max_pasid;
+        }
+        vtd_sm_pasid_table_walk(s,
+                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
+                                start,
+                                end,
+                                info);
+    }
+}
+
+/*
+ * This function replays the guest pasid bindings by walking the two level
+ * guest PASID table. For each valid pasid entry, it finds or creates a
+ * vtd_as and caches pasid entry in vtd_as.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+                                            VTDPASIDCacheInfo *pc_info)
+{
+    /*
+     * Currently only Requests-without-PASID is supported, as vIOMMU doesn't
+     * support RPS(RID-PASID Support), pasid scope is fixed to [0, 1).
+     */
+    int start = 0, end = 1;
+    VTDHostIOMMUDevice *vtd_hiod;
+    VTDPASIDCacheInfo walk_info;
+    GHashTableIter as_it;
+
+    switch (pc_info->type) {
+    case VTD_PASID_CACHE_PASIDSI:
+        start = pc_info->pasid;
+        end = pc_info->pasid + 1;
+       /* fall through */
+    case VTD_PASID_CACHE_DOMSI:
+    case VTD_PASID_CACHE_GLOBAL_INV:
+        /* loop all assigned devices */
+        break;
+    default:
+        error_setg(&error_fatal, "invalid pc_info->type for replay");
+    }
+
+    /*
+     * In this replay, one only needs to care about the devices which are
+     * backed by host IOMMU. Those devices have a corresponding vtd_hiod
+     * in s->vtd_host_iommu_dev. For devices not backed by host IOMMU, it
+     * is not necessary to replay the bindings since their cache could be
+     * re-created in the future DMA address translation.
+     *
+     * VTD translation callback never accesses vtd_hiod and its corresponding
+     * cached pasid entry, so no iommu lock needed here.
+     */
+    walk_info = *pc_info;
+    g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
+    while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
+        walk_info.bus = vtd_hiod->bus;
+        walk_info.devfn = vtd_hiod->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+    }
+}
+
 /* Update the pasid cache in vIOMMU */
 static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
 {
@@ -3228,6 +3389,9 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
     g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
                                 pc_info);
     vtd_iommu_unlock(s);
+
+    /* c): loop all passthrough device for new pasid entries */
+    vtd_replay_guest_pasid_bindings(s, pc_info);
 }
 
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 87059d26aa..621e1f6947 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -558,6 +558,7 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_CTX_ENTRY_LEGACY_SIZE     16
 #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
 
+#define VTD_SM_CONTEXT_ENTRY_PDTS(x)        extract64((x)->val[0], 9, 3)
 #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
@@ -588,6 +589,7 @@ typedef struct VTDPASIDCacheInfo {
 #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
 #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
 #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
 
 /* PASID Granular Translation Type Mask */
 #define VTD_PASID_ENTRY_P              1ULL
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v3 11/20] intel_iommu: Handle PASID entry adding
  2025-07-08 11:05 ` [PATCH v3 11/20] intel_iommu: Handle PASID entry adding Zhenzhong Duan
@ 2025-07-16 16:44   ` Eric Auger
  2025-07-17  7:15     ` Duan, Zhenzhong
  0 siblings, 1 reply; 55+ messages in thread
From: Eric Auger @ 2025-07-16 16:44 UTC (permalink / raw)
  To: Zhenzhong Duan, qemu-devel
  Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
	nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun

Hi Zhenzhong,

On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
> When guest modifies a PASID entry, QEMU will capture the guest pasid selective
> pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
> invalidation reasons:

commit msg: intel_iommu: Handle PASID entry addition?
Same for previous patch, ie. use nouns
>
>     a) a present pasid entry moved to non-present
>     b) a present pasid entry to be a present entry
>     c) a non-present pasid entry moved to present
just focus on c to avoid extra noise?
>
> This handles c) by going through each passthrough device and each pasid. When
> a new valid pasid entry is founded, find or create a vtd_as and cache pasid
is found, identify an existing vtd_as or create a new one and update its
corresponding cached pasid entry.
> entry in it.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
>  hw/i386/intel_iommu.c          | 170 ++++++++++++++++++++++++++++++++-
>  hw/i386/intel_iommu_internal.h |   2 +
>  2 files changed, 169 insertions(+), 3 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 5bda439de6..cf2c959b60 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -826,6 +826,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>      }
>  }
>  
> +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
> +{
> +    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
> +}
> +
>  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>  {
>      return pdire->val & 1;
> @@ -1647,9 +1652,9 @@ static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
>  }
>  
>  /* Translate iommu pasid to vtd_as */
> -static inline
> -VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
> -                                                uint16_t sid, uint32_t pasid)
> +static VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
> +                                                       uint16_t sid,
> +                                                       uint32_t pasid)
>  {
>      struct vtd_as_raw_key key = {
>          .sid = sid,
> @@ -3206,6 +3211,162 @@ remove:
>      return true;
>  }
>  
I think this would deserve a doc comment explaining it retrieves/creates
vtd_as and associated cached PE for PASID range within [start, end] that
matches @info type/did. Also you may document that the "_one" means it
walks a single SM page table between start/end  (if my understanding is
correct)
> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
> +                                        dma_addr_t pt_base,
> +                                        int start,
> +                                        int end,
> +                                        VTDPASIDCacheInfo *info)
> +{
> +    VTDPASIDEntry pe;
> +    int pasid = start;
> +
> +    while (pasid < end) {
> +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
> +            && vtd_pe_present(&pe)) {
> +            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
> +            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
> +            VTDPASIDCacheEntry *pc_entry;
> +            VTDAddressSpace *vtd_as;
> +
> +            vtd_iommu_lock(s);
> +            /*
> +             * When indexed by rid2pasid, vtd_as should have been created,
> +             * e.g., by PCI subsystem. For other iommu pasid, we need to
I thought iommu pasid was rid2pasid, as opposed to regular PCI pasids.
> +             * create vtd_as dynamically. The other iommu pasid is same as
> +             * PCI's pasid, so it's used as input of vtd_find_add_as().
> +             */
> +            vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
> +            vtd_iommu_unlock(s);
> +            if (!vtd_as) {
> +                vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
> +            }
> +
> +            if ((info->type == VTD_PASID_CACHE_DOMSI ||
> +                 info->type == VTD_PASID_CACHE_PASIDSI) &&
> +                (info->did != VTD_SM_PASID_ENTRY_DID(&pe))) {
> +                /*
> +                 * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
> +                 * requires domain id check. If domain id check fail,
require
> +                 * go to next pasid.
> +                 */
> +                pasid++;
> +                continue;
> +            }
> +
> +            pc_entry = &vtd_as->pasid_cache_entry;
> +            /*
> +             * pasic cache update and clear are handled in
> +             * vtd_flush_pasid_locked(), only care new pasid entry here.
> +             */
> +            if (!pc_entry->valid) {
> +                pc_entry->pasid_entry = pe;
> +                pc_entry->valid = true;
> +            }
> +        }
> +        pasid++;
> +    }
> +}
> +
> +/*
> + * In VT-d scalable mode translation, PASID dir + PASID table is used.
> + * This function aims at looping over a range of PASIDs in the given
> + * two level table to identify the pasid config in guest.
> + */
> +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
> +                                    dma_addr_t pdt_base,
> +                                    int start, int end,
> +                                    VTDPASIDCacheInfo *info)
> +{
> +    VTDPASIDDirEntry pdire;
> +    int pasid = start;
> +    int pasid_next;
> +    dma_addr_t pt_base;
> +
> +    while (pasid < end) {
> +        pasid_next =
> +             (pasid + VTD_PASID_TBL_ENTRY_NUM) & ~(VTD_PASID_TBL_ENTRY_NUM - 1);
> +        pasid_next = pasid_next < end ? pasid_next : end;
> +
> +        if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
> +            && vtd_pdire_present(&pdire)) {
> +            pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
> +            vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
> +        }
> +        pasid = pasid_next;
> +    }
> +}
> +
> +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
> +                                          int start, int end,
> +                                          VTDPASIDCacheInfo *info)
> +{
> +    VTDContextEntry ce;
> +
> +    if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus), info->devfn,
> +                                  &ce)) {
> +        uint32_t max_pasid;
> +
> +        max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
> +        if (end > max_pasid) {
> +            end = max_pasid;
> +        }
> +        vtd_sm_pasid_table_walk(s,
> +                                VTD_CE_GET_PASID_DIR_TABLE(&ce),
> +                                start,
> +                                end,
> +                                info);
> +    }
> +}
> +
> +/*
> + * This function replays the guest pasid bindings by walking the two level
> + * guest PASID table. For each valid pasid entry, it finds or creates a
> + * vtd_as and caches pasid entry in vtd_as.
> + */
> +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> +                                            VTDPASIDCacheInfo *pc_info)
> +{
> +    /*
> +     * Currently only Requests-without-PASID is supported, as vIOMMU doesn't
> +     * support RPS(RID-PASID Support), pasid scope is fixed to [0, 1).
> +     */
> +    int start = 0, end = 1;
> +    VTDHostIOMMUDevice *vtd_hiod;
> +    VTDPASIDCacheInfo walk_info;
> +    GHashTableIter as_it;
> +
> +    switch (pc_info->type) {
> +    case VTD_PASID_CACHE_PASIDSI:
> +        start = pc_info->pasid;
> +        end = pc_info->pasid + 1;
> +       /* fall through */
> +    case VTD_PASID_CACHE_DOMSI:
> +    case VTD_PASID_CACHE_GLOBAL_INV:
> +        /* loop all assigned devices */
> +        break;
> +    default:
> +        error_setg(&error_fatal, "invalid pc_info->type for replay");
> +    }
> +
> +    /*
> +     * In this replay, one only needs to care about the devices which are
> +     * backed by host IOMMU. Those devices have a corresponding vtd_hiod
> +     * in s->vtd_host_iommu_dev. For devices not backed by host IOMMU, it
> +     * is not necessary to replay the bindings since their cache could be
> +     * re-created in the future DMA address translation.
> +     *
> +     * VTD translation callback never accesses vtd_hiod and its corresponding
> +     * cached pasid entry, so no iommu lock needed here.
> +     */
> +    walk_info = *pc_info;
> +    g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
> +    while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
> +        walk_info.bus = vtd_hiod->bus;
> +        walk_info.devfn = vtd_hiod->devfn;
> +        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> +    }
> +}
> +
>  /* Update the pasid cache in vIOMMU */
>  static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
>  {
> @@ -3228,6 +3389,9 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
>      g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
>                                  pc_info);
>      vtd_iommu_unlock(s);
> +
> +    /* c): loop all passthrough device for new pasid entries */
> +    vtd_replay_guest_pasid_bindings(s, pc_info);
>  }
>  
>  static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 87059d26aa..621e1f6947 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -558,6 +558,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>  #define VTD_CTX_ENTRY_LEGACY_SIZE     16
>  #define VTD_CTX_ENTRY_SCALABLE_SIZE   32
>  
> +#define VTD_SM_CONTEXT_ENTRY_PDTS(x)        extract64((x)->val[0], 9, 3)
>  #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
>  #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
> @@ -588,6 +589,7 @@ typedef struct VTDPASIDCacheInfo {
>  #define VTD_PASID_TABLE_BITS_MASK     (0x3fULL)
>  #define VTD_PASID_TABLE_INDEX(pasid)  ((pasid) & VTD_PASID_TABLE_BITS_MASK)
>  #define VTD_PASID_ENTRY_FPD           (1ULL << 1) /* Fault Processing Disable */
> +#define VTD_PASID_TBL_ENTRY_NUM       (1ULL << 6)
>  
>  /* PASID Granular Translation Type Mask */
>  #define VTD_PASID_ENTRY_P              1ULL
Thanks

Eric



^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [PATCH v3 11/20] intel_iommu: Handle PASID entry adding
  2025-07-16 16:44   ` Eric Auger
@ 2025-07-17  7:15     ` Duan, Zhenzhong
  0 siblings, 0 replies; 55+ messages in thread
From: Duan, Zhenzhong @ 2025-07-17  7:15 UTC (permalink / raw)
  To: eric.auger@redhat.com, qemu-devel@nongnu.org
  Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
	jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
	jgg@nvidia.com, nicolinc@nvidia.com,
	shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
	clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
	Peng, Chao P, Yi Sun

Hi Eric,

>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v3 11/20] intel_iommu: Handle PASID entry adding
>
>Hi Zhenzhong,
>
>On 7/8/25 1:05 PM, Zhenzhong Duan wrote:
>> When guest modifies a PASID entry, QEMU will capture the guest pasid
>selective
>> pasid cache invalidation, allocate or remove a VTDAddressSpace instance
>per the
>> invalidation reasons:
>
>commit msg: intel_iommu: Handle PASID entry addition?
>Same for previous patch, ie. use nouns
>>
>>     a) a present pasid entry moved to non-present
>>     b) a present pasid entry to be a present entry
>>     c) a non-present pasid entry moved to present
>just focus on c to avoid extra noise?
>>
>> This handles c) by going through each passthrough device and each pasid.
>When
>> a new valid pasid entry is founded, find or create a vtd_as and cache pasid
>is found, identify an existing vtd_as or create a new one and update its
>corresponding cached pasid entry.

Will do all above.

>> entry in it.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>>  hw/i386/intel_iommu.c          | 170
>++++++++++++++++++++++++++++++++-
>>  hw/i386/intel_iommu_internal.h |   2 +
>>  2 files changed, 169 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 5bda439de6..cf2c959b60 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -826,6 +826,11 @@ static inline bool
>vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>>      }
>>  }
>>
>> +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry
>*ce)
>> +{
>> +    return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
>> +}
>> +
>>  static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>>  {
>>      return pdire->val & 1;
>> @@ -1647,9 +1652,9 @@ static gboolean
>vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
>>  }
>>
>>  /* Translate iommu pasid to vtd_as */
>> -static inline
>> -VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState
>*s,
>> -                                                uint16_t sid,
>uint32_t pasid)
>> +static VTDAddressSpace
>*vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
>> +
>uint16_t sid,
>> +
>uint32_t pasid)
>>  {
>>      struct vtd_as_raw_key key = {
>>          .sid = sid,
>> @@ -3206,6 +3211,162 @@ remove:
>>      return true;
>>  }
>>
>I think this would deserve a doc comment explaining it retrieves/creates
>vtd_as and associated cached PE for PASID range within [start, end] that
>matches @info type/did. Also you may document that the "_one" means it
>walks a single SM page table between start/end  (if my understanding is
>correct)

Will do.

>> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
>> +                                        dma_addr_t pt_base,
>> +                                        int start,
>> +                                        int end,
>> +                                        VTDPASIDCacheInfo
>*info)
>> +{
>> +    VTDPASIDEntry pe;
>> +    int pasid = start;
>> +
>> +    while (pasid < end) {
>> +        if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
>> +            && vtd_pe_present(&pe)) {
>> +            int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
>> +            uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
>> +            VTDPASIDCacheEntry *pc_entry;
>> +            VTDAddressSpace *vtd_as;
>> +
>> +            vtd_iommu_lock(s);
>> +            /*
>> +             * When indexed by rid2pasid, vtd_as should have been
>created,
>> +             * e.g., by PCI subsystem. For other iommu pasid, we need
>to
>I thought iommu pasid was rid2pasid, as opposed to regular PCI pasids.

Yes, for PCI pasid, it's PCI_NO_PASID.

Thanks
Zhenzhong

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v3 12/20] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (10 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 11/20] intel_iommu: Handle PASID entry adding Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 13/20] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Yi Sun

FORCE_RESET is different from GLOBAL_INV which updates pasid cache if
underlying pasid entry is still valid, it drops all the pasid caches.

FORCE_RESET isn't a VTD spec defined invalidation type for pasid cache,
only used internally in system level reset.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c          | 25 +++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  9 +++++++++
 hw/i386/trace-events           |  1 +
 3 files changed, 35 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index cf2c959b60..cf263498db 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -87,6 +87,8 @@ struct vtd_iotlb_key {
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+
 static void vtd_panic_require_caching_mode(void)
 {
     error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -391,6 +393,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
     vtd_iommu_lock(s);
     vtd_reset_iotlb_locked(s);
     vtd_reset_context_cache_locked(s);
+    vtd_pasid_cache_reset_locked(s);
     vtd_iommu_unlock(s);
 }
 
@@ -3171,6 +3174,8 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
         /* fall through */
     case VTD_PASID_CACHE_GLOBAL_INV:
         break;
+    case VTD_PASID_CACHE_FORCE_RESET:
+        goto remove;
     default:
         error_setg(&error_fatal, "invalid pc_info->type for flush");
     }
@@ -3211,6 +3216,23 @@ remove:
     return true;
 }
 
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_reset();
+
+    pc_info.type = VTD_PASID_CACHE_FORCE_RESET;
+
+    /*
+     * Reset pasid cache is a big hammer, so use g_hash_table_foreach_remove
+     * which will free all vtd_as instances except those created for PCI
+     * sub-system.
+     */
+    g_hash_table_foreach_remove(s->vtd_address_spaces,
+                                vtd_flush_pasid_locked, &pc_info);
+}
+
 static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
                                         dma_addr_t pt_base,
                                         int start,
@@ -3344,6 +3366,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     case VTD_PASID_CACHE_GLOBAL_INV:
         /* loop all assigned devices */
         break;
+    case VTD_PASID_CACHE_FORCE_RESET:
+        /* For force reset, no need to go further replay */
+        return;
     default:
         error_setg(&error_fatal, "invalid pc_info->type for replay");
     }
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 621e1f6947..887f93bac9 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -568,6 +568,15 @@ typedef enum VTDPCInvType {
     VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
     VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
     VTD_PASID_CACHE_GLOBAL_INV = VTD_INV_DESC_PASIDC_G_GLOBAL,
+
+    /*
+     * Internally used PASID cache invalidation type starts here,
+     * 0x10 is large enough as invalidation type in pc_inv_desc
+     * is 2bits in size.
+     */
+
+    /* Reset all PASID cache entries, used in system level reset */
+    VTD_PASID_CACHE_FORCE_RESET = 0x10,
 } VTDPCInvType;
 
 typedef struct VTDPASIDCacheInfo {
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ae5bbfcdc0..c8a936eb46 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
 vtd_inv_qi_tail(uint16_t head) "write tail %d"
 vtd_inv_qi_fetch(void) ""
 vtd_context_cache_reset(void) ""
+vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 13/20] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (11 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 12/20] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 14/20] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

When guest in scalable mode and x-flts=on, we stick to system MR for IOMMUFD
backed host device. Then its default hwpt contains GPA->HPA mappings which is
used directly if PGTT=PT and used as nested parent if PGTT=FLT. Otherwise
fallback to original processing.

Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index cf263498db..030862fb2f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1773,6 +1773,28 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
 
 }
 
+static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(IntelIOMMUState *s,
+                                                 VTDAddressSpace *as)
+{
+    struct vtd_as_key key = {
+        .bus = as->bus,
+        .devfn = as->devfn,
+    };
+    VTDHostIOMMUDevice *vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev,
+                                                       &key);
+
+    if (vtd_hiod && vtd_hiod->hiod &&
+        object_dynamic_cast(OBJECT(vtd_hiod->hiod),
+                            TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+        return vtd_hiod;
+    }
+    return NULL;
+}
+
+/*
+ * vtd_switch_address_space() calls vtd_as_pt_enabled() to determine which
+ * MR to switch to. Switch to system MR if return true, iommu MR otherwise.
+ */
 static bool vtd_as_pt_enabled(VTDAddressSpace *as)
 {
     IntelIOMMUState *s;
@@ -1781,6 +1803,18 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
     assert(as);
 
     s = as->iommu_state;
+
+    /*
+     * When guest in scalable mode and x-flts=on, we stick to system MR
+     * for IOMMUFD backed host device. Then its default hwpt contains
+     * GPA->HPA mappings which is used directly if PGTT=PT and used as
+     * nested parent if PGTT=FLT. Otherwise fallback to original
+     * processing.
+     */
+    if (s->root_scalable && s->flts && vtd_find_hiod_iommufd(s, as)) {
+        return true;
+    }
+
     if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
                                  &ce)) {
         /*
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 14/20] intel_iommu: Bind/unbind guest page table to host
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (12 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 13/20] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 15/20] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan, Yi Sun

This captures the guest PASID table entry modifications and
propagates the changes to host to attach a hwpt with type determined
per guest IOMMU mdoe and PGTT configuration.

When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
vIOMMU reuse hwpt(GPA->HPA) provided by VFIO as nested parent to
construct nested page table.

When guest decides to use legacy mode then vIOMMU switches the MRs of
the device's AS, hence the IOAS created by VFIO container would be
switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
switched to IOMMU MR. So it is able to support shadowing the guest IO
page table.

Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c          | 221 ++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h |  14 ++-
 hw/i386/trace-events           |   3 +
 include/hw/i386/intel_iommu.h  |   1 +
 4 files changed, 233 insertions(+), 6 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 030862fb2f..c35673eb58 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -20,6 +20,7 @@
  */
 
 #include "qemu/osdep.h"
+#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 #include "qapi/error.h"
@@ -41,6 +42,9 @@
 #include "migration/vmstate.h"
 #include "trace.h"
 #include "system/iommufd.h"
+#ifdef CONFIG_IOMMUFD
+#include <linux/iommufd.h>
+#endif
 
 /* context entry operations */
 #define VTD_CE_GET_RID2PASID(ce) \
@@ -50,10 +54,9 @@
 
 /* pe operations */
 #define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
-#define VTD_PE_GET_FL_LEVEL(pe) \
-    (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM))
 #define VTD_PE_GET_SL_LEVEL(pe) \
     (2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
+#define VTD_PE_GET_FL_LEVEL(pe) (VTD_SM_PASID_ENTRY_FSPM(pe) + 4)
 
 /*
  * PCI bus number (or SID) is not reliable since the device is usaully
@@ -834,6 +837,31 @@ static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
     return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
 }
 
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+    return pe->val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
+}
+
+/*
+ * Stage-1 IOVA address width: 48 bits for 4-level paging(FSPM=00)
+ *                             57 bits for 5-level paging(FSPM=01)
+ */
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+    return 48 + VTD_SM_PASID_ENTRY_FSPM(pe) * 9;
+}
+
+static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
+}
+
+/* check if pgtt is first stage translation */
+static inline bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
+{
+    return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
+}
+
 static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
 {
     return pdire->val & 1;
@@ -1131,7 +1159,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
     if (s->root_scalable) {
         vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
         if (s->flts) {
-            return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+            return pe.val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
         } else {
             return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
         }
@@ -1766,7 +1794,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
              */
             return false;
         }
-        return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
+        return vtd_pe_pgtt_is_pt(&pe);
     }
 
     return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
@@ -2428,6 +2456,178 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
     vtd_iommu_replay_all(s);
 }
 
+#ifdef CONFIG_IOMMUFD
+static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
+                                  VTDPASIDEntry *pe)
+{
+    memset(vtd, 0, sizeof(*vtd));
+
+    vtd->flags =  (VTD_SM_PASID_ENTRY_SRE_BIT(pe) ? IOMMU_VTD_S1_SRE : 0) |
+                  (VTD_SM_PASID_ENTRY_WPE_BIT(pe) ? IOMMU_VTD_S1_WPE : 0) |
+                  (VTD_SM_PASID_ENTRY_EAFE_BIT(pe) ? IOMMU_VTD_S1_EAFE : 0);
+    vtd->addr_width = vtd_pe_get_fl_aw(pe);
+    vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
+}
+
+static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                              VTDPASIDEntry *pe, uint32_t *s1_hwpt,
+                              Error **errp)
+{
+    struct iommu_hwpt_vtd_s1 vtd;
+
+    vtd_init_s1_hwpt_data(&vtd, pe);
+
+    return !iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+                                       idev->hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
+                                       sizeof(vtd), &vtd, s1_hwpt, errp);
+}
+
+static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+                                VTDAddressSpace *vtd_as)
+{
+    if (!vtd_as->s1_hwpt) {
+        return;
+    }
+    iommufd_backend_free_id(idev->iommufd, vtd_as->s1_hwpt);
+    vtd_as->s1_hwpt = 0;
+}
+
+static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                     VTDAddressSpace *vtd_as, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    VTDPASIDEntry *pe = &vtd_as->pasid_cache_entry.pasid_entry;
+    uint32_t hwpt_id;
+    int ret;
+
+    /*
+     * We can get here only if flts=on, the supported PGTT is FLT and PT.
+     * Catch invalid PGTT when processing invalidation request to avoid
+     * attaching to wrong hwpt.
+     */
+    if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
+        error_setg(errp, "Invalid PGTT type");
+        return -EINVAL;
+    }
+
+    if (vtd_pe_pgtt_is_flt(pe)) {
+        /* Should fail if the FLPT base is 0 */
+        if (!vtd_pe_get_flpt_base(pe)) {
+            error_setg(errp, "FLPT base is 0");
+            return -EINVAL;
+        }
+
+        if (vtd_create_s1_hwpt(idev, pe, &hwpt_id, errp)) {
+            return -EINVAL;
+        }
+    } else {
+        hwpt_id = idev->hwpt_id;
+    }
+
+    ret = !host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp);
+    trace_vtd_device_attach_hwpt(idev->devid, vtd_as->pasid, hwpt_id, ret);
+    if (!ret) {
+        vtd_destroy_s1_hwpt(idev, vtd_as);
+        if (vtd_pe_pgtt_is_flt(pe)) {
+            vtd_as->s1_hwpt = hwpt_id;
+        }
+    } else if (vtd_pe_pgtt_is_flt(pe)) {
+        iommufd_backend_free_id(idev->iommufd, hwpt_id);
+    }
+
+    return ret;
+}
+
+static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+                                     VTDAddressSpace *vtd_as, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+    uint32_t pasid = vtd_as->pasid;
+    int ret;
+
+    if (vtd_hiod->iommu_state->dmar_enabled) {
+        ret = !host_iommu_device_iommufd_detach_hwpt(idev, errp);
+        trace_vtd_device_detach_hwpt(idev->devid, pasid, ret);
+    } else {
+        ret = !host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
+        trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
+                                           ret);
+    }
+
+    if (!ret) {
+        vtd_destroy_s1_hwpt(idev, vtd_as);
+    }
+
+    return ret;
+}
+
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
+                                Error **errp)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(s, vtd_as);
+    int ret;
+
+    if (!vtd_hiod) {
+        /* No need to go further, e.g. for emulated device */
+        return 0;
+    }
+
+    if (vtd_as->pasid != PCI_NO_PASID) {
+        error_setg(errp, "Non-rid_pasid %d not supported yet", vtd_as->pasid);
+        return -EINVAL;
+    }
+
+    switch (op) {
+    case VTD_PASID_UPDATE:
+    case VTD_PASID_BIND:
+    {
+        ret = vtd_device_attach_iommufd(vtd_hiod, vtd_as, errp);
+        break;
+    }
+    case VTD_PASID_UNBIND:
+    {
+        ret = vtd_device_detach_iommufd(vtd_hiod, vtd_as, errp);
+        break;
+    }
+    default:
+        error_setg(errp, "Unknown VTDPASIDOp!!!");
+        break;
+    }
+
+    return ret;
+}
+#else
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
+                                Error **errp)
+{
+    return 0;
+}
+#endif
+
+static int vtd_bind_guest_pasid_report_err(VTDAddressSpace *vtd_as,
+                                           VTDPASIDOp op)
+{
+    Error *local_err = NULL;
+    int ret;
+
+    /*
+     * vIOMMU calls into kernel to do BIND/UNBIND, the failure reason
+     * can be kernel, QEMU bug or invalid guest config. None of them
+     * should be reported to guest in PASID cache invalidation
+     * processing path. But at least, we can report it to QEMU console.
+     *
+     * TODO: for invalid guest config, DMA translation fault will be
+     * caught by host and passed to QEMU to inject to guest in future.
+     */
+    ret = vtd_bind_guest_pasid(vtd_as, op, &local_err);
+    if (ret) {
+        error_report_err(local_err);
+    }
+
+    return ret;
+}
+
 /* Do a context-cache device-selective invalidation.
  * @func_mask: FM field after shifting
  */
@@ -3234,10 +3434,20 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
     /* No need to update if cached pasid entry is latest */
     if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
         pc_entry->pasid_entry = pe;
+        if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_UPDATE)) {
+            /*
+             * In case update binding fails, tear down existing binding to
+             * catch invalid pasid entry config during DMA translation.
+             */
+            goto remove;
+        }
     }
     return false;
 
 remove:
+    if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_UNBIND)) {
+        return false;
+    }
     pc_entry->valid = false;
 
     /*
@@ -3317,6 +3527,9 @@ static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
             if (!pc_entry->valid) {
                 pc_entry->pasid_entry = pe;
                 pc_entry->valid = true;
+                if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_BIND)) {
+                    pc_entry->valid = false;
+                }
             }
         }
         pasid++;
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 887f93bac9..c87e59554d 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -563,6 +563,12 @@ typedef struct VTDRootEntry VTDRootEntry;
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw)  (0x1e0ULL | ~VTD_HAW_MASK(aw))
 #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1      0xffffffffffe00000ULL
 
+typedef enum VTDPASIDOp {
+    VTD_PASID_BIND,
+    VTD_PASID_UPDATE,
+    VTD_PASID_UNBIND,
+} VTDPASIDOp;
+
 typedef enum VTDPCInvType {
     /* VTD spec defined PASID cache invalidation type */
     VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
@@ -611,8 +617,12 @@ typedef struct VTDPASIDCacheInfo {
 #define VTD_SM_PASID_ENTRY_AW          7ULL /* Adjusted guest-address-width */
 #define VTD_SM_PASID_ENTRY_DID(x)      extract64((x)->val[1], 0, 16)
 
-#define VTD_SM_PASID_ENTRY_FLPM          3ULL
-#define VTD_SM_PASID_ENTRY_FLPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_FSPTPTR       (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(x)    extract64((x)->val[2], 0, 1)
+/* 00: 4-level paging, 01: 5-level paging, 10-11: Reserved */
+#define VTD_SM_PASID_ENTRY_FSPM(x)       extract64((x)->val[2], 2, 2)
+#define VTD_SM_PASID_ENTRY_WPE_BIT(x)    extract64((x)->val[2], 4, 1)
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(x)   extract64((x)->val[2], 7, 1)
 
 /* First Level Paging Structure */
 /* Masks for First Level Paging Entry */
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index c8a936eb46..1c31b9a873 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -73,6 +73,9 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
 vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
 vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
 vtd_reset_exit(void) ""
+vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
+vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
 
 # amd_iommu.c
 amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" +  offset 0x%"PRIx32
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 0e3826f6f0..2affab36b2 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -104,6 +104,7 @@ struct VTDAddressSpace {
     PCIBus *bus;
     uint8_t devfn;
     uint32_t pasid;
+    uint32_t s1_hwpt;
     AddressSpace as;
     IOMMUMemoryRegion iommu;
     MemoryRegion root;          /* The root container of the device */
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 15/20] intel_iommu: Replay pasid bindings after context cache invalidation
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (13 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 14/20] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 16/20] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
	Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

This replays guest pasid bindings after context cache invalidation.
This is a behavior to ensure safety. Actually, programmer should issue
pasid cache invalidation with proper granularity after issuing a context
cache invalidation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c          | 42 ++++++++++++++++++++++++++++++++++
 hw/i386/intel_iommu_internal.h |  2 ++
 hw/i386/trace-events           |  1 +
 3 files changed, 45 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c35673eb58..887830a855 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -91,6 +91,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
 
 static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+                                 VTDPASIDCacheInfo *pc_info);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  PCIBus *bus, uint16_t devfn);
 
 static void vtd_panic_require_caching_mode(void)
 {
@@ -2437,6 +2441,8 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
 
 static void vtd_context_global_invalidate(IntelIOMMUState *s)
 {
+    VTDPASIDCacheInfo pc_info;
+
     trace_vtd_inv_desc_cc_global();
     /* Protects context cache */
     vtd_iommu_lock(s);
@@ -2454,6 +2460,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
      * VT-d emulation codes.
      */
     vtd_iommu_replay_all(s);
+
+    pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+    vtd_pasid_cache_sync(s, &pc_info);
 }
 
 #ifdef CONFIG_IOMMUFD
@@ -2686,6 +2695,15 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
              * happened.
              */
             vtd_address_space_sync(vtd_as);
+            /*
+             * Per spec, context flush should also be followed with PASID
+             * cache and iotlb flush. In order to work with a guest which
+             * doesn't follow spec and missed PASID cache flush, we have
+             * vtd_pasid_cache_devsi() to invalidate PASID caches of the
+             * passthrough device. Host iommu driver would flush piotlb
+             * when a pasid unbind is pass down to it.
+             */
+             vtd_pasid_cache_devsi(s, vtd_as->bus, devfn);
         }
     }
 }
@@ -3410,6 +3428,11 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
         break;
     case VTD_PASID_CACHE_FORCE_RESET:
         goto remove;
+    case VTD_PASID_CACHE_DEVSI:
+        if (pc_info->bus != vtd_as->bus || pc_info->devfn != vtd_as->devfn) {
+            return false;
+        }
+        break;
     default:
         error_setg(&error_fatal, "invalid pc_info->type for flush");
     }
@@ -3616,6 +3639,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     case VTD_PASID_CACHE_FORCE_RESET:
         /* For force reset, no need to go further replay */
         return;
+    case VTD_PASID_CACHE_DEVSI:
+        walk_info.bus = pc_info->bus;
+        walk_info.devfn = pc_info->devfn;
+        vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+        return;
     default:
         error_setg(&error_fatal, "invalid pc_info->type for replay");
     }
@@ -3666,6 +3694,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
     vtd_replay_guest_pasid_bindings(s, pc_info);
 }
 
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+                                  PCIBus *bus, uint16_t devfn)
+{
+    VTDPASIDCacheInfo pc_info;
+
+    trace_vtd_pasid_cache_devsi(devfn);
+
+    pc_info.type = VTD_PASID_CACHE_DEVSI;
+    pc_info.bus = bus;
+    pc_info.devfn = devfn;
+
+    vtd_pasid_cache_sync(s, &pc_info);
+}
+
 static bool vtd_process_pasid_desc(IntelIOMMUState *s,
                                    VTDInvDesc *inv_desc)
 {
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index c87e59554d..9af073d843 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -583,6 +583,8 @@ typedef enum VTDPCInvType {
 
     /* Reset all PASID cache entries, used in system level reset */
     VTD_PASID_CACHE_FORCE_RESET = 0x10,
+    /* Invalidate all PASID entries in a device */
+    VTD_PASID_CACHE_DEVSI,
 } VTDPCInvType;
 
 typedef struct VTDPASIDCacheInfo {
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 1c31b9a873..830b11f68b 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -28,6 +28,7 @@ vtd_pasid_cache_reset(void) ""
 vtd_pasid_cache_gsi(void) ""
 vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
 vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
 vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
 vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
 vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 16/20] intel_iommu: Propagate PASID-based iotlb invalidation to host
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (14 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 15/20] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 17/20] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
	Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

This traps the guest PASID-based iotlb invalidation request and propagate it
to host.

Intel VT-d 3.0 supports nested translation in PASID granularity. Guest SVA
support could be implemented by configuring nested translation on specific
pasid. This is also known as dual stage DMA translation.

Under such configuration, guest owns the GVA->GPA translation which is
configured as stage-1 page table on host side for a specific pasid, and host
owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
invalidation should be propagated to host since host IOMMU will cache first
level page table related mappings during DMA address translation.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c          | 95 +++++++++++++++++++++++++++++++++-
 hw/i386/intel_iommu_internal.h |  6 +++
 2 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 887830a855..d8b4296fe4 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2606,12 +2606,99 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
 
     return ret;
 }
+
+static void
+vtd_invalidate_piotlb_locked(VTDAddressSpace *vtd_as,
+                             struct iommu_hwpt_vtd_s1_invalidate *cache)
+{
+    IntelIOMMUState *s = vtd_as->iommu_state;
+    VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(s, vtd_as);
+    HostIOMMUDeviceIOMMUFD *idev;
+    uint32_t entry_num = 1; /* Only implement one request for simplicity */
+    Error *local_err = NULL;
+
+    if (!vtd_hiod || !vtd_as->s1_hwpt) {
+        return;
+    }
+    idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+
+    if (!iommufd_backend_invalidate_cache(idev->iommufd, vtd_as->s1_hwpt,
+                                          IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+                                          sizeof(*cache), &entry_num, cache,
+                                          &local_err)) {
+        /* Something wrong in kernel, but trying to continue */
+        error_report_err(local_err);
+    }
+}
+
+/*
+ * This function is a loop function for the s->vtd_address_spaces
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host.
+ */
+static void vtd_flush_host_piotlb_locked(gpointer key, gpointer value,
+                                         gpointer user_data)
+{
+    VTDPIOTLBInvInfo *piotlb_info = user_data;
+    VTDAddressSpace *vtd_as = value;
+    VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+    uint32_t pasid;
+    uint16_t did;
+
+    /* Replay only fills pasid entry cache for passthrough device */
+    if (!pc_entry->valid ||
+        !vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
+        return;
+    }
+
+    if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+        return;
+    }
+
+    did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
+
+    if (piotlb_info->domain_id == did && piotlb_info->pasid == pasid) {
+        vtd_invalidate_piotlb_locked(vtd_as, piotlb_info->inv_data);
+    }
+}
+
+static void
+vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
+                                 uint16_t domain_id, uint32_t pasid,
+                                 hwaddr addr, uint64_t npages, bool ih)
+{
+    struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
+    VTDPIOTLBInvInfo piotlb_info;
+
+    cache_info.addr = addr;
+    cache_info.npages = npages;
+    cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
+
+    piotlb_info.domain_id = domain_id;
+    piotlb_info.pasid = pasid;
+    piotlb_info.inv_data = &cache_info;
+
+    /*
+     * Go through each vtd_as instance in s->vtd_address_spaces, find out
+     * the affected host device which need host piotlb invalidation. Piotlb
+     * invalidation should check pasid cache per architecture point of view.
+     */
+    g_hash_table_foreach(s->vtd_address_spaces,
+                         vtd_flush_host_piotlb_locked, &piotlb_info);
+}
 #else
 static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
                                 Error **errp)
 {
     return 0;
 }
+
+static void
+vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
+                                 uint16_t domain_id, uint32_t pasid,
+                                 hwaddr addr, uint64_t npages, bool ih)
+{
+}
 #endif
 
 static int vtd_bind_guest_pasid_report_err(VTDAddressSpace *vtd_as,
@@ -3286,6 +3373,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
     vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
                                 &info);
+    vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, 0, (uint64_t)-1, 0);
     vtd_iommu_unlock(s);
 
     QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
@@ -3307,7 +3395,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
 }
 
 static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
-                                       uint32_t pasid, hwaddr addr, uint8_t am)
+                                       uint32_t pasid, hwaddr addr, uint8_t am,
+                                       bool ih)
 {
     VTDIOTLBPageInvInfo info;
 
@@ -3319,6 +3408,7 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
     vtd_iommu_lock(s);
     g_hash_table_foreach_remove(s->iotlb,
                                 vtd_hash_remove_by_page_piotlb, &info);
+    vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, addr, 1 << am, ih);
     vtd_iommu_unlock(s);
 
     vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am, pasid);
@@ -3350,7 +3440,8 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
     case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
         am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
         addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
-        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am);
+        vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+                                   VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
         break;
 
     default:
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 9af073d843..f4982b19a2 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -595,6 +595,12 @@ typedef struct VTDPASIDCacheInfo {
     uint16_t devfn;
 } VTDPASIDCacheInfo;
 
+typedef struct VTDPIOTLBInvInfo {
+    uint16_t domain_id;
+    uint32_t pasid;
+    struct iommu_hwpt_vtd_s1_invalidate *inv_data;
+} VTDPIOTLBInvInfo;
+
 /* PASID Table Related Definitions */
 #define VTD_PASID_DIR_BASE_ADDR_MASK  (~0xfffULL)
 #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 17/20] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (15 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 16/20] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-08 11:05 ` [PATCH v3 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

From: Yi Liu <yi.l.liu@intel.com>

When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
all pasid bindings on host side become stale and need to be updated.

Introduce a helper function vtd_replay_pasid_bindings_all() to go through all
pasid entries in all passthrough devices to update host side bindings.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d8b4296fe4..0a86bd47b2 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -89,6 +89,7 @@ struct vtd_iotlb_key {
 
 static void vtd_address_space_refresh_all(IntelIOMMUState *s);
 static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s);
 
 static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
 static void vtd_pasid_cache_sync(IntelIOMMUState *s,
@@ -3044,6 +3045,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
     vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_replay_pasid_bindings_all(s);
 }
 
 /* Set Interrupt Remap Table Pointer */
@@ -3078,6 +3080,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
 
     vtd_reset_caches(s);
     vtd_address_space_refresh_all(s);
+    vtd_replay_pasid_bindings_all(s);
 }
 
 /* Handle Interrupt Remap Enable/Disable */
@@ -3758,6 +3761,17 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
     }
 }
 
+static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s)
+{
+    VTDPASIDCacheInfo pc_info = { .type = VTD_PASID_CACHE_GLOBAL_INV };
+
+    if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
+        return;
+    }
+
+    vtd_replay_guest_pasid_bindings(s, &pc_info);
+}
+
 /* Update the pasid cache in vIOMMU */
 static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
 {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (16 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 17/20] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
@ 2025-07-08 11:05 ` Zhenzhong Duan
  2025-07-08 11:06 ` [PATCH v3 19/20] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
  2025-07-08 11:06 ` [PATCH v3 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
  19 siblings, 0 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

When bypass_ro is true, read only memory section is bypassed from
mapping in the container.

This is a preparing patch to workaround Intel ERRATA_772415.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/listener.c                    | 13 +++++++++----
 include/hw/vfio/vfio-container-base.h |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index f498e23a93..c64aa4539e 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -364,7 +364,8 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
     return true;
 }
 
-static bool vfio_listener_valid_section(MemoryRegionSection *section,
+static bool vfio_listener_valid_section(VFIOContainerBase *bcontainer,
+                                        MemoryRegionSection *section,
                                         const char *name)
 {
     if (vfio_listener_skipped_section(section)) {
@@ -375,6 +376,10 @@ static bool vfio_listener_valid_section(MemoryRegionSection *section,
         return false;
     }
 
+    if (bcontainer && bcontainer->bypass_ro && section->readonly) {
+        return false;
+    }
+
     if (unlikely((section->offset_within_address_space &
                   ~qemu_real_host_page_mask()) !=
                  (section->offset_within_region & ~qemu_real_host_page_mask()))) {
@@ -494,7 +499,7 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
     int ret;
     Error *err = NULL;
 
-    if (!vfio_listener_valid_section(section, "region_add")) {
+    if (!vfio_listener_valid_section(bcontainer, section, "region_add")) {
         return;
     }
 
@@ -655,7 +660,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
     int ret;
     bool try_unmap = true;
 
-    if (!vfio_listener_valid_section(section, "region_del")) {
+    if (!vfio_listener_valid_section(bcontainer, section, "region_del")) {
         return;
     }
 
@@ -812,7 +817,7 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
         container_of(listener, VFIODirtyRangesListener, listener);
     hwaddr iova, end;
 
-    if (!vfio_listener_valid_section(section, "tracking_update") ||
+    if (!vfio_listener_valid_section(NULL, section, "tracking_update") ||
         !vfio_get_section_iova_range(dirty->bcontainer, section,
                                      &iova, &end, NULL)) {
         return;
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index bded6e993f..31fd784d76 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
     QLIST_HEAD(, VFIODevice) device_list;
     GList *iova_ranges;
     NotifierWithReturn cpr_reboot_notifier;
+    bool bypass_ro;
 } VFIOContainerBase;
 
 typedef struct VFIOGuestIOMMU {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 19/20] Workaround for ERRATA_772415_SPR17
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (17 preceding siblings ...)
  2025-07-08 11:05 ` [PATCH v3 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
@ 2025-07-08 11:06 ` Zhenzhong Duan
  2025-07-08 11:06 ` [PATCH v3 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
  19 siblings, 0 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on stage-2 page table could still be written.

Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.
https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update/

Also copied the SPR17 details from above link:
"Problem: When remapping hardware is configured by system software in
scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the
PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended
Access bit if enabled) in first-stage page-table entries even when
second-stage mappings indicate that corresponding first-stage page-table
is Read-Only.

Implication: Due to this erratum, pages mapped as Read-only in second-stage
page-tables may be modified by remapping hardware Access/Dirty bit updates.

Workaround: None identified. System software enabling nested translations
for a VM should ensure that there are no read-only pages in the
corresponding second-stage mappings."

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/vfio/iommufd.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index c172c177fc..a28641b5f5 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -325,6 +325,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
 {
     ERRP_GUARD();
     IOMMUFDBackend *iommufd = vbasedev->iommufd;
+    struct iommu_hw_info_vtd vtd;
     uint32_t type, flags = 0;
     uint64_t hw_caps;
     VFIOIOASHwpt *hwpt;
@@ -372,10 +373,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
      * instead.
      */
     if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
-                                         &type, NULL, 0, &hw_caps, errp)) {
+                                         &type, &vtd, sizeof(vtd), &hw_caps,
+                                         errp)) {
         return false;
     }

+    if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
+        container->bcontainer.bypass_ro = true;
+    }
+
     if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
         flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
     }
-- 
2.47.1

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v3 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode
  2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
                   ` (18 preceding siblings ...)
  2025-07-08 11:06 ` [PATCH v3 19/20] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
@ 2025-07-08 11:06 ` Zhenzhong Duan
  19 siblings, 0 replies; 55+ messages in thread
From: Zhenzhong Duan @ 2025-07-08 11:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
	jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
	clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
	Zhenzhong Duan

Now that all infrastructures of supporting passthrough device running
with stage-1 translation are there, enable it now.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
 hw/i386/intel_iommu.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 0a86bd47b2..08e4d5cfe7 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -5203,6 +5203,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
                    "when x-flts=on");
         return false;
     }
+
+    return true;
 #endif
 
     error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2025-07-17  7:17 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-08 11:05 [PATCH v3 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-07-08 11:05 ` [PATCH v3 01/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-07-08 11:05 ` [PATCH v3 02/20] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
2025-07-09  0:39   ` Nicolin Chen
2025-07-09  3:38     ` Duan, Zhenzhong
2025-07-09  4:03       ` Nicolin Chen
2025-07-09  5:52         ` Duan, Zhenzhong
2025-07-09 17:55     ` Donald Dutile
2025-07-09 19:20       ` Nicolin Chen
2025-07-10  1:22         ` Donald Dutile
2025-07-15 15:28           ` Eric Auger
2025-07-15 16:42             ` Donald Dutile
2025-07-10  8:11         ` Shameerali Kolothum Thodi via
2025-07-10 17:01           ` Donald Dutile
2025-07-10 17:07             ` Donald Dutile
2025-07-10 17:25               ` Nicolin Chen
2025-07-10 17:16           ` Nicolin Chen
2025-07-11 13:18             ` Shameerali Kolothum Thodi via
2025-07-15 15:36   ` Eric Auger
2025-07-08 11:05 ` [PATCH v3 03/20] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
2025-07-15 16:32   ` Eric Auger
2025-07-08 11:05 ` [PATCH v3 04/20] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
2025-07-15 16:36   ` Eric Auger
2025-07-08 11:05 ` [PATCH v3 05/20] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
2025-07-15 16:40   ` Eric Auger
2025-07-16  3:29     ` Duan, Zhenzhong
2025-07-08 11:05 ` [PATCH v3 06/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-07-08 11:05 ` [PATCH v3 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-07-16  9:22   ` Eric Auger
2025-07-16 10:31     ` Duan, Zhenzhong
2025-07-16 12:09       ` Eric Auger
2025-07-17  3:47         ` Duan, Zhenzhong
2025-07-17  6:48           ` Eric Auger
2025-07-17  7:03             ` Duan, Zhenzhong
2025-07-08 11:05 ` [PATCH v3 08/20] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
2025-07-16  9:53   ` Eric Auger
2025-07-17  3:24     ` Duan, Zhenzhong
2025-07-08 11:05 ` [PATCH v3 09/20] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
2025-07-16 12:53   ` Eric Auger
2025-07-17  3:48     ` Duan, Zhenzhong
2025-07-08 11:05 ` [PATCH v3 10/20] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
2025-07-16 15:10   ` Eric Auger
2025-07-17  7:02     ` Duan, Zhenzhong
2025-07-08 11:05 ` [PATCH v3 11/20] intel_iommu: Handle PASID entry adding Zhenzhong Duan
2025-07-16 16:44   ` Eric Auger
2025-07-17  7:15     ` Duan, Zhenzhong
2025-07-08 11:05 ` [PATCH v3 12/20] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
2025-07-08 11:05 ` [PATCH v3 13/20] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
2025-07-08 11:05 ` [PATCH v3 14/20] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-07-08 11:05 ` [PATCH v3 15/20] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
2025-07-08 11:05 ` [PATCH v3 16/20] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-07-08 11:05 ` [PATCH v3 17/20] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
2025-07-08 11:05 ` [PATCH v3 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
2025-07-08 11:06 ` [PATCH v3 19/20] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
2025-07-08 11:06 ` [PATCH v3 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).