* [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device
@ 2025-06-20 7:17 Zhenzhong Duan
2025-06-20 7:17 ` [PATCH v2 01/19] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
` (18 more replies)
0 siblings, 19 replies; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:17 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
Hi,
For passthrough device with intel_iommu.x-flts=on, we don't do shadowing of
guest page table for passthrough device but pass stage-1 page table to host
side to construct a nested domain. There was some effort to enable this feature
in old days, see [1] for details.
The key design is to utilize the dual-stage IOMMU translation (also known as
IOMMU nested translation) capability in host IOMMU. As the below diagram shows,
guest I/O page table pointer in GPA (guest physical address) is passed to host
and be used to perform the stage-1 address translation. Along with it,
modifications to present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | Stage1 for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.--------------------------------------.
| | | Stage2 for GPA->HPA, unmanaged domain|
| | '--------------------------------------'
'-------------'
For history reason, there are different namings in different VTD spec rev,
Where:
- Stage1 = First stage = First level = flts
- Stage2 = Second stage = Second level = slts
<Intel VT-d Nested translation>
This series reuse VFIO device's default hwpt as nested parent instead of
creating new one. This way avoids duplicate code of a new memory listener,
all existing feature from VFIO listener can be shared, e.g., ram discard,
dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
under a PCI bridge with emulated device, because emulated device wants
IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
VFIO device's default hwpt is created with NEST_PARENT flag, kernel
inhibit RO mappings when switch to shadow mode.
This series is also a prerequisite work for vSVA, i.e. Sharing guest
application address space with passthrough devices.
There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
subsystem. VFIO calls them to register/unregister HostIOMMUDevice
instance to vIOMMU at vfio device realize stage.
* vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
VFIO calls it to get vIOMMU exposed capabilities.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
to bind/unbind device to IOMMUFD backed domains, either nested
domain or not.
See below diagram:
VFIO Device Intel IOMMU
.-----------------. .-------------------.
| | | |
| .---------|PCIIOMMUOps |.-------------. |
| | IOMMUFD |(set/unset_iommu_device) || Host IOMMU | |
| | Device |------------------------>|| Device list | |
| .---------|(get_viommu_cap) |.-------------. |
| | | | |
| | | V |
| .---------| HostIOMMUDeviceIOMMUFD | .-------------. |
| | IOMMUFD | (attach_hwpt)| | Host IOMMU | |
| | link |<------------------------| | Device | |
| .---------| (detach_hwpt)| .-------------. |
| | | | |
| | | ... |
.-----------------. .-------------------.
Below is an exmaple to enable stage-1 translation for passthrough device:
-M q35,...
-device intel-iommu,x-scalable-mode=on,x-flts=on...
-object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test
PATCH1-6: Some preparing work
PATCH7-8: Compatibility check between vIOMMU and Host IOMMU
PATCH9-18: Implement stage-1 page table for passthrough device
PATCH19: Enable stage-1 translation for passthrough device
Qemu code can be found at [2]
Fault report isn't supported in this series, we presume guest kernel always
construct correct stage1 page table for passthrough device. For emulated
devices, the emulation code already provided stage1 fault injection.
TODO:
- Fault report to guest when HW stage1 faults
[1] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v2
Thanks
Zhenzhong
Changelog:
v2:
- introduce get_viommu_cap() to get STAGE1 flag to create nested parent hwpt (Liuyi)
- reuse VFIO's default hwpt as parent hwpt of nested translation (Nicolin, Liuyi)
- abandon support of VFIO device under pcie-to-pci bridge to simplify design (Liuyi)
- bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 (Liuyi)
- drop vtd_dev_to_context_entry optimization (Liuyi)
v1:
- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
- rebase to master
rfcv3:
- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer)
- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
- simplify return value check of get_cap() (Eric)
- drop realize_late (Cedric, Eric)
- split patch13:intel_iommu: Add PASID cache management infrastructure (Eric)
- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
- refine comments (Eric, Donald)
rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
iommu pasid, this is important for dropping VTDPASIDAddressSpace
Yi Liu (3):
intel_iommu: Replay pasid binds after context cache invalidation
intel_iommu: Propagate PASID-based iotlb invalidation to host
intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
Zhenzhong Duan (16):
intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
vtd_ce_get_pasid_entry
hw/pci: Introduce pci_device_get_viommu_cap()
intel_iommu: Implement get_viommu_cap() callback
vfio/iommufd: Force creating nested parent domain
hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
intel_iommu: Introduce a new structure VTDHostIOMMUDevice
intel_iommu: Check for compatibility with IOMMUFD backed device when
x-flts=on
intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
intel_iommu: Handle PASID entry removing and updating
intel_iommu: Handle PASID entry adding
intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
intel_iommu: Stick to system MR for IOMMUFD backed host device when
x-fls=on
intel_iommu: Bind/unbind guest page table to host
Workaround for ERRATA_772415_SPR17
intel_iommu: Enable host device when x-flts=on in scalable mode
MAINTAINERS | 1 +
hw/i386/intel_iommu_internal.h | 56 ++
include/hw/i386/intel_iommu.h | 8 +-
include/hw/iommu.h | 14 +
include/hw/pci/pci.h | 24 +
include/hw/vfio/vfio-container-base.h | 1 +
hw/i386/intel_iommu.c | 1023 ++++++++++++++++++++++++-
hw/pci/pci.c | 23 +-
hw/vfio/iommufd.c | 22 +-
hw/vfio/listener.c | 13 +-
hw/i386/trace-events | 8 +
11 files changed, 1160 insertions(+), 33 deletions(-)
create mode 100644 include/hw/iommu.h
--
2.34.1
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v2 01/19] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
@ 2025-06-20 7:17 ` Zhenzhong Duan
2025-06-20 7:17 ` [PATCH v2 02/19] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
` (17 subsequent siblings)
18 siblings, 0 replies; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:17 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
of rid2pasid, then it was extended to get any pasid entry. So a new name
vtd_ce_get_pasid_entry is better to match what it actually does.
No functional change intended.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
hw/i386/intel_iommu.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 69d72ad35c..f0b1f90eff 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
return 0;
}
-static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
+static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
VTDContextEntry *ce,
VTDPASIDEntry *pe,
uint32_t pasid)
@@ -1025,7 +1025,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
if (s->flts) {
return VTD_PE_GET_FL_LEVEL(&pe);
} else {
@@ -1048,7 +1048,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
}
@@ -1116,7 +1116,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
if (s->flts) {
return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
} else {
@@ -1522,7 +1522,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
* has valid rid2pasid setting, which includes valid
* rid2pasid field and corresponding pasid entry setting
*/
- return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
+ return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
}
/* Map a device to its corresponding domain (context-entry) */
@@ -1611,7 +1611,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
}
@@ -1687,7 +1687,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
int ret;
if (s->root_scalable) {
- ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
if (ret) {
/*
* This error is guest triggerable. We should assumt PT
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 02/19] hw/pci: Introduce pci_device_get_viommu_cap()
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-06-20 7:17 ` [PATCH v2 01/19] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-06-20 7:17 ` Zhenzhong Duan
2025-06-20 7:53 ` Eric Auger
2025-06-20 7:17 ` [PATCH v2 03/19] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
` (16 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:17 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Marcel Apfelbaum
pci_device_get_viommu_cap() call pci_device_get_iommu_bus_devfn()
to get iommu_bus->iommu_ops and call get_viommu_cap() callback to
get a bitmap with each bit represents a vIOMMU exposed capability.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
include/hw/pci/pci.h | 22 ++++++++++++++++++++++
hw/pci/pci.c | 11 +++++++++++
2 files changed, 33 insertions(+)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index df3cc7b875..829757b2c2 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -453,6 +453,18 @@ typedef struct PCIIOMMUOps {
* @devfn: device and function number of the PCI device.
*/
void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
+ /**
+ * @get_viommu_cap: get vIOMMU capabilities
+ *
+ * Optional callback, if not implemented, then vIOMMU doesn't
+ * support exposing capabilities to other subsystem, e.g., VFIO.
+ * vIOMMU can choose which capabilities to expose.
+ *
+ * @opaque: the data passed to pci_setup_iommu().
+ *
+ * Returns: 64bit bitmap with each bit represents a capability.
+ */
+ uint64_t (*get_viommu_cap)(void *opaque);
/**
* @get_iotlb_info: get properties required to initialize a device IOTLB.
*
@@ -633,6 +645,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
Error **errp);
void pci_device_unset_iommu_device(PCIDevice *dev);
+/**
+ * pci_device_get_viommu_cap: get vIOMMU capabilities.
+ *
+ * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
+ * capability, 0 if vIOMMU doesn't support esposing capabilities.
+ *
+ * @dev: PCI device pointer.
+ */
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev);
+
/**
* pci_iommu_get_iotlb_info: get properties required to initialize a
* device IOTLB.
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index c70b5ceeba..df1fb615a8 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2992,6 +2992,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
}
}
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev)
+{
+ PCIBus *iommu_bus;
+
+ pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
+ if (iommu_bus && iommu_bus->iommu_ops->get_viommu_cap) {
+ return iommu_bus->iommu_ops->get_viommu_cap(iommu_bus->iommu_opaque);
+ }
+ return 0;
+}
+
int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
bool exec_req, hwaddr addr, bool lpig,
uint16_t prgi, bool is_read, bool is_write)
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 03/19] intel_iommu: Implement get_viommu_cap() callback
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-06-20 7:17 ` [PATCH v2 01/19] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-06-20 7:17 ` [PATCH v2 02/19] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
@ 2025-06-20 7:17 ` Zhenzhong Duan
2025-06-20 8:10 ` Eric Auger
2025-06-20 7:17 ` [PATCH v2 04/19] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
` (15 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:17 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Implement get_viommu_cap() callback and expose stage-1 capability for now.
VFIO uses it to create nested parent domain which is further used to create
nested domain in vIOMMU. All these will be implemented in following patches.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
MAINTAINERS | 1 +
include/hw/iommu.h | 14 ++++++++++++++
hw/i386/intel_iommu.c | 12 ++++++++++++
3 files changed, 27 insertions(+)
create mode 100644 include/hw/iommu.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 94c4076127..27817974a6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2277,6 +2277,7 @@ F: include/system/iommufd.h
F: backends/host_iommu_device.c
F: include/system/host_iommu_device.h
F: include/qemu/chardev_open.h
+F: include/hw/iommu.h
F: util/chardev_open.c
F: docs/devel/vfio-iommufd.rst
diff --git a/include/hw/iommu.h b/include/hw/iommu.h
new file mode 100644
index 0000000000..3c1c08f05d
--- /dev/null
+++ b/include/hw/iommu.h
@@ -0,0 +1,14 @@
+/*
+ * General vIOMMU capabilities, flags, etc
+ *
+ * Copyright (C) 2025 Intel Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_IOMMU_H
+#define HW_IOMMU_H
+
+#define VIOMMU_CAP_STAGE1 BIT_ULL(0)
+
+#endif /* HW_IOMMU_H */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f0b1f90eff..702973da5c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -24,6 +24,7 @@
#include "qemu/main-loop.h"
#include "qapi/error.h"
#include "hw/sysbus.h"
+#include "hw/iommu.h"
#include "intel_iommu_internal.h"
#include "hw/pci/pci.h"
#include "hw/pci/pci_bus.h"
@@ -4412,6 +4413,16 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
vtd_iommu_unlock(s);
}
+static uint64_t vtd_get_viommu_cap(void *opaque)
+{
+ IntelIOMMUState *s = opaque;
+ uint64_t caps;
+
+ caps = s->flts ? VIOMMU_CAP_STAGE1 : 0;
+
+ return caps;
+}
+
/* Unmap the whole range in the notifier's scope. */
static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
{
@@ -4734,6 +4745,7 @@ static PCIIOMMUOps vtd_iommu_ops = {
.get_address_space = vtd_host_dma_iommu,
.set_iommu_device = vtd_dev_set_iommu_device,
.unset_iommu_device = vtd_dev_unset_iommu_device,
+ .get_viommu_cap = vtd_get_viommu_cap,
};
static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 04/19] vfio/iommufd: Force creating nested parent domain
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (2 preceding siblings ...)
2025-06-20 7:17 ` [PATCH v2 03/19] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
@ 2025-06-20 7:17 ` Zhenzhong Duan
2025-06-20 8:08 ` Eric Auger
2025-06-20 7:17 ` [PATCH v2 05/19] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
` (14 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:17 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
Call pci_device_get_viommu_cap() to get if vIOMMU supports VIOMMU_CAP_STAGE1,
if yes, create nested parent domain which could be reused by vIOMMU to create
nested domain.
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/vfio/iommufd.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index d3efef71af..83a632bdee 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -20,6 +20,7 @@
#include "trace.h"
#include "qapi/error.h"
#include "system/iommufd.h"
+#include "hw/iommu.h"
#include "hw/qdev-core.h"
#include "hw/vfio/vfio-cpr.h"
#include "system/reset.h"
@@ -352,6 +353,19 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
}
+ /*
+ * If vIOMMU supports stage-1 translation, force to create nested parent
+ * domain which could be reused by vIOMMU to create nested domain.
+ */
+ if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+ VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+ hw_caps = pci_device_get_viommu_cap(&vdev->pdev);
+ if (hw_caps & VIOMMU_CAP_STAGE1) {
+ flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
+ }
+ }
+
if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
container->ioas_id, flags,
IOMMU_HWPT_DATA_NONE, 0, NULL,
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 05/19] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (3 preceding siblings ...)
2025-06-20 7:17 ` [PATCH v2 04/19] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
@ 2025-06-20 7:17 ` Zhenzhong Duan
2025-06-20 11:59 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 06/19] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
` (13 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:17 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Marcel Apfelbaum
Returns true if PCI device is aliased or false otherwise. This will be
used in following patch to determine if a PCI device is under a PCI
bridge.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
include/hw/pci/pci.h | 2 ++
hw/pci/pci.c | 12 ++++++++----
2 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 829757b2c2..3029cdf26f 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -640,6 +640,8 @@ typedef struct PCIIOMMUOps {
bool is_write);
} PCIIOMMUOps;
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+ PCIBus **aliased_bus, int *aliased_devfn);
AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
Error **errp);
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index df1fb615a8..87f7c942b3 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2857,20 +2857,21 @@ static void pci_device_class_base_init(ObjectClass *klass, const void *data)
* For call sites which don't need aliased BDF, passing NULL to
* aliased_[bus|devfn] is allowed.
*
+ * Returns true if PCI device is aliased or false otherwise.
+ *
* @piommu_bus: return root #PCIBus backed by an IOMMU for the PCI device.
*
* @aliased_bus: return aliased #PCIBus of the PCI device, optional.
*
* @aliased_devfn: return aliased devfn of the PCI device, optional.
*/
-static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
- PCIBus **piommu_bus,
- PCIBus **aliased_bus,
- int *aliased_devfn)
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+ PCIBus **aliased_bus, int *aliased_devfn)
{
PCIBus *bus = pci_get_bus(dev);
PCIBus *iommu_bus = bus;
int devfn = dev->devfn;
+ bool aliased = false;
while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
@@ -2907,6 +2908,7 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
devfn = parent->devfn;
bus = parent_bus;
}
+ aliased = true;
}
iommu_bus = parent_bus;
@@ -2928,6 +2930,8 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
if (aliased_devfn) {
*aliased_devfn = devfn;
}
+
+ return aliased;
}
AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 06/19] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (4 preceding siblings ...)
2025-06-20 7:17 ` [PATCH v2 05/19] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-20 7:18 ` [PATCH v2 07/19] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
` (12 subsequent siblings)
18 siblings, 0 replies; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
Richard Henderson, Eduardo Habkost
Introduce a new structure VTDHostIOMMUDevice which replaces
HostIOMMUDevice to be stored in hash table.
It includes a reference to HostIOMMUDevice and IntelIOMMUState,
also includes BDF information which will be used in future
patches.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
hw/i386/intel_iommu_internal.h | 7 +++++++
include/hw/i386/intel_iommu.h | 2 +-
hw/i386/intel_iommu.c | 15 +++++++++++++--
3 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e8b211e8b0..7aba259ef8 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -28,6 +28,7 @@
#ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
#define HW_I386_INTEL_IOMMU_INTERNAL_H
#include "hw/i386/intel_iommu.h"
+#include "system/host_iommu_device.h"
/*
* Intel IOMMU register specification
@@ -607,4 +608,10 @@ typedef struct VTDRootEntry VTDRootEntry;
/* Bits to decide the offset for each level */
#define VTD_LEVEL_BITS 9
+typedef struct VTDHostIOMMUDevice {
+ IntelIOMMUState *iommu_state;
+ PCIBus *bus;
+ uint8_t devfn;
+ HostIOMMUDevice *hiod;
+} VTDHostIOMMUDevice;
#endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index e95477e855..50f9b27a45 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -295,7 +295,7 @@ struct IntelIOMMUState {
/* list of registered notifiers */
QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
- GHashTable *vtd_host_iommu_dev; /* HostIOMMUDevice */
+ GHashTable *vtd_host_iommu_dev; /* VTDHostIOMMUDevice */
/* interrupt remapping */
bool intr_enabled; /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 702973da5c..e90fd2f28f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
static void vtd_hiod_destroy(gpointer v)
{
- object_unref(v);
+ VTDHostIOMMUDevice *vtd_hiod = v;
+
+ object_unref(vtd_hiod->hiod);
+ g_free(vtd_hiod);
}
static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
@@ -4360,6 +4363,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
HostIOMMUDevice *hiod, Error **errp)
{
IntelIOMMUState *s = opaque;
+ VTDHostIOMMUDevice *vtd_hiod;
struct vtd_as_key key = {
.bus = bus,
.devfn = devfn,
@@ -4376,7 +4380,14 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
return false;
}
+ vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
+ vtd_hiod->bus = bus;
+ vtd_hiod->devfn = (uint8_t)devfn;
+ vtd_hiod->iommu_state = s;
+ vtd_hiod->hiod = hiod;
+
if (!vtd_check_hiod(s, hiod, errp)) {
+ g_free(vtd_hiod);
vtd_iommu_unlock(s);
return false;
}
@@ -4386,7 +4397,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
new_key->devfn = devfn;
object_ref(hiod);
- g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
+ g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
vtd_iommu_unlock(s);
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 07/19] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (5 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 06/19] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-20 12:05 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 08/19] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
` (11 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
is passed to host to construct nested page table. We need to check
compatibility of some critical IOMMU capabilities between vIOMMU and
host IOMMU to ensure guest stage-1 page table could be used by host.
For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
does not, then this IOMMUFD backed device should be failed.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 1 +
hw/i386/intel_iommu.c | 28 ++++++++++++++++++++++++++++
2 files changed, 29 insertions(+)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 7aba259ef8..18bc22fc72 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -192,6 +192,7 @@
#define VTD_ECAP_PT (1ULL << 6)
#define VTD_ECAP_SC (1ULL << 7)
#define VTD_ECAP_MHMV (15ULL << 20)
+#define VTD_ECAP_NEST (1ULL << 26)
#define VTD_ECAP_SRS (1ULL << 31)
#define VTD_ECAP_PASID (1ULL << 40)
#define VTD_ECAP_SMTS (1ULL << 43)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index e90fd2f28f..1c79efc1cb 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
#include "kvm/kvm_i386.h"
#include "migration/vmstate.h"
#include "trace.h"
+#include "system/iommufd.h"
/* context entry operations */
#define VTD_CE_GET_RID2PASID(ce) \
@@ -4355,6 +4356,33 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
return true;
}
+#ifdef CONFIG_IOMMUFD
+ struct HostIOMMUDeviceCaps *caps = &hiod->caps;
+ struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+
+ /* Remaining checks are all stage-1 translation specific */
+ if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+ error_setg(errp, "Need IOMMUFD backend when x-flts=on");
+ return false;
+ }
+
+ if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
+ error_setg(errp, "Incompatible host platform IOMMU type %d",
+ caps->type);
+ return false;
+ }
+
+ if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
+ error_setg(errp, "Host IOMMU doesn't support nested translation");
+ return false;
+ }
+
+ if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
+ error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
+ return false;
+ }
+#endif
+
error_setg(errp, "host device is uncompatible with stage-1 translation");
return false;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 08/19] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (6 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 07/19] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-20 12:18 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 09/19] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
` (10 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
Richard Henderson, Eduardo Habkost
Currently we don't support nested translation for passthrough device
with emulated device under same PCI bridge.
Reason is for emulated devices, AS should switch to iommu MR, while for
passthrough devices, it needs the AS stick with the system MR hence be
able to keep the VFIO container IOAS as a GPA IOAS. To support this, let
AS switch to iommu MR and have a separate GPA IOAS is needed, but that
brings a new memory listener which duplicates with VFIO memory listener.
For trade off, we choose to not support this special scenario because
PCIE bridge is more popular than PCI bridge now.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1c79efc1cb..9d4adc9458 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4330,9 +4330,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
return vtd_dev_as;
}
-static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
+static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
Error **errp)
{
+ HostIOMMUDevice *hiod = vtd_hiod->hiod;
HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
int ret;
@@ -4359,6 +4360,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
#ifdef CONFIG_IOMMUFD
struct HostIOMMUDeviceCaps *caps = &hiod->caps;
struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+ PCIBus *bus = vtd_hiod->bus;
+ PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), vtd_hiod->devfn);
/* Remaining checks are all stage-1 translation specific */
if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
@@ -4381,6 +4384,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
return false;
}
+
+ if (pci_device_get_iommu_bus_devfn(pdev, &bus, NULL, NULL)) {
+ error_setg(errp, "Host device under PCI bridge is unsupported "
+ "when x-flts=on");
+ return false;
+ }
#endif
error_setg(errp, "host device is uncompatible with stage-1 translation");
@@ -4414,7 +4423,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
vtd_hiod->iommu_state = s;
vtd_hiod->hiod = hiod;
- if (!vtd_check_hiod(s, hiod, errp)) {
+ if (!vtd_check_hiod(s, vtd_hiod, errp)) {
g_free(vtd_hiod);
vtd_iommu_unlock(s);
return false;
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 09/19] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (7 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 08/19] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-20 12:46 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 10/19] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
` (9 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
PCI device supports two request types, Requests-without-PASID and
Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
prefix, IOMMU fetches rid_pasid from context entry and use it as IOMMU's
pasid to index pasid table.
So we need to translate between PCI's pasid and IOMMU's pasid specially
for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to vtd_as
which contains PCI's pasid vtd_as->pasid.
vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to iommu_pasid.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 58 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 9d4adc9458..8948b8370f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1602,6 +1602,64 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
return 0;
}
+static inline int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
+ uint32_t *pasid)
+{
+ VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ uint8_t bus_num = pci_bus_num(vtd_as->bus);
+ uint8_t devfn = vtd_as->devfn;
+ VTDContextEntry ce;
+ int ret;
+
+ if (cc_entry->context_cache_gen == s->context_cache_gen) {
+ ce = cc_entry->context_entry;
+ } else {
+ ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+ if (ret) {
+ return ret;
+ }
+ }
+
+ /* Translate to iommu pasid if PCI_NO_PASID */
+ if (vtd_as->pasid == PCI_NO_PASID) {
+ *pasid = VTD_CE_GET_RID2PASID(&ce);
+ } else {
+ *pasid = vtd_as->pasid;
+ }
+
+ return 0;
+}
+
+static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
+ gpointer user_data)
+{
+ VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
+ struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
+ uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
+ uint32_t pasid;
+
+ if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+ return false;
+ }
+
+ return (pasid == target->pasid) && (sid == target->sid);
+}
+
+/* Translate iommu pasid to vtd_as */
+static inline
+VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
+ uint16_t sid, uint32_t pasid)
+{
+ struct vtd_as_raw_key key = {
+ .sid = sid,
+ .pasid = pasid
+ };
+
+ return g_hash_table_find(s->vtd_address_spaces,
+ vtd_find_as_by_sid_and_iommu_pasid, &key);
+}
+
static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
void *private)
{
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 10/19] intel_iommu: Handle PASID entry removing and updating
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (8 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 09/19] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-20 15:44 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 11/19] intel_iommu: Handle PASID entry adding Zhenzhong Duan
` (8 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Yi Sun, Paolo Bonzini, Richard Henderson,
Eduardo Habkost, Marcel Apfelbaum
This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
pasid entry and track PASID usage and future PASID tagged DMA address
translation support in vIOMMU.
VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
never freed. For other pasid, VTDAddressSpace instance is created/destroyed
per the guest pasid entry set up/destroy for passthrough devices. While for
emulated devices, VTDAddressSpace instance is created in the PASID tagged DMA
translation and be destroyed per guest PASID cache invalidation. This focuses
on the PASID cache management for passthrough devices as there is no PASID
capable emulated devices yet.
When guest modifies a PASID entry, QEMU will capture the guest pasid selective
pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
invalidation reasons:
a) a present pasid entry moved to non-present
b) a present pasid entry to be a present entry
c) a non-present pasid entry moved to present
This handles a) and b), following patch will handle c).
vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
and compare it with the PASID cache.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 27 ++++
include/hw/i386/intel_iommu.h | 6 +
hw/i386/intel_iommu.c | 265 +++++++++++++++++++++++++++++++--
hw/i386/trace-events | 3 +
4 files changed, 291 insertions(+), 10 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 18bc22fc72..01c881ed4d 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
* request while disabled */
VTD_FR_IR_SID_ERR = 0x26, /* Invalid Source-ID */
+ VTD_FR_RTADDR_INV_TTM = 0x31, /* Invalid TTM in RTADDR */
/* PASID directory entry access failure */
VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
/* The Present(P) field of pasid directory entry is 0 */
@@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
#define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff000000000f1c0ULL
#define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
+#define VTD_INV_DESC_PASIDC_G (3ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PASIDC_DID(val) (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0 0xfff000000000f1c0ULL
+
+#define VTD_INV_DESC_PASIDC_DSI (0ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID_SI (1ULL << 4)
+#define VTD_INV_DESC_PASIDC_GLOBAL (3ULL << 4)
+
/* Information about page-selective IOTLB invalidate */
struct VTDIOTLBPageInvInfo {
uint16_t domain_id;
@@ -552,6 +562,22 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
+typedef enum VTDPCInvType {
+ /* pasid cache invalidation rely on guest PASID entry */
+ VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
+ VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation */
+ VTD_PASID_CACHE_PASIDSI, /* pasid cache pasid selective invalidation */
+} VTDPCInvType;
+
+typedef struct VTDPASIDCacheInfo {
+ VTDPCInvType type;
+ uint16_t domain_id;
+ uint32_t pasid;
+ PCIBus *bus;
+ uint16_t devfn;
+ bool error_happened;
+} VTDPASIDCacheInfo;
+
/* PASID Table Related Definitions */
#define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
#define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
@@ -563,6 +589,7 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_PASID_TABLE_BITS_MASK (0x3fULL)
#define VTD_PASID_TABLE_INDEX(pasid) ((pasid) & VTD_PASID_TABLE_BITS_MASK)
#define VTD_PASID_ENTRY_FPD (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM (1ULL << 6)
/* PASID Granular Translation Type Mask */
#define VTD_PASID_ENTRY_P 1ULL
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 50f9b27a45..fbc9da903a 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -95,6 +95,11 @@ struct VTDPASIDEntry {
uint64_t val[8];
};
+typedef struct VTDPASIDCacheEntry {
+ struct VTDPASIDEntry pasid_entry;
+ bool cache_filled;
+} VTDPASIDCacheEntry;
+
struct VTDAddressSpace {
PCIBus *bus;
uint8_t devfn;
@@ -107,6 +112,7 @@ struct VTDAddressSpace {
MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
IntelIOMMUState *iommu_state;
VTDContextCacheEntry context_cache_entry;
+ VTDPASIDCacheEntry pasid_cache_entry;
QLIST_ENTRY(VTDAddressSpace) next;
/* Superset of notifier flags that this address space has */
IOMMUNotifierFlag notifier_flags;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 8948b8370f..1db581d14a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -826,11 +826,24 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
}
}
+static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
+{
+ return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
+}
+
static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
{
return pdire->val & 1;
}
+static inline void pasid_cache_info_set_error(VTDPASIDCacheInfo *pc_info)
+{
+ if (pc_info->error_happened) {
+ return;
+ }
+ pc_info->error_happened = true;
+}
+
/**
* Caller of this function should check present bit if wants
* to use pdir entry for further usage except for fpd bit check.
@@ -3103,6 +3116,241 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
return true;
}
+static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
+ uint32_t pasid, VTDPASIDEntry *pe)
+{
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ VTDContextEntry ce;
+ int ret;
+
+ if (!s->root_scalable) {
+ return -VTD_FR_RTADDR_INV_TTM;
+ }
+
+ ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
+ &ce);
+ if (ret) {
+ return ret;
+ }
+
+ return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+ return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/*
+ * This function fills in the pasid entry in &vtd_as. Caller
+ * of this function should hold iommu_lock.
+ */
+static int vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
+ VTDPASIDEntry *pe)
+{
+ VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+
+ if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+ /* No need to go further as cached pasid entry is latest */
+ return 0;
+ }
+
+ pc_entry->pasid_entry = *pe;
+ pc_entry->cache_filled = true;
+
+ /*
+ * TODO: send pasid bind to host for passthru devices
+ */
+
+ return 0;
+}
+
+/*
+ * This function is used to update or clear cached pasid entry in vtd_as
+ * instances. Caller of this function should hold iommu_lock.
+ */
+static gboolean vtd_flush_pasid(gpointer key, gpointer value,
+ gpointer user_data)
+{
+ VTDPASIDCacheInfo *pc_info = user_data;
+ VTDAddressSpace *vtd_as = value;
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+ VTDPASIDEntry pe;
+ uint16_t did;
+ uint32_t pasid;
+ int ret;
+
+ if (!pc_entry->cache_filled) {
+ return false;
+ }
+ did = vtd_pe_get_did(&pc_entry->pasid_entry);
+
+ if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+ goto remove;
+ }
+
+ switch (pc_info->type) {
+ case VTD_PASID_CACHE_PASIDSI:
+ if (pc_info->pasid != pasid) {
+ return false;
+ }
+ /* Fall through */
+ case VTD_PASID_CACHE_DOMSI:
+ if (pc_info->domain_id != did) {
+ return false;
+ }
+ /* Fall through */
+ case VTD_PASID_CACHE_GLOBAL_INV:
+ break;
+ default:
+ error_report("invalid pc_info->type");
+ abort();
+ }
+
+ /*
+ * pasid cache invalidation may indicate a present pasid
+ * entry to present pasid entry modification. To cover such
+ * case, vIOMMU emulator needs to fetch latest guest pasid
+ * entry and check cached pasid entry, then update pasid
+ * cache and send pasid bind/unbind to host properly.
+ */
+ ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
+ if (ret) {
+ /*
+ * No valid pasid entry in guest memory. e.g. pasid entry
+ * was modified to be either all-zero or non-present. Either
+ * case means existing pasid cache should be removed.
+ */
+ goto remove;
+ }
+
+ if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
+ pasid_cache_info_set_error(pc_info);
+ }
+ return false;
+
+remove:
+ /*
+ * TODO: send pasid unbind to host for passthru devices
+ */
+ pc_entry->cache_filled = false;
+
+ /*
+ * Don't remove address space of PCI_NO_PASID which is created by PCI
+ * sub-system.
+ */
+ if (vtd_as->pasid == PCI_NO_PASID) {
+ return false;
+ }
+ return true;
+}
+
+/*
+ * This function syncs the pasid bindings between guest and host.
+ * It includes updating the pasid cache in vIOMMU and updating the
+ * pasid bindings per guest's latest pasid entry presence.
+ */
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+ VTDPASIDCacheInfo *pc_info)
+{
+ if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
+ return;
+ }
+
+ /*
+ * Regards to a pasid cache invalidation, e.g. a PSI.
+ * it could be either cases of below:
+ * a) a present pasid entry moved to non-present
+ * b) a present pasid entry to be a present entry
+ * c) a non-present pasid entry moved to present
+ *
+ * Different invalidation granularity may affect different device
+ * scope and pasid scope. But for each invalidation granularity,
+ * it needs to do two steps to sync host and guest pasid binding.
+ *
+ * Here is the handling of a PSI:
+ * 1) loop all the existing vtd_as instances to update them
+ * according to the latest guest pasid entry in pasid table.
+ * this will make sure affected existing vtd_as instances
+ * cached the latest pasid entries. Also, during the loop, the
+ * host should be notified if needed. e.g. pasid unbind or pasid
+ * update. Should be able to cover case a) and case b).
+ *
+ * 2) loop all devices to cover case c)
+ * - For devices which are backed by HostIOMMUDeviceIOMMUFD instances,
+ * we loop them and check if guest pasid entry exists. If yes,
+ * it is case c), we update the pasid cache and also notify
+ * host.
+ * - For devices which are not backed by HostIOMMUDeviceIOMMUFD,
+ * it is not necessary to create pasid cache at this phase since
+ * it could be created when vIOMMU does DMA address translation.
+ * This is not yet implemented since there is no emulated
+ * pasid-capable devices today. If we have such devices in
+ * future, the pasid cache shall be created there.
+ * Other granularity follow the same steps, just with different scope
+ *
+ */
+
+ vtd_iommu_lock(s);
+ /*
+ * Step 1: loop all the existing vtd_as instances for pasid unbind and
+ * update.
+ */
+ g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid,
+ pc_info);
+ vtd_iommu_unlock(s);
+
+ /* TODO: Step 2: loop all the existing vtd_hiod instances for pasid bind. */
+}
+
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+ VTDInvDesc *inv_desc)
+{
+ uint16_t domain_id;
+ uint32_t pasid;
+ VTDPASIDCacheInfo pc_info = {};
+ uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
+ VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
+
+ if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
+ __func__, "pasid cache inv")) {
+ return false;
+ }
+
+ domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
+ pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
+
+ switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
+ case VTD_INV_DESC_PASIDC_DSI:
+ trace_vtd_pasid_cache_dsi(domain_id);
+ pc_info.type = VTD_PASID_CACHE_DOMSI;
+ pc_info.domain_id = domain_id;
+ break;
+
+ case VTD_INV_DESC_PASIDC_PASID_SI:
+ /* PASID selective implies a DID selective */
+ trace_vtd_pasid_cache_psi(domain_id, pasid);
+ pc_info.type = VTD_PASID_CACHE_PASIDSI;
+ pc_info.domain_id = domain_id;
+ pc_info.pasid = pasid;
+ break;
+
+ case VTD_INV_DESC_PASIDC_GLOBAL:
+ trace_vtd_pasid_cache_gsi();
+ pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+ break;
+
+ default:
+ error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
+ " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+ return false;
+ }
+
+ vtd_pasid_cache_sync(s, &pc_info);
+ return !pc_info.error_happened ? true : false;
+}
+
static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
VTDInvDesc *inv_desc)
{
@@ -3264,6 +3512,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
}
break;
+ case VTD_INV_DESC_PC:
+ trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+ if (!vtd_process_pasid_desc(s, &inv_desc)) {
+ return false;
+ }
+ break;
+
case VTD_INV_DESC_PIOTLB:
trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
if (!vtd_process_piotlb_desc(s, &inv_desc)) {
@@ -3299,16 +3554,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
}
break;
- /*
- * TODO: the entity of below two cases will be implemented in future series.
- * To make guest (which integrates scalable mode support patch set in
- * iommu driver) work, just return true is enough so far.
- */
- case VTD_INV_DESC_PC:
- if (s->scalable_mode) {
- break;
- }
- /* fallthrough */
default:
error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
" (unknown type)", __func__, inv_desc.hi,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ac9e1a10aa..ae5bbfcdc0 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
vtd_inv_qi_tail(uint16_t head) "write tail %d"
vtd_inv_qi_fetch(void) ""
vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 11/19] intel_iommu: Handle PASID entry adding
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (9 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 10/19] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-23 11:47 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 12/19] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
` (7 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Yi Sun, Paolo Bonzini, Richard Henderson,
Eduardo Habkost, Marcel Apfelbaum
When guest modifies a PASID entry, QEMU will capture the guest pasid selective
pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
invalidation reasons:
a) a present pasid entry moved to non-present
b) a present pasid entry to be a present entry
c) a non-present pasid entry moved to present
This handles c).
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 1 +
hw/i386/intel_iommu.c | 169 ++++++++++++++++++++++++++++++++-
2 files changed, 169 insertions(+), 1 deletion(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 01c881ed4d..025787b3b9 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -558,6 +558,7 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_CTX_ENTRY_LEGACY_SIZE 16
#define VTD_CTX_ENTRY_SCALABLE_SIZE 32
+#define VTD_SM_CONTEXT_ENTRY_PDTS(val) (((val) >> 9) & 0x7)
#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1db581d14a..f4273dc640 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -826,6 +826,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
}
}
+static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
+{
+ return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
+}
+
static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
{
return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
@@ -3246,6 +3251,159 @@ remove:
return true;
}
+static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+ dma_addr_t pt_base,
+ int start,
+ int end,
+ VTDPASIDCacheInfo *info)
+{
+ VTDPASIDEntry pe;
+ int pasid = start;
+ int pasid_next;
+
+ while (pasid < end) {
+ pasid_next = pasid + 1;
+
+ if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+ && vtd_pe_present(&pe)) {
+ int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
+ uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
+ VTDAddressSpace *vtd_as;
+
+ vtd_iommu_lock(s);
+ /*
+ * When indexed by rid2pasid, vtd_as should have been created,
+ * e.g., by PCI subsystem. For other iommu pasid, we need to
+ * create vtd_as dynamically. The other iommu pasid is same as
+ * PCI's pasid, so it's used as input of vtd_find_add_as().
+ */
+ vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
+ vtd_iommu_unlock(s);
+ if (!vtd_as) {
+ vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
+ }
+
+ if ((info->type == VTD_PASID_CACHE_DOMSI ||
+ info->type == VTD_PASID_CACHE_PASIDSI) &&
+ !(info->domain_id == vtd_pe_get_did(&pe))) {
+ /*
+ * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
+ * requires domain ID check. If domain Id check fail,
+ * go to next pasid.
+ */
+ pasid = pasid_next;
+ continue;
+ }
+ if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
+ pasid_cache_info_set_error(info);
+ }
+ }
+ pasid = pasid_next;
+ }
+}
+
+/*
+ * Currently, VT-d scalable mode pasid table is a two level table,
+ * this function aims to loop a range of PASIDs in a given pasid
+ * table to identify the pasid config in guest.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
+ dma_addr_t pdt_base,
+ int start,
+ int end,
+ VTDPASIDCacheInfo *info)
+{
+ VTDPASIDDirEntry pdire;
+ int pasid = start;
+ int pasid_next;
+ dma_addr_t pt_base;
+
+ while (pasid < end) {
+ pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
+ (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
+ if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+ && vtd_pdire_present(&pdire)) {
+ pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+ vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
+ }
+ pasid = pasid_next;
+ }
+}
+
+static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
+ int start, int end,
+ VTDPASIDCacheInfo *info)
+{
+ VTDContextEntry ce;
+
+ if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus), info->devfn,
+ &ce)) {
+ uint32_t max_pasid;
+
+ max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
+ if (end > max_pasid) {
+ end = max_pasid;
+ }
+ vtd_sm_pasid_table_walk(s,
+ VTD_CE_GET_PASID_DIR_TABLE(&ce),
+ start,
+ end,
+ info);
+ }
+}
+
+/*
+ * This function replay the guest pasid bindings to hosts by
+ * walking the guest PASID table. This ensures host will have
+ * latest guest pasid bindings.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+ VTDPASIDCacheInfo *pc_info)
+{
+ VTDHostIOMMUDevice *vtd_hiod;
+ int start = 0, end = 1; /* only rid2pasid is supported */
+ VTDPASIDCacheInfo walk_info;
+ GHashTableIter as_it;
+
+ switch (pc_info->type) {
+ case VTD_PASID_CACHE_PASIDSI:
+ start = pc_info->pasid;
+ end = pc_info->pasid + 1;
+ /*
+ * PASID selective invalidation is within domain,
+ * thus fall through.
+ */
+ case VTD_PASID_CACHE_DOMSI:
+ case VTD_PASID_CACHE_GLOBAL_INV:
+ /* loop all assigned devices */
+ break;
+ default:
+ error_report("invalid pc_info->type for replay");
+ abort();
+ }
+
+ /*
+ * In this replay, only needs to care about the devices which
+ * are backed by host IOMMU. For such devices, their vtd_hiod
+ * instances are in the s->vtd_host_iommu_dev. For devices which
+ * are not backed by host IOMMU, it is not necessary to replay
+ * the bindings since their cache could be re-created in the future
+ * DMA address translation. Access to vtd_host_iommu_dev is already
+ * protected by BQL, so no iommu lock needed here.
+ */
+ walk_info = *pc_info;
+ g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
+ while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
+ /* bus|devfn fields are not identical with pc_info */
+ walk_info.bus = vtd_hiod->bus;
+ walk_info.devfn = vtd_hiod->devfn;
+ vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+ }
+ if (walk_info.error_happened) {
+ pasid_cache_info_set_error(pc_info);
+ }
+}
+
/*
* This function syncs the pasid bindings between guest and host.
* It includes updating the pasid cache in vIOMMU and updating the
@@ -3301,7 +3459,16 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
pc_info);
vtd_iommu_unlock(s);
- /* TODO: Step 2: loop all the existing vtd_hiod instances for pasid bind. */
+ /*
+ * Step 2: loop all the existing vtd_hiod instances for pasid bind.
+ * Ideally, needs to loop all devices to find if there is any new
+ * PASID binding regards to the PASID cache invalidation request.
+ * But it is enough to loop the devices which are backed by host
+ * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
+ * if new PASID happened on them, their vtd_as instance could
+ * be created during future vIOMMU DMA translation.
+ */
+ vtd_replay_guest_pasid_bindings(s, pc_info);
}
static bool vtd_process_pasid_desc(IntelIOMMUState *s,
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 12/19] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (10 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 11/19] intel_iommu: Handle PASID entry adding Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-23 11:55 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 13/19] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
` (6 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Yi Sun, Paolo Bonzini, Richard Henderson,
Eduardo Habkost, Marcel Apfelbaum
FORCE_RESET is different from GLOBAL_INV which updates pasid cache if
underlying pasid entry is still valid, it drops all the pasid caches.
FORCE_RESET isn't a VTD spec defined invalidation type for pasid cache,
only used internally in system level reset.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 2 ++
hw/i386/intel_iommu.c | 28 ++++++++++++++++++++++++++++
hw/i386/trace-events | 1 +
3 files changed, 31 insertions(+)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 025787b3b9..5ed76864be 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -564,6 +564,8 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
typedef enum VTDPCInvType {
+ /* Force reset all */
+ VTD_PASID_CACHE_FORCE_RESET = 0,
/* pasid cache invalidation rely on guest PASID entry */
VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f4273dc640..ed71bb8ec7 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -87,6 +87,8 @@ struct vtd_iotlb_key {
static void vtd_address_space_refresh_all(IntelIOMMUState *s);
static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+
static void vtd_panic_require_caching_mode(void)
{
error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -391,6 +393,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
vtd_iommu_lock(s);
vtd_reset_iotlb_locked(s);
vtd_reset_context_cache_locked(s);
+ vtd_pasid_cache_reset_locked(s);
vtd_iommu_unlock(s);
}
@@ -3196,6 +3199,8 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
}
switch (pc_info->type) {
+ case VTD_PASID_CACHE_FORCE_RESET:
+ goto remove;
case VTD_PASID_CACHE_PASIDSI:
if (pc_info->pasid != pasid) {
return false;
@@ -3251,6 +3256,26 @@ remove:
return true;
}
+/* Caller of this function should hold iommu_lock */
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
+{
+ VTDPASIDCacheInfo pc_info = {};
+
+ trace_vtd_pasid_cache_reset();
+
+ pc_info.type = VTD_PASID_CACHE_FORCE_RESET;
+
+ /*
+ * Reset pasid cache is a big hammer, so use g_hash_table_foreach_remove
+ * which will free the vtd_as instances. Also, as a big hammer, use
+ * VTD_PASID_CACHE_FORCE_RESET to ensure all the vtd_as instances are
+ * dropped, meanwhile the change will be passed to host if
+ * HostIOMMUDeviceIOMMUFD is available.
+ */
+ g_hash_table_foreach_remove(s->vtd_address_spaces,
+ vtd_flush_pasid, &pc_info);
+}
+
static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
dma_addr_t pt_base,
int start,
@@ -3377,6 +3402,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
case VTD_PASID_CACHE_GLOBAL_INV:
/* loop all assigned devices */
break;
+ case VTD_PASID_CACHE_FORCE_RESET:
+ /* For force reset, no need to go further replay */
+ return;
default:
error_report("invalid pc_info->type for replay");
abort();
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ae5bbfcdc0..c8a936eb46 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
vtd_inv_qi_tail(uint16_t head) "write tail %d"
vtd_inv_qi_fetch(void) ""
vtd_context_cache_reset(void) ""
+vtd_pasid_cache_reset(void) ""
vtd_pasid_cache_gsi(void) ""
vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 13/19] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (11 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 12/19] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-23 12:02 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 14/19] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
` (5 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
When guest in scalable mode and x-flts=on, we stick to system MR for IOMMUFD
backed host device. Then its default hwpt contains GPA->HPA mappings which is
used directly if PGTT=PT and used as nested parent if PGTT=FLT. Otherwise
fallback to original processing.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ed71bb8ec7..be01f8885f 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1779,6 +1779,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
*/
return false;
}
+
return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
}
@@ -1790,10 +1791,33 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
{
IntelIOMMUState *s;
VTDContextEntry ce;
+ struct vtd_as_key key = {
+ .bus = as->bus,
+ .devfn = as->devfn,
+ };
assert(as);
s = as->iommu_state;
+
+ /*
+ * When guest in scalable mode and x-flts=on, we stick to system MR
+ * for IOMMUFD backed host device. Then its default hwpt contains
+ * GPA->HPA mappings which is used directly if PGTT=PT and used as
+ * nested parent if PGTT=FLT. Otherwise fallback to original
+ * processing.
+ */
+ if (s->root_scalable && s->flts) {
+ VTDHostIOMMUDevice *vtd_hiod;
+
+ vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
+ if (vtd_hiod && vtd_hiod->hiod &&
+ object_dynamic_cast(OBJECT(vtd_hiod->hiod),
+ TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+ return true;
+ }
+ }
+
if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
&ce)) {
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 14/19] intel_iommu: Bind/unbind guest page table to host
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (12 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 13/19] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-23 13:17 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 15/19] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
` (4 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
Richard Henderson, Eduardo Habkost
This captures the guest PASID table entry modifications and
propagates the changes to host to attach a hwpt with type determined
per guest IOMMU mdoe and PGTT configuration.
When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
vIOMMU reuse hwpt(GPA->HPA) provided by VFIO as nested parent to
construct nested page table.
When guest decides to use legacy mode then vIOMMU switches the MRs of
the device's AS, hence the IOAS created by VFIO container would be
switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
switched to IOMMU MR. So it is able to support shadowing the guest IO
page table.
Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 11 ++
hw/i386/intel_iommu.c | 244 +++++++++++++++++++++++++++++++--
hw/i386/trace-events | 3 +
3 files changed, 243 insertions(+), 15 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 5ed76864be..92a533db54 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -563,6 +563,13 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
+typedef enum VTDPASIDOp {
+ VTD_PASID_BIND,
+ VTD_PASID_UPDATE,
+ VTD_PASID_UNBIND,
+ VTD_OP_NUM
+} VTDPASIDOp;
+
typedef enum VTDPCInvType {
/* Force reset all */
VTD_PASID_CACHE_FORCE_RESET = 0,
@@ -607,6 +614,9 @@ typedef struct VTDPASIDCacheInfo {
#define VTD_SM_PASID_ENTRY_FLPM 3ULL
#define VTD_SM_PASID_ENTRY_FLPTPTR (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(val) (!!((val) & 1ULL))
+#define VTD_SM_PASID_ENTRY_WPE_BIT(val) (!!(((val) >> 4) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
/* First Level Paging Structure */
/* Masks for First Level Paging Entry */
@@ -644,5 +654,6 @@ typedef struct VTDHostIOMMUDevice {
PCIBus *bus;
uint8_t devfn;
HostIOMMUDevice *hiod;
+ uint32_t s1_hwpt;
} VTDHostIOMMUDevice;
#endif
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index be01f8885f..1c94a0033c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -20,6 +20,7 @@
*/
#include "qemu/osdep.h"
+#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
#include "qemu/error-report.h"
#include "qemu/main-loop.h"
#include "qapi/error.h"
@@ -41,6 +42,9 @@
#include "migration/vmstate.h"
#include "trace.h"
#include "system/iommufd.h"
+#ifdef CONFIG_IOMMUFD
+#include <linux/iommufd.h>
+#endif
/* context entry operations */
#define VTD_CE_GET_RID2PASID(ce) \
@@ -839,6 +843,27 @@ static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
}
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+ return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+}
+
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+ return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
+}
+
+static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
+{
+ return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
+}
+
+/* check if pgtt is first stage translation */
+static inline bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
+{
+ return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
+}
+
static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
{
return pdire->val & 1;
@@ -2431,6 +2456,188 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
vtd_iommu_replay_all(s);
}
+#ifdef CONFIG_IOMMUFD
+static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
+ VTDPASIDEntry *pe)
+{
+ memset(vtd, 0, sizeof(*vtd));
+
+ vtd->flags = (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ?
+ IOMMU_VTD_S1_SRE : 0) |
+ (VTD_SM_PASID_ENTRY_WPE_BIT(pe->val[2]) ?
+ IOMMU_VTD_S1_WPE : 0) |
+ (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ?
+ IOMMU_VTD_S1_EAFE : 0);
+ vtd->addr_width = vtd_pe_get_fl_aw(pe);
+ vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
+}
+
+static int vtd_create_s1_hwpt(VTDHostIOMMUDevice *vtd_hiod,
+ VTDPASIDEntry *pe, Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+ struct iommu_hwpt_vtd_s1 vtd;
+ uint32_t s1_hwpt;
+
+ vtd_init_s1_hwpt_data(&vtd, pe);
+
+ if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+ idev->hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
+ sizeof(vtd), &vtd, &s1_hwpt, errp)) {
+ return -EINVAL;
+ }
+
+ vtd_hiod->s1_hwpt = s1_hwpt;
+
+ return 0;
+}
+
+static void vtd_destroy_s1_hwpt(VTDHostIOMMUDevice *vtd_hiod)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+
+ iommufd_backend_free_id(idev->iommufd, vtd_hiod->s1_hwpt);
+ vtd_hiod->s1_hwpt = 0;
+}
+
+static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+ uint32_t pasid, VTDPASIDEntry *pe,
+ Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+ uint32_t hwpt_id;
+ int ret;
+
+ if (vtd_pe_pgtt_is_flt(pe)) {
+ ret = vtd_create_s1_hwpt(vtd_hiod, pe, errp);
+ if (ret) {
+ return ret;
+ }
+ hwpt_id = vtd_hiod->s1_hwpt;
+ } else {
+ hwpt_id = idev->hwpt_id;
+ }
+
+ ret = !host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp);
+ trace_vtd_device_attach_hwpt(idev->devid, pasid, hwpt_id, ret);
+ if (ret && vtd_pe_pgtt_is_flt(pe)) {
+ vtd_destroy_s1_hwpt(vtd_hiod);
+ }
+
+ return ret;
+}
+
+static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+ uint32_t pasid, VTDPASIDEntry *pe,
+ Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+ int ret;
+
+ if (vtd_hiod->iommu_state->dmar_enabled) {
+ ret = !host_iommu_device_iommufd_detach_hwpt(idev, errp);
+ trace_vtd_device_detach_hwpt(idev->devid, pasid, ret);
+ } else {
+ ret = !host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
+ trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
+ ret);
+ }
+
+ if (vtd_pe_pgtt_is_flt(pe)) {
+ vtd_destroy_s1_hwpt(vtd_hiod);
+ }
+
+ return ret;
+}
+
+static int vtd_device_attach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
+ VTDAddressSpace *vtd_as, VTDPASIDEntry *pe)
+{
+ /*
+ * If pe->gptt == FLT, should be go ahead to do bind as host only
+ * accepts guest FLT under nesting. If pe->pgtt==PT, should setup
+ * the pasid with GPA page table. Otherwise should return failure.
+ */
+ if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
+ return -EINVAL;
+ }
+
+ /* Should fail if the FLPT base is 0 */
+ if (vtd_pe_pgtt_is_flt(pe) && !vtd_pe_get_flpt_base(pe)) {
+ return -EINVAL;
+ }
+
+ return vtd_device_attach_iommufd(vtd_hiod, vtd_as->pasid, pe, &error_abort);
+}
+
+static int vtd_device_detach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
+ VTDAddressSpace *vtd_as)
+{
+ VTDPASIDEntry *cached_pe = vtd_as->pasid_cache_entry.cache_filled ?
+ &vtd_as->pasid_cache_entry.pasid_entry : NULL;
+
+ if (!cached_pe ||
+ (!vtd_pe_pgtt_is_flt(cached_pe) && !vtd_pe_pgtt_is_pt(cached_pe))) {
+ return 0;
+ }
+
+ return vtd_device_detach_iommufd(vtd_hiod, vtd_as->pasid, cached_pe,
+ &error_abort);
+}
+
+/**
+ * Caller should hold iommu_lock.
+ */
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
+ VTDPASIDEntry *pe, VTDPASIDOp op)
+{
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ VTDHostIOMMUDevice *vtd_hiod;
+ int devfn = vtd_as->devfn;
+ int ret = -EINVAL;
+ struct vtd_as_key key = {
+ .bus = vtd_as->bus,
+ .devfn = devfn,
+ };
+
+ vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
+ if (!vtd_hiod || !vtd_hiod->hiod) {
+ /* means no need to go further, e.g. for emulated devices */
+ return 0;
+ }
+
+ if (vtd_as->pasid != PCI_NO_PASID) {
+ error_report("Non-rid_pasid %d not supported yet", vtd_as->pasid);
+ return ret;
+ }
+
+ switch (op) {
+ case VTD_PASID_UPDATE:
+ case VTD_PASID_BIND:
+ {
+ ret = vtd_device_attach_pgtbl(vtd_hiod, vtd_as, pe);
+ break;
+ }
+ case VTD_PASID_UNBIND:
+ {
+ ret = vtd_device_detach_pgtbl(vtd_hiod, vtd_as);
+ break;
+ }
+ default:
+ error_report_once("Unknown VTDPASIDOp!!!\n");
+ break;
+ }
+
+ return ret;
+}
+#else
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
+ VTDPASIDEntry *pe, VTDPASIDOp op)
+{
+ return 0;
+}
+#endif
+
/* Do a context-cache device-selective invalidation.
* @func_mask: FM field after shifting
*/
@@ -3181,20 +3388,23 @@ static int vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
VTDPASIDEntry *pe)
{
VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+ int ret;
- if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
- /* No need to go further as cached pasid entry is latest */
- return 0;
+ if (pc_entry->cache_filled) {
+ if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+ /* No need to go further as cached pasid entry is latest */
+ return 0;
+ }
+ ret = vtd_bind_guest_pasid(vtd_as, pe, VTD_PASID_UPDATE);
+ } else {
+ ret = vtd_bind_guest_pasid(vtd_as, pe, VTD_PASID_BIND);
}
- pc_entry->pasid_entry = *pe;
- pc_entry->cache_filled = true;
-
- /*
- * TODO: send pasid bind to host for passthru devices
- */
-
- return 0;
+ if (!ret) {
+ pc_entry->pasid_entry = *pe;
+ pc_entry->cache_filled = true;
+ }
+ return ret;
}
/*
@@ -3265,10 +3475,14 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
return false;
remove:
- /*
- * TODO: send pasid unbind to host for passthru devices
- */
- pc_entry->cache_filled = false;
+ if (pc_entry->cache_filled) {
+ if (vtd_bind_guest_pasid(vtd_as, NULL, VTD_PASID_UNBIND)) {
+ pasid_cache_info_set_error(pc_info);
+ return false;
+ } else {
+ pc_entry->cache_filled = false;
+ }
+ }
/*
* Don't remove address space of PCI_NO_PASID which is created by PCI
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index c8a936eb46..1c31b9a873 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -73,6 +73,9 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
vtd_reset_exit(void) ""
+vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
+vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
# amd_iommu.c
amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" + offset 0x%"PRIx32
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 15/19] intel_iommu: Replay pasid binds after context cache invalidation
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (13 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 14/19] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-23 13:25 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 16/19] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
` (3 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
From: Yi Liu <yi.l.liu@intel.com>
This replays guest pasid attachments after context cache invalidation.
This is a behavior to ensure safety. Actually, programmer should issue
pasid cache invalidation with proper granularity after issuing a context
cache invalidation.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 1 +
hw/i386/intel_iommu.c | 51 ++++++++++++++++++++++++++++++++--
hw/i386/trace-events | 1 +
3 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 92a533db54..b3e4aa23f1 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -575,6 +575,7 @@ typedef enum VTDPCInvType {
VTD_PASID_CACHE_FORCE_RESET = 0,
/* pasid cache invalidation rely on guest PASID entry */
VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
+ VTD_PASID_CACHE_DEVSI, /* pasid cache device selective invalidation */
VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation */
VTD_PASID_CACHE_PASIDSI, /* pasid cache pasid selective invalidation */
} VTDPCInvType;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1c94a0033c..621b07aa02 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -92,6 +92,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+ VTDPASIDCacheInfo *pc_info);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+ PCIBus *bus, uint16_t devfn);
static void vtd_panic_require_caching_mode(void)
{
@@ -2437,6 +2441,8 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
static void vtd_context_global_invalidate(IntelIOMMUState *s)
{
+ VTDPASIDCacheInfo pc_info = { .error_happened = false, };
+
trace_vtd_inv_desc_cc_global();
/* Protects context cache */
vtd_iommu_lock(s);
@@ -2454,6 +2460,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
* VT-d emulation codes.
*/
vtd_iommu_replay_all(s);
+
+ pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+ vtd_pasid_cache_sync(s, &pc_info);
}
#ifdef CONFIG_IOMMUFD
@@ -2696,6 +2705,21 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
* happened.
*/
vtd_address_space_sync(vtd_as);
+ /*
+ * Per spec, context flush should also followed with PASID
+ * cache and iotlb flush. Regards to a device selective
+ * context cache invalidation:
+ * if (emaulted_device)
+ * invalidate pasid cache and pasid-based iotlb
+ * else if (assigned_device)
+ * check if the device has been bound to any pasid
+ * invoke pasid_unbind regards to each bound pasid
+ * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
+ * caches, while for piotlb in QEMU, we don't have it yet, so
+ * no handling. For assigned device, host iommu driver would
+ * flush piotlb when a pasid unbind is pass down to it.
+ */
+ vtd_pasid_cache_devsi(s, vtd_as->bus, devfn);
}
}
}
@@ -3447,6 +3471,11 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
/* Fall through */
case VTD_PASID_CACHE_GLOBAL_INV:
break;
+ case VTD_PASID_CACHE_DEVSI:
+ if (pc_info->bus != vtd_as->bus || pc_info->devfn != vtd_as->devfn) {
+ return false;
+ }
+ break;
default:
error_report("invalid pc_info->type");
abort();
@@ -3640,6 +3669,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
case VTD_PASID_CACHE_GLOBAL_INV:
/* loop all assigned devices */
break;
+ case VTD_PASID_CACHE_DEVSI:
+ walk_info.bus = pc_info->bus;
+ walk_info.devfn = pc_info->devfn;
+ vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+ return;
case VTD_PASID_CACHE_FORCE_RESET:
/* For force reset, no need to go further replay */
return;
@@ -3675,8 +3709,7 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
* It includes updating the pasid cache in vIOMMU and updating the
* pasid bindings per guest's latest pasid entry presence.
*/
-static void vtd_pasid_cache_sync(IntelIOMMUState *s,
- VTDPASIDCacheInfo *pc_info)
+static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
{
if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
return;
@@ -3737,6 +3770,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
vtd_replay_guest_pasid_bindings(s, pc_info);
}
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+ PCIBus *bus, uint16_t devfn)
+{
+ VTDPASIDCacheInfo pc_info = { .error_happened = false, };
+
+ trace_vtd_pasid_cache_devsi(devfn);
+
+ pc_info.type = VTD_PASID_CACHE_DEVSI;
+ pc_info.bus = bus;
+ pc_info.devfn = devfn;
+
+ vtd_pasid_cache_sync(s, &pc_info);
+}
+
static bool vtd_process_pasid_desc(IntelIOMMUState *s,
VTDInvDesc *inv_desc)
{
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 1c31b9a873..830b11f68b 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -28,6 +28,7 @@ vtd_pasid_cache_reset(void) ""
vtd_pasid_cache_gsi(void) ""
vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 16/19] intel_iommu: Propagate PASID-based iotlb invalidation to host
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (14 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 15/19] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-23 13:41 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 17/19] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
` (2 subsequent siblings)
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
From: Yi Liu <yi.l.liu@intel.com>
This traps the guest PASID-based iotlb invalidation request and propagate it
to host.
Intel VT-d 3.0 supports nested translation in PASID granular. Guest SVA support
could be implemented by configuring nested translation on specific PASID. This
is also known as dual stage DMA translation.
Under such configuration, guest owns the GVA->GPA translation which is
configured as stage-1 page table in host side for a specific pasid, and host
owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
invalidation should be propagated to host since host IOMMU will cache first
level page table related mappings during DMA address translation.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 6 ++
hw/i386/intel_iommu.c | 113 ++++++++++++++++++++++++++++++++-
2 files changed, 117 insertions(+), 2 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index b3e4aa23f1..07bfb97499 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -589,6 +589,12 @@ typedef struct VTDPASIDCacheInfo {
bool error_happened;
} VTDPASIDCacheInfo;
+typedef struct VTDPIOTLBInvInfo {
+ uint16_t domain_id;
+ uint32_t pasid;
+ struct iommu_hwpt_vtd_s1_invalidate *inv_data;
+} VTDPIOTLBInvInfo;
+
/* PASID Table Related Definitions */
#define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
#define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 621b07aa02..d1fa395274 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2639,12 +2639,105 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
return ret;
}
+
+/*
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_invalidate_piotlb(VTDAddressSpace *vtd_as,
+ struct iommu_hwpt_vtd_s1_invalidate *cache)
+{
+ VTDHostIOMMUDevice *vtd_hiod;
+ HostIOMMUDeviceIOMMUFD *idev;
+ int devfn = vtd_as->devfn;
+ struct vtd_as_key key = {
+ .bus = vtd_as->bus,
+ .devfn = devfn,
+ };
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ uint32_t entry_num = 1; /* Only implement one request for simplicity */
+ Error *err;
+
+ vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
+ if (!vtd_hiod || !vtd_hiod->hiod) {
+ return;
+ }
+ idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+
+ if (!iommufd_backend_invalidate_cache(idev->iommufd, vtd_hiod->s1_hwpt,
+ IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+ sizeof(*cache), &entry_num, cache,
+ &err)) {
+ error_report_err(err);
+ }
+}
+
+/*
+ * This function is a loop function for the s->vtd_address_spaces
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host. Caller of this function
+ * should hold iommu_lock.
+ */
+static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
+ gpointer user_data)
+{
+ VTDPIOTLBInvInfo *piotlb_info = user_data;
+ VTDAddressSpace *vtd_as = value;
+ VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+ uint32_t pasid;
+ uint16_t did;
+
+ /* Replay only fill pasid entry cache for passthrough device */
+ if (!pc_entry->cache_filled ||
+ !vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
+ return;
+ }
+
+ if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+ return;
+ }
+
+ did = vtd_pe_get_did(&pc_entry->pasid_entry);
+
+ if (piotlb_info->domain_id == did && piotlb_info->pasid == pasid) {
+ vtd_invalidate_piotlb(vtd_as, piotlb_info->inv_data);
+ }
+}
+
+static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
+ uint16_t domain_id, uint32_t pasid,
+ hwaddr addr, uint64_t npages, bool ih)
+{
+ struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
+ VTDPIOTLBInvInfo piotlb_info;
+
+ cache_info.addr = addr;
+ cache_info.npages = npages;
+ cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
+
+ piotlb_info.domain_id = domain_id;
+ piotlb_info.pasid = pasid;
+ piotlb_info.inv_data = &cache_info;
+
+ /*
+ * Here loops all the vtd_as instances in s->vtd_address_spaces
+ * to find out the affected devices since piotlb invalidation
+ * should check pasid cache per architecture point of view.
+ */
+ g_hash_table_foreach(s->vtd_address_spaces,
+ vtd_flush_pasid_iotlb, &piotlb_info);
+}
#else
static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
VTDPASIDEntry *pe, VTDPASIDOp op)
{
return 0;
}
+
+static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
+ uint16_t domain_id, uint32_t pasid,
+ hwaddr addr, uint64_t npages, bool ih)
+{
+}
#endif
/* Do a context-cache device-selective invalidation.
@@ -3300,6 +3393,13 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
info.pasid = pasid;
vtd_iommu_lock(s);
+ /*
+ * Here loops all the vtd_as instances in s->vtd_as
+ * to find out the affected devices since piotlb invalidation
+ * should check pasid cache per architecture point of view.
+ */
+ vtd_flush_pasid_iotlb_all(s, domain_id, pasid, 0, (uint64_t)-1, 0);
+
g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
&info);
vtd_iommu_unlock(s);
@@ -3323,7 +3423,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
}
static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
- uint32_t pasid, hwaddr addr, uint8_t am)
+ uint32_t pasid, hwaddr addr, uint8_t am,
+ bool ih)
{
VTDIOTLBPageInvInfo info;
@@ -3333,6 +3434,13 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
info.mask = ~((1 << am) - 1);
vtd_iommu_lock(s);
+ /*
+ * Here loops all the vtd_as instances in s->vtd_as
+ * to find out the affected devices since piotlb invalidation
+ * should check pasid cache per architecture point of view.
+ */
+ vtd_flush_pasid_iotlb_all(s, domain_id, pasid, addr, 1 << am, ih);
+
g_hash_table_foreach_remove(s->iotlb,
vtd_hash_remove_by_page_piotlb, &info);
vtd_iommu_unlock(s);
@@ -3366,7 +3474,8 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
- vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am);
+ vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+ VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
break;
default:
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 17/19] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (15 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 16/19] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-23 13:48 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 18/19] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
2025-06-20 7:18 ` [PATCH v2 19/19] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
From: Yi Liu <yi.l.liu@intel.com>
When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
the pasid bindings on host side become stale and need to be updated.
Introduce a helper function vtd_refresh_pasid_bind() for that purpose.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d1fa395274..0b322078cc 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -90,6 +90,7 @@ struct vtd_iotlb_key {
static void vtd_address_space_refresh_all(IntelIOMMUState *s);
static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_refresh_pasid_bind(IntelIOMMUState *s);
static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
static void vtd_pasid_cache_sync(IntelIOMMUState *s,
@@ -3066,6 +3067,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
vtd_reset_caches(s);
vtd_address_space_refresh_all(s);
+ vtd_refresh_pasid_bind(s);
}
/* Set Interrupt Remap Table Pointer */
@@ -3100,6 +3102,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
vtd_reset_caches(s);
vtd_address_space_refresh_all(s);
+ vtd_refresh_pasid_bind(s);
}
/* Handle Interrupt Remap Enable/Disable */
@@ -3813,6 +3816,26 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
}
}
+static void vtd_refresh_pasid_bind(IntelIOMMUState *s)
+{
+ VTDPASIDCacheInfo pc_info = { .error_happened = false,
+ .type = VTD_PASID_CACHE_GLOBAL_INV };
+
+ /*
+ * Only when dmar is enabled, should pasid bindings replayed,
+ * otherwise no need to replay.
+ */
+ if (!s->dmar_enabled) {
+ return;
+ }
+
+ if (!s->flts || !s->root_scalable) {
+ return;
+ }
+
+ vtd_replay_guest_pasid_bindings(s, &pc_info);
+}
+
/*
* This function syncs the pasid bindings between guest and host.
* It includes updating the pasid cache in vIOMMU and updating the
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 18/19] Workaround for ERRATA_772415_SPR17
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (16 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 17/19] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
2025-06-20 16:01 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 19/19] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
18 siblings, 1 reply; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on stage-2 page table could still be written.
Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.
https://www.intel.com/content/www/us/en/content-details/772415/content-details.html
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
include/hw/vfio/vfio-container-base.h | 1 +
hw/vfio/iommufd.c | 8 +++++++-
hw/vfio/listener.c | 13 +++++++++----
3 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index f0232654ee..e5c51a51ac 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
QLIST_HEAD(, VFIODevice) device_list;
GList *iova_ranges;
NotifierWithReturn cpr_reboot_notifier;
+ bool bypass_ro;
} VFIOContainerBase;
typedef struct VFIOGuestIOMMU {
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 83a632bdee..23839a511a 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -306,6 +306,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
{
ERRP_GUARD();
IOMMUFDBackend *iommufd = vbasedev->iommufd;
+ struct iommu_hw_info_vtd vtd;
uint32_t type, flags = 0;
uint64_t hw_caps;
VFIOIOASHwpt *hwpt;
@@ -345,10 +346,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
* instead.
*/
if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
- &type, NULL, 0, &hw_caps, errp)) {
+ &type, &vtd, sizeof(vtd), &hw_caps,
+ errp)) {
return false;
}
+ if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
+ container->bcontainer.bypass_ro = true;
+ }
+
if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
}
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index f498e23a93..c64aa4539e 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -364,7 +364,8 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
return true;
}
-static bool vfio_listener_valid_section(MemoryRegionSection *section,
+static bool vfio_listener_valid_section(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section,
const char *name)
{
if (vfio_listener_skipped_section(section)) {
@@ -375,6 +376,10 @@ static bool vfio_listener_valid_section(MemoryRegionSection *section,
return false;
}
+ if (bcontainer && bcontainer->bypass_ro && section->readonly) {
+ return false;
+ }
+
if (unlikely((section->offset_within_address_space &
~qemu_real_host_page_mask()) !=
(section->offset_within_region & ~qemu_real_host_page_mask()))) {
@@ -494,7 +499,7 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
int ret;
Error *err = NULL;
- if (!vfio_listener_valid_section(section, "region_add")) {
+ if (!vfio_listener_valid_section(bcontainer, section, "region_add")) {
return;
}
@@ -655,7 +660,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
int ret;
bool try_unmap = true;
- if (!vfio_listener_valid_section(section, "region_del")) {
+ if (!vfio_listener_valid_section(bcontainer, section, "region_del")) {
return;
}
@@ -812,7 +817,7 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
container_of(listener, VFIODirtyRangesListener, listener);
hwaddr iova, end;
- if (!vfio_listener_valid_section(section, "tracking_update") ||
+ if (!vfio_listener_valid_section(NULL, section, "tracking_update") ||
!vfio_get_section_iova_range(dirty->bcontainer, section,
&iova, &end, NULL)) {
return;
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH v2 19/19] intel_iommu: Enable host device when x-flts=on in scalable mode
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (17 preceding siblings ...)
2025-06-20 7:18 ` [PATCH v2 18/19] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
@ 2025-06-20 7:18 ` Zhenzhong Duan
18 siblings, 0 replies; 58+ messages in thread
From: Zhenzhong Duan @ 2025-06-20 7:18 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Now that all infrastructures of supporting passthrough device running
with stage-1 translation are there, enable it now.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 0b322078cc..88ccce1f94 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -5305,6 +5305,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
"when x-flts=on");
return false;
}
+
+ return true;
#endif
error_setg(errp, "host device is uncompatible with stage-1 translation");
--
2.34.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH v2 02/19] hw/pci: Introduce pci_device_get_viommu_cap()
2025-06-20 7:17 ` [PATCH v2 02/19] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
@ 2025-06-20 7:53 ` Eric Auger
2025-06-23 2:20 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-20 7:53 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Marcel Apfelbaum
Hi Zhenzhong,
On 6/20/25 9:17 AM, Zhenzhong Duan wrote:
> pci_device_get_viommu_cap() call pci_device_get_iommu_bus_devfn()
> to get iommu_bus->iommu_ops and call get_viommu_cap() callback to
> get a bitmap with each bit represents a vIOMMU exposed capability.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> include/hw/pci/pci.h | 22 ++++++++++++++++++++++
> hw/pci/pci.c | 11 +++++++++++
> 2 files changed, 33 insertions(+)
>
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index df3cc7b875..829757b2c2 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -453,6 +453,18 @@ typedef struct PCIIOMMUOps {
> * @devfn: device and function number of the PCI device.
> */
> void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
> + /**
> + * @get_viommu_cap: get vIOMMU capabilities
> + *
> + * Optional callback, if not implemented, then vIOMMU doesn't
> + * support exposing capabilities to other subsystem, e.g., VFIO.
> + * vIOMMU can choose which capabilities to expose.
> + *
> + * @opaque: the data passed to pci_setup_iommu().
> + *
> + * Returns: 64bit bitmap with each bit represents a capability.
I think we need to clarify what this bitmap contains as capability bits
(enum type)
Thanks
Eric
> + */
> + uint64_t (*get_viommu_cap)(void *opaque);
> /**
> * @get_iotlb_info: get properties required to initialize a device IOTLB.
> *
> @@ -633,6 +645,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
> Error **errp);
> void pci_device_unset_iommu_device(PCIDevice *dev);
>
> +/**
> + * pci_device_get_viommu_cap: get vIOMMU capabilities.
> + *
> + * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
> + * capability, 0 if vIOMMU doesn't support esposing capabilities.
> + *
> + * @dev: PCI device pointer.
> + */
> +uint64_t pci_device_get_viommu_cap(PCIDevice *dev);
> +
> /**
> * pci_iommu_get_iotlb_info: get properties required to initialize a
> * device IOTLB.
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index c70b5ceeba..df1fb615a8 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2992,6 +2992,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
> }
> }
>
> +uint64_t pci_device_get_viommu_cap(PCIDevice *dev)
> +{
> + PCIBus *iommu_bus;
> +
> + pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
> + if (iommu_bus && iommu_bus->iommu_ops->get_viommu_cap) {
> + return iommu_bus->iommu_ops->get_viommu_cap(iommu_bus->iommu_opaque);
> + }
> + return 0;
> +}
> +
> int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
> bool exec_req, hwaddr addr, bool lpig,
> uint16_t prgi, bool is_read, bool is_write)
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 04/19] vfio/iommufd: Force creating nested parent domain
2025-06-20 7:17 ` [PATCH v2 04/19] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
@ 2025-06-20 8:08 ` Eric Auger
2025-06-23 2:33 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-20 8:08 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng
On 6/20/25 9:17 AM, Zhenzhong Duan wrote:
> Call pci_device_get_viommu_cap() to get if vIOMMU supports VIOMMU_CAP_STAGE1,
> if yes, create nested parent domain which could be reused by vIOMMU to create
could? Is there a choice?
> nested domain.
I would add in the commit msg that it is safe because
hw_caps & VIOMMU_CAP_STAGE1 cannot be set yet because s->flts is forbidden until last patch of the series.
Eric
>
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/vfio/iommufd.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index d3efef71af..83a632bdee 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -20,6 +20,7 @@
> #include "trace.h"
> #include "qapi/error.h"
> #include "system/iommufd.h"
> +#include "hw/iommu.h"
> #include "hw/qdev-core.h"
> #include "hw/vfio/vfio-cpr.h"
> #include "system/reset.h"
> @@ -352,6 +353,19 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> }
>
> + /*
> + * If vIOMMU supports stage-1 translation, force to create nested parent
> + * domain which could be reused by vIOMMU to create nested domain.
> + */
> + if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> + VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +
> + hw_caps = pci_device_get_viommu_cap(&vdev->pdev);
> + if (hw_caps & VIOMMU_CAP_STAGE1) {
> + flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
> + }
> + }
> +
> if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
> container->ioas_id, flags,
> IOMMU_HWPT_DATA_NONE, 0, NULL,
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 03/19] intel_iommu: Implement get_viommu_cap() callback
2025-06-20 7:17 ` [PATCH v2 03/19] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
@ 2025-06-20 8:10 ` Eric Auger
2025-06-23 2:20 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-20 8:10 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Zhenzhong,
On 6/20/25 9:17 AM, Zhenzhong Duan wrote:
> Implement get_viommu_cap() callback and expose stage-1 capability for now.
>
> VFIO uses it to create nested parent domain which is further used to create
> nested domain in vIOMMU. All these will be implemented in following patches.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> MAINTAINERS | 1 +
> include/hw/iommu.h | 14 ++++++++++++++
> hw/i386/intel_iommu.c | 12 ++++++++++++
> 3 files changed, 27 insertions(+)
> create mode 100644 include/hw/iommu.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 94c4076127..27817974a6 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2277,6 +2277,7 @@ F: include/system/iommufd.h
> F: backends/host_iommu_device.c
> F: include/system/host_iommu_device.h
> F: include/qemu/chardev_open.h
> +F: include/hw/iommu.h
> F: util/chardev_open.c
> F: docs/devel/vfio-iommufd.rst
>
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> new file mode 100644
> index 0000000000..3c1c08f05d
> --- /dev/null
> +++ b/include/hw/iommu.h
> @@ -0,0 +1,14 @@
> +/*
> + * General vIOMMU capabilities, flags, etc
> + *
> + * Copyright (C) 2025 Intel Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_IOMMU_H
> +#define HW_IOMMU_H
> +
> +#define VIOMMU_CAP_STAGE1 BIT_ULL(0)
I think you shall rather introduce the header and the first enum value
in the previous patch.
Eric
> +
> +#endif /* HW_IOMMU_H */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f0b1f90eff..702973da5c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -24,6 +24,7 @@
> #include "qemu/main-loop.h"
> #include "qapi/error.h"
> #include "hw/sysbus.h"
> +#include "hw/iommu.h"
> #include "intel_iommu_internal.h"
> #include "hw/pci/pci.h"
> #include "hw/pci/pci_bus.h"
> @@ -4412,6 +4413,16 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
> vtd_iommu_unlock(s);
> }
>
> +static uint64_t vtd_get_viommu_cap(void *opaque)
> +{
> + IntelIOMMUState *s = opaque;
> + uint64_t caps;
> +
> + caps = s->flts ? VIOMMU_CAP_STAGE1 : 0;
> +
> + return caps;
> +}
> +
> /* Unmap the whole range in the notifier's scope. */
> static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
> {
> @@ -4734,6 +4745,7 @@ static PCIIOMMUOps vtd_iommu_ops = {
> .get_address_space = vtd_host_dma_iommu,
> .set_iommu_device = vtd_dev_set_iommu_device,
> .unset_iommu_device = vtd_dev_unset_iommu_device,
> + .get_viommu_cap = vtd_get_viommu_cap,
> };
>
> static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 05/19] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
2025-06-20 7:17 ` [PATCH v2 05/19] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
@ 2025-06-20 11:59 ` Eric Auger
2025-06-23 2:47 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-20 11:59 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Marcel Apfelbaum
Hi Zhenzhong,
On 6/20/25 9:17 AM, Zhenzhong Duan wrote:
> Returns true if PCI device is aliased or false otherwise. This will be
> used in following patch to determine if a PCI device is under a PCI
> bridge.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> include/hw/pci/pci.h | 2 ++
> hw/pci/pci.c | 12 ++++++++----
> 2 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 829757b2c2..3029cdf26f 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -640,6 +640,8 @@ typedef struct PCIIOMMUOps {
> bool is_write);
> } PCIIOMMUOps;
>
> +bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
> + PCIBus **aliased_bus, int *aliased_devfn);
if I am correct you have a single caller of the helper using the
returned value, in intel_iommu.c, whereas all the existing callers are
not using the returned value. You may simply pass a non NULL aliased_bus
and aliased_devfn and check whether they differ from the original
bus/devfn. Besides the patch looks ok to me.
Thanks
Eric
> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
> Error **errp);
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index df1fb615a8..87f7c942b3 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2857,20 +2857,21 @@ static void pci_device_class_base_init(ObjectClass *klass, const void *data)
> * For call sites which don't need aliased BDF, passing NULL to
> * aliased_[bus|devfn] is allowed.
> *
> + * Returns true if PCI device is aliased or false otherwise.
> + *
> * @piommu_bus: return root #PCIBus backed by an IOMMU for the PCI device.
> *
> * @aliased_bus: return aliased #PCIBus of the PCI device, optional.
> *
> * @aliased_devfn: return aliased devfn of the PCI device, optional.
> */
> -static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
> - PCIBus **piommu_bus,
> - PCIBus **aliased_bus,
> - int *aliased_devfn)
> +bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
> + PCIBus **aliased_bus, int *aliased_devfn)
> {
> PCIBus *bus = pci_get_bus(dev);
> PCIBus *iommu_bus = bus;
> int devfn = dev->devfn;
> + bool aliased = false;
>
> while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
> PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> @@ -2907,6 +2908,7 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
> devfn = parent->devfn;
> bus = parent_bus;
> }
> + aliased = true;
> }
>
> iommu_bus = parent_bus;
> @@ -2928,6 +2930,8 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
> if (aliased_devfn) {
> *aliased_devfn = devfn;
> }
> +
> + return aliased;
> }
>
> AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 07/19] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
2025-06-20 7:18 ` [PATCH v2 07/19] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-06-20 12:05 ` Eric Auger
2025-06-23 2:44 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-20 12:05 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Zhenzhong,
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
> is passed to host to construct nested page table. We need to check
> compatibility of some critical IOMMU capabilities between vIOMMU and
> host IOMMU to ensure guest stage-1 page table could be used by host.
>
> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
> does not, then this IOMMUFD backed device should be failed.
s/should be/should
You may add in the commit message than even of the checks pass, for now
we willingly reject the association because all the bits are not there yet.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu_internal.h | 1 +
> hw/i386/intel_iommu.c | 28 ++++++++++++++++++++++++++++
> 2 files changed, 29 insertions(+)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 7aba259ef8..18bc22fc72 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -192,6 +192,7 @@
> #define VTD_ECAP_PT (1ULL << 6)
> #define VTD_ECAP_SC (1ULL << 7)
> #define VTD_ECAP_MHMV (15ULL << 20)
> +#define VTD_ECAP_NEST (1ULL << 26)
> #define VTD_ECAP_SRS (1ULL << 31)
> #define VTD_ECAP_PASID (1ULL << 40)
> #define VTD_ECAP_SMTS (1ULL << 43)
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index e90fd2f28f..1c79efc1cb 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -40,6 +40,7 @@
> #include "kvm/kvm_i386.h"
> #include "migration/vmstate.h"
> #include "trace.h"
> +#include "system/iommufd.h"
>
> /* context entry operations */
> #define VTD_CE_GET_RID2PASID(ce) \
> @@ -4355,6 +4356,33 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
> return true;
> }
>
> +#ifdef CONFIG_IOMMUFD
> + struct HostIOMMUDeviceCaps *caps = &hiod->caps;
> + struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
> +
> + /* Remaining checks are all stage-1 translation specific */
> + if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> + error_setg(errp, "Need IOMMUFD backend when x-flts=on");
> + return false;
> + }
> +
> + if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
> + error_setg(errp, "Incompatible host platform IOMMU type %d",
> + caps->type);
> + return false;
> + }
> +
> + if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
> + error_setg(errp, "Host IOMMU doesn't support nested translation");
> + return false;
> + }
> +
> + if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
> + error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
> + return false;
> + }
> +#endif
> +
> error_setg(errp, "host device is uncompatible with stage-1 translation");
s/uncompatible/incompatible
s/device/IOMMU
> return false;
> }
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 08/19] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
2025-06-20 7:18 ` [PATCH v2 08/19] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
@ 2025-06-20 12:18 ` Eric Auger
2025-06-23 3:20 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-20 12:18 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
Eduardo Habkost
Hi Zhenzhong,
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
> Currently we don't support nested translation for passthrough device
> with emulated device under same PCI bridge.
I don't understand the above. Do you want to forbid a configuration
where both a passthrough'ed device and an emulated device would be
downstream to a PCI bridge and protected by nested IOMMU?
If I am not wrong in the path you don't check coexistence of emulated
and passthrough'ed device but simply check whether the host assigned
device is downstream to a PCI bridge. So if I am not wrong this is not
really aligned to the commit msg description.
>
> Reason is for emulated devices, AS should switch to iommu MR, while for
> passthrough devices, it needs the AS stick with the system MR hence be
> able to keep the VFIO container IOAS as a GPA IOAS. To support this, let
> AS switch to iommu MR and have a separate GPA IOAS is needed, but that
> brings a new memory listener which duplicates with VFIO memory listener.
I have difficulties to parse the the above sentence
>
> For trade off, we choose to not support this special scenario because
> PCIE bridge is more popular than PCI bridge now.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1c79efc1cb..9d4adc9458 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -4330,9 +4330,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
> return vtd_dev_as;
> }
>
> -static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
> +static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
> Error **errp)
> {
> + HostIOMMUDevice *hiod = vtd_hiod->hiod;
> HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
> int ret;
>
> @@ -4359,6 +4360,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
> #ifdef CONFIG_IOMMUFD
> struct HostIOMMUDeviceCaps *caps = &hiod->caps;
> struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
> + PCIBus *bus = vtd_hiod->bus;
> + PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), vtd_hiod->devfn);
>
> /* Remaining checks are all stage-1 translation specific */
> if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> @@ -4381,6 +4384,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
> error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
> return false;
> }
> +
> + if (pci_device_get_iommu_bus_devfn(pdev, &bus, NULL, NULL)) {
> + error_setg(errp, "Host device under PCI bridge is unsupported "
> + "when x-flts=on");
so now the compatibility also comes from the device and not only from
the host IOMMU caps. (refering to my previous comment about
s/device/iommu in error msg)
> + return false;
> + }
> #endif
>
> error_setg(errp, "host device is uncompatible with stage-1 translation");
> @@ -4414,7 +4423,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> vtd_hiod->iommu_state = s;
> vtd_hiod->hiod = hiod;
>
> - if (!vtd_check_hiod(s, hiod, errp)) {
> + if (!vtd_check_hiod(s, vtd_hiod, errp)) {
> g_free(vtd_hiod);
> vtd_iommu_unlock(s);
> return false;
Thanks
Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 09/19] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
2025-06-20 7:18 ` [PATCH v2 09/19] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
@ 2025-06-20 12:46 ` Eric Auger
2025-06-24 2:48 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-20 12:46 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Zhenzhong,
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
> PCI device supports two request types, Requests-without-PASID and
> Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
> prefix, IOMMU fetches rid_pasid from context entry and use it as IOMMU's
> pasid to index pasid table.
>
> So we need to translate between PCI's pasid and IOMMU's pasid specially
> for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
> For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
>
> vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to vtd_as
> which contains PCI's pasid vtd_as->pasid.
>
> vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to iommu_pasid.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu.c | 58 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 58 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 9d4adc9458..8948b8370f 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1602,6 +1602,64 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
> return 0;
> }
>
> +static inline int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
> + uint32_t *pasid)
Is it meaningful to use inline here and below? Below I guess you do so
to avoid "defined but not used" compilation error but I don't think it
should
stay as is.
I don't really understand the iommu_pasid terminology. Either it is a
pasid passed through the PCI transaction or it is the default pasid
found in rid2passid ce field. So that's a pasid both ways ;-) can't you
simply call it pasid.
> +{
> + VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
> + IntelIOMMUState *s = vtd_as->iommu_state;
> + uint8_t bus_num = pci_bus_num(vtd_as->bus);
> + uint8_t devfn = vtd_as->devfn;
> + VTDContextEntry ce;
> + int ret;
> +
> + if (cc_entry->context_cache_gen == s->context_cache_gen) {
> + ce = cc_entry->context_entry;
> + } else {
> + ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
> + if (ret) {
> + return ret;
> + }
> + }
if the above pattern is used at many locations I still think it may be
valuable to have a _locked helper.
> +
> + /* Translate to iommu pasid if PCI_NO_PASID */
> + if (vtd_as->pasid == PCI_NO_PASID) {
> + *pasid = VTD_CE_GET_RID2PASID(&ce);
> + } else {
> + *pasid = vtd_as->pasid;
> + }
> +
> + return 0;
> +}
> +
> +static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
> + gpointer user_data)
> +{
> + VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
> + struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
why target? can't you name it key instead?
> + uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
> + uint32_t pasid;
> +
> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
> + return false;
> + }
> +
> + return (pasid == target->pasid) && (sid == target->sid);
> +}
> +
> +/* Translate iommu pasid to vtd_as */
same here
> +static inline
> +VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
> + uint16_t sid, uint32_t pasid)
> +{
> + struct vtd_as_raw_key key = {
> + .sid = sid,
> + .pasid = pasid
> + };
> +
> + return g_hash_table_find(s->vtd_address_spaces,
> + vtd_find_as_by_sid_and_iommu_pasid, &key);
> +}
> +
> static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
> void *private)
> {
Thanks
Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 10/19] intel_iommu: Handle PASID entry removing and updating
2025-06-20 7:18 ` [PATCH v2 10/19] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
@ 2025-06-20 15:44 ` Eric Auger
2025-06-24 3:34 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-20 15:44 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Zhenzhong,
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
I would suggest: Handle PASID entry removal and update instead of verbing.
> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
> pasid entry and track PASID usage and future PASID tagged DMA address
> translation support in vIOMMU.
>
> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
> never freed. For other pasid, VTDAddressSpace instance is created/destroyed
> per the guest pasid entry set up/destroy for passthrough devices. While for
> emulated devices, VTDAddressSpace instance is created in the PASID tagged DMA
> translation and be destroyed per guest PASID cache invalidation. This focuses
s/be detroyed/destroyed
> on the PASID cache management for passthrough devices as there is no PASID
> capable emulated devices yet.
if you don't handle the emulated device case, may be omit talking about
them here.
>
> When guest modifies a PASID entry, QEMU will capture the guest pasid selective
> pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
> invalidation reasons:
>
> a) a present pasid entry moved to non-present
> b) a present pasid entry to be a present entry
> c) a non-present pasid entry moved to present
>
> This handles a) and b), following patch will handle c).
This -> This patch
>
> vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
> and compare it with the PASID cache.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu_internal.h | 27 ++++
> include/hw/i386/intel_iommu.h | 6 +
> hw/i386/intel_iommu.c | 265 +++++++++++++++++++++++++++++++--
> hw/i386/trace-events | 3 +
> 4 files changed, 291 insertions(+), 10 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 18bc22fc72..01c881ed4d 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
> * request while disabled */
> VTD_FR_IR_SID_ERR = 0x26, /* Invalid Source-ID */
>
> + VTD_FR_RTADDR_INV_TTM = 0x31, /* Invalid TTM in RTADDR */
> /* PASID directory entry access failure */
> VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
> /* The Present(P) field of pasid directory entry is 0 */
> @@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
> #define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff000000000f1c0ULL
> #define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
>
Adding some basic comments for the uncultured reader like me would help
doing the link with the vtd spec:
PASID-cache Invalidate Descriptor (pc_inv_dsc) fields
> +#define VTD_INV_DESC_PASIDC_G (3ULL << 4)
> +#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
> +#define VTD_INV_DESC_PASIDC_DID(val) (((val) >> 16) & VTD_DOMAIN_ID_MASK)
> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0 0xfff000000000f1c0ULL
> +
> +#define VTD_INV_DESC_PASIDC_DSI (0ULL << 4)
> +#define VTD_INV_DESC_PASIDC_PASID_SI (1ULL << 4)
> +#define VTD_INV_DESC_PASIDC_GLOBAL (3ULL << 4)
as those are values for the granularity field using
VTD_INV_DESC_PASIDC_G_* look relevant to me.
I think you would gain in readability if you adopt extract32/64 syntax
like in hw/arm/smmuv3-internal.h
Looks more readable to me.
> +
> /* Information about page-selective IOTLB invalidate */
> struct VTDIOTLBPageInvInfo {
> uint16_t domain_id;
> @@ -552,6 +562,22 @@ typedef struct VTDRootEntry VTDRootEntry;
> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
>
> +typedef enum VTDPCInvType {
> + /* pasid cache invalidation rely on guest PASID entry */
> + VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
> + VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation */
> + VTD_PASID_CACHE_PASIDSI, /* pasid cache pasid selective invalidation */
> +} VTDPCInvType;
> +
> +typedef struct VTDPASIDCacheInfo {
> + VTDPCInvType type;
> + uint16_t domain_id;
did?
> + uint32_t pasid;
> + PCIBus *bus;
> + uint16_t devfn;
> + bool error_happened;
wouldn't it make more sense to store an Error * directly so that you can
fill it whenever there is an error occuring.
Something like VFIOContainerBase::error
> +} VTDPASIDCacheInfo;
> +
> /* PASID Table Related Definitions */
> #define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
> #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> @@ -563,6 +589,7 @@ typedef struct VTDRootEntry VTDRootEntry;
> #define VTD_PASID_TABLE_BITS_MASK (0x3fULL)
> #define VTD_PASID_TABLE_INDEX(pasid) ((pasid) & VTD_PASID_TABLE_BITS_MASK)
> #define VTD_PASID_ENTRY_FPD (1ULL << 1) /* Fault Processing Disable */
> +#define VTD_PASID_TBL_ENTRY_NUM (1ULL << 6)
>
> /* PASID Granular Translation Type Mask */
> #define VTD_PASID_ENTRY_P 1ULL
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 50f9b27a45..fbc9da903a 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
> uint64_t val[8];
> };
>
> +typedef struct VTDPASIDCacheEntry {
> + struct VTDPASIDEntry pasid_entry;
> + bool cache_filled;
> +} VTDPASIDCacheEntry;
> +
> struct VTDAddressSpace {
> PCIBus *bus;
> uint8_t devfn;
> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
> MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
> IntelIOMMUState *iommu_state;
> VTDContextCacheEntry context_cache_entry;
> + VTDPASIDCacheEntry pasid_cache_entry;
> QLIST_ENTRY(VTDAddressSpace) next;
> /* Superset of notifier flags that this address space has */
> IOMMUNotifierFlag notifier_flags;
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 8948b8370f..1db581d14a 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -826,11 +826,24 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
> }
> }
>
> +static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
> +{
> + return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
see my previous comment about extract*
> +}
> +
> static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
> {
> return pdire->val & 1;
> }
>
> +static inline void pasid_cache_info_set_error(VTDPASIDCacheInfo *pc_info)
not sure whether this helper is really needed
> +{
> + if (pc_info->error_happened) {
Besides why not simply overwriting unconditionnally?
> + return;
> + }
> + pc_info->error_happened = true;
> +}
> +
> /**
> * Caller of this function should check present bit if wants
> * to use pdir entry for further usage except for fpd bit check.
> @@ -3103,6 +3116,241 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
> return true;
> }
>
> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
> + uint32_t pasid, VTDPASIDEntry *pe)
> +{
> + IntelIOMMUState *s = vtd_as->iommu_state;
> + VTDContextEntry ce;
> + int ret;
> +
> + if (!s->root_scalable) {
> + return -VTD_FR_RTADDR_INV_TTM;
> + }
> +
> + ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
> + &ce);
> + if (ret) {
> + return ret;
> + }
> +
> + return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
> +}
> +
> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
> +{
> + return !memcmp(p1, p2, sizeof(*p1));
> +}
> +
> +/*
> + * This function fills in the pasid entry in &vtd_as. Caller
> + * of this function should hold iommu_lock.
> + */
> +static int vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
> + VTDPASIDEntry *pe)
> +{
> + VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> +
> + if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
> + /* No need to go further as cached pasid entry is latest */
> + return 0;
> + }
> +
> + pc_entry->pasid_entry = *pe;
> + pc_entry->cache_filled = true;
> +
> + /*
> + * TODO: send pasid bind to host for passthru devices
> + */
> +
> + return 0;
> +}
> +
> +/*
> + * This function is used to update or clear cached pasid entry in vtd_as
> + * instances. Caller of this function should hold iommu_lock.
> + */
> +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> + gpointer user_data)
> +{
> + VTDPASIDCacheInfo *pc_info = user_data;
> + VTDAddressSpace *vtd_as = value;
> + IntelIOMMUState *s = vtd_as->iommu_state;
> + VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> + VTDPASIDEntry pe;
> + uint16_t did;
> + uint32_t pasid;
> + int ret;
> +
> + if (!pc_entry->cache_filled) {
> + return false;
> + }
> + did = vtd_pe_get_did(&pc_entry->pasid_entry);
> +
> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
> + goto remove;
> + }
> +
> + switch (pc_info->type) {
> + case VTD_PASID_CACHE_PASIDSI:
> + if (pc_info->pasid != pasid) {
> + return false;
> + }
> + /* Fall through */
> + case VTD_PASID_CACHE_DOMSI:
> + if (pc_info->domain_id != did) {
> + return false;
> + }
> + /* Fall through */
> + case VTD_PASID_CACHE_GLOBAL_INV:
> + break;
> + default:
> + error_report("invalid pc_info->type");
> + abort();
> + }
> +
> + /*
> + * pasid cache invalidation may indicate a present pasid
> + * entry to present pasid entry modification. To cover such
> + * case, vIOMMU emulator needs to fetch latest guest pasid
> + * entry and check cached pasid entry, then update pasid
> + * cache and send pasid bind/unbind to host properly.
if you don't do that in that patch I would put that in a subsequent
patch. Focus on the PASID cache in this patch. See my subsequent comment
> + */
> + ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
> + if (ret) {
> + /*
> + * No valid pasid entry in guest memory. e.g. pasid entry
> + * was modified to be either all-zero or non-present. Either
> + * case means existing pasid cache should be removed.
> + */
> + goto remove;
> + }
> +
> + if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
> + pasid_cache_info_set_error(pc_info);
> + }
> + return false;
> +
> +remove:
> + /*
> + * TODO: send pasid unbind to host for passthru devices
> + */
> + pc_entry->cache_filled = false;
> +
> + /*
> + * Don't remove address space of PCI_NO_PASID which is created by PCI
> + * sub-system.
> + */
> + if (vtd_as->pasid == PCI_NO_PASID) {
> + return false;
> + }
> + return true;
> +}
> +
> +/*
> + * This function syncs the pasid bindings between guest and host.
> + * It includes updating the pasid cache in vIOMMU and updating the
> + * pasid bindings per guest's latest pasid entry presence.
> + */
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> + VTDPASIDCacheInfo *pc_info)
> +{
> + if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
you don't update error_happened. Is that OK?
> + return;
> + }
> +
> + /*
> + * Regards to a pasid cache invalidation, e.g. a PSI.
> + * it could be either cases of below:
> + * a) a present pasid entry moved to non-present
> + * b) a present pasid entry to be a present entry
> + * c) a non-present pasid entry moved to present
> + *
> + * Different invalidation granularity may affect different device
> + * scope and pasid scope. But for each invalidation granularity,
> + * it needs to do two steps to sync host and guest pasid binding.
> + *
> + * Here is the handling of a PSI:
> + * 1) loop all the existing vtd_as instances to update them
> + * according to the latest guest pasid entry in pasid table.
> + * this will make sure affected existing vtd_as instances
> + * cached the latest pasid entries. Also, during the loop, the
> + * host should be notified if needed. e.g. pasid unbind or pasid
> + * update. Should be able to cover case a) and case b).
> + *
> + * 2) loop all devices to cover case c)
> + * - For devices which are backed by HostIOMMUDeviceIOMMUFD instances,
> + * we loop them and check if guest pasid entry exists. If yes,
> + * it is case c), we update the pasid cache and also notify
> + * host.
> + * - For devices which are not backed by HostIOMMUDeviceIOMMUFD,
> + * it is not necessary to create pasid cache at this phase since
> + * it could be created when vIOMMU does DMA address translation.
> + * This is not yet implemented since there is no emulated
> + * pasid-capable devices today. If we have such devices in
> + * future, the pasid cache shall be created there.
> + * Other granularity follow the same steps, just with different scope
I would put all the stuff related to interactions with host in a
subsequent patch. This patch could concentrate on the IOMMU PASID cache
only. And then you would add the extra complexity of syncs with the
host. I think it would simplify the review.
> + *
> + */
> +
> + vtd_iommu_lock(s);
> + /*
> + * Step 1: loop all the existing vtd_as instances for pasid unbind and
> + * update.
> + */
> + g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid,
> + pc_info);
> + vtd_iommu_unlock(s);
> +
> + /* TODO: Step 2: loop all the existing vtd_hiod instances for pasid bind. */
> +}
> +
> +static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> + VTDInvDesc *inv_desc)
> +{
> + uint16_t domain_id;
> + uint32_t pasid;
> + VTDPASIDCacheInfo pc_info = {};
> + uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
> + VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
> +
> + if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
> + __func__, "pasid cache inv")) {
> + return false;
> + }
> +
> + domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
> + pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
> +
> + switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
> + case VTD_INV_DESC_PASIDC_DSI:
> + trace_vtd_pasid_cache_dsi(domain_id);
> + pc_info.type = VTD_PASID_CACHE_DOMSI;
> + pc_info.domain_id = domain_id;
> + break;
> +
> + case VTD_INV_DESC_PASIDC_PASID_SI:
> + /* PASID selective implies a DID selective */
> + trace_vtd_pasid_cache_psi(domain_id, pasid);
> + pc_info.type = VTD_PASID_CACHE_PASIDSI;
> + pc_info.domain_id = domain_id;
> + pc_info.pasid = pasid;
> + break;
> +
> + case VTD_INV_DESC_PASIDC_GLOBAL:
> + trace_vtd_pasid_cache_gsi();
> + pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
> + break;
> +
> + default:
> + error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
Make it end-user understandable? invalid granulrity field in PASID-cache
invalidate descriptor.
> + " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
> + return false;
> + }
> +
> + vtd_pasid_cache_sync(s, &pc_info);
> + return !pc_info.error_happened ? true : false;
> +}
> +
> static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
> VTDInvDesc *inv_desc)
> {
> @@ -3264,6 +3512,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
> }
> break;
>
> + case VTD_INV_DESC_PC:
> + trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
> + if (!vtd_process_pasid_desc(s, &inv_desc)) {
> + return false;
> + }
> + break;
> +
> case VTD_INV_DESC_PIOTLB:
> trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
> if (!vtd_process_piotlb_desc(s, &inv_desc)) {
> @@ -3299,16 +3554,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
> }
> break;
>
> - /*
> - * TODO: the entity of below two cases will be implemented in future series.
> - * To make guest (which integrates scalable mode support patch set in
> - * iommu driver) work, just return true is enough so far.
> - */
> - case VTD_INV_DESC_PC:
> - if (s->scalable_mode) {
> - break;
> - }
> - /* fallthrough */
> default:
> error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
> " (unknown type)", __func__, inv_desc.hi,
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index ac9e1a10aa..ae5bbfcdc0 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
> vtd_inv_qi_tail(uint16_t head) "write tail %d"
> vtd_inv_qi_fetch(void) ""
> vtd_context_cache_reset(void) ""
> +vtd_pasid_cache_gsi(void) ""
> +vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
Would use pasid cache instead of PC
> +vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
> vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
> vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
Thanks
Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 18/19] Workaround for ERRATA_772415_SPR17
2025-06-20 7:18 ` [PATCH v2 18/19] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
@ 2025-06-20 16:01 ` Eric Auger
2025-06-23 3:29 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-20 16:01 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng
Hi Zhenzhong,
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
> On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
> is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
> range mapped on stage-2 page table could still be written.
I would split this patch into a vfio only patch and an iommu one that
sets bcontainer->readonly according to the fetched info.
>
> Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
> Update, Errata Details, SPR17.
> https://www.intel.com/content/www/us/en/content-details/772415/content-details.html
the link does not work for me.
Please could you explain in english what the errata is about and what
actions need to be taken care in VFIO?
Sorry I failed to understand "
Due to this errata, even the readonly
range mapped on stage-2 page table could still be written.
"
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> include/hw/vfio/vfio-container-base.h | 1 +
> hw/vfio/iommufd.c | 8 +++++++-
> hw/vfio/listener.c | 13 +++++++++----
> 3 files changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
> index f0232654ee..e5c51a51ac 100644
> --- a/include/hw/vfio/vfio-container-base.h
> +++ b/include/hw/vfio/vfio-container-base.h
> @@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
> QLIST_HEAD(, VFIODevice) device_list;
> GList *iova_ranges;
> NotifierWithReturn cpr_reboot_notifier;
> + bool bypass_ro;
> } VFIOContainerBase;
>
> typedef struct VFIOGuestIOMMU {
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 83a632bdee..23839a511a 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -306,6 +306,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> {
> ERRP_GUARD();
> IOMMUFDBackend *iommufd = vbasedev->iommufd;
> + struct iommu_hw_info_vtd vtd;
= {}; ?
> uint32_t type, flags = 0;
> uint64_t hw_caps;
> VFIOIOASHwpt *hwpt;
> @@ -345,10 +346,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> * instead.
> */
> if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
> - &type, NULL, 0, &hw_caps, errp)) {
> + &type, &vtd, sizeof(vtd), &hw_caps,
> + errp)) {
> return false;
> }
>
> + if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
> + container->bcontainer.bypass_ro = true;
> + }
> +
> if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
> flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> }
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index f498e23a93..c64aa4539e 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -364,7 +364,8 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
> return true;
> }
>
> -static bool vfio_listener_valid_section(MemoryRegionSection *section,
> +static bool vfio_listener_valid_section(VFIOContainerBase *bcontainer,
> + MemoryRegionSection *section,
> const char *name)
> {
> if (vfio_listener_skipped_section(section)) {
> @@ -375,6 +376,10 @@ static bool vfio_listener_valid_section(MemoryRegionSection *section,
> return false;
> }
>
> + if (bcontainer && bcontainer->bypass_ro && section->readonly) {
> + return false;
> + }
> +
> if (unlikely((section->offset_within_address_space &
> ~qemu_real_host_page_mask()) !=
> (section->offset_within_region & ~qemu_real_host_page_mask()))) {
> @@ -494,7 +499,7 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
> int ret;
> Error *err = NULL;
>
> - if (!vfio_listener_valid_section(section, "region_add")) {
> + if (!vfio_listener_valid_section(bcontainer, section, "region_add")) {
> return;
> }
>
> @@ -655,7 +660,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
> int ret;
> bool try_unmap = true;
>
> - if (!vfio_listener_valid_section(section, "region_del")) {
> + if (!vfio_listener_valid_section(bcontainer, section, "region_del")) {
> return;
> }
>
> @@ -812,7 +817,7 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
> container_of(listener, VFIODirtyRangesListener, listener);
> hwaddr iova, end;
>
> - if (!vfio_listener_valid_section(section, "tracking_update") ||
> + if (!vfio_listener_valid_section(NULL, section, "tracking_update") ||
> !vfio_get_section_iova_range(dirty->bcontainer, section,
> &iova, &end, NULL)) {
> return;
Thanks
Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 02/19] hw/pci: Introduce pci_device_get_viommu_cap()
2025-06-20 7:53 ` Eric Auger
@ 2025-06-23 2:20 ` Duan, Zhenzhong
2025-06-23 9:30 ` Eric Auger
0 siblings, 1 reply; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 2:20 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Marcel Apfelbaum
Hi Eric,
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 02/19] hw/pci: Introduce pci_device_get_viommu_cap()
>
>Hi Zhenzhong,
>
>On 6/20/25 9:17 AM, Zhenzhong Duan wrote:
>> pci_device_get_viommu_cap() call pci_device_get_iommu_bus_devfn()
>> to get iommu_bus->iommu_ops and call get_viommu_cap() callback to
>> get a bitmap with each bit represents a vIOMMU exposed capability.
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> include/hw/pci/pci.h | 22 ++++++++++++++++++++++
>> hw/pci/pci.c | 11 +++++++++++
>> 2 files changed, 33 insertions(+)
>>
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index df3cc7b875..829757b2c2 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -453,6 +453,18 @@ typedef struct PCIIOMMUOps {
>> * @devfn: device and function number of the PCI device.
>> */
>> void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
>> + /**
>> + * @get_viommu_cap: get vIOMMU capabilities
>> + *
>> + * Optional callback, if not implemented, then vIOMMU doesn't
>> + * support exposing capabilities to other subsystem, e.g., VFIO.
>> + * vIOMMU can choose which capabilities to expose.
>> + *
>> + * @opaque: the data passed to pci_setup_iommu().
>> + *
>> + * Returns: 64bit bitmap with each bit represents a capability.
>I think we need to clarify what this bitmap contains as capability bits
>(enum type)
Sure, will be like:
* Returns: 64bit bitmap with each bit represents a capability emulated
* by VIOMMU_CAP_* in include/hw/iommu.h
enum {
VIOMMU_CAP_STAGE1 = BIT_ULL(0), /* stage1 page table supported */
};
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 03/19] intel_iommu: Implement get_viommu_cap() callback
2025-06-20 8:10 ` Eric Auger
@ 2025-06-23 2:20 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 2:20 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 03/19] intel_iommu: Implement get_viommu_cap()
>callback
>
>Hi Zhenzhong,
>
>On 6/20/25 9:17 AM, Zhenzhong Duan wrote:
>> Implement get_viommu_cap() callback and expose stage-1 capability for now.
>>
>> VFIO uses it to create nested parent domain which is further used to create
>> nested domain in vIOMMU. All these will be implemented in following patches.
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> MAINTAINERS | 1 +
>> include/hw/iommu.h | 14 ++++++++++++++
>> hw/i386/intel_iommu.c | 12 ++++++++++++
>> 3 files changed, 27 insertions(+)
>> create mode 100644 include/hw/iommu.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 94c4076127..27817974a6 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2277,6 +2277,7 @@ F: include/system/iommufd.h
>> F: backends/host_iommu_device.c
>> F: include/system/host_iommu_device.h
>> F: include/qemu/chardev_open.h
>> +F: include/hw/iommu.h
>> F: util/chardev_open.c
>> F: docs/devel/vfio-iommufd.rst
>>
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> new file mode 100644
>> index 0000000000..3c1c08f05d
>> --- /dev/null
>> +++ b/include/hw/iommu.h
>> @@ -0,0 +1,14 @@
>> +/*
>> + * General vIOMMU capabilities, flags, etc
>> + *
>> + * Copyright (C) 2025 Intel Corporation.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#ifndef HW_IOMMU_H
>> +#define HW_IOMMU_H
>> +
>> +#define VIOMMU_CAP_STAGE1 BIT_ULL(0)
>I think you shall rather introduce the header and the first enum value
>in the previous patch.
Sure, will do.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 04/19] vfio/iommufd: Force creating nested parent domain
2025-06-20 8:08 ` Eric Auger
@ 2025-06-23 2:33 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 2:33 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
Hi Eric,
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 04/19] vfio/iommufd: Force creating nested parent
>domain
>
>
>
>On 6/20/25 9:17 AM, Zhenzhong Duan wrote:
>> Call pci_device_get_viommu_cap() to get if vIOMMU supports
>VIOMMU_CAP_STAGE1,
>> if yes, create nested parent domain which could be reused by vIOMMU to
>create
>could? Is there a choice?
I think no other choice.
If vIOMMU returns VIOMMU_CAP_STAGE1, both smmuv3 and vtd want nested
translation.
>> nested domain.
>
>I would add in the commit msg that it is safe because
>
>hw_caps & VIOMMU_CAP_STAGE1 cannot be set yet because s->flts is forbidden
>until last patch of the series.
Will do.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 07/19] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
2025-06-20 12:05 ` Eric Auger
@ 2025-06-23 2:44 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 2:44 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 07/19] intel_iommu: Check for compatibility with
>IOMMUFD backed device when x-flts=on
>
>Hi Zhenzhong,
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
>> is passed to host to construct nested page table. We need to check
>> compatibility of some critical IOMMU capabilities between vIOMMU and
>> host IOMMU to ensure guest stage-1 page table could be used by host.
>>
>> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
>> does not, then this IOMMUFD backed device should be failed.
>s/should be/should
>
>You may add in the commit message than even of the checks pass, for now
>we willingly reject the association because all the bits are not there yet.
Will do.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu_internal.h | 1 +
>> hw/i386/intel_iommu.c | 28 ++++++++++++++++++++++++++++
>> 2 files changed, 29 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> index 7aba259ef8..18bc22fc72 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -192,6 +192,7 @@
>> #define VTD_ECAP_PT (1ULL << 6)
>> #define VTD_ECAP_SC (1ULL << 7)
>> #define VTD_ECAP_MHMV (15ULL << 20)
>> +#define VTD_ECAP_NEST (1ULL << 26)
>> #define VTD_ECAP_SRS (1ULL << 31)
>> #define VTD_ECAP_PASID (1ULL << 40)
>> #define VTD_ECAP_SMTS (1ULL << 43)
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index e90fd2f28f..1c79efc1cb 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -40,6 +40,7 @@
>> #include "kvm/kvm_i386.h"
>> #include "migration/vmstate.h"
>> #include "trace.h"
>> +#include "system/iommufd.h"
>>
>> /* context entry operations */
>> #define VTD_CE_GET_RID2PASID(ce) \
>> @@ -4355,6 +4356,33 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>HostIOMMUDevice *hiod,
>> return true;
>> }
>>
>> +#ifdef CONFIG_IOMMUFD
>> + struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>> + struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
>> +
>> + /* Remaining checks are all stage-1 translation specific */
>> + if (!object_dynamic_cast(OBJECT(hiod),
>TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
>> + error_setg(errp, "Need IOMMUFD backend when x-flts=on");
>> + return false;
>> + }
>> +
>> + if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>> + error_setg(errp, "Incompatible host platform IOMMU type %d",
>> + caps->type);
>> + return false;
>> + }
>> +
>> + if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
>> + error_setg(errp, "Host IOMMU doesn't support nested translation");
>> + return false;
>> + }
>> +
>> + if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
>> + error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
>> + return false;
>> + }
>> +#endif
>> +
>> error_setg(errp, "host device is uncompatible with stage-1 translation");
>s/uncompatible/incompatible
>s/device/IOMMU
Will do.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 05/19] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
2025-06-20 11:59 ` Eric Auger
@ 2025-06-23 2:47 ` Duan, Zhenzhong
2025-06-23 9:31 ` Eric Auger
0 siblings, 1 reply; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 2:47 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Marcel Apfelbaum
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 05/19] hw/pci: Export pci_device_get_iommu_bus_devfn()
>and return bool
>
>Hi Zhenzhong,
>
>On 6/20/25 9:17 AM, Zhenzhong Duan wrote:
>> Returns true if PCI device is aliased or false otherwise. This will be
>> used in following patch to determine if a PCI device is under a PCI
>> bridge.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> include/hw/pci/pci.h | 2 ++
>> hw/pci/pci.c | 12 ++++++++----
>> 2 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index 829757b2c2..3029cdf26f 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -640,6 +640,8 @@ typedef struct PCIIOMMUOps {
>> bool is_write);
>> } PCIIOMMUOps;
>>
>> +bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus
>**piommu_bus,
>> + PCIBus **aliased_bus, int *aliased_devfn);
>if I am correct you have a single caller of the helper using the
>returned value, in intel_iommu.c, whereas all the existing callers are
>not using the returned value. You may simply pass a non NULL aliased_bus
>and aliased_devfn and check whether they differ from the original
>bus/devfn. Besides the patch looks ok to me.
I do this way initially, but it doesn't work if PCI device is the first device under
PCIE-to-PCI bridge, e.g., 01:00.0, in this case aliased BDF==real BDF.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 08/19] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
2025-06-20 12:18 ` Eric Auger
@ 2025-06-23 3:20 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 3:20 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
Eduardo Habkost
Hi Eric,
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 08/19] intel_iommu: Fail passthrough device under PCI
>bridge if x-flts=on
>
>Hi Zhenzhong,
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>> Currently we don't support nested translation for passthrough device
>> with emulated device under same PCI bridge.
>
>I don't understand the above. Do you want to forbid a configuration
>where both a passthrough'ed device and an emulated device would be
>downstream to a PCI bridge and protected by nested IOMMU?
>
>If I am not wrong in the path you don't check coexistence of emulated
>and passthrough'ed device but simply check whether the host assigned
>device is downstream to a PCI bridge. So if I am not wrong this is not
>really aligned to the commit msg description.
You are right, just realized I didn't make it clear here. Will be like:
"Currently we don't support nested translation for passthrough device with
emulated device under same PCI bridge, because they require different address
space when x-flts=on.
In theory, we do support if devices under same PCI bridge are all passthrough
devices. But emulated device can be hotplugged under same bridge. For simplify,
just forbid passthrough device under PCI bridge no matter if there is, or will
be emulated devices under same bridge. This is acceptable because PCIE bridge
is more popular than PCI bridge now."
>>
>> Reason is for emulated devices, AS should switch to iommu MR, while for
>> passthrough devices, it needs the AS stick with the system MR hence be
>> able to keep the VFIO container IOAS as a GPA IOAS. To support this, let
>> AS switch to iommu MR and have a separate GPA IOAS is needed, but that
>> brings a new memory listener which duplicates with VFIO memory listener.
>I have difficulties to parse the the above sentence
I'll drop this, presuming above comment is clear.
>>
>> For trade off, we choose to not support this special scenario because
>> PCIE bridge is more popular than PCI bridge now.
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu.c | 13 +++++++++++--
>> 1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 1c79efc1cb..9d4adc9458 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -4330,9 +4330,10 @@ VTDAddressSpace
>*vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
>> return vtd_dev_as;
>> }
>>
>> -static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
>> +static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice
>*vtd_hiod,
>> Error **errp)
>> {
>> + HostIOMMUDevice *hiod = vtd_hiod->hiod;
>> HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>> int ret;
>>
>> @@ -4359,6 +4360,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>HostIOMMUDevice *hiod,
>> #ifdef CONFIG_IOMMUFD
>> struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>> struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
>> + PCIBus *bus = vtd_hiod->bus;
>> + PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), vtd_hiod->devfn);
>>
>> /* Remaining checks are all stage-1 translation specific */
>> if (!object_dynamic_cast(OBJECT(hiod),
>TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
>> @@ -4381,6 +4384,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>HostIOMMUDevice *hiod,
>> error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
>> return false;
>> }
>> +
>> + if (pci_device_get_iommu_bus_devfn(pdev, &bus, NULL, NULL)) {
>> + error_setg(errp, "Host device under PCI bridge is unsupported "
>> + "when x-flts=on");
>so now the compatibility also comes from the device and not only from
>the host IOMMU caps. (refering to my previous comment about
>s/device/iommu in error msg)
Yes, there are three types of check:
1) host IOMMU caps compatibility
2) host device not under PCI bridge
3) CONFIG_IOMMUFD configured
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 18/19] Workaround for ERRATA_772415_SPR17
2025-06-20 16:01 ` Eric Auger
@ 2025-06-23 3:29 ` Duan, Zhenzhong
2025-06-23 9:33 ` Eric Auger
0 siblings, 1 reply; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-23 3:29 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 18/19] Workaround for ERRATA_772415_SPR17
>
>Hi Zhenzhong,
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>> On a system influenced by ERRATA_772415,
>IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
>> is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the
>readonly
>> range mapped on stage-2 page table could still be written.
>
>I would split this patch into a vfio only patch and an iommu one that
>sets bcontainer->readonly according to the fetched info.
Will do.
>>
>> Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
>> Update, Errata Details, SPR17.
>> https://www.intel.com/content/www/us/en/content-details/772415/content-
>details.html
>the link does not work for me.
That's strange, what about:
https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update/
>
>Please could you explain in english what the errata is about and what
>actions need to be taken care in VFIO?
>
>Sorry I failed to understand "
>
>Due to this errata, even the readonly
>range mapped on stage-2 page table could still be written.
Copy the explanation in above link:
"SPR17. Remapping Hardware May Set Access/Dirty Bits in a First-stage
Page-table Entry
Problem: When remapping hardware is configured by system software in
scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the
PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended
Access bit if enabled) in first-stage page-table entries even when
second-stage mappings indicate that corresponding first-stage page-table
is Read-Only.
Implication: Due to this erratum, pages mapped as Read-only in second-stage
page-tables may be modified by remapping hardware Access/Dirty bit updates.
Workaround: None identified. System software enabling nested translations
for a VM should ensure that there are no read-only pages in the
corresponding second-stage mappings."
>
>"
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> include/hw/vfio/vfio-container-base.h | 1 +
>> hw/vfio/iommufd.c | 8 +++++++-
>> hw/vfio/listener.c | 13 +++++++++----
>> 3 files changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-
>container-base.h
>> index f0232654ee..e5c51a51ac 100644
>> --- a/include/hw/vfio/vfio-container-base.h
>> +++ b/include/hw/vfio/vfio-container-base.h
>> @@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
>> QLIST_HEAD(, VFIODevice) device_list;
>> GList *iova_ranges;
>> NotifierWithReturn cpr_reboot_notifier;
>> + bool bypass_ro;
>> } VFIOContainerBase;
>>
>> typedef struct VFIOGuestIOMMU {
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 83a632bdee..23839a511a 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -306,6 +306,7 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>> {
>> ERRP_GUARD();
>> IOMMUFDBackend *iommufd = vbasedev->iommufd;
>> + struct iommu_hw_info_vtd vtd;
>= {}; ?
Will do.
>> uint32_t type, flags = 0;
>> uint64_t hw_caps;
>> VFIOIOASHwpt *hwpt;
>> @@ -345,10 +346,15 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>> * instead.
>> */
>> if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev-
>>devid,
>> - &type, NULL, 0, &hw_caps, errp)) {
>> + &type, &vtd, sizeof(vtd), &hw_caps,
>> + errp)) {
>> return false;
>> }
>>
>> + if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
>> + container->bcontainer.bypass_ro = true;
>> + }
>> +
>> if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>> flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>> }
>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>> index f498e23a93..c64aa4539e 100644
>> --- a/hw/vfio/listener.c
>> +++ b/hw/vfio/listener.c
>> @@ -364,7 +364,8 @@ static bool
>vfio_known_safe_misalignment(MemoryRegionSection *section)
>> return true;
>> }
>>
>> -static bool vfio_listener_valid_section(MemoryRegionSection *section,
>> +static bool vfio_listener_valid_section(VFIOContainerBase *bcontainer,
>> + MemoryRegionSection *section,
>> const char *name)
>> {
>> if (vfio_listener_skipped_section(section)) {
>> @@ -375,6 +376,10 @@ static bool
>vfio_listener_valid_section(MemoryRegionSection *section,
>> return false;
>> }
>>
>> + if (bcontainer && bcontainer->bypass_ro && section->readonly) {
>> + return false;
>> + }
>> +
>> if (unlikely((section->offset_within_address_space &
>> ~qemu_real_host_page_mask()) !=
>> (section->offset_within_region & ~qemu_real_host_page_mask()))) {
>> @@ -494,7 +499,7 @@ void vfio_container_region_add(VFIOContainerBase
>*bcontainer,
>> int ret;
>> Error *err = NULL;
>>
>> - if (!vfio_listener_valid_section(section, "region_add")) {
>> + if (!vfio_listener_valid_section(bcontainer, section, "region_add")) {
>> return;
>> }
>>
>> @@ -655,7 +660,7 @@ static void vfio_listener_region_del(MemoryListener
>*listener,
>> int ret;
>> bool try_unmap = true;
>>
>> - if (!vfio_listener_valid_section(section, "region_del")) {
>> + if (!vfio_listener_valid_section(bcontainer, section, "region_del")) {
>> return;
>> }
>>
>> @@ -812,7 +817,7 @@ static void vfio_dirty_tracking_update(MemoryListener
>*listener,
>> container_of(listener, VFIODirtyRangesListener, listener);
>> hwaddr iova, end;
>>
>> - if (!vfio_listener_valid_section(section, "tracking_update") ||
>> + if (!vfio_listener_valid_section(NULL, section, "tracking_update") ||
>> !vfio_get_section_iova_range(dirty->bcontainer, section,
>> &iova, &end, NULL)) {
>> return;
>Thanks
>
>Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 02/19] hw/pci: Introduce pci_device_get_viommu_cap()
2025-06-23 2:20 ` Duan, Zhenzhong
@ 2025-06-23 9:30 ` Eric Auger
0 siblings, 0 replies; 58+ messages in thread
From: Eric Auger @ 2025-06-23 9:30 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Marcel Apfelbaum
Hi Zhenzhong,
On 6/23/25 4:20 AM, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH v2 02/19] hw/pci: Introduce pci_device_get_viommu_cap()
>>
>> Hi Zhenzhong,
>>
>> On 6/20/25 9:17 AM, Zhenzhong Duan wrote:
>>> pci_device_get_viommu_cap() call pci_device_get_iommu_bus_devfn()
>>> to get iommu_bus->iommu_ops and call get_viommu_cap() callback to
>>> get a bitmap with each bit represents a vIOMMU exposed capability.
>>>
>>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>> include/hw/pci/pci.h | 22 ++++++++++++++++++++++
>>> hw/pci/pci.c | 11 +++++++++++
>>> 2 files changed, 33 insertions(+)
>>>
>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>> index df3cc7b875..829757b2c2 100644
>>> --- a/include/hw/pci/pci.h
>>> +++ b/include/hw/pci/pci.h
>>> @@ -453,6 +453,18 @@ typedef struct PCIIOMMUOps {
>>> * @devfn: device and function number of the PCI device.
>>> */
>>> void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
>>> + /**
>>> + * @get_viommu_cap: get vIOMMU capabilities
>>> + *
>>> + * Optional callback, if not implemented, then vIOMMU doesn't
>>> + * support exposing capabilities to other subsystem, e.g., VFIO.
>>> + * vIOMMU can choose which capabilities to expose.
>>> + *
>>> + * @opaque: the data passed to pci_setup_iommu().
>>> + *
>>> + * Returns: 64bit bitmap with each bit represents a capability.
>> I think we need to clarify what this bitmap contains as capability bits
>> (enum type)
> Sure, will be like:
>
> * Returns: 64bit bitmap with each bit represents a capability emulated
> * by VIOMMU_CAP_* in include/hw/iommu.h
>
> enum {
> VIOMMU_CAP_STAGE1 = BIT_ULL(0), /* stage1 page table supported */
> };
looks good to me
Eric
>
> Thanks
> Zhenzhong
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 05/19] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
2025-06-23 2:47 ` Duan, Zhenzhong
@ 2025-06-23 9:31 ` Eric Auger
0 siblings, 0 replies; 58+ messages in thread
From: Eric Auger @ 2025-06-23 9:31 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Marcel Apfelbaum
On 6/23/25 4:47 AM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH v2 05/19] hw/pci: Export pci_device_get_iommu_bus_devfn()
>> and return bool
>>
>> Hi Zhenzhong,
>>
>> On 6/20/25 9:17 AM, Zhenzhong Duan wrote:
>>> Returns true if PCI device is aliased or false otherwise. This will be
>>> used in following patch to determine if a PCI device is under a PCI
>>> bridge.
>>>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>> include/hw/pci/pci.h | 2 ++
>>> hw/pci/pci.c | 12 ++++++++----
>>> 2 files changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>>> index 829757b2c2..3029cdf26f 100644
>>> --- a/include/hw/pci/pci.h
>>> +++ b/include/hw/pci/pci.h
>>> @@ -640,6 +640,8 @@ typedef struct PCIIOMMUOps {
>>> bool is_write);
>>> } PCIIOMMUOps;
>>>
>>> +bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus
>> **piommu_bus,
>>> + PCIBus **aliased_bus, int *aliased_devfn);
>> if I am correct you have a single caller of the helper using the
>> returned value, in intel_iommu.c, whereas all the existing callers are
>> not using the returned value. You may simply pass a non NULL aliased_bus
>> and aliased_devfn and check whether they differ from the original
>> bus/devfn. Besides the patch looks ok to me.
> I do this way initially, but it doesn't work if PCI device is the first device under
> PCIE-to-PCI bridge, e.g., 01:00.0, in this case aliased BDF==real BDF.
I see.
Thank you for the explanation
Eric
>
> Thanks
> Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 18/19] Workaround for ERRATA_772415_SPR17
2025-06-23 3:29 ` Duan, Zhenzhong
@ 2025-06-23 9:33 ` Eric Auger
0 siblings, 0 replies; 58+ messages in thread
From: Eric Auger @ 2025-06-23 9:33 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
On 6/23/25 5:29 AM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH v2 18/19] Workaround for ERRATA_772415_SPR17
>>
>> Hi Zhenzhong,
>>
>> On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>>> On a system influenced by ERRATA_772415,
>> IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
>>> is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the
>> readonly
>>> range mapped on stage-2 page table could still be written.
>> I would split this patch into a vfio only patch and an iommu one that
>> sets bcontainer->readonly according to the fetched info.
> Will do.
>
>>> Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
>>> Update, Errata Details, SPR17.
>>> https://www.intel.com/content/www/us/en/content-details/772415/content-
>> details.html
>> the link does not work for me.
> That's strange, what about:
>
> https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update/
yes this link works for me.
>
>> Please could you explain in english what the errata is about and what
>> actions need to be taken care in VFIO?
>>
>> Sorry I failed to understand "
>>
>> Due to this errata, even the readonly
>> range mapped on stage-2 page table could still be written.
> Copy the explanation in above link:
>
> "SPR17. Remapping Hardware May Set Access/Dirty Bits in a First-stage
> Page-table Entry
>
> Problem: When remapping hardware is configured by system software in
> scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the
> PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended
> Access bit if enabled) in first-stage page-table entries even when
> second-stage mappings indicate that corresponding first-stage page-table
> is Read-Only.
>
> Implication: Due to this erratum, pages mapped as Read-only in second-stage
> page-tables may be modified by remapping hardware Access/Dirty bit updates.
>
> Workaround: None identified. System software enabling nested translations
> for a VM should ensure that there are no read-only pages in the
> corresponding second-stage mappings."
OK this definitively helps. I will further look at the patch with that
background
Eric
>
>> "
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>> include/hw/vfio/vfio-container-base.h | 1 +
>>> hw/vfio/iommufd.c | 8 +++++++-
>>> hw/vfio/listener.c | 13 +++++++++----
>>> 3 files changed, 17 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-
>> container-base.h
>>> index f0232654ee..e5c51a51ac 100644
>>> --- a/include/hw/vfio/vfio-container-base.h
>>> +++ b/include/hw/vfio/vfio-container-base.h
>>> @@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
>>> QLIST_HEAD(, VFIODevice) device_list;
>>> GList *iova_ranges;
>>> NotifierWithReturn cpr_reboot_notifier;
>>> + bool bypass_ro;
>>> } VFIOContainerBase;
>>>
>>> typedef struct VFIOGuestIOMMU {
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 83a632bdee..23839a511a 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -306,6 +306,7 @@ static bool
>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>> {
>>> ERRP_GUARD();
>>> IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>> + struct iommu_hw_info_vtd vtd;
>> = {}; ?
> Will do.
>
>>> uint32_t type, flags = 0;
>>> uint64_t hw_caps;
>>> VFIOIOASHwpt *hwpt;
>>> @@ -345,10 +346,15 @@ static bool
>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>> * instead.
>>> */
>>> if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev-
>>> devid,
>>> - &type, NULL, 0, &hw_caps, errp)) {
>>> + &type, &vtd, sizeof(vtd), &hw_caps,
>>> + errp)) {
>>> return false;
>>> }
>>>
>>> + if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
>>> + container->bcontainer.bypass_ro = true;
>>> + }
>>> +
>>> if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
>>> flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>> }
>>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>>> index f498e23a93..c64aa4539e 100644
>>> --- a/hw/vfio/listener.c
>>> +++ b/hw/vfio/listener.c
>>> @@ -364,7 +364,8 @@ static bool
>> vfio_known_safe_misalignment(MemoryRegionSection *section)
>>> return true;
>>> }
>>>
>>> -static bool vfio_listener_valid_section(MemoryRegionSection *section,
>>> +static bool vfio_listener_valid_section(VFIOContainerBase *bcontainer,
>>> + MemoryRegionSection *section,
>>> const char *name)
>>> {
>>> if (vfio_listener_skipped_section(section)) {
>>> @@ -375,6 +376,10 @@ static bool
>> vfio_listener_valid_section(MemoryRegionSection *section,
>>> return false;
>>> }
>>>
>>> + if (bcontainer && bcontainer->bypass_ro && section->readonly) {
>>> + return false;
>>> + }
>>> +
>>> if (unlikely((section->offset_within_address_space &
>>> ~qemu_real_host_page_mask()) !=
>>> (section->offset_within_region & ~qemu_real_host_page_mask()))) {
>>> @@ -494,7 +499,7 @@ void vfio_container_region_add(VFIOContainerBase
>> *bcontainer,
>>> int ret;
>>> Error *err = NULL;
>>>
>>> - if (!vfio_listener_valid_section(section, "region_add")) {
>>> + if (!vfio_listener_valid_section(bcontainer, section, "region_add")) {
>>> return;
>>> }
>>>
>>> @@ -655,7 +660,7 @@ static void vfio_listener_region_del(MemoryListener
>> *listener,
>>> int ret;
>>> bool try_unmap = true;
>>>
>>> - if (!vfio_listener_valid_section(section, "region_del")) {
>>> + if (!vfio_listener_valid_section(bcontainer, section, "region_del")) {
>>> return;
>>> }
>>>
>>> @@ -812,7 +817,7 @@ static void vfio_dirty_tracking_update(MemoryListener
>> *listener,
>>> container_of(listener, VFIODirtyRangesListener, listener);
>>> hwaddr iova, end;
>>>
>>> - if (!vfio_listener_valid_section(section, "tracking_update") ||
>>> + if (!vfio_listener_valid_section(NULL, section, "tracking_update") ||
>>> !vfio_get_section_iova_range(dirty->bcontainer, section,
>>> &iova, &end, NULL)) {
>>> return;
>> Thanks
>>
>> Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 11/19] intel_iommu: Handle PASID entry adding
2025-06-20 7:18 ` [PATCH v2 11/19] intel_iommu: Handle PASID entry adding Zhenzhong Duan
@ 2025-06-23 11:47 ` Eric Auger
2025-06-24 10:56 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-23 11:47 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Zhenzhong,
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
> When guest modifies a PASID entry, QEMU will capture the guest pasid selective
> pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
> invalidation reasons:
>
> a) a present pasid entry moved to non-present
> b) a present pasid entry to be a present entry
> c) a non-present pasid entry moved to present
>
> This handles c).
As you use the replay terminology in the patch, please explain what it
means in that case and how the patch achieve above goal.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu_internal.h | 1 +
> hw/i386/intel_iommu.c | 169 ++++++++++++++++++++++++++++++++-
> 2 files changed, 169 insertions(+), 1 deletion(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 01c881ed4d..025787b3b9 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -558,6 +558,7 @@ typedef struct VTDRootEntry VTDRootEntry;
> #define VTD_CTX_ENTRY_LEGACY_SIZE 16
> #define VTD_CTX_ENTRY_SCALABLE_SIZE 32
>
> +#define VTD_SM_CONTEXT_ENTRY_PDTS(val) (((val) >> 9) & 0x7)
> #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1db581d14a..f4273dc640 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -826,6 +826,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
> }
> }
>
> +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
> +{
> + return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
> +}
> +
> static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
> {
> return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> @@ -3246,6 +3251,159 @@ remove:
> return true;
> }
>
> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
> + dma_addr_t pt_base,
> + int start,
> + int end,
> + VTDPASIDCacheInfo *info)
> +{
> + VTDPASIDEntry pe;
> + int pasid = start;
> + int pasid_next;
> +
> + while (pasid < end) {
> + pasid_next = pasid + 1;
> +
> + if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
> + && vtd_pe_present(&pe)) {
> + int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
> + uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
> + VTDAddressSpace *vtd_as;
> +
> + vtd_iommu_lock(s);
> + /*
> + * When indexed by rid2pasid, vtd_as should have been created,
> + * e.g., by PCI subsystem. For other iommu pasid, we need to
> + * create vtd_as dynamically. The other iommu pasid is same as
> + * PCI's pasid, so it's used as input of vtd_find_add_as().
> + */
> + vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
> + vtd_iommu_unlock(s);
> + if (!vtd_as) {
> + vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
> + }
> +
> + if ((info->type == VTD_PASID_CACHE_DOMSI ||
> + info->type == VTD_PASID_CACHE_PASIDSI) &&
> + !(info->domain_id == vtd_pe_get_did(&pe))) {
> + /*
> + * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
> + * requires domain ID check. If domain Id check fail,
> + * go to next pasid.
> + */
> + pasid = pasid_next;
> + continue;
> + }
> + if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
> + pasid_cache_info_set_error(info);
> + }
> + }
> + pasid = pasid_next;
> + }
> +}
> +
> +/*
> + * Currently, VT-d scalable mode pasid table is a two level table,
do you mean PASID dir + PASID table? in the positive I would use that
terminology directly.
> + * this function aims to loop a range of PASIDs in a given pasid
aims at looping over a range of PASIDs in a given table?
so what do you call a table here? is a a PASID directory ot a PASID table.
> + * table to identify the pasid config in guest.
> + */
> +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
> + dma_addr_t pdt_base,
> + int start,
> + int end,
> + VTDPASIDCacheInfo *info)
> +{
> + VTDPASIDDirEntry pdire;
> + int pasid = start;
> + int pasid_next;
> + dma_addr_t pt_base;
> +
> + while (pasid < end) {
> + pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
> + (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
> + if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
> + && vtd_pdire_present(&pdire)) {
> + pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
> + vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
> + }
> + pasid = pasid_next;
> + }
> +}
> +
> +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
> + int start, int end,
> + VTDPASIDCacheInfo *info)
> +{
> + VTDContextEntry ce;
> +
> + if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus), info->devfn,
> + &ce)) {
> + uint32_t max_pasid;
> +
> + max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
> + if (end > max_pasid) {
> + end = max_pasid;
> + }
> + vtd_sm_pasid_table_walk(s,
> + VTD_CE_GET_PASID_DIR_TABLE(&ce),
> + start,
> + end,
> + info);
> + }
> +}
> +
> +/*
> + * This function replay the guest pasid bindings to hosts by
replays, host
> + * walking the guest PASID table. This ensures host will have
> + * latest guest pasid bindings.
> + */
> +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> + VTDPASIDCacheInfo *pc_info)
> +{
> + VTDHostIOMMUDevice *vtd_hiod;
> + int start = 0, end = 1; /* only rid2pasid is supported */
> + VTDPASIDCacheInfo walk_info;
> + GHashTableIter as_it;
> +
> + switch (pc_info->type) {
> + case VTD_PASID_CACHE_PASIDSI:
> + start = pc_info->pasid;
> + end = pc_info->pasid + 1;
> + /*
> + * PASID selective invalidation is within domain,
> + * thus fall through.
this is still not totally clean to me. For domsi I would have expected a
different setting of start and end?
> + */
> + case VTD_PASID_CACHE_DOMSI:
fall though clause here and above?
> + case VTD_PASID_CACHE_GLOBAL_INV:
> + /* loop all assigned devices */
> + break;
> + default:
> + error_report("invalid pc_info->type for replay");
> + abort();
> + }
> +
> + /*
> + * In this replay, only needs to care about the devices which
one only needs to care
> + * are backed by host IOMMU. For such devices, their vtd_hiod
Those devices have a corresponding vtd_hiod in s->vtd_host_iommu_dev
> + * instances are in the s->vtd_host_iommu_dev. For devices which
> + * are not backed by host IOMMU, it is not necessary to replay
> + * the bindings since their cache could be re-created in the future
> + * DMA address translation. Access to vtd_host_iommu_dev is already
> + * protected by BQL, so no iommu lock needed here.
> + */
> + walk_info = *pc_info;
> + g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
> + while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
> + /* bus|devfn fields are not identical with pc_info */
Can you clarify?
> + walk_info.bus = vtd_hiod->bus;
> + walk_info.devfn = vtd_hiod->devfn;
> + vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> + }
> + if (walk_info.error_happened) {
> + pasid_cache_info_set_error(pc_info);
> + }
> +}
> +
> /*
> * This function syncs the pasid bindings between guest and host.
> * It includes updating the pasid cache in vIOMMU and updating the
> @@ -3301,7 +3459,16 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> pc_info);
> vtd_iommu_unlock(s);
>
> - /* TODO: Step 2: loop all the existing vtd_hiod instances for pasid bind. */
> + /*
> + * Step 2: loop all the existing vtd_hiod instances for pasid bind.
> + * Ideally, needs to loop all devices to find if there is any new
> + * PASID binding regards to the PASID cache invalidation request.
> + * But it is enough to loop the devices which are backed by host
> + * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
> + * if new PASID happened on them, their vtd_as instance could
> + * be created during future vIOMMU DMA translation.
the above comment is somehow redundant with the previous one
> + */
> + vtd_replay_guest_pasid_bindings(s, pc_info);
> }
>
> static bool vtd_process_pasid_desc(IntelIOMMUState *s,
Thanks
Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 12/19] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
2025-06-20 7:18 ` [PATCH v2 12/19] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
@ 2025-06-23 11:55 ` Eric Auger
2025-06-26 8:28 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-23 11:55 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Zhenzhong,
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
> FORCE_RESET is different from GLOBAL_INV which updates pasid cache if
> underlying pasid entry is still valid, it drops all the pasid caches.
>
> FORCE_RESET isn't a VTD spec defined invalidation type for pasid cache,
> only used internally in system level reset.
this comment shall be put in the code, along with the VTDPCInvType
because the reader may look for that cmd in the spec.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu_internal.h | 2 ++
> hw/i386/intel_iommu.c | 28 ++++++++++++++++++++++++++++
> hw/i386/trace-events | 1 +
> 3 files changed, 31 insertions(+)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 025787b3b9..5ed76864be 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -564,6 +564,8 @@ typedef struct VTDRootEntry VTDRootEntry;
> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
>
> typedef enum VTDPCInvType {
> + /* Force reset all */
> + VTD_PASID_CACHE_FORCE_RESET = 0,
> /* pasid cache invalidation rely on guest PASID entry */
> VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
> VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f4273dc640..ed71bb8ec7 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -87,6 +87,8 @@ struct vtd_iotlb_key {
> static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>
> +static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
> +
> static void vtd_panic_require_caching_mode(void)
> {
> error_report("We need to set caching-mode=on for intel-iommu to enable "
> @@ -391,6 +393,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
> vtd_iommu_lock(s);
> vtd_reset_iotlb_locked(s);
> vtd_reset_context_cache_locked(s);
> + vtd_pasid_cache_reset_locked(s);
> vtd_iommu_unlock(s);
> }
>
> @@ -3196,6 +3199,8 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> }
>
> switch (pc_info->type) {
> + case VTD_PASID_CACHE_FORCE_RESET:
> + goto remove;
> case VTD_PASID_CACHE_PASIDSI:
> if (pc_info->pasid != pasid) {
> return false;
> @@ -3251,6 +3256,26 @@ remove:
> return true;
> }
>
> +/* Caller of this function should hold iommu_lock */
> +static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
> +{
> + VTDPASIDCacheInfo pc_info = {};
> +
> + trace_vtd_pasid_cache_reset();
> +
> + pc_info.type = VTD_PASID_CACHE_FORCE_RESET;
> +
> + /*
> + * Reset pasid cache is a big hammer, so use g_hash_table_foreach_remove
> + * which will free the vtd_as instances. Also, as a big hammer, use
> + * VTD_PASID_CACHE_FORCE_RESET to ensure all the vtd_as instances are
> + * dropped, meanwhile the change will be passed to host if
> + * HostIOMMUDeviceIOMMUFD is available.
> + */
> + g_hash_table_foreach_remove(s->vtd_address_spaces,
> + vtd_flush_pasid, &pc_info);
> +}
> +
> static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
> dma_addr_t pt_base,
> int start,
> @@ -3377,6 +3402,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> case VTD_PASID_CACHE_GLOBAL_INV:
> /* loop all assigned devices */
> break;
> + case VTD_PASID_CACHE_FORCE_RESET:
> + /* For force reset, no need to go further replay */
> + return;
> default:
> error_report("invalid pc_info->type for replay");
> abort();
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index ae5bbfcdc0..c8a936eb46 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -24,6 +24,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
> vtd_inv_qi_tail(uint16_t head) "write tail %d"
> vtd_inv_qi_fetch(void) ""
> vtd_context_cache_reset(void) ""
> +vtd_pasid_cache_reset(void) ""
> vtd_pasid_cache_gsi(void) ""
> vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
> vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 13/19] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
2025-06-20 7:18 ` [PATCH v2 13/19] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
@ 2025-06-23 12:02 ` Eric Auger
2025-06-26 8:37 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-23 12:02 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
> When guest in scalable mode and x-flts=on, we stick to system MR for IOMMUFD
when guest works in scalable mode ?
> backed host device. Then its default hwpt contains GPA->HPA mappings which is
> used directly if PGTT=PT and used as nested parent if PGTT=FLT. Otherwise
> fallback to original processing.
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index ed71bb8ec7..be01f8885f 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1779,6 +1779,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
> */
> return false;
> }
> +
spurious new line
> return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
> }
>
> @@ -1790,10 +1791,33 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
> {
> IntelIOMMUState *s;
> VTDContextEntry ce;
> + struct vtd_as_key key = {
> + .bus = as->bus,
> + .devfn = as->devfn,
> + };
>
> assert(as);
>
> s = as->iommu_state;
> +
> + /*
> + * When guest in scalable mode and x-flts=on, we stick to system MR
ditto
> + * for IOMMUFD backed host device. Then its default hwpt contains
> + * GPA->HPA mappings which is used directly if PGTT=PT and used as
> + * nested parent if PGTT=FLT. Otherwise fallback to original
fall back here and above
This comment sounds a little bit unrelated to the below implementation
which does not refer to system MR. how does the search for the hiod
relate to that. I would refocus the comment.
> + * processing.
> + */
> + if (s->root_scalable && s->flts) {
> + VTDHostIOMMUDevice *vtd_hiod;
> +
> + vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
> + if (vtd_hiod && vtd_hiod->hiod &&
> + object_dynamic_cast(OBJECT(vtd_hiod->hiod),
> + TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> + return true;
> + }
> + }
> +
> if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
> &ce)) {
> /*
Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 14/19] intel_iommu: Bind/unbind guest page table to host
2025-06-20 7:18 ` [PATCH v2 14/19] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-06-23 13:17 ` Eric Auger
2025-06-26 9:17 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-23 13:17 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
Eduardo Habkost
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
> This captures the guest PASID table entry modifications and
> propagates the changes to host to attach a hwpt with type determined
> per guest IOMMU mdoe and PGTT configuration.
>
> When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
> page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
> vIOMMU reuse hwpt(GPA->HPA) provided by VFIO as nested parent to
> construct nested page table.
>
> When guest decides to use legacy mode then vIOMMU switches the MRs of
> the device's AS, hence the IOAS created by VFIO container would be
> switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
> switched to IOMMU MR. So it is able to support shadowing the guest IO
> page table.
>
> Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu_internal.h | 11 ++
> hw/i386/intel_iommu.c | 244 +++++++++++++++++++++++++++++++--
> hw/i386/trace-events | 3 +
> 3 files changed, 243 insertions(+), 15 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 5ed76864be..92a533db54 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -563,6 +563,13 @@ typedef struct VTDRootEntry VTDRootEntry;
> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
>
> +typedef enum VTDPASIDOp {
> + VTD_PASID_BIND,
> + VTD_PASID_UPDATE,
> + VTD_PASID_UNBIND,
> + VTD_OP_NUM
> +} VTDPASIDOp;
> +
> typedef enum VTDPCInvType {
> /* Force reset all */
> VTD_PASID_CACHE_FORCE_RESET = 0,
> @@ -607,6 +614,9 @@ typedef struct VTDPASIDCacheInfo {
>
> #define VTD_SM_PASID_ENTRY_FLPM 3ULL
> #define VTD_SM_PASID_ENTRY_FLPTPTR (~0xfffULL)
> +#define VTD_SM_PASID_ENTRY_SRE_BIT(val) (!!((val) & 1ULL))
> +#define VTD_SM_PASID_ENTRY_WPE_BIT(val) (!!(((val) >> 4) & 1ULL))
> +#define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
>
> /* First Level Paging Structure */
> /* Masks for First Level Paging Entry */
> @@ -644,5 +654,6 @@ typedef struct VTDHostIOMMUDevice {
> PCIBus *bus;
> uint8_t devfn;
> HostIOMMUDevice *hiod;
> + uint32_t s1_hwpt;
> } VTDHostIOMMUDevice;
> #endif
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index be01f8885f..1c94a0033c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -20,6 +20,7 @@
> */
>
> #include "qemu/osdep.h"
> +#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
> #include "qemu/error-report.h"
> #include "qemu/main-loop.h"
> #include "qapi/error.h"
> @@ -41,6 +42,9 @@
> #include "migration/vmstate.h"
> #include "trace.h"
> #include "system/iommufd.h"
> +#ifdef CONFIG_IOMMUFD
> +#include <linux/iommufd.h>
> +#endif
>
> /* context entry operations */
> #define VTD_CE_GET_RID2PASID(ce) \
> @@ -839,6 +843,27 @@ static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
> return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> }
>
> +static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
> +{
> + return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
Isn'it called FSPTPTR in the spec. In the positive I would use the same
terminology.
> +}
> +
> +static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
> +{
> + return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
here again I am bit lost as you seem to look at 3d 64b FSPM while there
is an AW field in the first 64b, please add a comment.
Also it isnot clear where this computation come from. Can you quote the
spec?
> +}
> +
> +static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
> +{
> + return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
> +}
> +
> +/* check if pgtt is first stage translation */
> +static inline bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
> +{
> + return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
> +}
> +
> static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
> {
> return pdire->val & 1;
> @@ -2431,6 +2456,188 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
> vtd_iommu_replay_all(s);
> }
>
> +#ifdef CONFIG_IOMMUFD
> +static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
> + VTDPASIDEntry *pe)
> +{
> + memset(vtd, 0, sizeof(*vtd));
> +
> + vtd->flags = (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ?
> + IOMMU_VTD_S1_SRE : 0) |
> + (VTD_SM_PASID_ENTRY_WPE_BIT(pe->val[2]) ?
> + IOMMU_VTD_S1_WPE : 0) |
> + (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ?
> + IOMMU_VTD_S1_EAFE : 0);
> + vtd->addr_width = vtd_pe_get_fl_aw(pe);
> + vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
> +}
> +
> +static int vtd_create_s1_hwpt(VTDHostIOMMUDevice *vtd_hiod,
> + VTDPASIDEntry *pe, Error **errp)
> +{
> + HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> + struct iommu_hwpt_vtd_s1 vtd;
> + uint32_t s1_hwpt;
> +
> + vtd_init_s1_hwpt_data(&vtd, pe);
> +
> + if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
> + idev->hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
> + sizeof(vtd), &vtd, &s1_hwpt, errp)) {
> + return -EINVAL;
> + }
> +
> + vtd_hiod->s1_hwpt = s1_hwpt;
> +
> + return 0;
> +}
> +
> +static void vtd_destroy_s1_hwpt(VTDHostIOMMUDevice *vtd_hiod)
> +{
> + HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> +
> + iommufd_backend_free_id(idev->iommufd, vtd_hiod->s1_hwpt);
> + vtd_hiod->s1_hwpt = 0;
> +}
> +
> +static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
> + uint32_t pasid, VTDPASIDEntry *pe,
> + Error **errp)
> +{
> + HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> + uint32_t hwpt_id;
> + int ret;
> +
> + if (vtd_pe_pgtt_is_flt(pe)) {
> + ret = vtd_create_s1_hwpt(vtd_hiod, pe, errp);
> + if (ret) {
> + return ret;
> + }
> + hwpt_id = vtd_hiod->s1_hwpt;
> + } else {
> + hwpt_id = idev->hwpt_id;
> + }
> +
> + ret = !host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp);
> + trace_vtd_device_attach_hwpt(idev->devid, pasid, hwpt_id, ret);
> + if (ret && vtd_pe_pgtt_is_flt(pe)) {
> + vtd_destroy_s1_hwpt(vtd_hiod);
> + }
> +
> + return ret;
> +}
> +
> +static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
> + uint32_t pasid, VTDPASIDEntry *pe,
> + Error **errp)
> +{
> + HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> + int ret;
> +
> + if (vtd_hiod->iommu_state->dmar_enabled) {
> + ret = !host_iommu_device_iommufd_detach_hwpt(idev, errp);
> + trace_vtd_device_detach_hwpt(idev->devid, pasid, ret);
> + } else {
> + ret = !host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
> + trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
> + ret);
> + }
> +
> + if (vtd_pe_pgtt_is_flt(pe)) {
> + vtd_destroy_s1_hwpt(vtd_hiod);
> + }
> +
> + return ret;
> +}
> +
> +static int vtd_device_attach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
> + VTDAddressSpace *vtd_as, VTDPASIDEntry *pe)
> +{
> + /*
> + * If pe->gptt == FLT, should be go ahead to do bind as host only
PGTT. The rest of the sentence is difficult to parse.
> + * accepts guest FLT under nesting. If pe->pgtt==PT, should setup
> + * the pasid with GPA page table. Otherwise should return failure.
> + */
> + if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
> + return -EINVAL;
> + }
> +
> + /* Should fail if the FLPT base is 0 */
> + if (vtd_pe_pgtt_is_flt(pe) && !vtd_pe_get_flpt_base(pe)) {
> + return -EINVAL;
> + }
> +
> + return vtd_device_attach_iommufd(vtd_hiod, vtd_as->pasid, pe, &error_abort);
> +}
> +
> +static int vtd_device_detach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
> + VTDAddressSpace *vtd_as)
> +{
> + VTDPASIDEntry *cached_pe = vtd_as->pasid_cache_entry.cache_filled ?
> + &vtd_as->pasid_cache_entry.pasid_entry : NULL;
> +
> + if (!cached_pe ||
> + (!vtd_pe_pgtt_is_flt(cached_pe) && !vtd_pe_pgtt_is_pt(cached_pe))) {
> + return 0;
> + }
> +
> + return vtd_device_detach_iommufd(vtd_hiod, vtd_as->pasid, cached_pe,
> + &error_abort);
> +}
> +
> +/**
> + * Caller should hold iommu_lock.
> + */
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
> + VTDPASIDEntry *pe, VTDPASIDOp op)
> +{
> + IntelIOMMUState *s = vtd_as->iommu_state;
> + VTDHostIOMMUDevice *vtd_hiod;
> + int devfn = vtd_as->devfn;
> + int ret = -EINVAL;
> + struct vtd_as_key key = {
> + .bus = vtd_as->bus,
> + .devfn = devfn,
> + };
> +
> + vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
> + if (!vtd_hiod || !vtd_hiod->hiod) {
> + /* means no need to go further, e.g. for emulated devices */
don't you want to check
object_dynamic_cast(OBJECT(vtd_hiod->hiod),
TYPE_HOST_IOMMU_DEVICE_IOMMUFD)
as well.
In the positive you may introduce a helper that returns the vtd_hiod or NULL. It could also be used in previous patch and maybe at other locations as well.
> + return 0;
> + }
> +
> + if (vtd_as->pasid != PCI_NO_PASID) {
> + error_report("Non-rid_pasid %d not supported yet", vtd_as->pasid);
> + return ret;
> + }
> +
> + switch (op) {
> + case VTD_PASID_UPDATE:
> + case VTD_PASID_BIND:
> + {
> + ret = vtd_device_attach_pgtbl(vtd_hiod, vtd_as, pe);
> + break;
> + }
> + case VTD_PASID_UNBIND:
> + {
> + ret = vtd_device_detach_pgtbl(vtd_hiod, vtd_as);
> + break;
> + }
> + default:
> + error_report_once("Unknown VTDPASIDOp!!!\n");
> + break;
> + }
> +
> + return ret;
> +}
> +#else
> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
> + VTDPASIDEntry *pe, VTDPASIDOp op)
> +{
> + return 0;
> +}
> +#endif
> +
> /* Do a context-cache device-selective invalidation.
> * @func_mask: FM field after shifting
> */
> @@ -3181,20 +3388,23 @@ static int vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
> VTDPASIDEntry *pe)
> {
> VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> + int ret;
>
> - if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
> - /* No need to go further as cached pasid entry is latest */
> - return 0;
> + if (pc_entry->cache_filled) {
> + if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
> + /* No need to go further as cached pasid entry is latest */
> + return 0;
> + }
> + ret = vtd_bind_guest_pasid(vtd_as, pe, VTD_PASID_UPDATE);
> + } else {
> + ret = vtd_bind_guest_pasid(vtd_as, pe, VTD_PASID_BIND);
> }
>
> - pc_entry->pasid_entry = *pe;
> - pc_entry->cache_filled = true;
> -
> - /*
> - * TODO: send pasid bind to host for passthru devices
> - */
> -
> - return 0;
> + if (!ret) {
> + pc_entry->pasid_entry = *pe;
> + pc_entry->cache_filled = true;
> + }
> + return ret;
> }
>
> /*
> @@ -3265,10 +3475,14 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> return false;
>
> remove:
> - /*
> - * TODO: send pasid unbind to host for passthru devices
> - */
> - pc_entry->cache_filled = false;
> + if (pc_entry->cache_filled) {
> + if (vtd_bind_guest_pasid(vtd_as, NULL, VTD_PASID_UNBIND)) {
> + pasid_cache_info_set_error(pc_info);
> + return false;
> + } else {
> + pc_entry->cache_filled = false;
> + }
> + }
>
> /*
> * Don't remove address space of PCI_NO_PASID which is created by PCI
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index c8a936eb46..1c31b9a873 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -73,6 +73,9 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
> vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
> vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
> vtd_reset_exit(void) ""
> +vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
> +vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
> +vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
>
> # amd_iommu.c
> amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" + offset 0x%"PRIx32
Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 15/19] intel_iommu: Replay pasid binds after context cache invalidation
2025-06-20 7:18 ` [PATCH v2 15/19] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
@ 2025-06-23 13:25 ` Eric Auger
2025-06-26 9:27 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-23 13:25 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> This replays guest pasid attachments after context cache invalidation.
> This is a behavior to ensure safety. Actually, programmer should issue
> pasid cache invalidation with proper granularity after issuing a context
> cache invalidation.
so is it mandated to do the invalidation twice?
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu_internal.h | 1 +
> hw/i386/intel_iommu.c | 51 ++++++++++++++++++++++++++++++++--
> hw/i386/trace-events | 1 +
> 3 files changed, 51 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 92a533db54..b3e4aa23f1 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -575,6 +575,7 @@ typedef enum VTDPCInvType {
> VTD_PASID_CACHE_FORCE_RESET = 0,
> /* pasid cache invalidation rely on guest PASID entry */
> VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
> + VTD_PASID_CACHE_DEVSI, /* pasid cache device selective invalidation */
> VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation */
> VTD_PASID_CACHE_PASIDSI, /* pasid cache pasid selective invalidation */
> } VTDPCInvType;
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 1c94a0033c..621b07aa02 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -92,6 +92,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
>
> static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> + VTDPASIDCacheInfo *pc_info);
> +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> + PCIBus *bus, uint16_t devfn);
>
> static void vtd_panic_require_caching_mode(void)
> {
> @@ -2437,6 +2441,8 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
>
> static void vtd_context_global_invalidate(IntelIOMMUState *s)
> {
> + VTDPASIDCacheInfo pc_info = { .error_happened = false, };
> +
> trace_vtd_inv_desc_cc_global();
> /* Protects context cache */
> vtd_iommu_lock(s);
> @@ -2454,6 +2460,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
> * VT-d emulation codes.
> */
> vtd_iommu_replay_all(s);
> +
> + pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
> + vtd_pasid_cache_sync(s, &pc_info);
> }
>
> #ifdef CONFIG_IOMMUFD
> @@ -2696,6 +2705,21 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
> * happened.
> */
> vtd_address_space_sync(vtd_as);
> + /*
> + * Per spec, context flush should also followed with PASID
be followed
> + * cache and iotlb flush. Regards to a device selective
regarding to?
> + * context cache invalidation:
> + * if (emaulted_device)
emulatted
> + * invalidate pasid cache and pasid-based iotlb
> + * else if (assigned_device)
> + * check if the device has been bound to any pasid
> + * invoke pasid_unbind regards to each bound pasid
> + * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
> + * caches, while for piotlb in QEMU, we don't have it yet, so
> + * no handling. For assigned device, host iommu driver would
> + * flush piotlb when a pasid unbind is pass down to it.
> + */
> + vtd_pasid_cache_devsi(s, vtd_as->bus, devfn);
> }
> }
> }
> @@ -3447,6 +3471,11 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> /* Fall through */
> case VTD_PASID_CACHE_GLOBAL_INV:
> break;
> + case VTD_PASID_CACHE_DEVSI:
> + if (pc_info->bus != vtd_as->bus || pc_info->devfn != vtd_as->devfn) {
> + return false;
> + }
> + break;
> default:
> error_report("invalid pc_info->type");
> abort();
> @@ -3640,6 +3669,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> case VTD_PASID_CACHE_GLOBAL_INV:
> /* loop all assigned devices */
> break;
> + case VTD_PASID_CACHE_DEVSI:
> + walk_info.bus = pc_info->bus;
> + walk_info.devfn = pc_info->devfn;
> + vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
> + return;
> case VTD_PASID_CACHE_FORCE_RESET:
> /* For force reset, no need to go further replay */
> return;
> @@ -3675,8 +3709,7 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> * It includes updating the pasid cache in vIOMMU and updating the
> * pasid bindings per guest's latest pasid entry presence.
> */
> -static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> - VTDPASIDCacheInfo *pc_info)
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
> {
> if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
> return;
> @@ -3737,6 +3770,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> vtd_replay_guest_pasid_bindings(s, pc_info);
> }
>
> +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
> + PCIBus *bus, uint16_t devfn)
> +{
> + VTDPASIDCacheInfo pc_info = { .error_happened = false, };
> +
> + trace_vtd_pasid_cache_devsi(devfn);
> +
> + pc_info.type = VTD_PASID_CACHE_DEVSI;
> + pc_info.bus = bus;
> + pc_info.devfn = devfn;
> +
> + vtd_pasid_cache_sync(s, &pc_info);
> +}
> +
> static bool vtd_process_pasid_desc(IntelIOMMUState *s,
> VTDInvDesc *inv_desc)
> {
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index 1c31b9a873..830b11f68b 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -28,6 +28,7 @@ vtd_pasid_cache_reset(void) ""
> vtd_pasid_cache_gsi(void) ""
> vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
> vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
> +vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
> vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
> vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
> vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 16/19] intel_iommu: Propagate PASID-based iotlb invalidation to host
2025-06-20 7:18 ` [PATCH v2 16/19] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
@ 2025-06-23 13:41 ` Eric Auger
2025-06-26 9:42 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-23 13:41 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Zhenzhong,
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> This traps the guest PASID-based iotlb invalidation request and propagate it
> to host.
>
> Intel VT-d 3.0 supports nested translation in PASID granular. Guest SVA support
s/granular/granularity
> could be implemented by configuring nested translation on specific PASID. This
> is also known as dual stage DMA translation.
>
> Under such configuration, guest owns the GVA->GPA translation which is
> configured as stage-1 page table in host side for a specific pasid, and host
> owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
> invalidation should be propagated to host since host IOMMU will cache first
> level page table related mappings during DMA address translation.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu_internal.h | 6 ++
> hw/i386/intel_iommu.c | 113 ++++++++++++++++++++++++++++++++-
> 2 files changed, 117 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index b3e4aa23f1..07bfb97499 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -589,6 +589,12 @@ typedef struct VTDPASIDCacheInfo {
> bool error_happened;
> } VTDPASIDCacheInfo;
>
> +typedef struct VTDPIOTLBInvInfo {
> + uint16_t domain_id;
> + uint32_t pasid;
> + struct iommu_hwpt_vtd_s1_invalidate *inv_data;
> +} VTDPIOTLBInvInfo;
> +
> /* PASID Table Related Definitions */
> #define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
> #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 621b07aa02..d1fa395274 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -2639,12 +2639,105 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
>
> return ret;
> }
> +
> +/*
> + * Caller of this function should hold iommu_lock.
> + */
> +static void vtd_invalidate_piotlb(VTDAddressSpace *vtd_as,
> + struct iommu_hwpt_vtd_s1_invalidate *cache)
> +{
> + VTDHostIOMMUDevice *vtd_hiod;
> + HostIOMMUDeviceIOMMUFD *idev;
> + int devfn = vtd_as->devfn;
> + struct vtd_as_key key = {
> + .bus = vtd_as->bus,
> + .devfn = devfn,
> + };
> + IntelIOMMUState *s = vtd_as->iommu_state;
> + uint32_t entry_num = 1; /* Only implement one request for simplicity */
> + Error *err;
> +
> + vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
> + if (!vtd_hiod || !vtd_hiod->hiod) {
> + return;
> + }
> + idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
> +
> + if (!iommufd_backend_invalidate_cache(idev->iommufd, vtd_hiod->s1_hwpt,
> + IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
> + sizeof(*cache), &entry_num, cache,
> + &err)) {
> + error_report_err(err);
> + }
> +}
> +
> +/*
> + * This function is a loop function for the s->vtd_address_spaces
> + * list with VTDPIOTLBInvInfo as execution filter. It propagates
> + * the piotlb invalidation to host. Caller of this function
> + * should hold iommu_lock.
instead of having this mention everywhere may be more efficient to
postfix each function with _locked and I don't know if it exists have a
checker.
> + */
> +static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
> + gpointer user_data)
> +{
> + VTDPIOTLBInvInfo *piotlb_info = user_data;
> + VTDAddressSpace *vtd_as = value;
> + VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> + uint32_t pasid;
> + uint16_t did;
> +
> + /* Replay only fill pasid entry cache for passthrough device */
fills
> + if (!pc_entry->cache_filled ||
> + !vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
> + return;
> + }
> +
> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
> + return;
> + }
> +
> + did = vtd_pe_get_did(&pc_entry->pasid_entry);
> +
> + if (piotlb_info->domain_id == did && piotlb_info->pasid == pasid) {
> + vtd_invalidate_piotlb(vtd_as, piotlb_info->inv_data);
> + }
> +}
> +
> +static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
> + uint16_t domain_id, uint32_t pasid,
> + hwaddr addr, uint64_t npages, bool ih)
> +{
> + struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
> + VTDPIOTLBInvInfo piotlb_info;
> +
> + cache_info.addr = addr;
> + cache_info.npages = npages;
> + cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
> +
> + piotlb_info.domain_id = domain_id;
> + piotlb_info.pasid = pasid;
> + piotlb_info.inv_data = &cache_info;
> +
> + /*
> + * Here loops all the vtd_as instances in s->vtd_address_spaces
I am not a native english speaker but I am not sure loop something is
OK. Go though each?
Besides that comment is not that useful as it paraphrases the code.
> + * to find out the affected devices since piotlb invalidation
> + * should check pasid cache per architecture point of view.
> + */
> + g_hash_table_foreach(s->vtd_address_spaces,
> + vtd_flush_pasid_iotlb, &piotlb_info);
> +}
> #else
> static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
> VTDPASIDEntry *pe, VTDPASIDOp op)
> {
> return 0;
> }
> +
> +static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
> + uint16_t domain_id, uint32_t pasid,
> + hwaddr addr, uint64_t npages, bool ih)
> +{
> +}
> #endif
>
> /* Do a context-cache device-selective invalidation.
> @@ -3300,6 +3393,13 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
> info.pasid = pasid;
>
> vtd_iommu_lock(s);
> + /*
> + * Here loops all the vtd_as instances in s->vtd_as
would frop above.
> + * to find out the affected devices since piotlb invalidation
Find out ...
> + * should check pasid cache per architecture point of view.
> + */
> + vtd_flush_pasid_iotlb_all(s, domain_id, pasid, 0, (uint64_t)-1, 0);
> +
> g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
> &info);
> vtd_iommu_unlock(s);
> @@ -3323,7 +3423,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
> }
>
> static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> - uint32_t pasid, hwaddr addr, uint8_t am)
> + uint32_t pasid, hwaddr addr, uint8_t am,
> + bool ih)
> {
> VTDIOTLBPageInvInfo info;
>
> @@ -3333,6 +3434,13 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> info.mask = ~((1 << am) - 1);
>
> vtd_iommu_lock(s);
> + /*
> + * Here loops all the vtd_as instances in s->vtd_as
> + * to find out the affected devices since piotlb invalidation
> + * should check pasid cache per architecture point of view.
> + */
> + vtd_flush_pasid_iotlb_all(s, domain_id, pasid, addr, 1 << am, ih);
> +
> g_hash_table_foreach_remove(s->iotlb,
> vtd_hash_remove_by_page_piotlb, &info);
> vtd_iommu_unlock(s);
> @@ -3366,7 +3474,8 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
> case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
> am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
> addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
> - vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am);
> + vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
> + VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
> break;
>
> default:
Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 17/19] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
2025-06-20 7:18 ` [PATCH v2 17/19] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
@ 2025-06-23 13:48 ` Eric Auger
2025-06-26 10:16 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-06-23 13:48 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Zhenzhong,
On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
> the pasid bindings on host side become stale and need to be updated.
>
> Introduce a helper function vtd_refresh_pasid_bind() for that purpose.
nit I would avoid introducing yet another terminology, ie. refresh. If
it is a replay let's keep replay and precisely explain what the replay does.
vtd_replay_pasid_bindings?
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu.c | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index d1fa395274..0b322078cc 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -90,6 +90,7 @@ struct vtd_iotlb_key {
>
> static void vtd_address_space_refresh_all(IntelIOMMUState *s);
> static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
> +static void vtd_refresh_pasid_bind(IntelIOMMUState *s);
>
> static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
> static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> @@ -3066,6 +3067,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
> vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
> vtd_reset_caches(s);
> vtd_address_space_refresh_all(s);
> + vtd_refresh_pasid_bind(s);
> }
>
> /* Set Interrupt Remap Table Pointer */
> @@ -3100,6 +3102,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>
> vtd_reset_caches(s);
> vtd_address_space_refresh_all(s);
> + vtd_refresh_pasid_bind(s);
> }
>
> /* Handle Interrupt Remap Enable/Disable */
> @@ -3813,6 +3816,26 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
> }
> }
>
> +static void vtd_refresh_pasid_bind(IntelIOMMUState *s)
> +{
> + VTDPASIDCacheInfo pc_info = { .error_happened = false,
> + .type = VTD_PASID_CACHE_GLOBAL_INV };
> +
> + /*
> + * Only when dmar is enabled, should pasid bindings replayed,
> + * otherwise no need to replay.
> + */
I am not sure the above comment is necessary.
> + if (!s->dmar_enabled) {
> + return;
> + }
> +
> + if (!s->flts || !s->root_scalable) {
> + return;
> + }
> +
> + vtd_replay_guest_pasid_bindings(s, &pc_info);
> +}
> +
> /*
> * This function syncs the pasid bindings between guest and host.
> * It includes updating the pasid cache in vIOMMU and updating the
Thanks
Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 09/19] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
2025-06-20 12:46 ` Eric Auger
@ 2025-06-24 2:48 ` Duan, Zhenzhong
2025-07-07 3:12 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-24 2:48 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 09/19] intel_iommu: Introduce two helpers
>vtd_as_from/to_iommu_pasid_locked
>
>Hi Zhenzhong,
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>> PCI device supports two request types, Requests-without-PASID and
>> Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
>> prefix, IOMMU fetches rid_pasid from context entry and use it as IOMMU's
>> pasid to index pasid table.
>>
>> So we need to translate between PCI's pasid and IOMMU's pasid specially
>> for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
>> For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
>>
>> vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to
>vtd_as
>> which contains PCI's pasid vtd_as->pasid.
>>
>> vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to
>iommu_pasid.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu.c | 58
>+++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 58 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 9d4adc9458..8948b8370f 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -1602,6 +1602,64 @@ static int
>vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>> return 0;
>> }
>>
>> +static inline int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
>> + uint32_t *pasid)
>Is it meaningful to use inline here and below? Below I guess you do so
>to avoid "defined but not used" compilation error but I don't think it
>should stay as is.
Yes, that's the only reason I define the both inline.
Do you have other suggestions to avoid compilation error if not use inline?
>
>I don't really understand the iommu_pasid terminology. Either it is a
>pasid passed through the PCI transaction or it is the default pasid
>found in rid2passid ce field. So that's a pasid both ways ;-) can't you
>simply call it pasid.
Yes, PCI side we call it just pasid, the other side I name it iommu pasid to distinguish.
Does that work for you?
>> +{
>> + VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
>> + IntelIOMMUState *s = vtd_as->iommu_state;
>> + uint8_t bus_num = pci_bus_num(vtd_as->bus);
>> + uint8_t devfn = vtd_as->devfn;
>> + VTDContextEntry ce;
>> + int ret;
>> +
>> + if (cc_entry->context_cache_gen == s->context_cache_gen) {
>> + ce = cc_entry->context_entry;
>> + } else {
>> + ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
>> + if (ret) {
>> + return ret;
>> + }
>> + }
>if the above pattern is used at many locations I still think it may be
>valuable to have a _locked helper.
Not get, both vtd_as_to_iommu_pasid_locked() and vtd_as_from_iommu_pasid_locked()
are already _locked helper, isn't it?
Do you mean adding a comment saying "Caller of this function should hold iommu_lock."
>> +
>> + /* Translate to iommu pasid if PCI_NO_PASID */
>> + if (vtd_as->pasid == PCI_NO_PASID) {
>> + *pasid = VTD_CE_GET_RID2PASID(&ce);
>> + } else {
>> + *pasid = vtd_as->pasid;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer
>value,
>> + gpointer user_data)
>> +{
>> + VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
>> + struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
>why target? can't you name it key instead?
There is already a parameter named key, maybe target_key?
Thanks
Zhenzhong
>> + uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
>> + uint32_t pasid;
>> +
>> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
>> + return false;
>> + }
>> +
>> + return (pasid == target->pasid) && (sid == target->sid);
>> +}
>> +
>> +/* Translate iommu pasid to vtd_as */
>same here
>> +static inline
>> +VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
>> + uint16_t sid, uint32_t pasid)
>> +{
>> + struct vtd_as_raw_key key = {
>> + .sid = sid,
>> + .pasid = pasid
>> + };
>> +
>> + return g_hash_table_find(s->vtd_address_spaces,
>> + vtd_find_as_by_sid_and_iommu_pasid, &key);
>> +}
>> +
>> static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
>> void *private)
>> {
>Thanks
>
>Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 10/19] intel_iommu: Handle PASID entry removing and updating
2025-06-20 15:44 ` Eric Auger
@ 2025-06-24 3:34 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-24 3:34 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
Eduardo Habkost, Marcel Apfelbaum
Hi Eric,
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 10/19] intel_iommu: Handle PASID entry removing and
>updating
>
>Hi Zhenzhong,
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>I would suggest: Handle PASID entry removal and update instead of verbing.
>> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
>> pasid entry and track PASID usage and future PASID tagged DMA address
>> translation support in vIOMMU.
>>
>> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
>> never freed. For other pasid, VTDAddressSpace instance is created/destroyed
>> per the guest pasid entry set up/destroy for passthrough devices. While for
>> emulated devices, VTDAddressSpace instance is created in the PASID tagged
>DMA
>> translation and be destroyed per guest PASID cache invalidation. This focuses
>s/be detroyed/destroyed
>> on the PASID cache management for passthrough devices as there is no PASID
>> capable emulated devices yet.
>if you don't handle the emulated device case, may be omit talking about
>them here.
OK, will drop the sentence starting from "While for emulated devices..."
>>
>> When guest modifies a PASID entry, QEMU will capture the guest pasid
>selective
>> pasid cache invalidation, allocate or remove a VTDAddressSpace instance per
>the
>> invalidation reasons:
>>
>> a) a present pasid entry moved to non-present
>> b) a present pasid entry to be a present entry
>> c) a non-present pasid entry moved to present
>>
>> This handles a) and b), following patch will handle c).
>This -> This patch
Will do.
>>
>> vIOMMU emulator could figure out the reason by fetching latest guest pasid
>entry
>> and compare it with the PASID cache.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu_internal.h | 27 ++++
>> include/hw/i386/intel_iommu.h | 6 +
>> hw/i386/intel_iommu.c | 265 +++++++++++++++++++++++++++++++--
>> hw/i386/trace-events | 3 +
>> 4 files changed, 291 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> index 18bc22fc72..01c881ed4d 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
>> * request while disabled */
>> VTD_FR_IR_SID_ERR = 0x26, /* Invalid Source-ID */
>>
>> + VTD_FR_RTADDR_INV_TTM = 0x31, /* Invalid TTM in RTADDR */
>> /* PASID directory entry access failure */
>> VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>> /* The Present(P) field of pasid directory entry is 0 */
>> @@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
>> #define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff000000000f1c0ULL
>> #define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
>>
>Adding some basic comments for the uncultured reader like me would help
>doing the link with the vtd spec:
>PASID-cache Invalidate Descriptor (pc_inv_dsc) fields
Will do.
>> +#define VTD_INV_DESC_PASIDC_G (3ULL << 4)
>> +#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
>> +#define VTD_INV_DESC_PASIDC_DID(val) (((val) >> 16) &
>VTD_DOMAIN_ID_MASK)
>> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0 0xfff000000000f1c0ULL
>> +
>> +#define VTD_INV_DESC_PASIDC_DSI (0ULL << 4)
>> +#define VTD_INV_DESC_PASIDC_PASID_SI (1ULL << 4)
>> +#define VTD_INV_DESC_PASIDC_GLOBAL (3ULL << 4)
>as those are values for the granularity field using
>
>VTD_INV_DESC_PASIDC_G_* look relevant to me.
Will do.
>
>
>I think you would gain in readability if you adopt extract32/64 syntax
>like in hw/arm/smmuv3-internal.h
>Looks more readable to me.
This is different from existing style in intel_iommu_internal.h
But I do agree with your suggestions, let me try it with this series.
>> +
>> /* Information about page-selective IOTLB invalidate */
>> struct VTDIOTLBPageInvInfo {
>> uint16_t domain_id;
>> @@ -552,6 +562,22 @@ typedef struct VTDRootEntry VTDRootEntry;
>> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL |
>~VTD_HAW_MASK(aw))
>> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
>>
>> +typedef enum VTDPCInvType {
>> + /* pasid cache invalidation rely on guest PASID entry */
>> + VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
>> + VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation
>*/
>> + VTD_PASID_CACHE_PASIDSI, /* pasid cache pasid selective invalidation */
>> +} VTDPCInvType;
>> +
>> +typedef struct VTDPASIDCacheInfo {
>> + VTDPCInvType type;
>> + uint16_t domain_id;
>did?
OK
>> + uint32_t pasid;
>> + PCIBus *bus;
>> + uint16_t devfn;
>> + bool error_happened;
>wouldn't it make more sense to store an Error * directly so that you can
>fill it whenever there is an error occuring.
>Something like VFIOContainerBase::error
Will do
>> +} VTDPASIDCacheInfo;
>> +
>> /* PASID Table Related Definitions */
>> #define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
>> #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
>> @@ -563,6 +589,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>> #define VTD_PASID_TABLE_BITS_MASK (0x3fULL)
>> #define VTD_PASID_TABLE_INDEX(pasid) ((pasid) &
>VTD_PASID_TABLE_BITS_MASK)
>> #define VTD_PASID_ENTRY_FPD (1ULL << 1) /* Fault Processing Disable
>*/
>> +#define VTD_PASID_TBL_ENTRY_NUM (1ULL << 6)
>>
>> /* PASID Granular Translation Type Mask */
>> #define VTD_PASID_ENTRY_P 1ULL
>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>> index 50f9b27a45..fbc9da903a 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>> uint64_t val[8];
>> };
>>
>> +typedef struct VTDPASIDCacheEntry {
>> + struct VTDPASIDEntry pasid_entry;
>> + bool cache_filled;
>> +} VTDPASIDCacheEntry;
>> +
>> struct VTDAddressSpace {
>> PCIBus *bus;
>> uint8_t devfn;
>> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>> MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
>> IntelIOMMUState *iommu_state;
>> VTDContextCacheEntry context_cache_entry;
>> + VTDPASIDCacheEntry pasid_cache_entry;
>> QLIST_ENTRY(VTDAddressSpace) next;
>> /* Superset of notifier flags that this address space has */
>> IOMMUNotifierFlag notifier_flags;
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 8948b8370f..1db581d14a 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -826,11 +826,24 @@ static inline bool
>vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>> }
>> }
>>
>> +static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
>> +{
>> + return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
>see my previous comment about extract*
>> +}
>> +
>> static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>> {
>> return pdire->val & 1;
>> }
>>
>> +static inline void pasid_cache_info_set_error(VTDPASIDCacheInfo *pc_info)
>not sure whether this helper is really needed
>> +{
>> + if (pc_info->error_happened) {
>Besides why not simply overwriting unconditionnally?
I will follow VFIO code to save first Error and drop the following ones.
>> + return;
>> + }
>> + pc_info->error_happened = true;
>> +}
>> +
>> /**
>> * Caller of this function should check present bit if wants
>> * to use pdir entry for further usage except for fpd bit check.
>> @@ -3103,6 +3116,241 @@ static bool
>vtd_process_piotlb_desc(IntelIOMMUState *s,
>> return true;
>> }
>>
>> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
>> + uint32_t pasid, VTDPASIDEntry *pe)
>> +{
>> + IntelIOMMUState *s = vtd_as->iommu_state;
>> + VTDContextEntry ce;
>> + int ret;
>> +
>> + if (!s->root_scalable) {
>> + return -VTD_FR_RTADDR_INV_TTM;
>> + }
>> +
>> + ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus), vtd_as-
>>devfn,
>> + &ce);
>> + if (ret) {
>> + return ret;
>> + }
>> +
>> + return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
>> +}
>> +
>> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
>> +{
>> + return !memcmp(p1, p2, sizeof(*p1));
>> +}
>> +
>> +/*
>> + * This function fills in the pasid entry in &vtd_as. Caller
>> + * of this function should hold iommu_lock.
>> + */
>> +static int vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
>> + VTDPASIDEntry *pe)
>> +{
>> + VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> +
>> + if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
>> + /* No need to go further as cached pasid entry is latest */
>> + return 0;
>> + }
>> +
>> + pc_entry->pasid_entry = *pe;
>> + pc_entry->cache_filled = true;
>> +
>> + /*
>> + * TODO: send pasid bind to host for passthru devices
>> + */
>> +
>> + return 0;
>> +}
>> +
>> +/*
>> + * This function is used to update or clear cached pasid entry in vtd_as
>> + * instances. Caller of this function should hold iommu_lock.
>> + */
>> +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
>> + gpointer user_data)
>> +{
>> + VTDPASIDCacheInfo *pc_info = user_data;
>> + VTDAddressSpace *vtd_as = value;
>> + IntelIOMMUState *s = vtd_as->iommu_state;
>> + VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> + VTDPASIDEntry pe;
>> + uint16_t did;
>> + uint32_t pasid;
>> + int ret;
>> +
>> + if (!pc_entry->cache_filled) {
>> + return false;
>> + }
>> + did = vtd_pe_get_did(&pc_entry->pasid_entry);
>> +
>> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
>> + goto remove;
>> + }
>> +
>> + switch (pc_info->type) {
>> + case VTD_PASID_CACHE_PASIDSI:
>> + if (pc_info->pasid != pasid) {
>> + return false;
>> + }
>> + /* Fall through */
>> + case VTD_PASID_CACHE_DOMSI:
>> + if (pc_info->domain_id != did) {
>> + return false;
>> + }
>> + /* Fall through */
>> + case VTD_PASID_CACHE_GLOBAL_INV:
>> + break;
>> + default:
>> + error_report("invalid pc_info->type");
>> + abort();
>> + }
>> +
>> + /*
>> + * pasid cache invalidation may indicate a present pasid
>> + * entry to present pasid entry modification. To cover such
>> + * case, vIOMMU emulator needs to fetch latest guest pasid
>> + * entry and check cached pasid entry, then update pasid
>> + * cache and send pasid bind/unbind to host properly.
>if you don't do that in that patch I would put that in a subsequent
>patch. Focus on the PASID cache in this patch. See my subsequent comment
Sure.
>> + */
>> + ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
>> + if (ret) {
>> + /*
>> + * No valid pasid entry in guest memory. e.g. pasid entry
>> + * was modified to be either all-zero or non-present. Either
>> + * case means existing pasid cache should be removed.
>> + */
>> + goto remove;
>> + }
>> +
>> + if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
>> + pasid_cache_info_set_error(pc_info);
>> + }
>> + return false;
>> +
>> +remove:
>> + /*
>> + * TODO: send pasid unbind to host for passthru devices
>> + */
>> + pc_entry->cache_filled = false;
>> +
>> + /*
>> + * Don't remove address space of PCI_NO_PASID which is created by PCI
>> + * sub-system.
>> + */
>> + if (vtd_as->pasid == PCI_NO_PASID) {
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> +/*
>> + * This function syncs the pasid bindings between guest and host.
>> + * It includes updating the pasid cache in vIOMMU and updating the
>> + * pasid bindings per guest's latest pasid entry presence.
>> + */
>> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
>> + VTDPASIDCacheInfo *pc_info)
>> +{
>> + if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
>you don't update error_happened. Is that OK?
This is the case when no pasid cache used but guest sends pasid cache
invalidation request, there is no error, so just ignore this request.
>> + return;
>> + }
>> +
>> + /*
>> + * Regards to a pasid cache invalidation, e.g. a PSI.
>> + * it could be either cases of below:
>> + * a) a present pasid entry moved to non-present
>> + * b) a present pasid entry to be a present entry
>> + * c) a non-present pasid entry moved to present
>> + *
>> + * Different invalidation granularity may affect different device
>> + * scope and pasid scope. But for each invalidation granularity,
>> + * it needs to do two steps to sync host and guest pasid binding.
>> + *
>> + * Here is the handling of a PSI:
>> + * 1) loop all the existing vtd_as instances to update them
>> + * according to the latest guest pasid entry in pasid table.
>> + * this will make sure affected existing vtd_as instances
>> + * cached the latest pasid entries. Also, during the loop, the
>> + * host should be notified if needed. e.g. pasid unbind or pasid
>> + * update. Should be able to cover case a) and case b).
>> + *
>> + * 2) loop all devices to cover case c)
>> + * - For devices which are backed by HostIOMMUDeviceIOMMUFD
>instances,
>> + * we loop them and check if guest pasid entry exists. If yes,
>> + * it is case c), we update the pasid cache and also notify
>> + * host.
>> + * - For devices which are not backed by HostIOMMUDeviceIOMMUFD,
>> + * it is not necessary to create pasid cache at this phase since
>> + * it could be created when vIOMMU does DMA address translation.
>> + * This is not yet implemented since there is no emulated
>> + * pasid-capable devices today. If we have such devices in
>> + * future, the pasid cache shall be created there.
>> + * Other granularity follow the same steps, just with different scope
>I would put all the stuff related to interactions with host in a
>subsequent patch. This patch could concentrate on the IOMMU PASID cache
>only. And then you would add the extra complexity of syncs with the
>host. I think it would simplify the review.
Sure.
>> + *
>> + */
>> +
>> + vtd_iommu_lock(s);
>> + /*
>> + * Step 1: loop all the existing vtd_as instances for pasid unbind and
>> + * update.
>> + */
>> + g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid,
>> + pc_info);
>> + vtd_iommu_unlock(s);
>> +
>> + /* TODO: Step 2: loop all the existing vtd_hiod instances for pasid bind. */
>> +}
>> +
>> +static bool vtd_process_pasid_desc(IntelIOMMUState *s,
>> + VTDInvDesc *inv_desc)
>> +{
>> + uint16_t domain_id;
>> + uint32_t pasid;
>> + VTDPASIDCacheInfo pc_info = {};
>> + uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0,
>VTD_INV_DESC_ALL_ONE,
>> + VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
>> +
>> + if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
>> + __func__, "pasid cache inv")) {
>> + return false;
>> + }
>> +
>> + domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
>> + pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
>> +
>> + switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
>> + case VTD_INV_DESC_PASIDC_DSI:
>> + trace_vtd_pasid_cache_dsi(domain_id);
>> + pc_info.type = VTD_PASID_CACHE_DOMSI;
>> + pc_info.domain_id = domain_id;
>> + break;
>> +
>> + case VTD_INV_DESC_PASIDC_PASID_SI:
>> + /* PASID selective implies a DID selective */
>> + trace_vtd_pasid_cache_psi(domain_id, pasid);
>> + pc_info.type = VTD_PASID_CACHE_PASIDSI;
>> + pc_info.domain_id = domain_id;
>> + pc_info.pasid = pasid;
>> + break;
>> +
>> + case VTD_INV_DESC_PASIDC_GLOBAL:
>> + trace_vtd_pasid_cache_gsi();
>> + pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
>> + break;
>> +
>> + default:
>> + error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
>Make it end-user understandable? invalid granulrity field in PASID-cache
>invalidate descriptor.
Sure.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 11/19] intel_iommu: Handle PASID entry adding
2025-06-23 11:47 ` Eric Auger
@ 2025-06-24 10:56 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-24 10:56 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
Eduardo Habkost, Marcel Apfelbaum
Hi Eric,
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
><marcel.apfelbaum@gmail.com>
>Subject: Re: [PATCH v2 11/19] intel_iommu: Handle PASID entry adding
>
>Hi Zhenzhong,
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>> When guest modifies a PASID entry, QEMU will capture the guest pasid
>selective
>> pasid cache invalidation, allocate or remove a VTDAddressSpace instance per
>the
>> invalidation reasons:
>>
>> a) a present pasid entry moved to non-present
>> b) a present pasid entry to be a present entry
>> c) a non-present pasid entry moved to present
>>
>> This handles c).
>
>As you use the replay terminology in the patch, please explain what it
>means in that case and how the patch achieve above goal.
Will do.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu_internal.h | 1 +
>> hw/i386/intel_iommu.c | 169 ++++++++++++++++++++++++++++++++-
>> 2 files changed, 169 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> index 01c881ed4d..025787b3b9 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -558,6 +558,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>> #define VTD_CTX_ENTRY_LEGACY_SIZE 16
>> #define VTD_CTX_ENTRY_SCALABLE_SIZE 32
>>
>> +#define VTD_SM_CONTEXT_ENTRY_PDTS(val) (((val) >> 9) & 0x7)
>> #define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
>> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL |
>~VTD_HAW_MASK(aw))
>> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 1db581d14a..f4273dc640 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -826,6 +826,11 @@ static inline bool
>vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>> }
>> }
>>
>> +static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
>> +{
>> + return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
>> +}
>> +
>> static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
>> {
>> return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
>> @@ -3246,6 +3251,159 @@ remove:
>> return true;
>> }
>>
>> +static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
>> + dma_addr_t pt_base,
>> + int start,
>> + int end,
>> + VTDPASIDCacheInfo *info)
>> +{
>> + VTDPASIDEntry pe;
>> + int pasid = start;
>> + int pasid_next;
>> +
>> + while (pasid < end) {
>> + pasid_next = pasid + 1;
>> +
>> + if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
>> + && vtd_pe_present(&pe)) {
>> + int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
>> + uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
>> + VTDAddressSpace *vtd_as;
>> +
>> + vtd_iommu_lock(s);
>> + /*
>> + * When indexed by rid2pasid, vtd_as should have been created,
>> + * e.g., by PCI subsystem. For other iommu pasid, we need to
>> + * create vtd_as dynamically. The other iommu pasid is same as
>> + * PCI's pasid, so it's used as input of vtd_find_add_as().
>> + */
>> + vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
>> + vtd_iommu_unlock(s);
>> + if (!vtd_as) {
>> + vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
>> + }
>> +
>> + if ((info->type == VTD_PASID_CACHE_DOMSI ||
>> + info->type == VTD_PASID_CACHE_PASIDSI) &&
>> + !(info->domain_id == vtd_pe_get_did(&pe))) {
>> + /*
>> + * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
>> + * requires domain ID check. If domain Id check fail,
>> + * go to next pasid.
>> + */
>> + pasid = pasid_next;
>> + continue;
>> + }
>> + if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
>> + pasid_cache_info_set_error(info);
>> + }
>> + }
>> + pasid = pasid_next;
>> + }
>> +}
>> +
>> +/*
>> + * Currently, VT-d scalable mode pasid table is a two level table,
>do you mean PASID dir + PASID table? in the positive I would use that
>terminology directly.
Yes, OK.
>> + * this function aims to loop a range of PASIDs in a given pasid
>aims at looping over a range of PASIDs in a given table?
>so what do you call a table here? is a a PASID directory ot a PASID table.
Yes, this is confusing, will be:
* In VT-d scalable mode translation, PASID dir + PASID table is used.
* This function aims at looping over a range of PASIDs in a given two
* level table to identify the pasid config in guest.
>> + * table to identify the pasid config in guest.
>> + */
>> +static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
>> + dma_addr_t pdt_base,
>> + int start,
>> + int end,
>> + VTDPASIDCacheInfo *info)
>> +{
>> + VTDPASIDDirEntry pdire;
>> + int pasid = start;
>> + int pasid_next;
>> + dma_addr_t pt_base;
>> +
>> + while (pasid < end) {
>> + pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
>> + (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
>> + if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
>> + && vtd_pdire_present(&pdire)) {
>> + pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
>> + vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
>> + }
>> + pasid = pasid_next;
>> + }
>> +}
>> +
>> +static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
>> + int start, int end,
>> + VTDPASIDCacheInfo *info)
>> +{
>> + VTDContextEntry ce;
>> +
>> + if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus), info->devfn,
>> + &ce)) {
>> + uint32_t max_pasid;
>> +
>> + max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) *
>VTD_PASID_TBL_ENTRY_NUM;
>> + if (end > max_pasid) {
>> + end = max_pasid;
>> + }
>> + vtd_sm_pasid_table_walk(s,
>> + VTD_CE_GET_PASID_DIR_TABLE(&ce),
>> + start,
>> + end,
>> + info);
>> + }
>> +}
>> +
>> +/*
>> + * This function replay the guest pasid bindings to hosts by
>replays, host
OK
>> + * walking the guest PASID table. This ensures host will have
>> + * latest guest pasid bindings.
>> + */
>> +static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
>> + VTDPASIDCacheInfo *pc_info)
>> +{
>> + VTDHostIOMMUDevice *vtd_hiod;
>> + int start = 0, end = 1; /* only rid2pasid is supported */
>> + VTDPASIDCacheInfo walk_info;
>> + GHashTableIter as_it;
>> +
>> + switch (pc_info->type) {
>> + case VTD_PASID_CACHE_PASIDSI:
>> + start = pc_info->pasid;
>> + end = pc_info->pasid + 1;
>> + /*
>> + * PASID selective invalidation is within domain,
>> + * thus fall through.
>this is still not totally clean to me. For domsi I would have expected a
>different setting of start and end?
For DOMSI, it's initialized above, "start = 0, end = 1".
We only support rid2pasid in this series, so end=1.
>> + */
>> + case VTD_PASID_CACHE_DOMSI:
>fall though clause here and above?
Sure
>> + case VTD_PASID_CACHE_GLOBAL_INV:
>> + /* loop all assigned devices */
>> + break;
>> + default:
>> + error_report("invalid pc_info->type for replay");
>> + abort();
>> + }
>> +
>> + /*
>> + * In this replay, only needs to care about the devices which
>one only needs to care
OK
>> + * are backed by host IOMMU. For such devices, their vtd_hiod
>Those devices have a corresponding vtd_hiod in s->vtd_host_iommu_dev
OK
>> + * instances are in the s->vtd_host_iommu_dev. For devices which
>> + * are not backed by host IOMMU, it is not necessary to replay
>> + * the bindings since their cache could be re-created in the future
>> + * DMA address translation. Access to vtd_host_iommu_dev is already
>> + * protected by BQL, so no iommu lock needed here.
>> + */
>> + walk_info = *pc_info;
>> + g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
>> + while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
>> + /* bus|devfn fields are not identical with pc_info */
>Can you clarify?
I mean bus|devfn in pc_info is not used here, I'll drop this line as it's unrelated and confusing.
>> + walk_info.bus = vtd_hiod->bus;
>> + walk_info.devfn = vtd_hiod->devfn;
>> + vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
>> + }
>> + if (walk_info.error_happened) {
>> + pasid_cache_info_set_error(pc_info);
>> + }
>> +}
>> +
>> /*
>> * This function syncs the pasid bindings between guest and host.
>> * It includes updating the pasid cache in vIOMMU and updating the
>> @@ -3301,7 +3459,16 @@ static void vtd_pasid_cache_sync(IntelIOMMUState
>*s,
>> pc_info);
>> vtd_iommu_unlock(s);
>>
>> - /* TODO: Step 2: loop all the existing vtd_hiod instances for pasid bind. */
>> + /*
>> + * Step 2: loop all the existing vtd_hiod instances for pasid bind.
>> + * Ideally, needs to loop all devices to find if there is any new
>> + * PASID binding regards to the PASID cache invalidation request.
>> + * But it is enough to loop the devices which are backed by host
>> + * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
>> + * if new PASID happened on them, their vtd_as instance could
>> + * be created during future vIOMMU DMA translation.
>the above comment is somehow redundant with the previous one
Yes, will drop this one.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 12/19] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
2025-06-23 11:55 ` Eric Auger
@ 2025-06-26 8:28 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-26 8:28 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
Eduardo Habkost, Marcel Apfelbaum
Hi Eric,
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 12/19] intel_iommu: Introduce a new pasid cache
>invalidation type FORCE_RESET
>
>Hi Zhenzhong,
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>> FORCE_RESET is different from GLOBAL_INV which updates pasid cache if
>> underlying pasid entry is still valid, it drops all the pasid caches.
>>
>> FORCE_RESET isn't a VTD spec defined invalidation type for pasid cache,
>> only used internally in system level reset.
>this comment shall be put in the code, along with the VTDPCInvType
>because the reader may look for that cmd in the spec.
Good idea, will do.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 13/19] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
2025-06-23 12:02 ` Eric Auger
@ 2025-06-26 8:37 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-26 8:37 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 13/19] intel_iommu: Stick to system MR for
>IOMMUFD backed host device when x-fls=on
>
>
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>> When guest in scalable mode and x-flts=on, we stick to system MR for
>IOMMUFD
>when guest works in scalable mode ?
>> backed host device. Then its default hwpt contains GPA->HPA mappings
>which is
>> used directly if PGTT=PT and used as nested parent if PGTT=FLT. Otherwise
>> fallback to original processing.
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu.c | 24 ++++++++++++++++++++++++
>> 1 file changed, 24 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index ed71bb8ec7..be01f8885f 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -1779,6 +1779,7 @@ static bool
>vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
>> */
>> return false;
>> }
>> +
>spurious new line
Will fix.
>> return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
>> }
>>
>> @@ -1790,10 +1791,33 @@ static bool
>vtd_as_pt_enabled(VTDAddressSpace *as)
>> {
>> IntelIOMMUState *s;
>> VTDContextEntry ce;
>> + struct vtd_as_key key = {
>> + .bus = as->bus,
>> + .devfn = as->devfn,
>> + };
>>
>> assert(as);
>>
>> s = as->iommu_state;
>> +
>> + /*
>> + * When guest in scalable mode and x-flts=on, we stick to system MR
>ditto
That's strange, I didn't see spurious new line here.
>> + * for IOMMUFD backed host device. Then its default hwpt contains
>> + * GPA->HPA mappings which is used directly if PGTT=PT and used as
>> + * nested parent if PGTT=FLT. Otherwise fallback to original
>fall back here and above
>
>This comment sounds a little bit unrelated to the below implementation
>which does not refer to system MR. how does the search for the hiod
>relate to that. I would refocus the comment.
vtd_as_pt_enabled()'s return value determines which MR to switch to.
See vtd_switch_address_space(). How about adding a comment:
+/*
+ * vtd_switch_address_space() calls vtd_as_pt_enabled() to determine which
+ * MR to switch to. Switch to system MR if return true, iommu MR otherwise.
+ */
static bool vtd_as_pt_enabled(VTDAddressSpace *as)
{
With this comment, we know the return value impact MR switch and this patch
deal with return value.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 14/19] intel_iommu: Bind/unbind guest page table to host
2025-06-23 13:17 ` Eric Auger
@ 2025-06-26 9:17 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-26 9:17 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
Richard Henderson, Eduardo Habkost
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 14/19] intel_iommu: Bind/unbind guest page table to
>host
>
>
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>> This captures the guest PASID table entry modifications and
>> propagates the changes to host to attach a hwpt with type determined
>> per guest IOMMU mdoe and PGTT configuration.
>>
>> When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
>> page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
>> vIOMMU reuse hwpt(GPA->HPA) provided by VFIO as nested parent to
>> construct nested page table.
>>
>> When guest decides to use legacy mode then vIOMMU switches the MRs of
>> the device's AS, hence the IOAS created by VFIO container would be
>> switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
>> switched to IOMMU MR. So it is able to support shadowing the guest IO
>> page table.
>>
>> Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu_internal.h | 11 ++
>> hw/i386/intel_iommu.c | 244
>+++++++++++++++++++++++++++++++--
>> hw/i386/trace-events | 3 +
>> 3 files changed, 243 insertions(+), 15 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 5ed76864be..92a533db54 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -563,6 +563,13 @@ typedef struct VTDRootEntry VTDRootEntry;
>> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL |
>~VTD_HAW_MASK(aw))
>> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1
>0xffffffffffe00000ULL
>>
>> +typedef enum VTDPASIDOp {
>> + VTD_PASID_BIND,
>> + VTD_PASID_UPDATE,
>> + VTD_PASID_UNBIND,
>> + VTD_OP_NUM
>> +} VTDPASIDOp;
>> +
>> typedef enum VTDPCInvType {
>> /* Force reset all */
>> VTD_PASID_CACHE_FORCE_RESET = 0,
>> @@ -607,6 +614,9 @@ typedef struct VTDPASIDCacheInfo {
>>
>> #define VTD_SM_PASID_ENTRY_FLPM 3ULL
>> #define VTD_SM_PASID_ENTRY_FLPTPTR (~0xfffULL)
>> +#define VTD_SM_PASID_ENTRY_SRE_BIT(val) (!!((val) & 1ULL))
>> +#define VTD_SM_PASID_ENTRY_WPE_BIT(val) (!!(((val) >> 4) & 1ULL))
>> +#define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
>>
>> /* First Level Paging Structure */
>> /* Masks for First Level Paging Entry */
>> @@ -644,5 +654,6 @@ typedef struct VTDHostIOMMUDevice {
>> PCIBus *bus;
>> uint8_t devfn;
>> HostIOMMUDevice *hiod;
>> + uint32_t s1_hwpt;
>> } VTDHostIOMMUDevice;
>> #endif
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index be01f8885f..1c94a0033c 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -20,6 +20,7 @@
>> */
>>
>> #include "qemu/osdep.h"
>> +#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
>> #include "qemu/error-report.h"
>> #include "qemu/main-loop.h"
>> #include "qapi/error.h"
>> @@ -41,6 +42,9 @@
>> #include "migration/vmstate.h"
>> #include "trace.h"
>> #include "system/iommufd.h"
>> +#ifdef CONFIG_IOMMUFD
>> +#include <linux/iommufd.h>
>> +#endif
>>
>> /* context entry operations */
>> #define VTD_CE_GET_RID2PASID(ce) \
>> @@ -839,6 +843,27 @@ static inline uint16_t
>vtd_pe_get_did(VTDPASIDEntry *pe)
>> return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
>> }
>>
>> +static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
>> +{
>> + return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
>Isn'it called FSPTPTR in the spec. In the positive I would use the same
>terminology.
Sure.
>> +}
>> +
>> +static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
>> +{
>> + return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
>here again I am bit lost as you seem to look at 3d 64b FSPM while there
>is an AW field in the first 64b, please add a comment.
Sure, will add. Per spec, this helper is for stage1 address width,
AW in first 64b is for stage2 page table:
"This field is treated as Reserved(0) for implementations not supporting Second-stage
Translation (SSTS=0 in the Extended Capability Register)."
>Also it isnot clear where this computation come from. Can you quote the
>spec?
In Figure 9-6. Scalable-Mode PASID Table Entry Format:
This field specifies the paging mode for first-stage translation.
* 00: 4-level paging (FSPTPTR is base of FS-PML4)
* 01: 5-level paging (FSPTPTR is base of FS-PML5)
* 10-11: Reserved
For 4-level paging, iova width is 48bit, for 5-level paging it's (48+9)bit.
>> +}
>> +
>> +static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
>> +{
>> + return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
>> +}
>> +
>> +/* check if pgtt is first stage translation */
>> +static inline bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
>> +{
>> + return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
>> +}
>> +
>> static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>> {
>> return pdire->val & 1;
>> @@ -2431,6 +2456,188 @@ static void
>vtd_context_global_invalidate(IntelIOMMUState *s)
>> vtd_iommu_replay_all(s);
>> }
>>
>> +#ifdef CONFIG_IOMMUFD
>> +static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
>> + VTDPASIDEntry *pe)
>> +{
>> + memset(vtd, 0, sizeof(*vtd));
>> +
>> + vtd->flags = (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ?
>> + IOMMU_VTD_S1_SRE : 0)
>|
>> + (VTD_SM_PASID_ENTRY_WPE_BIT(pe->val[2]) ?
>> + IOMMU_VTD_S1_WPE :
>0) |
>> + (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ?
>> + IOMMU_VTD_S1_EAFE :
>0);
>> + vtd->addr_width = vtd_pe_get_fl_aw(pe);
>> + vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
>> +}
>> +
>> +static int vtd_create_s1_hwpt(VTDHostIOMMUDevice *vtd_hiod,
>> + VTDPASIDEntry *pe, Error **errp)
>> +{
>> + HostIOMMUDeviceIOMMUFD *idev =
>HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
>> + struct iommu_hwpt_vtd_s1 vtd;
>> + uint32_t s1_hwpt;
>> +
>> + vtd_init_s1_hwpt_data(&vtd, pe);
>> +
>> + if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
>> + idev->hwpt_id, 0,
>IOMMU_HWPT_DATA_VTD_S1,
>> + sizeof(vtd), &vtd, &s1_hwpt,
>errp)) {
>> + return -EINVAL;
>> + }
>> +
>> + vtd_hiod->s1_hwpt = s1_hwpt;
>> +
>> + return 0;
>> +}
>> +
>> +static void vtd_destroy_s1_hwpt(VTDHostIOMMUDevice *vtd_hiod)
>> +{
>> + HostIOMMUDeviceIOMMUFD *idev =
>HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
>> +
>> + iommufd_backend_free_id(idev->iommufd, vtd_hiod->s1_hwpt);
>> + vtd_hiod->s1_hwpt = 0;
>> +}
>> +
>> +static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
>> + uint32_t pasid,
>VTDPASIDEntry *pe,
>> + Error **errp)
>> +{
>> + HostIOMMUDeviceIOMMUFD *idev =
>HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
>> + uint32_t hwpt_id;
>> + int ret;
>> +
>> + if (vtd_pe_pgtt_is_flt(pe)) {
>> + ret = vtd_create_s1_hwpt(vtd_hiod, pe, errp);
>> + if (ret) {
>> + return ret;
>> + }
>> + hwpt_id = vtd_hiod->s1_hwpt;
>> + } else {
>> + hwpt_id = idev->hwpt_id;
>> + }
>> +
>> + ret = !host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id,
>errp);
>> + trace_vtd_device_attach_hwpt(idev->devid, pasid, hwpt_id, ret);
>> + if (ret && vtd_pe_pgtt_is_flt(pe)) {
>> + vtd_destroy_s1_hwpt(vtd_hiod);
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
>> + uint32_t pasid,
>VTDPASIDEntry *pe,
>> + Error **errp)
>> +{
>> + HostIOMMUDeviceIOMMUFD *idev =
>HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
>> + int ret;
>> +
>> + if (vtd_hiod->iommu_state->dmar_enabled) {
>> + ret = !host_iommu_device_iommufd_detach_hwpt(idev, errp);
>> + trace_vtd_device_detach_hwpt(idev->devid, pasid, ret);
>> + } else {
>> + ret = !host_iommu_device_iommufd_attach_hwpt(idev,
>idev->hwpt_id, errp);
>> + trace_vtd_device_reattach_def_hwpt(idev->devid, pasid,
>idev->hwpt_id,
>> + ret);
>> + }
>> +
>> + if (vtd_pe_pgtt_is_flt(pe)) {
>> + vtd_destroy_s1_hwpt(vtd_hiod);
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +static int vtd_device_attach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
>> + VTDAddressSpace *vtd_as,
>VTDPASIDEntry *pe)
>> +{
>> + /*
>> + * If pe->gptt == FLT, should be go ahead to do bind as host only
>PGTT. The rest of the sentence is difficult to parse.
Sure, will rephrase.
>> + * accepts guest FLT under nesting. If pe->pgtt==PT, should setup
>> + * the pasid with GPA page table. Otherwise should return failure.
>> + */
>> + if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
>> + return -EINVAL;
>> + }
>> +
>> + /* Should fail if the FLPT base is 0 */
>> + if (vtd_pe_pgtt_is_flt(pe) && !vtd_pe_get_flpt_base(pe)) {
>> + return -EINVAL;
>> + }
>> +
>> + return vtd_device_attach_iommufd(vtd_hiod, vtd_as->pasid, pe,
>&error_abort);
>> +}
>> +
>> +static int vtd_device_detach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
>> + VTDAddressSpace *vtd_as)
>> +{
>> + VTDPASIDEntry *cached_pe =
>vtd_as->pasid_cache_entry.cache_filled ?
>> + &vtd_as->pasid_cache_entry.pasid_entry :
>NULL;
>> +
>> + if (!cached_pe ||
>> + (!vtd_pe_pgtt_is_flt(cached_pe)
>&& !vtd_pe_pgtt_is_pt(cached_pe))) {
>> + return 0;
>> + }
>> +
>> + return vtd_device_detach_iommufd(vtd_hiod, vtd_as->pasid,
>cached_pe,
>> + &error_abort);
>> +}
>> +
>> +/**
>> + * Caller should hold iommu_lock.
>> + */
>> +static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
>> + VTDPASIDEntry *pe, VTDPASIDOp
>op)
>> +{
>> + IntelIOMMUState *s = vtd_as->iommu_state;
>> + VTDHostIOMMUDevice *vtd_hiod;
>> + int devfn = vtd_as->devfn;
>> + int ret = -EINVAL;
>> + struct vtd_as_key key = {
>> + .bus = vtd_as->bus,
>> + .devfn = devfn,
>> + };
>> +
>> + vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
>> + if (!vtd_hiod || !vtd_hiod->hiod) {
>> + /* means no need to go further, e.g. for emulated devices */
>don't you want to check
>
> object_dynamic_cast(OBJECT(vtd_hiod->hiod),
>
>TYPE_HOST_IOMMU_DEVICE_IOMMUFD)
>as well.
>In the positive you may introduce a helper that returns the vtd_hiod or NULL.
>It could also be used in previous patch and maybe at other locations as well.
After further thinking, it looks checking only vtd_hiod is enough, as vtd_hiod is
created based on hiod, and below check in vtd_check_hiod() ensure hiod are
TYPE_HOST_IOMMU_DEVICE_IOMMUFD.
/* Remaining checks are all stage-1 translation specific */
if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
error_setg(errp, "Need IOMMUFD backend when x-flts=on");
return false;
}
Note we only run into these functions when hiod is TYPE_HOST_IOMMU_DEVICE_IOMMUFD and x-flts=on.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 15/19] intel_iommu: Replay pasid binds after context cache invalidation
2025-06-23 13:25 ` Eric Auger
@ 2025-06-26 9:27 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-26 9:27 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
Eduardo Habkost, Marcel Apfelbaum
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 15/19] intel_iommu: Replay pasid binds after context
>cache invalidation
>
>
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> This replays guest pasid attachments after context cache invalidation.
>> This is a behavior to ensure safety. Actually, programmer should issue
>> pasid cache invalidation with proper granularity after issuing a context
>> cache invalidation.
>so is it mandated to do the invalidation twice?
Yes, because we don't have a way to know if pasid cache invalidation following
context cache invalidation is duplicate one from guest.
It depends on if we want to support a guest with such flaw, if no need to support,
we can drop this patch.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu_internal.h | 1 +
>> hw/i386/intel_iommu.c | 51
>++++++++++++++++++++++++++++++++--
>> hw/i386/trace-events | 1 +
>> 3 files changed, 51 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index 92a533db54..b3e4aa23f1 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -575,6 +575,7 @@ typedef enum VTDPCInvType {
>> VTD_PASID_CACHE_FORCE_RESET = 0,
>> /* pasid cache invalidation rely on guest PASID entry */
>> VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation
>*/
>> + VTD_PASID_CACHE_DEVSI, /* pasid cache device selective
>invalidation */
>> VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective
>invalidation */
>> VTD_PASID_CACHE_PASIDSI, /* pasid cache pasid selective
>invalidation */
>> } VTDPCInvType;
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 1c94a0033c..621b07aa02 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -92,6 +92,10 @@ static void
>vtd_address_space_refresh_all(IntelIOMMUState *s);
>> static void vtd_address_space_unmap(VTDAddressSpace *as,
>IOMMUNotifier *n);
>>
>> static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
>> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
>> + VTDPASIDCacheInfo *pc_info);
>> +static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
>> + PCIBus *bus, uint16_t devfn);
>>
>> static void vtd_panic_require_caching_mode(void)
>> {
>> @@ -2437,6 +2441,8 @@ static void
>vtd_iommu_replay_all(IntelIOMMUState *s)
>>
>> static void vtd_context_global_invalidate(IntelIOMMUState *s)
>> {
>> + VTDPASIDCacheInfo pc_info = { .error_happened = false, };
>> +
>> trace_vtd_inv_desc_cc_global();
>> /* Protects context cache */
>> vtd_iommu_lock(s);
>> @@ -2454,6 +2460,9 @@ static void
>vtd_context_global_invalidate(IntelIOMMUState *s)
>> * VT-d emulation codes.
>> */
>> vtd_iommu_replay_all(s);
>> +
>> + pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
>> + vtd_pasid_cache_sync(s, &pc_info);
>> }
>>
>> #ifdef CONFIG_IOMMUFD
>> @@ -2696,6 +2705,21 @@ static void
>vtd_context_device_invalidate(IntelIOMMUState *s,
>> * happened.
>> */
>> vtd_address_space_sync(vtd_as);
>> + /*
>> + * Per spec, context flush should also followed with PASID
>be followed
>> + * cache and iotlb flush. Regards to a device selective
>regarding to?
>> + * context cache invalidation:
>> + * if (emaulted_device)
>emulated
Will fix above three.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 16/19] intel_iommu: Propagate PASID-based iotlb invalidation to host
2025-06-23 13:41 ` Eric Auger
@ 2025-06-26 9:42 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-26 9:42 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Yi Sun, Paolo Bonzini, Richard Henderson,
Eduardo Habkost, Marcel Apfelbaum
Hi Eric,
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 16/19] intel_iommu: Propagate PASID-based iotlb
>invalidation to host
>
>Hi Zhenzhong,
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> This traps the guest PASID-based iotlb invalidation request and propagate it
>> to host.
>>
>> Intel VT-d 3.0 supports nested translation in PASID granular. Guest SVA
>support
>s/granular/granularity
Will do
>> could be implemented by configuring nested translation on specific PASID.
>This
>> is also known as dual stage DMA translation.
>>
>> Under such configuration, guest owns the GVA->GPA translation which is
>> configured as stage-1 page table in host side for a specific pasid, and host
>> owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
>> invalidation should be propagated to host since host IOMMU will cache first
>> level page table related mappings during DMA address translation.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu_internal.h | 6 ++
>> hw/i386/intel_iommu.c | 113
>++++++++++++++++++++++++++++++++-
>> 2 files changed, 117 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h
>b/hw/i386/intel_iommu_internal.h
>> index b3e4aa23f1..07bfb97499 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -589,6 +589,12 @@ typedef struct VTDPASIDCacheInfo {
>> bool error_happened;
>> } VTDPASIDCacheInfo;
>>
>> +typedef struct VTDPIOTLBInvInfo {
>> + uint16_t domain_id;
>> + uint32_t pasid;
>> + struct iommu_hwpt_vtd_s1_invalidate *inv_data;
>> +} VTDPIOTLBInvInfo;
>> +
>> /* PASID Table Related Definitions */
>> #define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
>> #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 621b07aa02..d1fa395274 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -2639,12 +2639,105 @@ static int
>vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
>>
>> return ret;
>> }
>> +
>> +/*
>> + * Caller of this function should hold iommu_lock.
>> + */
>> +static void vtd_invalidate_piotlb(VTDAddressSpace *vtd_as,
>> + struct
>iommu_hwpt_vtd_s1_invalidate *cache)
>> +{
>> + VTDHostIOMMUDevice *vtd_hiod;
>> + HostIOMMUDeviceIOMMUFD *idev;
>> + int devfn = vtd_as->devfn;
>> + struct vtd_as_key key = {
>> + .bus = vtd_as->bus,
>> + .devfn = devfn,
>> + };
>> + IntelIOMMUState *s = vtd_as->iommu_state;
>> + uint32_t entry_num = 1; /* Only implement one request for simplicity
>*/
>> + Error *err;
>> +
>> + vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
>> + if (!vtd_hiod || !vtd_hiod->hiod) {
>> + return;
>> + }
>> + idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
>> +
>> + if (!iommufd_backend_invalidate_cache(idev->iommufd,
>vtd_hiod->s1_hwpt,
>> +
>IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
>> + sizeof(*cache),
>&entry_num, cache,
>> + &err)) {
>> + error_report_err(err);
>> + }
>> +}
>> +
>> +/*
>> + * This function is a loop function for the s->vtd_address_spaces
>> + * list with VTDPIOTLBInvInfo as execution filter. It propagates
>> + * the piotlb invalidation to host. Caller of this function
>> + * should hold iommu_lock.
>instead of having this mention everywhere may be more efficient to
>postfix each function with _locked and I don't know if it exists have a
>checker.
OK, will add _locked and checker if necessary.
>> + */
>> +static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
>> + gpointer user_data)
>> +{
>> + VTDPIOTLBInvInfo *piotlb_info = user_data;
>> + VTDAddressSpace *vtd_as = value;
>> + VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> + uint32_t pasid;
>> + uint16_t did;
>> +
>> + /* Replay only fill pasid entry cache for passthrough device */
>fills
>> + if (!pc_entry->cache_filled ||
>> + !vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
>> + return;
>> + }
>> +
>> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
>> + return;
>> + }
>> +
>> + did = vtd_pe_get_did(&pc_entry->pasid_entry);
>> +
>> + if (piotlb_info->domain_id == did && piotlb_info->pasid == pasid) {
>> + vtd_invalidate_piotlb(vtd_as, piotlb_info->inv_data);
>> + }
>> +}
>> +
>> +static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
>> + uint16_t domain_id,
>uint32_t pasid,
>> + hwaddr addr, uint64_t
>npages, bool ih)
>> +{
>> + struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
>> + VTDPIOTLBInvInfo piotlb_info;
>> +
>> + cache_info.addr = addr;
>> + cache_info.npages = npages;
>> + cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
>> +
>> + piotlb_info.domain_id = domain_id;
>> + piotlb_info.pasid = pasid;
>> + piotlb_info.inv_data = &cache_info;
>> +
>> + /*
>> + * Here loops all the vtd_as instances in s->vtd_address_spaces
>I am not a native english speaker but I am not sure loop something is
>OK. Go though each?
Will do.
>Besides that comment is not that useful as it paraphrases the code.
OK, will try to simplify it.
>> + * to find out the affected devices since piotlb invalidation
>> + * should check pasid cache per architecture point of view.
>> + */
>> + g_hash_table_foreach(s->vtd_address_spaces,
>> + vtd_flush_pasid_iotlb, &piotlb_info);
>> +}
>> #else
>> static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
>> VTDPASIDEntry *pe, VTDPASIDOp
>op)
>> {
>> return 0;
>> }
>> +
>> +static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
>> + uint16_t domain_id,
>uint32_t pasid,
>> + hwaddr addr, uint64_t
>npages, bool ih)
>> +{
>> +}
>> #endif
>>
>> /* Do a context-cache device-selective invalidation.
>> @@ -3300,6 +3393,13 @@ static void
>vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
>> info.pasid = pasid;
>>
>> vtd_iommu_lock(s);
>> + /*
>> + * Here loops all the vtd_as instances in s->vtd_as
>would frop above.
>> + * to find out the affected devices since piotlb invalidation
>Find out ...
Will do.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 17/19] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
2025-06-23 13:48 ` Eric Auger
@ 2025-06-26 10:16 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-06-26 10:16 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Eric,
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 17/19] intel_iommu: Refresh pasid bind when either
>SRTP or TE bit is changed
>
>Hi Zhenzhong,
>
>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
>> the pasid bindings on host side become stale and need to be updated.
>>
>> Introduce a helper function vtd_refresh_pasid_bind() for that purpose.
>nit I would avoid introducing yet another terminology, ie. refresh. If
>it is a replay let's keep replay and precisely explain what the replay does.
>vtd_replay_pasid_bindings?
Will do
>
>
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu.c | 23 +++++++++++++++++++++++
>> 1 file changed, 23 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index d1fa395274..0b322078cc 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -90,6 +90,7 @@ struct vtd_iotlb_key {
>>
>> static void vtd_address_space_refresh_all(IntelIOMMUState *s);
>> static void vtd_address_space_unmap(VTDAddressSpace *as,
>IOMMUNotifier *n);
>> +static void vtd_refresh_pasid_bind(IntelIOMMUState *s);
>>
>> static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
>> static void vtd_pasid_cache_sync(IntelIOMMUState *s,
>> @@ -3066,6 +3067,7 @@ static void
>vtd_handle_gcmd_srtp(IntelIOMMUState *s)
>> vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
>> vtd_reset_caches(s);
>> vtd_address_space_refresh_all(s);
>> + vtd_refresh_pasid_bind(s);
>> }
>>
>> /* Set Interrupt Remap Table Pointer */
>> @@ -3100,6 +3102,7 @@ static void
>vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
>>
>> vtd_reset_caches(s);
>> vtd_address_space_refresh_all(s);
>> + vtd_refresh_pasid_bind(s);
>> }
>>
>> /* Handle Interrupt Remap Enable/Disable */
>> @@ -3813,6 +3816,26 @@ static void
>vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
>> }
>> }
>>
>> +static void vtd_refresh_pasid_bind(IntelIOMMUState *s)
>> +{
>> + VTDPASIDCacheInfo pc_info = { .error_happened = false,
>> + .type =
>VTD_PASID_CACHE_GLOBAL_INV };
>> +
>> + /*
>> + * Only when dmar is enabled, should pasid bindings replayed,
>> + * otherwise no need to replay.
>> + */
>I am not sure the above comment is necessary.
Will delete it
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 09/19] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
2025-06-24 2:48 ` Duan, Zhenzhong
@ 2025-07-07 3:12 ` Duan, Zhenzhong
2025-07-07 16:54 ` Eric Auger
0 siblings, 1 reply; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-07-07 3:12 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Eric,
>-----Original Message-----
>From: Duan, Zhenzhong
>Subject: RE: [PATCH v2 09/19] intel_iommu: Introduce two helpers
>vtd_as_from/to_iommu_pasid_locked
>
>
>
>>-----Original Message-----
>>From: Eric Auger <eric.auger@redhat.com>
>>Subject: Re: [PATCH v2 09/19] intel_iommu: Introduce two helpers
>>vtd_as_from/to_iommu_pasid_locked
>>
>>Hi Zhenzhong,
>>
>>On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>>> PCI device supports two request types, Requests-without-PASID and
>>> Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
>>> prefix, IOMMU fetches rid_pasid from context entry and use it as
>IOMMU's
>>> pasid to index pasid table.
>>>
>>> So we need to translate between PCI's pasid and IOMMU's pasid specially
>>> for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
>>> For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
>>>
>>> vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to
>>vtd_as
>>> which contains PCI's pasid vtd_as->pasid.
>>>
>>> vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to
>>iommu_pasid.
>>>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>> hw/i386/intel_iommu.c | 58
>>+++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 58 insertions(+)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index 9d4adc9458..8948b8370f 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -1602,6 +1602,64 @@ static int
>>vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>>> return 0;
>>> }
>>>
>>> +static inline int vtd_as_to_iommu_pasid_locked(VTDAddressSpace
>*vtd_as,
>>> + uint32_t
>*pasid)
>>Is it meaningful to use inline here and below? Below I guess you do so
>>to avoid "defined but not used" compilation error but I don't think it
>>should stay as is.
>
>Yes, that's the only reason I define the both inline.
>Do you have other suggestions to avoid compilation error if not use inline?
I find I am not clear on above comments yet, do you just want to remove inline flag?
Maybe merging the two helpers to other patches using them to avoid inline?
If I misunderstood, could you share more light on what changes you want this piece of code to have?
Thanks
Zhenzhong
>
>>
>>I don't really understand the iommu_pasid terminology. Either it is a
>>pasid passed through the PCI transaction or it is the default pasid
>>found in rid2passid ce field. So that's a pasid both ways ;-) can't you
>>simply call it pasid.
>
>Yes, PCI side we call it just pasid, the other side I name it iommu pasid to
>distinguish.
>Does that work for you?
>
>>> +{
>>> + VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
>>> + IntelIOMMUState *s = vtd_as->iommu_state;
>>> + uint8_t bus_num = pci_bus_num(vtd_as->bus);
>>> + uint8_t devfn = vtd_as->devfn;
>>> + VTDContextEntry ce;
>>> + int ret;
>>> +
>>> + if (cc_entry->context_cache_gen == s->context_cache_gen) {
>>> + ce = cc_entry->context_entry;
>>> + } else {
>>> + ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
>>> + if (ret) {
>>> + return ret;
>>> + }
>>> + }
>>if the above pattern is used at many locations I still think it may be
>>valuable to have a _locked helper.
>
>Not get, both vtd_as_to_iommu_pasid_locked() and
>vtd_as_from_iommu_pasid_locked()
>are already _locked helper, isn't it?
>
>Do you mean adding a comment saying "Caller of this function should hold
>iommu_lock."
>
>>> +
>>> + /* Translate to iommu pasid if PCI_NO_PASID */
>>> + if (vtd_as->pasid == PCI_NO_PASID) {
>>> + *pasid = VTD_CE_GET_RID2PASID(&ce);
>>> + } else {
>>> + *pasid = vtd_as->pasid;
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key,
>gpointer
>>value,
>>> + gpointer
>user_data)
>>> +{
>>> + VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
>>> + struct vtd_as_raw_key *target = (struct vtd_as_raw_key
>*)user_data;
>>why target? can't you name it key instead?
>
>There is already a parameter named key, maybe target_key?
>
>Thanks
>Zhenzhong
>
>>> + uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus),
>vtd_as->devfn);
>>> + uint32_t pasid;
>>> +
>>> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
>>> + return false;
>>> + }
>>> +
>>> + return (pasid == target->pasid) && (sid == target->sid);
>>> +}
>>> +
>>> +/* Translate iommu pasid to vtd_as */
>>same here
>>> +static inline
>>> +VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState
>*s,
>>> + uint16_t sid,
>uint32_t pasid)
>>> +{
>>> + struct vtd_as_raw_key key = {
>>> + .sid = sid,
>>> + .pasid = pasid
>>> + };
>>> +
>>> + return g_hash_table_find(s->vtd_address_spaces,
>>> +
>vtd_find_as_by_sid_and_iommu_pasid, &key);
>>> +}
>>> +
>>> static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
>>> void *private)
>>> {
>>Thanks
>>
>>Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v2 09/19] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
2025-07-07 3:12 ` Duan, Zhenzhong
@ 2025-07-07 16:54 ` Eric Auger
2025-07-08 2:35 ` Duan, Zhenzhong
0 siblings, 1 reply; 58+ messages in thread
From: Eric Auger @ 2025-07-07 16:54 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Zhenzhong,
On 7/7/25 5:12 AM, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Duan, Zhenzhong
>> Subject: RE: [PATCH v2 09/19] intel_iommu: Introduce two helpers
>> vtd_as_from/to_iommu_pasid_locked
>>
>>
>>
>>> -----Original Message-----
>>> From: Eric Auger <eric.auger@redhat.com>
>>> Subject: Re: [PATCH v2 09/19] intel_iommu: Introduce two helpers
>>> vtd_as_from/to_iommu_pasid_locked
>>>
>>> Hi Zhenzhong,
>>>
>>> On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>>>> PCI device supports two request types, Requests-without-PASID and
>>>> Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
>>>> prefix, IOMMU fetches rid_pasid from context entry and use it as
>> IOMMU's
>>>> pasid to index pasid table.
>>>>
>>>> So we need to translate between PCI's pasid and IOMMU's pasid specially
>>>> for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
>>>> For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
>>>>
>>>> vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to
>>> vtd_as
>>>> which contains PCI's pasid vtd_as->pasid.
>>>>
>>>> vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to
>>> iommu_pasid.
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>> hw/i386/intel_iommu.c | 58
>>> +++++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 58 insertions(+)
>>>>
>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>> index 9d4adc9458..8948b8370f 100644
>>>> --- a/hw/i386/intel_iommu.c
>>>> +++ b/hw/i386/intel_iommu.c
>>>> @@ -1602,6 +1602,64 @@ static int
>>> vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>>>> return 0;
>>>> }
>>>>
>>>> +static inline int vtd_as_to_iommu_pasid_locked(VTDAddressSpace
>> *vtd_as,
>>>> + uint32_t
>> *pasid)
>>> Is it meaningful to use inline here and below? Below I guess you do so
>>> to avoid "defined but not used" compilation error but I don't think it
>>> should stay as is.
>> Yes, that's the only reason I define the both inline.
>> Do you have other suggestions to avoid compilation error if not use inline?
> I find I am not clear on above comments yet, do you just want to remove inline flag?
> Maybe merging the two helpers to other patches using them to avoid inline?
In the past what I did in such situation consisted in introducing a
declaration of the static function before its definition and when the
actual user is introduced, in a subsequent patch, remove the spurious
declaration.
Now, reading
https://www.reddit.com/r/cpp_questions/comments/15kfije/how_to_decide_if_a_function_should_be_inline_or/,
maybe adding the inline here is not a problem given the compiler may or
may not inline the function.
Thanks
Eric
>
> If I misunderstood, could you share more light on what changes you want this piece of code to have?
>
> Thanks
> Zhenzhong
>
>>> I don't really understand the iommu_pasid terminology. Either it is a
>>> pasid passed through the PCI transaction or it is the default pasid
>>> found in rid2passid ce field. So that's a pasid both ways ;-) can't you
>>> simply call it pasid.
>> Yes, PCI side we call it just pasid, the other side I name it iommu pasid to
>> distinguish.
>> Does that work for you?
>>
>>>> +{
>>>> + VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
>>>> + IntelIOMMUState *s = vtd_as->iommu_state;
>>>> + uint8_t bus_num = pci_bus_num(vtd_as->bus);
>>>> + uint8_t devfn = vtd_as->devfn;
>>>> + VTDContextEntry ce;
>>>> + int ret;
>>>> +
>>>> + if (cc_entry->context_cache_gen == s->context_cache_gen) {
>>>> + ce = cc_entry->context_entry;
>>>> + } else {
>>>> + ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
>>>> + if (ret) {
>>>> + return ret;
>>>> + }
>>>> + }
>>> if the above pattern is used at many locations I still think it may be
>>> valuable to have a _locked helper.
>> Not get, both vtd_as_to_iommu_pasid_locked() and
>> vtd_as_from_iommu_pasid_locked()
>> are already _locked helper, isn't it?
>>
>> Do you mean adding a comment saying "Caller of this function should hold
>> iommu_lock."
>>
>>>> +
>>>> + /* Translate to iommu pasid if PCI_NO_PASID */
>>>> + if (vtd_as->pasid == PCI_NO_PASID) {
>>>> + *pasid = VTD_CE_GET_RID2PASID(&ce);
>>>> + } else {
>>>> + *pasid = vtd_as->pasid;
>>>> + }
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key,
>> gpointer
>>> value,
>>>> + gpointer
>> user_data)
>>>> +{
>>>> + VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
>>>> + struct vtd_as_raw_key *target = (struct vtd_as_raw_key
>> *)user_data;
>>> why target? can't you name it key instead?
>> There is already a parameter named key, maybe target_key?
>>
>> Thanks
>> Zhenzhong
>>
>>>> + uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus),
>> vtd_as->devfn);
>>>> + uint32_t pasid;
>>>> +
>>>> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
>>>> + return false;
>>>> + }
>>>> +
>>>> + return (pasid == target->pasid) && (sid == target->sid);
>>>> +}
>>>> +
>>>> +/* Translate iommu pasid to vtd_as */
>>> same here
>>>> +static inline
>>>> +VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState
>> *s,
>>>> + uint16_t sid,
>> uint32_t pasid)
>>>> +{
>>>> + struct vtd_as_raw_key key = {
>>>> + .sid = sid,
>>>> + .pasid = pasid
>>>> + };
>>>> +
>>>> + return g_hash_table_find(s->vtd_address_spaces,
>>>> +
>> vtd_find_as_by_sid_and_iommu_pasid, &key);
>>>> +}
>>>> +
>>>> static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
>>>> void *private)
>>>> {
>>> Thanks
>>>
>>> Eric
^ permalink raw reply [flat|nested] 58+ messages in thread
* RE: [PATCH v2 09/19] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
2025-07-07 16:54 ` Eric Auger
@ 2025-07-08 2:35 ` Duan, Zhenzhong
0 siblings, 0 replies; 58+ messages in thread
From: Duan, Zhenzhong @ 2025-07-08 2:35 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Eric,
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v2 09/19] intel_iommu: Introduce two helpers
>vtd_as_from/to_iommu_pasid_locked
>
>Hi Zhenzhong,
>
>On 7/7/25 5:12 AM, Duan, Zhenzhong wrote:
>> Hi Eric,
>>
>>> -----Original Message-----
>>> From: Duan, Zhenzhong
>>> Subject: RE: [PATCH v2 09/19] intel_iommu: Introduce two helpers
>>> vtd_as_from/to_iommu_pasid_locked
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Subject: Re: [PATCH v2 09/19] intel_iommu: Introduce two helpers
>>>> vtd_as_from/to_iommu_pasid_locked
>>>>
>>>> Hi Zhenzhong,
>>>>
>>>> On 6/20/25 9:18 AM, Zhenzhong Duan wrote:
>>>>> PCI device supports two request types, Requests-without-PASID and
>>>>> Requests-with-PASID. Requests-without-PASID doesn't include a PASID
>TLP
>>>>> prefix, IOMMU fetches rid_pasid from context entry and use it as
>>> IOMMU's
>>>>> pasid to index pasid table.
>>>>>
>>>>> So we need to translate between PCI's pasid and IOMMU's pasid
>specially
>>>>> for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
>>>>> For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same
>value.
>>>>>
>>>>> vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid
>to
>>>> vtd_as
>>>>> which contains PCI's pasid vtd_as->pasid.
>>>>>
>>>>> vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to
>>>> iommu_pasid.
>>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>>> ---
>>>>> hw/i386/intel_iommu.c | 58
>>>> +++++++++++++++++++++++++++++++++++++++++++
>>>>> 1 file changed, 58 insertions(+)
>>>>>
>>>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>>>> index 9d4adc9458..8948b8370f 100644
>>>>> --- a/hw/i386/intel_iommu.c
>>>>> +++ b/hw/i386/intel_iommu.c
>>>>> @@ -1602,6 +1602,64 @@ static int
>>>> vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>>>>> return 0;
>>>>> }
>>>>>
>>>>> +static inline int vtd_as_to_iommu_pasid_locked(VTDAddressSpace
>>> *vtd_as,
>>>>> + uint32_t
>>> *pasid)
>>>> Is it meaningful to use inline here and below? Below I guess you do so
>>>> to avoid "defined but not used" compilation error but I don't think it
>>>> should stay as is.
>>> Yes, that's the only reason I define the both inline.
>>> Do you have other suggestions to avoid compilation error if not use inline?
>> I find I am not clear on above comments yet, do you just want to remove
>inline flag?
>> Maybe merging the two helpers to other patches using them to avoid inline?
>In the past what I did in such situation consisted in introducing a
>declaration of the static function before its definition and when the
>actual user is introduced, in a subsequent patch, remove the spurious
>declaration.
>Now, reading
>https://www.reddit.com/r/cpp_questions/comments/15kfije/how_to_decide
>_if_a_function_should_be_inline_or/,
>maybe adding the inline here is not a problem given the compiler may or
>may not inline the function.
Thanks for the link, this refreshes my understanding to inline.
BRs,
Zhenzhong
^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2025-07-08 21:12 UTC | newest]
Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-20 7:17 [PATCH v2 00/19] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-06-20 7:17 ` [PATCH v2 01/19] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-06-20 7:17 ` [PATCH v2 02/19] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
2025-06-20 7:53 ` Eric Auger
2025-06-23 2:20 ` Duan, Zhenzhong
2025-06-23 9:30 ` Eric Auger
2025-06-20 7:17 ` [PATCH v2 03/19] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
2025-06-20 8:10 ` Eric Auger
2025-06-23 2:20 ` Duan, Zhenzhong
2025-06-20 7:17 ` [PATCH v2 04/19] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
2025-06-20 8:08 ` Eric Auger
2025-06-23 2:33 ` Duan, Zhenzhong
2025-06-20 7:17 ` [PATCH v2 05/19] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
2025-06-20 11:59 ` Eric Auger
2025-06-23 2:47 ` Duan, Zhenzhong
2025-06-23 9:31 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 06/19] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-06-20 7:18 ` [PATCH v2 07/19] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-06-20 12:05 ` Eric Auger
2025-06-23 2:44 ` Duan, Zhenzhong
2025-06-20 7:18 ` [PATCH v2 08/19] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
2025-06-20 12:18 ` Eric Auger
2025-06-23 3:20 ` Duan, Zhenzhong
2025-06-20 7:18 ` [PATCH v2 09/19] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
2025-06-20 12:46 ` Eric Auger
2025-06-24 2:48 ` Duan, Zhenzhong
2025-07-07 3:12 ` Duan, Zhenzhong
2025-07-07 16:54 ` Eric Auger
2025-07-08 2:35 ` Duan, Zhenzhong
2025-06-20 7:18 ` [PATCH v2 10/19] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
2025-06-20 15:44 ` Eric Auger
2025-06-24 3:34 ` Duan, Zhenzhong
2025-06-20 7:18 ` [PATCH v2 11/19] intel_iommu: Handle PASID entry adding Zhenzhong Duan
2025-06-23 11:47 ` Eric Auger
2025-06-24 10:56 ` Duan, Zhenzhong
2025-06-20 7:18 ` [PATCH v2 12/19] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
2025-06-23 11:55 ` Eric Auger
2025-06-26 8:28 ` Duan, Zhenzhong
2025-06-20 7:18 ` [PATCH v2 13/19] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
2025-06-23 12:02 ` Eric Auger
2025-06-26 8:37 ` Duan, Zhenzhong
2025-06-20 7:18 ` [PATCH v2 14/19] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-06-23 13:17 ` Eric Auger
2025-06-26 9:17 ` Duan, Zhenzhong
2025-06-20 7:18 ` [PATCH v2 15/19] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
2025-06-23 13:25 ` Eric Auger
2025-06-26 9:27 ` Duan, Zhenzhong
2025-06-20 7:18 ` [PATCH v2 16/19] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-06-23 13:41 ` Eric Auger
2025-06-26 9:42 ` Duan, Zhenzhong
2025-06-20 7:18 ` [PATCH v2 17/19] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
2025-06-23 13:48 ` Eric Auger
2025-06-26 10:16 ` Duan, Zhenzhong
2025-06-20 7:18 ` [PATCH v2 18/19] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
2025-06-20 16:01 ` Eric Auger
2025-06-23 3:29 ` Duan, Zhenzhong
2025-06-23 9:33 ` Eric Auger
2025-06-20 7:18 ` [PATCH v2 19/19] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).