* [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device
@ 2025-07-29 9:20 Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 01/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
` (20 more replies)
0 siblings, 21 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
Hi,
For passthrough device with intel_iommu.x-flts=on, we don't do shadowing of
guest page table for passthrough device but pass stage-1 page table to host
side to construct a nested domain. There was some effort to enable this feature
in old days, see [1] for details.
The key design is to utilize the dual-stage IOMMU translation (also known as
IOMMU nested translation) capability in host IOMMU. As the below diagram shows,
guest I/O page table pointer in GPA (guest physical address) is passed to host
and be used to perform the stage-1 address translation. Along with it,
modifications to present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | Stage1 for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.--------------------------------------.
| | | Stage2 for GPA->HPA, unmanaged domain|
| | '--------------------------------------'
'-------------'
For history reason, there are different namings in different VTD spec rev,
Where:
- Stage1 = First stage = First level = flts
- Stage2 = Second stage = Second level = slts
<Intel VT-d Nested translation>
This series reuse VFIO device's default hwpt as nested parent instead of
creating new one. This way avoids duplicate code of a new memory listener,
all existing feature from VFIO listener can be shared, e.g., ram discard,
dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
under a PCI bridge with emulated device, because emulated device wants
IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
VFIO device's default hwpt is created with NEST_PARENT flag, kernel
inhibit RO mappings when switch to shadow mode.
This series is also a prerequisite work for vSVA, i.e. Sharing guest
application address space with passthrough devices.
There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
subsystem. VFIO calls them to register/unregister HostIOMMUDevice
instance to vIOMMU at vfio device realize stage.
* vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
VFIO calls it to get vIOMMU exposed capabilities.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
to bind/unbind device to IOMMUFD backed domains, either nested
domain or not.
See below diagram:
VFIO Device Intel IOMMU
.-----------------. .-------------------.
| | | |
| .---------|PCIIOMMUOps |.-------------. |
| | IOMMUFD |(set/unset_iommu_device) || Host IOMMU | |
| | Device |------------------------>|| Device list | |
| .---------|(get_viommu_cap) |.-------------. |
| | | | |
| | | V |
| .---------| HostIOMMUDeviceIOMMUFD | .-------------. |
| | IOMMUFD | (attach_hwpt)| | Host IOMMU | |
| | link |<------------------------| | Device | |
| .---------| (detach_hwpt)| .-------------. |
| | | | |
| | | ... |
.-----------------. .-------------------.
Below is an example to enable stage-1 translation for passthrough device:
-M q35,...
-device intel-iommu,x-scalable-mode=on,x-flts=on...
-object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test
PATCH1-6: Some preparing work
PATCH7-8: Compatibility check between vIOMMU and Host IOMMU
PATCH9-17: Implement stage-1 page table for passthrough device
PATCH18-19:Workaround for ERRATA_772415_SPR17
PATCH20: Enable stage-1 translation for passthrough device
Qemu code can be found at [2]
Fault report isn't supported in this series, we presume guest kernel always
construct correct stage1 page table for passthrough device. For emulated
devices, the emulation code already provided stage1 fault injection.
TODO:
- Fault report to guest when HW stage1 faults
[1] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4
Thanks
Zhenzhong
Changelog:
v4:
- s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin, Donald, Shameer)
- clarify get_viommu_cap() return pure emulated caps and explain reason in commit log (Eric)
- retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric)
- refine doc comment and commit log in patch10-11 (Eric)
v3:
- define enum type for VIOMMU_CAP_* (Eric)
- drop inline flag in the patch which uses the helper (Eric)
- use extract64 in new introduced MACRO (Eric)
- polish comments and fix typo error (Eric)
- split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
- optimize bind/unbind error path processing
v2:
- introduce get_viommu_cap() to get STAGE1 flag to create nested parent hwpt (Liuyi)
- reuse VFIO's default hwpt as parent hwpt of nested translation (Nicolin, Liuyi)
- abandon support of VFIO device under pcie-to-pci bridge to simplify design (Liuyi)
- bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 (Liuyi)
- drop vtd_dev_to_context_entry optimization (Liuyi)
v1:
- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
- rebase to master
rfcv3:
- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer)
- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
- simplify return value check of get_cap() (Eric)
- drop realize_late (Cedric, Eric)
- split patch13:intel_iommu: Add PASID cache management infrastructure (Eric)
- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
- refine comments (Eric, Donald)
rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
iommu pasid, this is important for dropping VTDPASIDAddressSpace
Yi Liu (3):
intel_iommu: Replay pasid bindings after context cache invalidation
intel_iommu: Propagate PASID-based iotlb invalidation to host
intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
changed
Zhenzhong Duan (17):
intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
vtd_ce_get_pasid_entry
hw/pci: Introduce pci_device_get_viommu_cap()
intel_iommu: Implement get_viommu_cap() callback
vfio/iommufd: Force creating nested parent domain
hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
intel_iommu: Introduce a new structure VTDHostIOMMUDevice
intel_iommu: Check for compatibility with IOMMUFD backed device when
x-flts=on
intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
intel_iommu: Handle PASID entry removal and update
intel_iommu: Handle PASID entry addition
intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
intel_iommu: Stick to system MR for IOMMUFD backed host device when
x-fls=on
intel_iommu: Bind/unbind guest page table to host
vfio: Add a new element bypass_ro in VFIOContainerBase
Workaround for ERRATA_772415_SPR17
intel_iommu: Enable host device when x-flts=on in scalable mode
MAINTAINERS | 1 +
hw/i386/intel_iommu_internal.h | 68 +-
include/hw/i386/intel_iommu.h | 9 +-
include/hw/iommu.h | 17 +
include/hw/pci/pci.h | 27 +
include/hw/vfio/vfio-container-base.h | 1 +
hw/i386/intel_iommu.c | 941 +++++++++++++++++++++++++-
hw/pci/pci.c | 23 +-
hw/vfio/iommufd.c | 22 +-
hw/vfio/listener.c | 13 +-
hw/i386/trace-events | 8 +
11 files changed, 1088 insertions(+), 42 deletions(-)
create mode 100644 include/hw/iommu.h
base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0
--
2.47.1
^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH v4 01/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
` (19 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
of rid2pasid, then it was extended to get any pasid entry. So a new name
vtd_ce_get_pasid_entry is better to match what it actually does.
No functional change intended.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
hw/i386/intel_iommu.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fe9a5f2872..c97953d7a1 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
return 0;
}
-static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
+static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
VTDContextEntry *ce,
VTDPASIDEntry *pe,
uint32_t pasid)
@@ -1025,7 +1025,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
if (s->flts) {
return VTD_PE_GET_FL_LEVEL(&pe);
} else {
@@ -1048,7 +1048,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
}
@@ -1116,7 +1116,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
if (s->flts) {
return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
} else {
@@ -1522,7 +1522,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
* has valid rid2pasid setting, which includes valid
* rid2pasid field and corresponding pasid entry setting
*/
- return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
+ return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
}
/* Map a device to its corresponding domain (context-entry) */
@@ -1611,7 +1611,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
}
@@ -1687,7 +1687,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
int ret;
if (s->root_scalable) {
- ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
if (ret) {
/*
* This error is guest triggerable. We should assumt PT
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 01/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 13:19 ` Cédric Le Goater
2025-08-13 6:43 ` Jim Shu
2025-07-29 9:20 ` [PATCH v4 03/20] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
` (18 subsequent siblings)
20 siblings, 2 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
allows to retrieve capabilities exposed by a vIOMMU. The first planned
capability is VIOMMU_CAP_HW_NESTED that advertises the support of HW
nested stage translation scheme. pci_device_get_viommu_cap is a wrapper
that can be called on a PCI device potentially protected by a vIOMMU.
get_viommu_cap() is designed to return 64bit bitmap of purely emulated
capabilities which are only derermined by user's configuration, no host
capabilities involved. Reasons are:
1. there can be more than one host IOMMUs with different capabilities
2. there can also be more than one vIOMMUs with different user
configuration, e.g., arm smmuv3.
3. This is migration friendly, return value is consistent between source
and target.
4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
because we need get_viommu_cap() to determine if creating nested parent
hwpt or not at attaching stage, meanwhile hiod realize needs iommufd,
devid and hwpt_id which are ready after attach_device().
See below sequence:
attach_device()
get_viommu_cap()
create hwpt
...
create hiod
set_iommu_device(hiod)
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
MAINTAINERS | 1 +
include/hw/iommu.h | 17 +++++++++++++++++
include/hw/pci/pci.h | 25 +++++++++++++++++++++++++
hw/pci/pci.c | 11 +++++++++++
4 files changed, 54 insertions(+)
create mode 100644 include/hw/iommu.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 37879ab64e..840cb1e604 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2304,6 +2304,7 @@ F: include/system/iommufd.h
F: backends/host_iommu_device.c
F: include/system/host_iommu_device.h
F: include/qemu/chardev_open.h
+F: include/hw/iommu.h
F: util/chardev_open.c
F: docs/devel/vfio-iommufd.rst
diff --git a/include/hw/iommu.h b/include/hw/iommu.h
new file mode 100644
index 0000000000..021db50db5
--- /dev/null
+++ b/include/hw/iommu.h
@@ -0,0 +1,17 @@
+/*
+ * General vIOMMU capabilities, flags, etc
+ *
+ * Copyright (C) 2025 Intel Corporation.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_IOMMU_H
+#define HW_IOMMU_H
+
+enum {
+ /* hardware nested stage-1 page table support */
+ VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
+};
+
+#endif /* HW_IOMMU_H */
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 6b7d3ac8a3..d89aefc030 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -462,6 +462,21 @@ typedef struct PCIIOMMUOps {
* @devfn: device and function number of the PCI device.
*/
void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
+ /**
+ * @get_viommu_cap: get vIOMMU capabilities
+ *
+ * Optional callback, if not implemented, then vIOMMU doesn't
+ * support exposing capabilities to other subsystem, e.g., VFIO.
+ * vIOMMU can choose which capabilities to expose.
+ *
+ * @opaque: the data passed to pci_setup_iommu().
+ *
+ * Returns: 64bit bitmap with each bit represents a capability emulated by
+ * VIOMMU_CAP_* in include/hw/iommu.h, these capabilities are theoretical
+ * which are only determined by user's configuration and independent on the
+ * actual host capabilities they may depend on.
+ */
+ uint64_t (*get_viommu_cap)(void *opaque);
/**
* @get_iotlb_info: get properties required to initialize a device IOTLB.
*
@@ -642,6 +657,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
Error **errp);
void pci_device_unset_iommu_device(PCIDevice *dev);
+/**
+ * pci_device_get_viommu_cap: get vIOMMU capabilities.
+ *
+ * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
+ * capability, 0 if vIOMMU doesn't support esposing capabilities.
+ *
+ * @dev: PCI device pointer.
+ */
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev);
+
/**
* pci_iommu_get_iotlb_info: get properties required to initialize a
* device IOTLB.
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index c70b5ceeba..df1fb615a8 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2992,6 +2992,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
}
}
+uint64_t pci_device_get_viommu_cap(PCIDevice *dev)
+{
+ PCIBus *iommu_bus;
+
+ pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
+ if (iommu_bus && iommu_bus->iommu_ops->get_viommu_cap) {
+ return iommu_bus->iommu_ops->get_viommu_cap(iommu_bus->iommu_opaque);
+ }
+ return 0;
+}
+
int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
bool exec_req, hwaddr addr, bool lpig,
uint16_t prgi, bool is_read, bool is_write)
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 03/20] intel_iommu: Implement get_viommu_cap() callback
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 01/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 04/20] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
` (17 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
Implement get_viommu_cap() callback and expose stage-1 capability for now.
VFIO uses it to create nested parent domain which is further used to create
nested domain in vIOMMU. All these will be implemented in following patches.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
hw/i386/intel_iommu.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c97953d7a1..178c15e31c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -24,6 +24,7 @@
#include "qemu/main-loop.h"
#include "qapi/error.h"
#include "hw/sysbus.h"
+#include "hw/iommu.h"
#include "intel_iommu_internal.h"
#include "hw/pci/pci.h"
#include "hw/pci/pci_bus.h"
@@ -4420,6 +4421,16 @@ static void vtd_dev_unset_iommu_device(PCIBus *bus, void *opaque, int devfn)
vtd_iommu_unlock(s);
}
+static uint64_t vtd_get_viommu_cap(void *opaque)
+{
+ IntelIOMMUState *s = opaque;
+ uint64_t caps;
+
+ caps = s->flts ? VIOMMU_CAP_HW_NESTED : 0;
+
+ return caps;
+}
+
/* Unmap the whole range in the notifier's scope. */
static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n)
{
@@ -4850,6 +4861,7 @@ static PCIIOMMUOps vtd_iommu_ops = {
.register_iotlb_notifier = vtd_register_iotlb_notifier,
.unregister_iotlb_notifier = vtd_unregister_iotlb_notifier,
.ats_request_translation = vtd_ats_request_translation,
+ .get_viommu_cap = vtd_get_viommu_cap,
};
static bool vtd_decide_config(IntelIOMMUState *s, Error **errp)
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 04/20] vfio/iommufd: Force creating nested parent domain
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (2 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 03/20] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 13:44 ` Cédric Le Goater
2025-07-29 9:20 ` [PATCH v4 05/20] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
` (16 subsequent siblings)
20 siblings, 1 reply; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
Call pci_device_get_viommu_cap() to get if vIOMMU supports VIOMMU_CAP_HW_NESTED,
if yes, create nested parent domain which could be reused by vIOMMU to create
nested domain.
It is safe because hw_caps & VIOMMU_CAP_HW_NESTED cannot be set yet because
s->flts is forbidden until we support passthrough device with x-flts=on.
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
hw/vfio/iommufd.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 48c590b6a9..61a548f13f 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -20,6 +20,7 @@
#include "trace.h"
#include "qapi/error.h"
#include "system/iommufd.h"
+#include "hw/iommu.h"
#include "hw/qdev-core.h"
#include "hw/vfio/vfio-cpr.h"
#include "system/reset.h"
@@ -379,6 +380,19 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
}
+ /*
+ * If vIOMMU supports stage-1 translation, force to create nested parent
+ * domain which could be reused by vIOMMU to create nested domain.
+ */
+ if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
+ VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
+
+ hw_caps = pci_device_get_viommu_cap(&vdev->pdev);
+ if (hw_caps & VIOMMU_CAP_HW_NESTED) {
+ flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
+ }
+ }
+
if (cpr_is_incoming()) {
hwpt_id = vbasedev->cpr.hwpt_id;
goto skip_alloc;
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 05/20] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (3 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 04/20] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 06/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
` (15 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
Returns true if PCI device is aliased or false otherwise. This will be
used in following patch to determine if a PCI device is under a PCI
bridge.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
include/hw/pci/pci.h | 2 ++
hw/pci/pci.c | 12 ++++++++----
2 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index d89aefc030..f2df5c85d5 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -652,6 +652,8 @@ typedef struct PCIIOMMUOps {
bool is_write);
} PCIIOMMUOps;
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+ PCIBus **aliased_bus, int *aliased_devfn);
AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
Error **errp);
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index df1fb615a8..151c27088b 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2857,20 +2857,21 @@ static void pci_device_class_base_init(ObjectClass *klass, const void *data)
* For call sites which don't need aliased BDF, passing NULL to
* aliased_[bus|devfn] is allowed.
*
+ * Returns true if PCI device RID is aliased or false otherwise.
+ *
* @piommu_bus: return root #PCIBus backed by an IOMMU for the PCI device.
*
* @aliased_bus: return aliased #PCIBus of the PCI device, optional.
*
* @aliased_devfn: return aliased devfn of the PCI device, optional.
*/
-static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
- PCIBus **piommu_bus,
- PCIBus **aliased_bus,
- int *aliased_devfn)
+bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
+ PCIBus **aliased_bus, int *aliased_devfn)
{
PCIBus *bus = pci_get_bus(dev);
PCIBus *iommu_bus = bus;
int devfn = dev->devfn;
+ bool aliased = false;
while (iommu_bus && !iommu_bus->iommu_ops && iommu_bus->parent_dev) {
PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
@@ -2907,6 +2908,7 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
devfn = parent->devfn;
bus = parent_bus;
}
+ aliased = true;
}
iommu_bus = parent_bus;
@@ -2928,6 +2930,8 @@ static void pci_device_get_iommu_bus_devfn(PCIDevice *dev,
if (aliased_devfn) {
*aliased_devfn = devfn;
}
+
+ return aliased;
}
AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 06/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (4 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 05/20] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
` (14 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
Introduce a new structure VTDHostIOMMUDevice which replaces
HostIOMMUDevice to be stored in hash table.
It includes a reference to HostIOMMUDevice and IntelIOMMUState,
also includes BDF information which will be used in future
patches.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
hw/i386/intel_iommu_internal.h | 7 +++++++
include/hw/i386/intel_iommu.h | 2 +-
hw/i386/intel_iommu.c | 15 +++++++++++++--
3 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 360e937989..c7046eb4e2 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -28,6 +28,7 @@
#ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
#define HW_I386_INTEL_IOMMU_INTERNAL_H
#include "hw/i386/intel_iommu.h"
+#include "system/host_iommu_device.h"
/*
* Intel IOMMU register specification
@@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
/* Bits to decide the offset for each level */
#define VTD_LEVEL_BITS 9
+typedef struct VTDHostIOMMUDevice {
+ IntelIOMMUState *iommu_state;
+ PCIBus *bus;
+ uint8_t devfn;
+ HostIOMMUDevice *hiod;
+} VTDHostIOMMUDevice;
#endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index e95477e855..50f9b27a45 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -295,7 +295,7 @@ struct IntelIOMMUState {
/* list of registered notifiers */
QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
- GHashTable *vtd_host_iommu_dev; /* HostIOMMUDevice */
+ GHashTable *vtd_host_iommu_dev; /* VTDHostIOMMUDevice */
/* interrupt remapping */
bool intr_enabled; /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 178c15e31c..870b293c14 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
static void vtd_hiod_destroy(gpointer v)
{
- object_unref(v);
+ VTDHostIOMMUDevice *vtd_hiod = v;
+
+ object_unref(vtd_hiod->hiod);
+ g_free(vtd_hiod);
}
static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
@@ -4368,6 +4371,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
HostIOMMUDevice *hiod, Error **errp)
{
IntelIOMMUState *s = opaque;
+ VTDHostIOMMUDevice *vtd_hiod;
struct vtd_as_key key = {
.bus = bus,
.devfn = devfn,
@@ -4384,7 +4388,14 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
return false;
}
+ vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
+ vtd_hiod->bus = bus;
+ vtd_hiod->devfn = (uint8_t)devfn;
+ vtd_hiod->iommu_state = s;
+ vtd_hiod->hiod = hiod;
+
if (!vtd_check_hiod(s, hiod, errp)) {
+ g_free(vtd_hiod);
vtd_iommu_unlock(s);
return false;
}
@@ -4394,7 +4405,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
new_key->devfn = devfn;
object_ref(hiod);
- g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
+ g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
vtd_iommu_unlock(s);
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (5 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 06/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 08/20] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
` (13 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
is passed to host to construct nested page table. We need to check
compatibility of some critical IOMMU capabilities between vIOMMU and
host IOMMU to ensure guest stage-1 page table could be used by host.
For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
does not, then this IOMMUFD backed device should fail.
Even of the checks pass, for now we willingly reject the association
because all the bits are not there yet.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 1 +
hw/i386/intel_iommu.c | 30 +++++++++++++++++++++++++++++-
2 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index c7046eb4e2..f7510861d1 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -192,6 +192,7 @@
#define VTD_ECAP_PT (1ULL << 6)
#define VTD_ECAP_SC (1ULL << 7)
#define VTD_ECAP_MHMV (15ULL << 20)
+#define VTD_ECAP_NEST (1ULL << 26)
#define VTD_ECAP_SRS (1ULL << 31)
#define VTD_ECAP_PSS (7ULL << 35) /* limit: MemTxAttrs::pid */
#define VTD_ECAP_PASID (1ULL << 40)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 870b293c14..33e498b3bd 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -40,6 +40,7 @@
#include "kvm/kvm_i386.h"
#include "migration/vmstate.h"
#include "trace.h"
+#include "system/iommufd.h"
/* context entry operations */
#define VTD_CE_GET_RID2PASID(ce) \
@@ -4363,7 +4364,34 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
return true;
}
- error_setg(errp, "host device is uncompatible with stage-1 translation");
+#ifdef CONFIG_IOMMUFD
+ struct HostIOMMUDeviceCaps *caps = &hiod->caps;
+ struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+
+ /* Remaining checks are all stage-1 translation specific */
+ if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+ error_setg(errp, "Need IOMMUFD backend when x-flts=on");
+ return false;
+ }
+
+ if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
+ error_setg(errp, "Incompatible host platform IOMMU type %d",
+ caps->type);
+ return false;
+ }
+
+ if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
+ error_setg(errp, "Host IOMMU doesn't support nested translation");
+ return false;
+ }
+
+ if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
+ error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
+ return false;
+ }
+#endif
+
+ error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
return false;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 08/20] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (6 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 09/20] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
` (12 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
Currently we don't support nested translation for passthrough device with
emulated device under same PCI bridge, because they require different address
space when x-flts=on.
In theory, we do support if devices under same PCI bridge are all passthrough
devices. But emulated device can be hotplugged under same bridge. To simplify,
just forbid passthrough device under PCI bridge no matter if there is, or will
be emulated devices under same bridge. This is acceptable because PCIE bridge
is more popular than PCI bridge now.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
hw/i386/intel_iommu.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 33e498b3bd..1410ddd65e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -4338,9 +4338,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
return vtd_dev_as;
}
-static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
+static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
Error **errp)
{
+ HostIOMMUDevice *hiod = vtd_hiod->hiod;
HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
int ret;
@@ -4367,6 +4368,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
#ifdef CONFIG_IOMMUFD
struct HostIOMMUDeviceCaps *caps = &hiod->caps;
struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+ PCIBus *bus = vtd_hiod->bus;
+ PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), vtd_hiod->devfn);
/* Remaining checks are all stage-1 translation specific */
if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
@@ -4389,6 +4392,12 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
return false;
}
+
+ if (pci_device_get_iommu_bus_devfn(pdev, &bus, NULL, NULL)) {
+ error_setg(errp, "Host device under PCI bridge is unsupported "
+ "when x-flts=on");
+ return false;
+ }
#endif
error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
@@ -4422,7 +4431,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
vtd_hiod->iommu_state = s;
vtd_hiod->hiod = hiod;
- if (!vtd_check_hiod(s, hiod, errp)) {
+ if (!vtd_check_hiod(s, vtd_hiod, errp)) {
g_free(vtd_hiod);
vtd_iommu_unlock(s);
return false;
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 09/20] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (7 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 08/20] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 10/20] intel_iommu: Handle PASID entry removal and update Zhenzhong Duan
` (11 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
PCI device supports two request types, Requests-without-PASID and
Requests-with-PASID. Requests-without-PASID doesn't include a PASID TLP
prefix, IOMMU fetches rid_pasid from context entry and use it as IOMMU's
pasid to index pasid table.
So we need to translate between PCI's pasid and IOMMU's pasid specially
for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to vtd_as
which contains PCI's pasid vtd_as->pasid.
vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to iommu_pasid.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
---
hw/i386/intel_iommu.c | 58 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 1410ddd65e..eb0e799e4a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1602,6 +1602,64 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
return 0;
}
+static int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
+ uint32_t *pasid)
+{
+ VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ uint8_t bus_num = pci_bus_num(vtd_as->bus);
+ uint8_t devfn = vtd_as->devfn;
+ VTDContextEntry ce;
+ int ret;
+
+ /* For Requests-with-PASID, its pasid value is used by vIOMMU directly */
+ if (vtd_as->pasid != PCI_NO_PASID) {
+ *pasid = vtd_as->pasid;
+ return 0;
+ }
+
+ if (cc_entry->context_cache_gen == s->context_cache_gen) {
+ ce = cc_entry->context_entry;
+ } else {
+ ret = vtd_dev_to_context_entry(s, bus_num, devfn, &ce);
+ if (ret) {
+ return ret;
+ }
+ }
+
+ *pasid = VTD_CE_GET_RID2PASID(&ce);
+ return 0;
+}
+
+static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
+ gpointer user_data)
+{
+ VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
+ struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
+ uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
+ uint32_t pasid;
+
+ if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+ return false;
+ }
+
+ return (pasid == target->pasid) && (sid == target->sid);
+}
+
+/* Translate iommu pasid to vtd_as */
+static inline
+VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
+ uint16_t sid, uint32_t pasid)
+{
+ struct vtd_as_raw_key key = {
+ .sid = sid,
+ .pasid = pasid
+ };
+
+ return g_hash_table_find(s->vtd_address_spaces,
+ vtd_find_as_by_sid_and_iommu_pasid, &key);
+}
+
static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
void *private)
{
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 10/20] intel_iommu: Handle PASID entry removal and update
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (8 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 09/20] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 11/20] intel_iommu: Handle PASID entry addition Zhenzhong Duan
` (10 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Yi Sun
This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
pasid entry and track PASID usage and future PASID tagged DMA address
translation support in vIOMMU.
VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
never freed. For other pasid, VTDAddressSpace instance is created/destroyed
per the guest pasid entry set up/destroy.
When guest removes or updates a PASID entry, QEMU will capture the guest pasid
selective pasid cache invalidation, removes VTDAddressSpace or update cached
PASID entry.
vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
and compare it with cached PASID entry.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 27 ++++-
include/hw/i386/intel_iommu.h | 6 +
hw/i386/intel_iommu.c | 196 +++++++++++++++++++++++++++++++--
hw/i386/trace-events | 3 +
4 files changed, 220 insertions(+), 12 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index f7510861d1..b9b76dd996 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -316,6 +316,7 @@ typedef enum VTDFaultReason {
* request while disabled */
VTD_FR_IR_SID_ERR = 0x26, /* Invalid Source-ID */
+ VTD_FR_RTADDR_INV_TTM = 0x31, /* Invalid TTM in RTADDR */
/* PASID directory entry access failure */
VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
/* The Present(P) field of pasid directory entry is 0 */
@@ -493,6 +494,15 @@ typedef union VTDInvDesc VTDInvDesc;
#define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff000000000f1c0ULL
#define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
+/* PASID-cache Invalidate Descriptor (pc_inv_dsc) fields */
+#define VTD_INV_DESC_PASIDC_G(x) extract64((x)->val[0], 4, 2)
+#define VTD_INV_DESC_PASIDC_G_DSI 0
+#define VTD_INV_DESC_PASIDC_G_PASID_SI 1
+#define VTD_INV_DESC_PASIDC_G_GLOBAL 3
+#define VTD_INV_DESC_PASIDC_DID(x) extract64((x)->val[0], 16, 16)
+#define VTD_INV_DESC_PASIDC_PASID(x) extract64((x)->val[0], 32, 20)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0 0xfff000000000f1c0ULL
+
/* Information about page-selective IOTLB invalidate */
struct VTDIOTLBPageInvInfo {
uint16_t domain_id;
@@ -553,6 +563,21 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
+typedef enum VTDPCInvType {
+ /* VTD spec defined PASID cache invalidation type */
+ VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
+ VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
+ VTD_PASID_CACHE_GLOBAL_INV = VTD_INV_DESC_PASIDC_G_GLOBAL,
+} VTDPCInvType;
+
+typedef struct VTDPASIDCacheInfo {
+ VTDPCInvType type;
+ uint16_t did;
+ uint32_t pasid;
+ PCIBus *bus;
+ uint16_t devfn;
+} VTDPASIDCacheInfo;
+
/* PASID Table Related Definitions */
#define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
#define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
@@ -574,7 +599,7 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_SM_PASID_ENTRY_PT (4ULL << 6)
#define VTD_SM_PASID_ENTRY_AW 7ULL /* Adjusted guest-address-width */
-#define VTD_SM_PASID_ENTRY_DID(val) ((val) & VTD_DOMAIN_ID_MASK)
+#define VTD_SM_PASID_ENTRY_DID(x) extract64((x)->val[1], 0, 16)
#define VTD_SM_PASID_ENTRY_FLPM 3ULL
#define VTD_SM_PASID_ENTRY_FLPTPTR (~0xfffULL)
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 50f9b27a45..0e3826f6f0 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -95,6 +95,11 @@ struct VTDPASIDEntry {
uint64_t val[8];
};
+typedef struct VTDPASIDCacheEntry {
+ struct VTDPASIDEntry pasid_entry;
+ bool valid;
+} VTDPASIDCacheEntry;
+
struct VTDAddressSpace {
PCIBus *bus;
uint8_t devfn;
@@ -107,6 +112,7 @@ struct VTDAddressSpace {
MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
IntelIOMMUState *iommu_state;
VTDContextCacheEntry context_cache_entry;
+ VTDPASIDCacheEntry pasid_cache_entry;
QLIST_ENTRY(VTDAddressSpace) next;
/* Superset of notifier flags that this address space has */
IOMMUNotifierFlag notifier_flags;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index eb0e799e4a..cc849acd20 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1675,7 +1675,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
if (s->root_scalable) {
vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
- return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
+ return VTD_SM_PASID_ENTRY_DID(&pe);
}
return VTD_CONTEXT_ENTRY_DID(ce->hi);
@@ -3109,6 +3109,183 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
return true;
}
+static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
+ uint32_t pasid, VTDPASIDEntry *pe)
+{
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ VTDContextEntry ce;
+ int ret;
+
+ if (!s->root_scalable) {
+ return -VTD_FR_RTADDR_INV_TTM;
+ }
+
+ ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus), vtd_as->devfn,
+ &ce);
+ if (ret) {
+ return ret;
+ }
+
+ return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+ return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/*
+ * This function is a loop function which return value determines if
+ * vtd_as including cached pasid entry is removed.
+ *
+ * For PCI_NO_PASID, when corresponding cached pasid entry is cleared,
+ * it returns false so that vtd_as is reserved as it's owned by PCI
+ * sub-system. For other pasid, it returns true so vtd_as is removed.
+ */
+static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
+ gpointer user_data)
+{
+ VTDPASIDCacheInfo *pc_info = user_data;
+ VTDAddressSpace *vtd_as = value;
+ VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+ VTDPASIDEntry pe;
+ uint16_t did;
+ uint32_t pasid;
+ int ret;
+
+ if (!pc_entry->valid) {
+ return false;
+ }
+ did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
+
+ if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+ goto remove;
+ }
+
+ switch (pc_info->type) {
+ case VTD_PASID_CACHE_PASIDSI:
+ if (pc_info->pasid != pasid) {
+ return false;
+ }
+ /* fall through */
+ case VTD_PASID_CACHE_DOMSI:
+ if (pc_info->did != did) {
+ return false;
+ }
+ /* fall through */
+ case VTD_PASID_CACHE_GLOBAL_INV:
+ break;
+ default:
+ error_setg(&error_fatal, "invalid pc_info->type for flush");
+ }
+
+ /*
+ * pasid cache invalidation may indicate a present pasid entry to present
+ * pasid entry modification. To cover such case, vIOMMU emulator needs to
+ * fetch latest guest pasid entry and compares with cached pasid entry,
+ * then update pasid cache.
+ */
+ ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
+ if (ret) {
+ /*
+ * No valid pasid entry in guest memory. e.g. pasid entry was modified
+ * to be either all-zero or non-present. Either case means existing
+ * pasid cache should be removed.
+ */
+ goto remove;
+ }
+
+ /*
+ * Update cached pasid entry if it's stale compared to what's in guest
+ * memory.
+ */
+ if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
+ pc_entry->pasid_entry = pe;
+ }
+ return false;
+
+remove:
+ pc_entry->valid = false;
+
+ /*
+ * Don't remove address space of PCI_NO_PASID which is created for PCI
+ * sub-system.
+ */
+ if (vtd_as->pasid == PCI_NO_PASID) {
+ return false;
+ }
+ return true;
+}
+
+/*
+ * For a PASID cache invalidation, this function handles below scenarios:
+ * a) a present cached pasid entry needs to be removed
+ * b) a present cached pasid entry needs to be updated
+ */
+static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
+{
+ if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
+ return;
+ }
+
+ vtd_iommu_lock(s);
+ /*
+ * a,b): loop all the existing vtd_as instances for pasid cache removal
+ or update.
+ */
+ g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
+ pc_info);
+ vtd_iommu_unlock(s);
+}
+
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+ VTDInvDesc *inv_desc)
+{
+ uint16_t did;
+ uint32_t pasid;
+ VTDPASIDCacheInfo pc_info;
+ uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
+ VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
+
+ if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
+ __func__, "pasid cache inv")) {
+ return false;
+ }
+
+ did = VTD_INV_DESC_PASIDC_DID(inv_desc);
+ pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc);
+
+ switch (VTD_INV_DESC_PASIDC_G(inv_desc)) {
+ case VTD_INV_DESC_PASIDC_G_DSI:
+ trace_vtd_pasid_cache_dsi(did);
+ pc_info.type = VTD_PASID_CACHE_DOMSI;
+ pc_info.did = did;
+ break;
+
+ case VTD_INV_DESC_PASIDC_G_PASID_SI:
+ /* PASID selective implies a DID selective */
+ trace_vtd_pasid_cache_psi(did, pasid);
+ pc_info.type = VTD_PASID_CACHE_PASIDSI;
+ pc_info.did = did;
+ pc_info.pasid = pasid;
+ break;
+
+ case VTD_INV_DESC_PASIDC_G_GLOBAL:
+ trace_vtd_pasid_cache_gsi();
+ pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+ break;
+
+ default:
+ error_report_once("invalid granularity field in PASID-cache invalidate "
+ "descriptor, hi: 0x%"PRIx64" lo: 0x%" PRIx64,
+ inv_desc->val[1], inv_desc->val[0]);
+ return false;
+ }
+
+ vtd_pasid_cache_sync(s, &pc_info);
+ return true;
+}
+
static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
VTDInvDesc *inv_desc)
{
@@ -3271,6 +3448,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
}
break;
+ case VTD_INV_DESC_PC:
+ trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+ if (!vtd_process_pasid_desc(s, &inv_desc)) {
+ return false;
+ }
+ break;
+
case VTD_INV_DESC_PIOTLB:
trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
if (!vtd_process_piotlb_desc(s, &inv_desc)) {
@@ -3306,16 +3490,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
}
break;
- /*
- * TODO: the entity of below two cases will be implemented in future series.
- * To make guest (which integrates scalable mode support patch set in
- * iommu driver) work, just return true is enough so far.
- */
- case VTD_INV_DESC_PC:
- if (s->scalable_mode) {
- break;
- }
- /* fallthrough */
default:
error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
" (unknown type)", __func__, inv_desc.hi,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ac9e1a10aa..ae5bbfcdc0 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
vtd_inv_qi_tail(uint16_t head) "write tail %d"
vtd_inv_qi_fetch(void) ""
vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 11/20] intel_iommu: Handle PASID entry addition
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (9 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 10/20] intel_iommu: Handle PASID entry removal and update Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 12/20] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
` (9 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Yi Sun
When guest creates new PASID entries, QEMU will capture the guest pasid
selective pasid cache invalidation, walk through each passthrough device
and each pasid, when a match is found, identify an existing vtd_as or
create a new one and update its corresponding cached pasid entry.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 2 +
hw/i386/intel_iommu.c | 176 ++++++++++++++++++++++++++++++++-
2 files changed, 175 insertions(+), 3 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index b9b76dd996..fb2a919e87 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -559,6 +559,7 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_CTX_ENTRY_LEGACY_SIZE 16
#define VTD_CTX_ENTRY_SCALABLE_SIZE 32
+#define VTD_SM_CONTEXT_ENTRY_PDTS(x) extract64((x)->val[0], 9, 3)
#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
@@ -589,6 +590,7 @@ typedef struct VTDPASIDCacheInfo {
#define VTD_PASID_TABLE_BITS_MASK (0x3fULL)
#define VTD_PASID_TABLE_INDEX(pasid) ((pasid) & VTD_PASID_TABLE_BITS_MASK)
#define VTD_PASID_ENTRY_FPD (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM (1ULL << 6)
/* PASID Granular Translation Type Mask */
#define VTD_PASID_ENTRY_P 1ULL
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index cc849acd20..95d1893a44 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -826,6 +826,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
}
}
+static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
+{
+ return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
+}
+
static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
{
return pdire->val & 1;
@@ -1647,9 +1652,9 @@ static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
}
/* Translate iommu pasid to vtd_as */
-static inline
-VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
- uint16_t sid, uint32_t pasid)
+static VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
+ uint16_t sid,
+ uint32_t pasid)
{
struct vtd_as_raw_key key = {
.sid = sid,
@@ -3217,10 +3222,172 @@ remove:
return true;
}
+/*
+ * This function walks over PASID range within [start, end) in a single
+ * PASID table for entries matching @info type/did, then retrieve/create
+ * vtd_as and fill associated pasid entry cache.
+ */
+static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+ dma_addr_t pt_base,
+ int start,
+ int end,
+ VTDPASIDCacheInfo *info)
+{
+ VTDPASIDEntry pe;
+ int pasid = start;
+
+ while (pasid < end) {
+ if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+ && vtd_pe_present(&pe)) {
+ int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
+ uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
+ VTDPASIDCacheEntry *pc_entry;
+ VTDAddressSpace *vtd_as;
+
+ vtd_iommu_lock(s);
+ /*
+ * When indexed by rid2pasid, vtd_as should have been created,
+ * e.g., by PCI subsystem. For other iommu pasid, we need to
+ * create vtd_as dynamically. Other iommu pasid is same value
+ * as PCI's pasid, so it's used as input of vtd_find_add_as().
+ */
+ vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
+ vtd_iommu_unlock(s);
+ if (!vtd_as) {
+ vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
+ }
+
+ if ((info->type == VTD_PASID_CACHE_DOMSI ||
+ info->type == VTD_PASID_CACHE_PASIDSI) &&
+ (info->did != VTD_SM_PASID_ENTRY_DID(&pe))) {
+ /*
+ * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
+ * requires domain id check. If domain id check fail,
+ * go to next pasid.
+ */
+ pasid++;
+ continue;
+ }
+
+ pc_entry = &vtd_as->pasid_cache_entry;
+ /*
+ * pasid cache update and clear are handled in
+ * vtd_flush_pasid_locked(), only care new pasid entry here.
+ */
+ if (!pc_entry->valid) {
+ pc_entry->pasid_entry = pe;
+ pc_entry->valid = true;
+ }
+ }
+ pasid++;
+ }
+}
+
+/*
+ * In VT-d scalable mode translation, PASID dir + PASID table is used.
+ * This function aims at looping over a range of PASIDs in the given
+ * two level table to identify the pasid config in guest.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
+ dma_addr_t pdt_base,
+ int start, int end,
+ VTDPASIDCacheInfo *info)
+{
+ VTDPASIDDirEntry pdire;
+ int pasid = start;
+ int pasid_next;
+ dma_addr_t pt_base;
+
+ while (pasid < end) {
+ pasid_next =
+ (pasid + VTD_PASID_TBL_ENTRY_NUM) & ~(VTD_PASID_TBL_ENTRY_NUM - 1);
+ pasid_next = pasid_next < end ? pasid_next : end;
+
+ if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+ && vtd_pdire_present(&pdire)) {
+ pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+ vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
+ }
+ pasid = pasid_next;
+ }
+}
+
+static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
+ int start, int end,
+ VTDPASIDCacheInfo *info)
+{
+ VTDContextEntry ce;
+
+ if (!vtd_dev_to_context_entry(s, pci_bus_num(info->bus), info->devfn,
+ &ce)) {
+ uint32_t max_pasid;
+
+ max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
+ if (end > max_pasid) {
+ end = max_pasid;
+ }
+ vtd_sm_pasid_table_walk(s,
+ VTD_CE_GET_PASID_DIR_TABLE(&ce),
+ start,
+ end,
+ info);
+ }
+}
+
+/*
+ * This function replays the guest pasid bindings by walking the two level
+ * guest PASID table. For each valid pasid entry, it finds or creates a
+ * vtd_as and caches pasid entry in vtd_as.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+ VTDPASIDCacheInfo *pc_info)
+{
+ /*
+ * Currently only Requests-without-PASID is supported, as vIOMMU doesn't
+ * support RPS(RID-PASID Support), pasid scope is fixed to [0, 1).
+ */
+ int start = 0, end = 1;
+ VTDHostIOMMUDevice *vtd_hiod;
+ VTDPASIDCacheInfo walk_info;
+ GHashTableIter as_it;
+
+ switch (pc_info->type) {
+ case VTD_PASID_CACHE_PASIDSI:
+ start = pc_info->pasid;
+ end = pc_info->pasid + 1;
+ /* fall through */
+ case VTD_PASID_CACHE_DOMSI:
+ case VTD_PASID_CACHE_GLOBAL_INV:
+ /* loop all assigned devices */
+ break;
+ default:
+ error_setg(&error_fatal, "invalid pc_info->type for replay");
+ }
+
+ /*
+ * In this replay, one only needs to care about the devices which are
+ * backed by host IOMMU. Those devices have a corresponding vtd_hiod
+ * in s->vtd_host_iommu_dev. For devices not backed by host IOMMU, it
+ * is not necessary to replay the bindings since their cache could be
+ * re-created in the future DMA address translation.
+ *
+ * VTD translation callback never accesses vtd_hiod and its corresponding
+ * cached pasid entry, so no iommu lock needed here.
+ */
+ walk_info = *pc_info;
+ g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
+ while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
+ walk_info.bus = vtd_hiod->bus;
+ walk_info.devfn = vtd_hiod->devfn;
+ vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+ }
+}
+
/*
* For a PASID cache invalidation, this function handles below scenarios:
* a) a present cached pasid entry needs to be removed
* b) a present cached pasid entry needs to be updated
+ * c) a present cached pasid entry needs to be created
*/
static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
{
@@ -3236,6 +3403,9 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid_locked,
pc_info);
vtd_iommu_unlock(s);
+
+ /* c): loop all passthrough device for new pasid entries */
+ vtd_replay_guest_pasid_bindings(s, pc_info);
}
static bool vtd_process_pasid_desc(IntelIOMMUState *s,
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 12/20] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (10 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 11/20] intel_iommu: Handle PASID entry addition Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 13/20] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
` (8 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Yi Sun
FORCE_RESET is different from GLOBAL_INV which updates pasid cache if
underlying pasid entry is still valid, it drops all the pasid caches.
FORCE_RESET isn't a VTD spec defined invalidation type for pasid cache,
only used internally in system level reset.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 9 +++++++++
hw/i386/intel_iommu.c | 25 +++++++++++++++++++++++++
hw/i386/trace-events | 1 +
3 files changed, 35 insertions(+)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index fb2a919e87..c510b09d1a 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -569,6 +569,15 @@ typedef enum VTDPCInvType {
VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
VTD_PASID_CACHE_PASIDSI = VTD_INV_DESC_PASIDC_G_PASID_SI,
VTD_PASID_CACHE_GLOBAL_INV = VTD_INV_DESC_PASIDC_G_GLOBAL,
+
+ /*
+ * Internally used PASID cache invalidation type starts here,
+ * 0x10 is large enough as invalidation type in pc_inv_desc
+ * is 2bits in size.
+ */
+
+ /* Reset all PASID cache entries, used in system level reset */
+ VTD_PASID_CACHE_FORCE_RESET = 0x10,
} VTDPCInvType;
typedef struct VTDPASIDCacheInfo {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 95d1893a44..2d8588d9fe 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -87,6 +87,8 @@ struct vtd_iotlb_key {
static void vtd_address_space_refresh_all(IntelIOMMUState *s);
static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+
static void vtd_panic_require_caching_mode(void)
{
error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -391,6 +393,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
vtd_iommu_lock(s);
vtd_reset_iotlb_locked(s);
vtd_reset_context_cache_locked(s);
+ vtd_pasid_cache_reset_locked(s);
vtd_iommu_unlock(s);
}
@@ -3180,6 +3183,8 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
/* fall through */
case VTD_PASID_CACHE_GLOBAL_INV:
break;
+ case VTD_PASID_CACHE_FORCE_RESET:
+ goto remove;
default:
error_setg(&error_fatal, "invalid pc_info->type for flush");
}
@@ -3222,6 +3227,23 @@ remove:
return true;
}
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
+{
+ VTDPASIDCacheInfo pc_info;
+
+ trace_vtd_pasid_cache_reset();
+
+ pc_info.type = VTD_PASID_CACHE_FORCE_RESET;
+
+ /*
+ * Reset pasid cache is a big hammer, so use g_hash_table_foreach_remove
+ * which will free all vtd_as instances except those created for PCI
+ * sub-system.
+ */
+ g_hash_table_foreach_remove(s->vtd_address_spaces,
+ vtd_flush_pasid_locked, &pc_info);
+}
+
/*
* This function walks over PASID range within [start, end) in a single
* PASID table for entries matching @info type/did, then retrieve/create
@@ -3360,6 +3382,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
case VTD_PASID_CACHE_GLOBAL_INV:
/* loop all assigned devices */
break;
+ case VTD_PASID_CACHE_FORCE_RESET:
+ /* For force reset, no need to go further replay */
+ return;
default:
error_setg(&error_fatal, "invalid pc_info->type for replay");
}
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ae5bbfcdc0..c8a936eb46 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
vtd_inv_qi_tail(uint16_t head) "write tail %d"
vtd_inv_qi_fetch(void) ""
vtd_context_cache_reset(void) ""
+vtd_pasid_cache_reset(void) ""
vtd_pasid_cache_gsi(void) ""
vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 13/20] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (11 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 12/20] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 14/20] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
` (7 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
When guest in scalable mode and x-flts=on, we stick to system MR for IOMMUFD
backed host device. Then its default hwpt contains GPA->HPA mappings which is
used directly if PGTT=PT and used as nested parent if PGTT=FLT. Otherwise
fallback to original processing.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 2d8588d9fe..7c41fd76a8 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1773,6 +1773,28 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
}
+static VTDHostIOMMUDevice *vtd_find_hiod_iommufd(IntelIOMMUState *s,
+ VTDAddressSpace *as)
+{
+ struct vtd_as_key key = {
+ .bus = as->bus,
+ .devfn = as->devfn,
+ };
+ VTDHostIOMMUDevice *vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev,
+ &key);
+
+ if (vtd_hiod && vtd_hiod->hiod &&
+ object_dynamic_cast(OBJECT(vtd_hiod->hiod),
+ TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+ return vtd_hiod;
+ }
+ return NULL;
+}
+
+/*
+ * vtd_switch_address_space() calls vtd_as_pt_enabled() to determine which
+ * MR to switch to. Switch to system MR if return true, iommu MR otherwise.
+ */
static bool vtd_as_pt_enabled(VTDAddressSpace *as)
{
IntelIOMMUState *s;
@@ -1781,6 +1803,18 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
assert(as);
s = as->iommu_state;
+
+ /*
+ * When guest in scalable mode and x-flts=on, we stick to system MR
+ * for IOMMUFD backed host device. Then its default hwpt contains
+ * GPA->HPA mappings which is used directly if PGTT=PT and used as
+ * nested parent if PGTT=FLT. Otherwise fallback to original
+ * processing.
+ */
+ if (s->root_scalable && s->flts && vtd_find_hiod_iommufd(s, as)) {
+ return true;
+ }
+
if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
&ce)) {
/*
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 14/20] intel_iommu: Bind/unbind guest page table to host
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (12 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 13/20] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 15/20] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
` (6 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Yi Sun
This captures the guest PASID table entry modifications and
propagates the changes to host to attach a hwpt with type determined
per guest IOMMU mode and PGTT configuration.
When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
vIOMMU reuse hwpt(GPA->HPA) provided by VFIO as nested parent to
construct nested page table.
When guest decides to use legacy mode then vIOMMU switches the MRs of
the device's AS, hence the IOAS created by VFIO container would be
switched to using the IOMMU_NOTIFIER_IOTLB_EVENTS since the MR is
switched to IOMMU MR. So it is able to support shadowing the guest IO
page table.
Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 14 ++-
include/hw/i386/intel_iommu.h | 1 +
hw/i386/intel_iommu.c | 221 ++++++++++++++++++++++++++++++++-
hw/i386/trace-events | 3 +
4 files changed, 233 insertions(+), 6 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index c510b09d1a..61e35dbdc0 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -564,6 +564,12 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
+typedef enum VTDPASIDOp {
+ VTD_PASID_BIND,
+ VTD_PASID_UPDATE,
+ VTD_PASID_UNBIND,
+} VTDPASIDOp;
+
typedef enum VTDPCInvType {
/* VTD spec defined PASID cache invalidation type */
VTD_PASID_CACHE_DOMSI = VTD_INV_DESC_PASIDC_G_DSI,
@@ -612,8 +618,12 @@ typedef struct VTDPASIDCacheInfo {
#define VTD_SM_PASID_ENTRY_AW 7ULL /* Adjusted guest-address-width */
#define VTD_SM_PASID_ENTRY_DID(x) extract64((x)->val[1], 0, 16)
-#define VTD_SM_PASID_ENTRY_FLPM 3ULL
-#define VTD_SM_PASID_ENTRY_FLPTPTR (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_FSPTPTR (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(x) extract64((x)->val[2], 0, 1)
+/* 00: 4-level paging, 01: 5-level paging, 10-11: Reserved */
+#define VTD_SM_PASID_ENTRY_FSPM(x) extract64((x)->val[2], 2, 2)
+#define VTD_SM_PASID_ENTRY_WPE_BIT(x) extract64((x)->val[2], 4, 1)
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(x) extract64((x)->val[2], 7, 1)
/* First Level Paging Structure */
/* Masks for First Level Paging Entry */
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 0e3826f6f0..2affab36b2 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -104,6 +104,7 @@ struct VTDAddressSpace {
PCIBus *bus;
uint8_t devfn;
uint32_t pasid;
+ uint32_t s1_hwpt;
AddressSpace as;
IOMMUMemoryRegion iommu;
MemoryRegion root; /* The root container of the device */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 7c41fd76a8..0fdaf0b0bb 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -20,6 +20,7 @@
*/
#include "qemu/osdep.h"
+#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
#include "qemu/error-report.h"
#include "qemu/main-loop.h"
#include "qapi/error.h"
@@ -41,6 +42,9 @@
#include "migration/vmstate.h"
#include "trace.h"
#include "system/iommufd.h"
+#ifdef CONFIG_IOMMUFD
+#include <linux/iommufd.h>
+#endif
/* context entry operations */
#define VTD_CE_GET_RID2PASID(ce) \
@@ -50,10 +54,9 @@
/* pe operations */
#define VTD_PE_GET_TYPE(pe) ((pe)->val[0] & VTD_SM_PASID_ENTRY_PGTT)
-#define VTD_PE_GET_FL_LEVEL(pe) \
- (4 + (((pe)->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM))
#define VTD_PE_GET_SL_LEVEL(pe) \
(2 + (((pe)->val[0] >> 2) & VTD_SM_PASID_ENTRY_AW))
+#define VTD_PE_GET_FL_LEVEL(pe) (VTD_SM_PASID_ENTRY_FSPM(pe) + 4)
/*
* PCI bus number (or SID) is not reliable since the device is usaully
@@ -834,6 +837,31 @@ static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce) + 7);
}
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+ return pe->val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
+}
+
+/*
+ * Stage-1 IOVA address width: 48 bits for 4-level paging(FSPM=00)
+ * 57 bits for 5-level paging(FSPM=01)
+ */
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+ return 48 + VTD_SM_PASID_ENTRY_FSPM(pe) * 9;
+}
+
+static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
+{
+ return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
+}
+
+/* check if pgtt is first stage translation */
+static inline bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
+{
+ return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
+}
+
static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
{
return pdire->val & 1;
@@ -1131,7 +1159,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
if (s->root_scalable) {
vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
if (s->flts) {
- return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+ return pe.val[2] & VTD_SM_PASID_ENTRY_FSPTPTR;
} else {
return pe.val[0] & VTD_SM_PASID_ENTRY_SLPTPTR;
}
@@ -1766,7 +1794,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
*/
return false;
}
- return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
+ return vtd_pe_pgtt_is_pt(&pe);
}
return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
@@ -2433,6 +2461,178 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
vtd_iommu_replay_all(s);
}
+#ifdef CONFIG_IOMMUFD
+static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
+ VTDPASIDEntry *pe)
+{
+ memset(vtd, 0, sizeof(*vtd));
+
+ vtd->flags = (VTD_SM_PASID_ENTRY_SRE_BIT(pe) ? IOMMU_VTD_S1_SRE : 0) |
+ (VTD_SM_PASID_ENTRY_WPE_BIT(pe) ? IOMMU_VTD_S1_WPE : 0) |
+ (VTD_SM_PASID_ENTRY_EAFE_BIT(pe) ? IOMMU_VTD_S1_EAFE : 0);
+ vtd->addr_width = vtd_pe_get_fl_aw(pe);
+ vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
+}
+
+static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+ VTDPASIDEntry *pe, uint32_t *s1_hwpt,
+ Error **errp)
+{
+ struct iommu_hwpt_vtd_s1 vtd;
+
+ vtd_init_s1_hwpt_data(&vtd, pe);
+
+ return !iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+ idev->hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
+ sizeof(vtd), &vtd, s1_hwpt, errp);
+}
+
+static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+ VTDAddressSpace *vtd_as)
+{
+ if (!vtd_as->s1_hwpt) {
+ return;
+ }
+ iommufd_backend_free_id(idev->iommufd, vtd_as->s1_hwpt);
+ vtd_as->s1_hwpt = 0;
+}
+
+static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+ VTDAddressSpace *vtd_as, Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+ VTDPASIDEntry *pe = &vtd_as->pasid_cache_entry.pasid_entry;
+ uint32_t hwpt_id;
+ int ret;
+
+ /*
+ * We can get here only if flts=on, the supported PGTT is FLT and PT.
+ * Catch invalid PGTT when processing invalidation request to avoid
+ * attaching to wrong hwpt.
+ */
+ if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
+ error_setg(errp, "Invalid PGTT type");
+ return -EINVAL;
+ }
+
+ if (vtd_pe_pgtt_is_flt(pe)) {
+ /* Should fail if the FLPT base is 0 */
+ if (!vtd_pe_get_flpt_base(pe)) {
+ error_setg(errp, "FLPT base is 0");
+ return -EINVAL;
+ }
+
+ if (vtd_create_s1_hwpt(idev, pe, &hwpt_id, errp)) {
+ return -EINVAL;
+ }
+ } else {
+ hwpt_id = idev->hwpt_id;
+ }
+
+ ret = !host_iommu_device_iommufd_attach_hwpt(idev, hwpt_id, errp);
+ trace_vtd_device_attach_hwpt(idev->devid, vtd_as->pasid, hwpt_id, ret);
+ if (!ret) {
+ vtd_destroy_s1_hwpt(idev, vtd_as);
+ if (vtd_pe_pgtt_is_flt(pe)) {
+ vtd_as->s1_hwpt = hwpt_id;
+ }
+ } else if (vtd_pe_pgtt_is_flt(pe)) {
+ iommufd_backend_free_id(idev->iommufd, hwpt_id);
+ }
+
+ return ret;
+}
+
+static int vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+ VTDAddressSpace *vtd_as, Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+ uint32_t pasid = vtd_as->pasid;
+ int ret;
+
+ if (vtd_hiod->iommu_state->dmar_enabled) {
+ ret = !host_iommu_device_iommufd_detach_hwpt(idev, errp);
+ trace_vtd_device_detach_hwpt(idev->devid, pasid, ret);
+ } else {
+ ret = !host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
+ trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
+ ret);
+ }
+
+ if (!ret) {
+ vtd_destroy_s1_hwpt(idev, vtd_as);
+ }
+
+ return ret;
+}
+
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
+ Error **errp)
+{
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(s, vtd_as);
+ int ret;
+
+ if (!vtd_hiod) {
+ /* No need to go further, e.g. for emulated device */
+ return 0;
+ }
+
+ if (vtd_as->pasid != PCI_NO_PASID) {
+ error_setg(errp, "Non-rid_pasid %d not supported yet", vtd_as->pasid);
+ return -EINVAL;
+ }
+
+ switch (op) {
+ case VTD_PASID_UPDATE:
+ case VTD_PASID_BIND:
+ {
+ ret = vtd_device_attach_iommufd(vtd_hiod, vtd_as, errp);
+ break;
+ }
+ case VTD_PASID_UNBIND:
+ {
+ ret = vtd_device_detach_iommufd(vtd_hiod, vtd_as, errp);
+ break;
+ }
+ default:
+ error_setg(errp, "Unknown VTDPASIDOp!!!");
+ break;
+ }
+
+ return ret;
+}
+#else
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
+ Error **errp)
+{
+ return 0;
+}
+#endif
+
+static int vtd_bind_guest_pasid_report_err(VTDAddressSpace *vtd_as,
+ VTDPASIDOp op)
+{
+ Error *local_err = NULL;
+ int ret;
+
+ /*
+ * vIOMMU calls into kernel to do BIND/UNBIND, the failure reason
+ * can be kernel, QEMU bug or invalid guest config. None of them
+ * should be reported to guest in PASID cache invalidation
+ * processing path. But at least, we can report it to QEMU console.
+ *
+ * TODO: for invalid guest config, DMA translation fault will be
+ * caught by host and passed to QEMU to inject to guest in future.
+ */
+ ret = vtd_bind_guest_pasid(vtd_as, op, &local_err);
+ if (ret) {
+ error_report_err(local_err);
+ }
+
+ return ret;
+}
+
/* Do a context-cache device-selective invalidation.
* @func_mask: FM field after shifting
*/
@@ -3245,10 +3445,20 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
*/
if (!vtd_pasid_entry_compare(&pe, &pc_entry->pasid_entry)) {
pc_entry->pasid_entry = pe;
+ if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_UPDATE)) {
+ /*
+ * In case update binding fails, tear down existing binding to
+ * catch invalid pasid entry config during DMA translation.
+ */
+ goto remove;
+ }
}
return false;
remove:
+ if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_UNBIND)) {
+ return false;
+ }
pc_entry->valid = false;
/*
@@ -3333,6 +3543,9 @@ static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
if (!pc_entry->valid) {
pc_entry->pasid_entry = pe;
pc_entry->valid = true;
+ if (vtd_bind_guest_pasid_report_err(vtd_as, VTD_PASID_BIND)) {
+ pc_entry->valid = false;
+ }
}
}
pasid++;
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index c8a936eb46..1c31b9a873 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -73,6 +73,9 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
vtd_reset_exit(void) ""
+vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
+vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
# amd_iommu.c
amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" + offset 0x%"PRIx32
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 15/20] intel_iommu: Replay pasid bindings after context cache invalidation
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (13 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 14/20] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 16/20] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
` (5 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Zhenzhong Duan
From: Yi Liu <yi.l.liu@intel.com>
This replays guest pasid bindings after context cache invalidation.
This is a behavior to ensure safety. Actually, programmer should issue
pasid cache invalidation with proper granularity after issuing a context
cache invalidation.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 2 ++
hw/i386/intel_iommu.c | 42 ++++++++++++++++++++++++++++++++++
hw/i386/trace-events | 1 +
3 files changed, 45 insertions(+)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 61e35dbdc0..8af1004888 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -584,6 +584,8 @@ typedef enum VTDPCInvType {
/* Reset all PASID cache entries, used in system level reset */
VTD_PASID_CACHE_FORCE_RESET = 0x10,
+ /* Invalidate all PASID entries in a device */
+ VTD_PASID_CACHE_DEVSI,
} VTDPCInvType;
typedef struct VTDPASIDCacheInfo {
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 0fdaf0b0bb..6620e975f3 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -91,6 +91,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+ VTDPASIDCacheInfo *pc_info);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+ PCIBus *bus, uint16_t devfn);
static void vtd_panic_require_caching_mode(void)
{
@@ -2442,6 +2446,8 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
static void vtd_context_global_invalidate(IntelIOMMUState *s)
{
+ VTDPASIDCacheInfo pc_info;
+
trace_vtd_inv_desc_cc_global();
/* Protects context cache */
vtd_iommu_lock(s);
@@ -2459,6 +2465,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
* VT-d emulation codes.
*/
vtd_iommu_replay_all(s);
+
+ pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+ vtd_pasid_cache_sync(s, &pc_info);
}
#ifdef CONFIG_IOMMUFD
@@ -2691,6 +2700,15 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
* happened.
*/
vtd_address_space_sync(vtd_as);
+ /*
+ * Per spec, context flush should also be followed with PASID
+ * cache and iotlb flush. In order to work with a guest which
+ * doesn't follow spec and missed PASID cache flush, we have
+ * vtd_pasid_cache_devsi() to invalidate PASID caches of the
+ * passthrough device. Host iommu driver would flush piotlb
+ * when a pasid unbind is pass down to it.
+ */
+ vtd_pasid_cache_devsi(s, vtd_as->bus, devfn);
}
}
}
@@ -3419,6 +3437,11 @@ static gboolean vtd_flush_pasid_locked(gpointer key, gpointer value,
break;
case VTD_PASID_CACHE_FORCE_RESET:
goto remove;
+ case VTD_PASID_CACHE_DEVSI:
+ if (pc_info->bus != vtd_as->bus || pc_info->devfn != vtd_as->devfn) {
+ return false;
+ }
+ break;
default:
error_setg(&error_fatal, "invalid pc_info->type for flush");
}
@@ -3632,6 +3655,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
case VTD_PASID_CACHE_FORCE_RESET:
/* For force reset, no need to go further replay */
return;
+ case VTD_PASID_CACHE_DEVSI:
+ walk_info.bus = pc_info->bus;
+ walk_info.devfn = pc_info->devfn;
+ vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+ return;
default:
error_setg(&error_fatal, "invalid pc_info->type for replay");
}
@@ -3680,6 +3708,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
vtd_replay_guest_pasid_bindings(s, pc_info);
}
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+ PCIBus *bus, uint16_t devfn)
+{
+ VTDPASIDCacheInfo pc_info;
+
+ trace_vtd_pasid_cache_devsi(devfn);
+
+ pc_info.type = VTD_PASID_CACHE_DEVSI;
+ pc_info.bus = bus;
+ pc_info.devfn = devfn;
+
+ vtd_pasid_cache_sync(s, &pc_info);
+}
+
static bool vtd_process_pasid_desc(IntelIOMMUState *s,
VTDInvDesc *inv_desc)
{
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index 1c31b9a873..830b11f68b 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -28,6 +28,7 @@ vtd_pasid_cache_reset(void) ""
vtd_pasid_cache_gsi(void) ""
vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 16/20] intel_iommu: Propagate PASID-based iotlb invalidation to host
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (14 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 15/20] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 17/20] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
` (4 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Zhenzhong Duan
From: Yi Liu <yi.l.liu@intel.com>
This traps the guest PASID-based iotlb invalidation request and propagate it
to host.
Intel VT-d 3.0 supports nested translation in PASID granularity. Guest SVA
support could be implemented by configuring nested translation on specific
pasid. This is also known as dual stage DMA translation.
Under such configuration, guest owns the GVA->GPA translation which is
configured as stage-1 page table on host side for a specific pasid, and host
owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
invalidation should be propagated to host since host IOMMU will cache first
level page table related mappings during DMA address translation.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 6 +++
hw/i386/intel_iommu.c | 95 +++++++++++++++++++++++++++++++++-
2 files changed, 99 insertions(+), 2 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 8af1004888..c1a9263651 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -596,6 +596,12 @@ typedef struct VTDPASIDCacheInfo {
uint16_t devfn;
} VTDPASIDCacheInfo;
+typedef struct VTDPIOTLBInvInfo {
+ uint16_t domain_id;
+ uint32_t pasid;
+ struct iommu_hwpt_vtd_s1_invalidate *inv_data;
+} VTDPIOTLBInvInfo;
+
/* PASID Table Related Definitions */
#define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
#define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6620e975f3..27bd8c4c89 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2611,12 +2611,99 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
return ret;
}
+
+static void
+vtd_invalidate_piotlb_locked(VTDAddressSpace *vtd_as,
+ struct iommu_hwpt_vtd_s1_invalidate *cache)
+{
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ VTDHostIOMMUDevice *vtd_hiod = vtd_find_hiod_iommufd(s, vtd_as);
+ HostIOMMUDeviceIOMMUFD *idev;
+ uint32_t entry_num = 1; /* Only implement one request for simplicity */
+ Error *local_err = NULL;
+
+ if (!vtd_hiod || !vtd_as->s1_hwpt) {
+ return;
+ }
+ idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+
+ if (!iommufd_backend_invalidate_cache(idev->iommufd, vtd_as->s1_hwpt,
+ IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+ sizeof(*cache), &entry_num, cache,
+ &local_err)) {
+ /* Something wrong in kernel, but trying to continue */
+ error_report_err(local_err);
+ }
+}
+
+/*
+ * This function is a loop function for the s->vtd_address_spaces
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host.
+ */
+static void vtd_flush_host_piotlb_locked(gpointer key, gpointer value,
+ gpointer user_data)
+{
+ VTDPIOTLBInvInfo *piotlb_info = user_data;
+ VTDAddressSpace *vtd_as = value;
+ VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+ uint32_t pasid;
+ uint16_t did;
+
+ /* Replay only fills pasid entry cache for passthrough device */
+ if (!pc_entry->valid ||
+ !vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
+ return;
+ }
+
+ if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+ return;
+ }
+
+ did = VTD_SM_PASID_ENTRY_DID(&pc_entry->pasid_entry);
+
+ if (piotlb_info->domain_id == did && piotlb_info->pasid == pasid) {
+ vtd_invalidate_piotlb_locked(vtd_as, piotlb_info->inv_data);
+ }
+}
+
+static void
+vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
+ uint16_t domain_id, uint32_t pasid,
+ hwaddr addr, uint64_t npages, bool ih)
+{
+ struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
+ VTDPIOTLBInvInfo piotlb_info;
+
+ cache_info.addr = addr;
+ cache_info.npages = npages;
+ cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
+
+ piotlb_info.domain_id = domain_id;
+ piotlb_info.pasid = pasid;
+ piotlb_info.inv_data = &cache_info;
+
+ /*
+ * Go through each vtd_as instance in s->vtd_address_spaces, find out
+ * the affected host device which need host piotlb invalidation. Piotlb
+ * invalidation should check pasid cache per architecture point of view.
+ */
+ g_hash_table_foreach(s->vtd_address_spaces,
+ vtd_flush_host_piotlb_locked, &piotlb_info);
+}
#else
static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as, VTDPASIDOp op,
Error **errp)
{
return 0;
}
+
+static void
+vtd_flush_host_piotlb_all_locked(IntelIOMMUState *s,
+ uint16_t domain_id, uint32_t pasid,
+ hwaddr addr, uint64_t npages, bool ih)
+{
+}
#endif
static int vtd_bind_guest_pasid_report_err(VTDAddressSpace *vtd_as,
@@ -3292,6 +3379,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
vtd_iommu_lock(s);
g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
&info);
+ vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, 0, (uint64_t)-1, 0);
vtd_iommu_unlock(s);
QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
@@ -3313,7 +3401,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
}
static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
- uint32_t pasid, hwaddr addr, uint8_t am)
+ uint32_t pasid, hwaddr addr, uint8_t am,
+ bool ih)
{
VTDIOTLBPageInvInfo info;
@@ -3325,6 +3414,7 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
vtd_iommu_lock(s);
g_hash_table_foreach_remove(s->iotlb,
vtd_hash_remove_by_page_piotlb, &info);
+ vtd_flush_host_piotlb_all_locked(s, domain_id, pasid, addr, 1 << am, ih);
vtd_iommu_unlock(s);
vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am, pasid);
@@ -3356,7 +3446,8 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
- vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am);
+ vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+ VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
break;
default:
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 17/20] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (15 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 16/20] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
` (3 subsequent siblings)
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
From: Yi Liu <yi.l.liu@intel.com>
When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
all pasid bindings on host side become stale and need to be updated.
Introduce a helper function vtd_replay_pasid_bindings_all() to go through all
pasid entries in all passthrough devices to update host side bindings.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 27bd8c4c89..d2442ff28d 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -89,6 +89,7 @@ struct vtd_iotlb_key {
static void vtd_address_space_refresh_all(IntelIOMMUState *s);
static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s);
static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
static void vtd_pasid_cache_sync(IntelIOMMUState *s,
@@ -3050,6 +3051,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
vtd_reset_caches(s);
vtd_address_space_refresh_all(s);
+ vtd_replay_pasid_bindings_all(s);
}
/* Set Interrupt Remap Table Pointer */
@@ -3084,6 +3086,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
vtd_reset_caches(s);
vtd_address_space_refresh_all(s);
+ vtd_replay_pasid_bindings_all(s);
}
/* Handle Interrupt Remap Enable/Disable */
@@ -3774,6 +3777,17 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
}
}
+static void vtd_replay_pasid_bindings_all(IntelIOMMUState *s)
+{
+ VTDPASIDCacheInfo pc_info = { .type = VTD_PASID_CACHE_GLOBAL_INV };
+
+ if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
+ return;
+ }
+
+ vtd_replay_guest_pasid_bindings(s, &pc_info);
+}
+
/*
* For a PASID cache invalidation, this function handles below scenarios:
* a) a present cached pasid entry needs to be removed
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (16 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 17/20] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 13:55 ` Cédric Le Goater
2025-07-29 9:20 ` [PATCH v4 19/20] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
` (2 subsequent siblings)
20 siblings, 1 reply; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
When bypass_ro is true, read only memory section is bypassed from
mapping in the container.
This is a preparing patch to workaround Intel ERRATA_772415.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
include/hw/vfio/vfio-container-base.h | 1 +
hw/vfio/listener.c | 13 +++++++++----
2 files changed, 10 insertions(+), 4 deletions(-)
diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
index bded6e993f..31fd784d76 100644
--- a/include/hw/vfio/vfio-container-base.h
+++ b/include/hw/vfio/vfio-container-base.h
@@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
QLIST_HEAD(, VFIODevice) device_list;
GList *iova_ranges;
NotifierWithReturn cpr_reboot_notifier;
+ bool bypass_ro;
} VFIOContainerBase;
typedef struct VFIOGuestIOMMU {
diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
index f498e23a93..c64aa4539e 100644
--- a/hw/vfio/listener.c
+++ b/hw/vfio/listener.c
@@ -364,7 +364,8 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
return true;
}
-static bool vfio_listener_valid_section(MemoryRegionSection *section,
+static bool vfio_listener_valid_section(VFIOContainerBase *bcontainer,
+ MemoryRegionSection *section,
const char *name)
{
if (vfio_listener_skipped_section(section)) {
@@ -375,6 +376,10 @@ static bool vfio_listener_valid_section(MemoryRegionSection *section,
return false;
}
+ if (bcontainer && bcontainer->bypass_ro && section->readonly) {
+ return false;
+ }
+
if (unlikely((section->offset_within_address_space &
~qemu_real_host_page_mask()) !=
(section->offset_within_region & ~qemu_real_host_page_mask()))) {
@@ -494,7 +499,7 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
int ret;
Error *err = NULL;
- if (!vfio_listener_valid_section(section, "region_add")) {
+ if (!vfio_listener_valid_section(bcontainer, section, "region_add")) {
return;
}
@@ -655,7 +660,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
int ret;
bool try_unmap = true;
- if (!vfio_listener_valid_section(section, "region_del")) {
+ if (!vfio_listener_valid_section(bcontainer, section, "region_del")) {
return;
}
@@ -812,7 +817,7 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
container_of(listener, VFIODirtyRangesListener, listener);
hwaddr iova, end;
- if (!vfio_listener_valid_section(section, "tracking_update") ||
+ if (!vfio_listener_valid_section(NULL, section, "tracking_update") ||
!vfio_get_section_iova_range(dirty->bcontainer, section,
&iova, &end, NULL)) {
return;
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 19/20] Workaround for ERRATA_772415_SPR17
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (17 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
2025-08-21 7:19 ` [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Duan, Zhenzhong
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on stage-2 page table could still be written.
Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.
https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update/
Also copied the SPR17 details from above link:
"Problem: When remapping hardware is configured by system software in
scalable mode as Nested (PGTT=011b) and with PWSNP field Set in the
PASID-table-entry, it may Set Accessed bit and Dirty bit (and Extended
Access bit if enabled) in first-stage page-table entries even when
second-stage mappings indicate that corresponding first-stage page-table
is Read-Only.
Implication: Due to this erratum, pages mapped as Read-only in second-stage
page-tables may be modified by remapping hardware Access/Dirty bit updates.
Workaround: None identified. System software enabling nested translations
for a VM should ensure that there are no read-only pages in the
corresponding second-stage mappings."
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/vfio/iommufd.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 61a548f13f..8156d81488 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -325,6 +325,7 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
{
ERRP_GUARD();
IOMMUFDBackend *iommufd = vbasedev->iommufd;
+ struct iommu_hw_info_vtd vtd;
uint32_t type, flags = 0;
uint64_t hw_caps;
VFIOIOASHwpt *hwpt;
@@ -372,10 +373,15 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
* instead.
*/
if (!iommufd_backend_get_device_info(vbasedev->iommufd, vbasedev->devid,
- &type, NULL, 0, &hw_caps, errp)) {
+ &type, &vtd, sizeof(vtd), &hw_caps,
+ errp)) {
return false;
}
+ if (vtd.flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17) {
+ container->bcontainer.bypass_ro = true;
+ }
+
if (hw_caps & IOMMU_HW_CAP_DIRTY_TRACKING) {
flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* [PATCH v4 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (18 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 19/20] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
@ 2025-07-29 9:20 ` Zhenzhong Duan
2025-08-21 7:19 ` [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Duan, Zhenzhong
20 siblings, 0 replies; 41+ messages in thread
From: Zhenzhong Duan @ 2025-07-29 9:20 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
Now that all infrastructures of supporting passthrough device running
with stage-1 translation are there, enable it now.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index d2442ff28d..c8375c33f2 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -5219,6 +5219,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
"when x-flts=on");
return false;
}
+
+ return true;
#endif
error_setg(errp, "host IOMMU is incompatible with stage-1 translation");
--
2.47.1
^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
2025-07-29 9:20 ` [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
@ 2025-07-29 13:19 ` Cédric Le Goater
2025-07-30 10:51 ` Duan, Zhenzhong
2025-08-13 6:43 ` Jim Shu
1 sibling, 1 reply; 41+ messages in thread
From: Cédric Le Goater @ 2025-07-29 13:19 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng
Hello Zhenzhong
On 7/29/25 11:20, Zhenzhong Duan wrote:
> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
> allows to retrieve capabilities exposed by a vIOMMU. The first planned
vIOMMU device
"device" is implied, but I find it easier to understand if it is stated
openly.
> capability is VIOMMU_CAP_HW_NESTED that advertises the support of HW
> nested stage translation scheme. pci_device_get_viommu_cap is a wrapper
> that can be called on a PCI device potentially protected by a vIOMMU.
>
> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
> capabilities which are only derermined by user's configuration, no host
typo: determined
> capabilities involved. Reasons are:
>
> 1. there can be more than one host IOMMUs with different capabilities
"host IOMMU devices"
Such as iommufd and VFIO IOMMU Type1 ?
> 2. there can also be more than one vIOMMUs with different user
> configuration, e.g., arm smmuv3.
> 3. This is migration friendly, return value is consistent between source
> and target.
> 4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
> because we need get_viommu_cap() to determine if creating nested parent
> hwpt or not at attaching stage, meanwhile hiod realize needs iommufd,
hiod -> "host IOMMU device"
> devid and hwpt_id which are ready after attach_device().
I find the above sentence difficult to understand.
> See below sequence:
>
> attach_device()
> get_viommu_cap()
> create hwpt
> ...
> create hiod
> set_iommu_device(hiod)
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> MAINTAINERS | 1 +
> include/hw/iommu.h | 17 +++++++++++++++++
> include/hw/pci/pci.h | 25 +++++++++++++++++++++++++
> hw/pci/pci.c | 11 +++++++++++
> 4 files changed, 54 insertions(+)
> create mode 100644 include/hw/iommu.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 37879ab64e..840cb1e604 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2304,6 +2304,7 @@ F: include/system/iommufd.h
> F: backends/host_iommu_device.c
> F: include/system/host_iommu_device.h
> F: include/qemu/chardev_open.h
> +F: include/hw/iommu.h
> F: util/chardev_open.c
> F: docs/devel/vfio-iommufd.rst
>
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> new file mode 100644
> index 0000000000..021db50db5
> --- /dev/null
> +++ b/include/hw/iommu.h
> @@ -0,0 +1,17 @@
> +/*
> + * General vIOMMU capabilities, flags, etc
> + *
> + * Copyright (C) 2025 Intel Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_IOMMU_H
> +#define HW_IOMMU_H
> +
> +enum {
> + /* hardware nested stage-1 page table support */
> + VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
> +};
> +
> +#endif /* HW_IOMMU_H */
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 6b7d3ac8a3..d89aefc030 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -462,6 +462,21 @@ typedef struct PCIIOMMUOps {
> * @devfn: device and function number of the PCI device.
> */
> void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
> + /**
> + * @get_viommu_cap: get vIOMMU capabilities
> + *
> + * Optional callback, if not implemented, then vIOMMU doesn't
> + * support exposing capabilities to other subsystem, e.g., VFIO.
> + * vIOMMU can choose which capabilities to expose.
> + *
> + * @opaque: the data passed to pci_setup_iommu().
> + *
> + * Returns: 64bit bitmap with each bit represents a capability emulated by
> + * VIOMMU_CAP_* in include/hw/iommu.h, these capabilities are theoretical
> + * which are only determined by user's configuration and independent on the
What is meant by "user's configuration" ? the vIOMMU device properties ?
> + * actual host capabilities they may depend on.
> + */
> + uint64_t (*get_viommu_cap)(void *opaque);
> /**
> * @get_iotlb_info: get properties required to initialize a device IOTLB.
> *
> @@ -642,6 +657,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
> Error **errp);
> void pci_device_unset_iommu_device(PCIDevice *dev);
>
> +/**
> + * pci_device_get_viommu_cap: get vIOMMU capabilities.
> + *
> + * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
> + * capability, 0 if vIOMMU doesn't support esposing capabilities.
exposing
Thanks,
C.
> + *
> + * @dev: PCI device pointer.
> + */
> +uint64_t pci_device_get_viommu_cap(PCIDevice *dev);
> +
> /**
> * pci_iommu_get_iotlb_info: get properties required to initialize a
> * device IOTLB.
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index c70b5ceeba..df1fb615a8 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2992,6 +2992,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
> }
> }
>
> +uint64_t pci_device_get_viommu_cap(PCIDevice *dev)
> +{
> + PCIBus *iommu_bus;
> +
> + pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
> + if (iommu_bus && iommu_bus->iommu_ops->get_viommu_cap) {
> + return iommu_bus->iommu_ops->get_viommu_cap(iommu_bus->iommu_opaque);
> + }
> + return 0;
> +}
> +
> int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
> bool exec_req, hwaddr addr, bool lpig,
> uint16_t prgi, bool is_read, bool is_write)
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v4 04/20] vfio/iommufd: Force creating nested parent domain
2025-07-29 9:20 ` [PATCH v4 04/20] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
@ 2025-07-29 13:44 ` Cédric Le Goater
2025-07-30 10:55 ` Duan, Zhenzhong
0 siblings, 1 reply; 41+ messages in thread
From: Cédric Le Goater @ 2025-07-29 13:44 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng
On 7/29/25 11:20, Zhenzhong Duan wrote:
> Call pci_device_get_viommu_cap() to get if vIOMMU supports VIOMMU_CAP_HW_NESTED,
> if yes, create nested parent domain which could be reused by vIOMMU to create
> nested domain.
>
> It is safe because hw_caps & VIOMMU_CAP_HW_NESTED cannot be set yet because
> s->flts is forbidden until we support passthrough device with x-flts=on.
>
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> ---
> hw/vfio/iommufd.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 48c590b6a9..61a548f13f 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -20,6 +20,7 @@
> #include "trace.h"
> #include "qapi/error.h"
> #include "system/iommufd.h"
> +#include "hw/iommu.h"
> #include "hw/qdev-core.h"
> #include "hw/vfio/vfio-cpr.h"
> #include "system/reset.h"
> @@ -379,6 +380,19 @@ static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> }
>
> + /*
> + * If vIOMMU supports stage-1 translation, force to create nested parent
> + * domain which could be reused by vIOMMU to create nested domain.
> + */
> + if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> + VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
> +
> + hw_caps = pci_device_get_viommu_cap(&vdev->pdev);
> + if (hw_caps & VIOMMU_CAP_HW_NESTED) {
> + flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
> + }
> + }
>
Could you please add a wrapper for the above ? Something like :
static bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
{
if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev);
return !!(pci_device_get_viommu_cap(&vdev->pdev) & VIOMMU_CAP_HW_NESTED);
}
return false;
}
May be this routine belongs to hw/vfio/device.c.
Thanks,
C.
> if (cpr_is_incoming()) {
> hwpt_id = vbasedev->cpr.hwpt_id;
> goto skip_alloc;
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v4 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase
2025-07-29 9:20 ` [PATCH v4 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
@ 2025-07-29 13:55 ` Cédric Le Goater
2025-07-30 10:58 ` Duan, Zhenzhong
0 siblings, 1 reply; 41+ messages in thread
From: Cédric Le Goater @ 2025-07-29 13:55 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, eric.auger, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng
On 7/29/25 11:20, Zhenzhong Duan wrote:
> When bypass_ro is true, read only memory section is bypassed from
> mapping in the container.
>
> This is a preparing patch to workaround Intel ERRATA_772415.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> include/hw/vfio/vfio-container-base.h | 1 +
> hw/vfio/listener.c | 13 +++++++++----
> 2 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/include/hw/vfio/vfio-container-base.h b/include/hw/vfio/vfio-container-base.h
> index bded6e993f..31fd784d76 100644
> --- a/include/hw/vfio/vfio-container-base.h
> +++ b/include/hw/vfio/vfio-container-base.h
> @@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
> QLIST_HEAD(, VFIODevice) device_list;
> GList *iova_ranges;
> NotifierWithReturn cpr_reboot_notifier;
> + bool bypass_ro;
> } VFIOContainerBase;
>
> typedef struct VFIOGuestIOMMU {
> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
> index f498e23a93..c64aa4539e 100644
> --- a/hw/vfio/listener.c
> +++ b/hw/vfio/listener.c
> @@ -364,7 +364,8 @@ static bool vfio_known_safe_misalignment(MemoryRegionSection *section)
> return true;
> }
>
> -static bool vfio_listener_valid_section(MemoryRegionSection *section,
> +static bool vfio_listener_valid_section(VFIOContainerBase *bcontainer,
> + MemoryRegionSection *section,
> const char *name)
Instead of adding a 'VFIOContainerBase *' argument, I would add an
extra 'bool bypass_ro' argument.
Thanks,
C.
> {
> if (vfio_listener_skipped_section(section)) {
> @@ -375,6 +376,10 @@ static bool vfio_listener_valid_section(MemoryRegionSection *section,
> return false;
> }
>
> + if (bcontainer && bcontainer->bypass_ro && section->readonly) {
> + return false;
> + }
> +
> if (unlikely((section->offset_within_address_space &
> ~qemu_real_host_page_mask()) !=
> (section->offset_within_region & ~qemu_real_host_page_mask()))) {
> @@ -494,7 +499,7 @@ void vfio_container_region_add(VFIOContainerBase *bcontainer,
> int ret;
> Error *err = NULL;
>
> - if (!vfio_listener_valid_section(section, "region_add")) {
> + if (!vfio_listener_valid_section(bcontainer, section, "region_add")) {
> return;
> }
>
> @@ -655,7 +660,7 @@ static void vfio_listener_region_del(MemoryListener *listener,
> int ret;
> bool try_unmap = true;
>
> - if (!vfio_listener_valid_section(section, "region_del")) {
> + if (!vfio_listener_valid_section(bcontainer, section, "region_del")) {
> return;
> }
>
> @@ -812,7 +817,7 @@ static void vfio_dirty_tracking_update(MemoryListener *listener,
> container_of(listener, VFIODirtyRangesListener, listener);
> hwaddr iova, end;
>
> - if (!vfio_listener_valid_section(section, "tracking_update") ||
> + if (!vfio_listener_valid_section(NULL, section, "tracking_update") ||
> !vfio_get_section_iova_range(dirty->bcontainer, section,
> &iova, &end, NULL)) {
> return;
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
2025-07-29 13:19 ` Cédric Le Goater
@ 2025-07-30 10:51 ` Duan, Zhenzhong
2025-08-04 18:45 ` Nicolin Chen
0 siblings, 1 reply; 41+ messages in thread
From: Duan, Zhenzhong @ 2025-07-30 10:51 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
Hi Cédric,
>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v4 02/20] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>Hello Zhenzhong
>
>On 7/29/25 11:20, Zhenzhong Duan wrote:
>> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
>> allows to retrieve capabilities exposed by a vIOMMU. The first planned
> vIOMMU device
>
>"device" is implied, but I find it easier to understand if it is stated
>openly.
Will do.
>
>> capability is VIOMMU_CAP_HW_NESTED that advertises the support of HW
>> nested stage translation scheme. pci_device_get_viommu_cap is a wrapper
>> that can be called on a PCI device potentially protected by a vIOMMU.
>>
>> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
>> capabilities which are only derermined by user's configuration, no host
>
>typo: determined
Will fix.
>
>> capabilities involved. Reasons are:
>>
>> 1. there can be more than one host IOMMUs with different capabilities
>
>"host IOMMU devices"
Will do.
>
>Such as iommufd and VFIO IOMMU Type1 ?
Not VFIO iommu backend, I mean host iommu hardware units.
>
>> 2. there can also be more than one vIOMMUs with different user
>> configuration, e.g., arm smmuv3.
>> 3. This is migration friendly, return value is consistent between source
>> and target.
>> 4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
>> because we need get_viommu_cap() to determine if creating nested
>parent
>> hwpt or not at attaching stage, meanwhile hiod realize needs
>iommufd,
>
>hiod -> "host IOMMU device"
Will do.
>
>> devid and hwpt_id which are ready after attach_device().
>
>I find the above sentence difficult to understand.
This is trying to explain the reason of order between attach_device(), get_viommu_cap() and hiod realizing.
What about:
4. host IOMMU capabilities are passed to vIOMMU through set_iommu_device()
interface which have to be after attach_device(), when get_viommu_cap()
is called in attach_device(), there is no way for vIOMMU to get host
IOMMU capabilities yet, so only emulated capabilities can be returned.
See below sequence:
attach_device()
get_viommu_cap()
create hwpt
...
vfio_device_hiod_create_and_realize()
set_iommu_device(hiod)
>
>
>> See below sequence:
>>
>> attach_device()
>> get_viommu_cap()
>> create hwpt
>> ...
>> create hiod
>> set_iommu_device(hiod)
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> MAINTAINERS | 1 +
>> include/hw/iommu.h | 17 +++++++++++++++++
>> include/hw/pci/pci.h | 25 +++++++++++++++++++++++++
>> hw/pci/pci.c | 11 +++++++++++
>> 4 files changed, 54 insertions(+)
>> create mode 100644 include/hw/iommu.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 37879ab64e..840cb1e604 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2304,6 +2304,7 @@ F: include/system/iommufd.h
>> F: backends/host_iommu_device.c
>> F: include/system/host_iommu_device.h
>> F: include/qemu/chardev_open.h
>> +F: include/hw/iommu.h
>> F: util/chardev_open.c
>> F: docs/devel/vfio-iommufd.rst
>>
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> new file mode 100644
>> index 0000000000..021db50db5
>> --- /dev/null
>> +++ b/include/hw/iommu.h
>> @@ -0,0 +1,17 @@
>> +/*
>> + * General vIOMMU capabilities, flags, etc
>> + *
>> + * Copyright (C) 2025 Intel Corporation.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#ifndef HW_IOMMU_H
>> +#define HW_IOMMU_H
>> +
>> +enum {
>> + /* hardware nested stage-1 page table support */
>> + VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
>> +};
>> +
>> +#endif /* HW_IOMMU_H */
>> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
>> index 6b7d3ac8a3..d89aefc030 100644
>> --- a/include/hw/pci/pci.h
>> +++ b/include/hw/pci/pci.h
>> @@ -462,6 +462,21 @@ typedef struct PCIIOMMUOps {
>> * @devfn: device and function number of the PCI device.
>> */
>> void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
>> + /**
>> + * @get_viommu_cap: get vIOMMU capabilities
>> + *
>> + * Optional callback, if not implemented, then vIOMMU doesn't
>> + * support exposing capabilities to other subsystem, e.g., VFIO.
>> + * vIOMMU can choose which capabilities to expose.
>> + *
>> + * @opaque: the data passed to pci_setup_iommu().
>> + *
>> + * Returns: 64bit bitmap with each bit represents a capability
>emulated by
>> + * VIOMMU_CAP_* in include/hw/iommu.h, these capabilities are
>theoretical
>> + * which are only determined by user's configuration and
>independent on the
>
>What is meant by "user's configuration" ? the vIOMMU device properties ?
Yes, I mean user's qemu cmdline configuration.
>
>> + * actual host capabilities they may depend on.
>> + */
>> + uint64_t (*get_viommu_cap)(void *opaque);
>> /**
>> * @get_iotlb_info: get properties required to initialize a device
>IOTLB.
>> *
>> @@ -642,6 +657,16 @@ bool pci_device_set_iommu_device(PCIDevice
>*dev, HostIOMMUDevice *hiod,
>> Error **errp);
>> void pci_device_unset_iommu_device(PCIDevice *dev);
>>
>> +/**
>> + * pci_device_get_viommu_cap: get vIOMMU capabilities.
>> + *
>> + * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
>> + * capability, 0 if vIOMMU doesn't support esposing capabilities.
>
>exposing
Will fix.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v4 04/20] vfio/iommufd: Force creating nested parent domain
2025-07-29 13:44 ` Cédric Le Goater
@ 2025-07-30 10:55 ` Duan, Zhenzhong
2025-07-30 14:00 ` Cédric Le Goater
0 siblings, 1 reply; 41+ messages in thread
From: Duan, Zhenzhong @ 2025-07-30 10:55 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v4 04/20] vfio/iommufd: Force creating nested parent
>domain
>
>On 7/29/25 11:20, Zhenzhong Duan wrote:
>> Call pci_device_get_viommu_cap() to get if vIOMMU supports
>VIOMMU_CAP_HW_NESTED,
>> if yes, create nested parent domain which could be reused by vIOMMU to
>create
>> nested domain.
>>
>> It is safe because hw_caps & VIOMMU_CAP_HW_NESTED cannot be set yet
>because
>> s->flts is forbidden until we support passthrough device with x-flts=on.
>>
>> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>> ---
>> hw/vfio/iommufd.c | 14 ++++++++++++++
>> 1 file changed, 14 insertions(+)
>>
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 48c590b6a9..61a548f13f 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -20,6 +20,7 @@
>> #include "trace.h"
>> #include "qapi/error.h"
>> #include "system/iommufd.h"
>> +#include "hw/iommu.h"
>> #include "hw/qdev-core.h"
>> #include "hw/vfio/vfio-cpr.h"
>> #include "system/reset.h"
>> @@ -379,6 +380,19 @@ static bool
>iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>> flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>> }
>>
>> + /*
>> + * If vIOMMU supports stage-1 translation, force to create nested
>parent
>> + * domain which could be reused by vIOMMU to create nested
>domain.
>> + */
>> + if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> + VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice,
>vbasedev);
>> +
>> + hw_caps = pci_device_get_viommu_cap(&vdev->pdev);
>> + if (hw_caps & VIOMMU_CAP_HW_NESTED) {
>> + flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
>> + }
>> + }
>>
>
>Could you please add a wrapper for the above ? Something like :
>
>static bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
>{
> if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
> VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice,
>vbasedev);
>
> return !!(pci_device_get_viommu_cap(&vdev->pdev) &
>VIOMMU_CAP_HW_NESTED);
> }
> return false;
>}
>
>May be this routine belongs to hw/vfio/device.c.
Done, see https://github.com/yiliu1765/qemu/commit/7ce1af90d5c1f418f23c9e397a16a3914b30f09f
I also introduced another wrapper vfio_device_to_vfio_pci(), let me know if you think it's unnecessary, I'll fold it.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v4 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase
2025-07-29 13:55 ` Cédric Le Goater
@ 2025-07-30 10:58 ` Duan, Zhenzhong
2025-07-30 14:18 ` Cédric Le Goater
0 siblings, 1 reply; 41+ messages in thread
From: Duan, Zhenzhong @ 2025-07-30 10:58 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v4 18/20] vfio: Add a new element bypass_ro in
>VFIOContainerBase
>
>On 7/29/25 11:20, Zhenzhong Duan wrote:
>> When bypass_ro is true, read only memory section is bypassed from
>> mapping in the container.
>>
>> This is a preparing patch to workaround Intel ERRATA_772415.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> include/hw/vfio/vfio-container-base.h | 1 +
>> hw/vfio/listener.c | 13 +++++++++----
>> 2 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/hw/vfio/vfio-container-base.h
>b/include/hw/vfio/vfio-container-base.h
>> index bded6e993f..31fd784d76 100644
>> --- a/include/hw/vfio/vfio-container-base.h
>> +++ b/include/hw/vfio/vfio-container-base.h
>> @@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
>> QLIST_HEAD(, VFIODevice) device_list;
>> GList *iova_ranges;
>> NotifierWithReturn cpr_reboot_notifier;
>> + bool bypass_ro;
>> } VFIOContainerBase;
>>
>> typedef struct VFIOGuestIOMMU {
>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>> index f498e23a93..c64aa4539e 100644
>> --- a/hw/vfio/listener.c
>> +++ b/hw/vfio/listener.c
>> @@ -364,7 +364,8 @@ static bool
>vfio_known_safe_misalignment(MemoryRegionSection *section)
>> return true;
>> }
>>
>> -static bool vfio_listener_valid_section(MemoryRegionSection *section,
>> +static bool vfio_listener_valid_section(VFIOContainerBase *bcontainer,
>> + MemoryRegionSection
>*section,
>> const char *name)
>
>Instead of adding a 'VFIOContainerBase *' argument, I would add an
>extra 'bool bypass_ro' argument.
Done, see https://github.com/yiliu1765/qemu/commit/b0be8bfc9a899334819f3f4f0704e47116944a53
Opportunistically, I also introduced another patch to bypass dirty tracking for readonly region, see https://github.com/yiliu1765/qemu/commit/a21ce0afd8aec5d5a9e6de46cf46757530cb7d9f
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v4 04/20] vfio/iommufd: Force creating nested parent domain
2025-07-30 10:55 ` Duan, Zhenzhong
@ 2025-07-30 14:00 ` Cédric Le Goater
2025-07-31 3:29 ` Duan, Zhenzhong
0 siblings, 1 reply; 41+ messages in thread
From: Cédric Le Goater @ 2025-07-30 14:00 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
On 7/30/25 12:55, Duan, Zhenzhong wrote:
>
>
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH v4 04/20] vfio/iommufd: Force creating nested parent
>> domain
>>
>> On 7/29/25 11:20, Zhenzhong Duan wrote:
>>> Call pci_device_get_viommu_cap() to get if vIOMMU supports
>> VIOMMU_CAP_HW_NESTED,
>>> if yes, create nested parent domain which could be reused by vIOMMU to
>> create
>>> nested domain.
>>>
>>> It is safe because hw_caps & VIOMMU_CAP_HW_NESTED cannot be set yet
>> because
>>> s->flts is forbidden until we support passthrough device with x-flts=on.
>>>
>>> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
>>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>>> ---
>>> hw/vfio/iommufd.c | 14 ++++++++++++++
>>> 1 file changed, 14 insertions(+)
>>>
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 48c590b6a9..61a548f13f 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -20,6 +20,7 @@
>>> #include "trace.h"
>>> #include "qapi/error.h"
>>> #include "system/iommufd.h"
>>> +#include "hw/iommu.h"
>>> #include "hw/qdev-core.h"
>>> #include "hw/vfio/vfio-cpr.h"
>>> #include "system/reset.h"
>>> @@ -379,6 +380,19 @@ static bool
>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>> flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>> }
>>>
>>> + /*
>>> + * If vIOMMU supports stage-1 translation, force to create nested
>> parent
>>> + * domain which could be reused by vIOMMU to create nested
>> domain.
>>> + */
>>> + if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>>> + VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice,
>> vbasedev);
>>> +
>>> + hw_caps = pci_device_get_viommu_cap(&vdev->pdev);
>>> + if (hw_caps & VIOMMU_CAP_HW_NESTED) {
>>> + flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
>>> + }
>>> + }
>>>
>>
>> Could you please add a wrapper for the above ? Something like :
>>
>> static bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
>> {
>> if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice,
>> vbasedev);
>>
>> return !!(pci_device_get_viommu_cap(&vdev->pdev) &
>> VIOMMU_CAP_HW_NESTED);
>> }
>> return false;
>> }
>>
>> May be this routine belongs to hw/vfio/device.c.
>
> Done, see https://github.com/yiliu1765/qemu/commit/7ce1af90d5c1f418f23c9e397a16a3914b30f09f
>
> I also introduced another wrapper vfio_device_to_vfio_pci(), let me know if you think it's unnecessary, I'll fold it.
It's good to have it if you use it everywhere:
hw/vfio/device.c: if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
hw/vfio/container.c: vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
hw/vfio/container.c: vbasedev_iter->type != VFIO_DEVICE_TYPE_PCI) {
hw/vfio/listener.c: if (vbasedev && vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
hw/vfio/listener.c: if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
hw/vfio/iommufd.c: if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
hw/vfio/iommufd.c: vbasedev_tmp->type != VFIO_DEVICE_TYPE_PCI) {
So, I would address this cleanup separately.
FYI, We also have this helper :
VFIODevice *vfio_get_vfio_device(Object *obj)
{
if (object_dynamic_cast(obj, TYPE_VFIO_PCI)) {
return &VFIO_PCI_BASE(obj)->vbasedev;
} else {
return NULL;
}
}
Thanks,
C.
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v4 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase
2025-07-30 10:58 ` Duan, Zhenzhong
@ 2025-07-30 14:18 ` Cédric Le Goater
2025-07-31 3:30 ` Duan, Zhenzhong
0 siblings, 1 reply; 41+ messages in thread
From: Cédric Le Goater @ 2025-07-30 14:18 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
On 7/30/25 12:58, Duan, Zhenzhong wrote:
>
>
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH v4 18/20] vfio: Add a new element bypass_ro in
>> VFIOContainerBase
>>
>> On 7/29/25 11:20, Zhenzhong Duan wrote:
>>> When bypass_ro is true, read only memory section is bypassed from
>>> mapping in the container.
>>>
>>> This is a preparing patch to workaround Intel ERRATA_772415.
>>>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>> include/hw/vfio/vfio-container-base.h | 1 +
>>> hw/vfio/listener.c | 13 +++++++++----
>>> 2 files changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/hw/vfio/vfio-container-base.h
>> b/include/hw/vfio/vfio-container-base.h
>>> index bded6e993f..31fd784d76 100644
>>> --- a/include/hw/vfio/vfio-container-base.h
>>> +++ b/include/hw/vfio/vfio-container-base.h
>>> @@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
>>> QLIST_HEAD(, VFIODevice) device_list;
>>> GList *iova_ranges;
>>> NotifierWithReturn cpr_reboot_notifier;
>>> + bool bypass_ro;
>>> } VFIOContainerBase;
>>>
>>> typedef struct VFIOGuestIOMMU {
>>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>>> index f498e23a93..c64aa4539e 100644
>>> --- a/hw/vfio/listener.c
>>> +++ b/hw/vfio/listener.c
>>> @@ -364,7 +364,8 @@ static bool
>> vfio_known_safe_misalignment(MemoryRegionSection *section)
>>> return true;
>>> }
>>>
>>> -static bool vfio_listener_valid_section(MemoryRegionSection *section,
>>> +static bool vfio_listener_valid_section(VFIOContainerBase *bcontainer,
>>> + MemoryRegionSection
>> *section,
>>> const char *name)
>>
>> Instead of adding a 'VFIOContainerBase *' argument, I would add an
>> extra 'bool bypass_ro' argument.
>
> Done, see https://github.com/yiliu1765/qemu/commit/b0be8bfc9a899334819f3f4f0704e47116944a53
I don't think the extra 'bool bypass_ro' are useful. A part from that,
looks good.
Thanks,
C.
> Opportunistically, I also introduced another patch to bypass dirty tracking for readonly region, see https://github.com/yiliu1765/qemu/commit/a21ce0afd8aec5d5a9e6de46cf46757530cb7d9f
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v4 04/20] vfio/iommufd: Force creating nested parent domain
2025-07-30 14:00 ` Cédric Le Goater
@ 2025-07-31 3:29 ` Duan, Zhenzhong
0 siblings, 0 replies; 41+ messages in thread
From: Duan, Zhenzhong @ 2025-07-31 3:29 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v4 04/20] vfio/iommufd: Force creating nested parent
>domain
>
>On 7/30/25 12:55, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Cédric Le Goater <clg@redhat.com>
>>> Subject: Re: [PATCH v4 04/20] vfio/iommufd: Force creating nested parent
>>> domain
>>>
>>> On 7/29/25 11:20, Zhenzhong Duan wrote:
>>>> Call pci_device_get_viommu_cap() to get if vIOMMU supports
>>> VIOMMU_CAP_HW_NESTED,
>>>> if yes, create nested parent domain which could be reused by vIOMMU to
>>> create
>>>> nested domain.
>>>>
>>>> It is safe because hw_caps & VIOMMU_CAP_HW_NESTED cannot be set
>yet
>>> because
>>>> s->flts is forbidden until we support passthrough device with x-flts=on.
>>>>
>>>> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
>>>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>>>> ---
>>>> hw/vfio/iommufd.c | 14 ++++++++++++++
>>>> 1 file changed, 14 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>> index 48c590b6a9..61a548f13f 100644
>>>> --- a/hw/vfio/iommufd.c
>>>> +++ b/hw/vfio/iommufd.c
>>>> @@ -20,6 +20,7 @@
>>>> #include "trace.h"
>>>> #include "qapi/error.h"
>>>> #include "system/iommufd.h"
>>>> +#include "hw/iommu.h"
>>>> #include "hw/qdev-core.h"
>>>> #include "hw/vfio/vfio-cpr.h"
>>>> #include "system/reset.h"
>>>> @@ -379,6 +380,19 @@ static bool
>>> iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>> flags = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>>> }
>>>>
>>>> + /*
>>>> + * If vIOMMU supports stage-1 translation, force to create nested
>>> parent
>>>> + * domain which could be reused by vIOMMU to create nested
>>> domain.
>>>> + */
>>>> + if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>>>> + VFIOPCIDevice *vdev = container_of(vbasedev,
>VFIOPCIDevice,
>>> vbasedev);
>>>> +
>>>> + hw_caps = pci_device_get_viommu_cap(&vdev->pdev);
>>>> + if (hw_caps & VIOMMU_CAP_HW_NESTED) {
>>>> + flags |= IOMMU_HWPT_ALLOC_NEST_PARENT;
>>>> + }
>>>> + }
>>>>
>>>
>>> Could you please add a wrapper for the above ? Something like :
>>>
>>> static bool vfio_device_viommu_get_nested(VFIODevice *vbasedev)
>>> {
>>> if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>>> VFIOPCIDevice *vdev = container_of(vbasedev,
>VFIOPCIDevice,
>>> vbasedev);
>>>
>>> return !!(pci_device_get_viommu_cap(&vdev->pdev) &
>>> VIOMMU_CAP_HW_NESTED);
>>> }
>>> return false;
>>> }
>>>
>>> May be this routine belongs to hw/vfio/device.c.
>>
>> Done, see
>https://github.com/yiliu1765/qemu/commit/7ce1af90d5c1f418f23c9e397a16
>a3914b30f09f
>>
>> I also introduced another wrapper vfio_device_to_vfio_pci(), let me know if
>you think it's unnecessary, I'll fold it.
>
>It's good to have it if you use it everywhere:
>
>hw/vfio/device.c: if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
>hw/vfio/container.c: vbasedev_iter->type !=
>VFIO_DEVICE_TYPE_PCI) {
>hw/vfio/container.c: vbasedev_iter->type !=
>VFIO_DEVICE_TYPE_PCI) {
>hw/vfio/listener.c: if (vbasedev && vbasedev->type ==
>VFIO_DEVICE_TYPE_PCI) {
>hw/vfio/listener.c: if (vbasedev->type != VFIO_DEVICE_TYPE_PCI) {
>hw/vfio/iommufd.c: if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>hw/vfio/iommufd.c: vbasedev_tmp->type !=
>VFIO_DEVICE_TYPE_PCI) {
>
>So, I would address this cleanup separately.
OK, will send a separate one.
>
>FYI, We also have this helper :
>
>VFIODevice *vfio_get_vfio_device(Object *obj)
>{
> if (object_dynamic_cast(obj, TYPE_VFIO_PCI)) {
> return &VFIO_PCI_BASE(obj)->vbasedev;
> } else {
> return NULL;
> }
>}
I'll put new helper right under it.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v4 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase
2025-07-30 14:18 ` Cédric Le Goater
@ 2025-07-31 3:30 ` Duan, Zhenzhong
0 siblings, 0 replies; 41+ messages in thread
From: Duan, Zhenzhong @ 2025-07-31 3:30 UTC (permalink / raw)
To: Cédric Le Goater, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v4 18/20] vfio: Add a new element bypass_ro in
>VFIOContainerBase
>
>On 7/30/25 12:58, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Cédric Le Goater <clg@redhat.com>
>>> Subject: Re: [PATCH v4 18/20] vfio: Add a new element bypass_ro in
>>> VFIOContainerBase
>>>
>>> On 7/29/25 11:20, Zhenzhong Duan wrote:
>>>> When bypass_ro is true, read only memory section is bypassed from
>>>> mapping in the container.
>>>>
>>>> This is a preparing patch to workaround Intel ERRATA_772415.
>>>>
>>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>>> ---
>>>> include/hw/vfio/vfio-container-base.h | 1 +
>>>> hw/vfio/listener.c | 13 +++++++++----
>>>> 2 files changed, 10 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/include/hw/vfio/vfio-container-base.h
>>> b/include/hw/vfio/vfio-container-base.h
>>>> index bded6e993f..31fd784d76 100644
>>>> --- a/include/hw/vfio/vfio-container-base.h
>>>> +++ b/include/hw/vfio/vfio-container-base.h
>>>> @@ -51,6 +51,7 @@ typedef struct VFIOContainerBase {
>>>> QLIST_HEAD(, VFIODevice) device_list;
>>>> GList *iova_ranges;
>>>> NotifierWithReturn cpr_reboot_notifier;
>>>> + bool bypass_ro;
>>>> } VFIOContainerBase;
>>>>
>>>> typedef struct VFIOGuestIOMMU {
>>>> diff --git a/hw/vfio/listener.c b/hw/vfio/listener.c
>>>> index f498e23a93..c64aa4539e 100644
>>>> --- a/hw/vfio/listener.c
>>>> +++ b/hw/vfio/listener.c
>>>> @@ -364,7 +364,8 @@ static bool
>>> vfio_known_safe_misalignment(MemoryRegionSection *section)
>>>> return true;
>>>> }
>>>>
>>>> -static bool vfio_listener_valid_section(MemoryRegionSection *section,
>>>> +static bool vfio_listener_valid_section(VFIOContainerBase *bcontainer,
>>>> + MemoryRegionSection
>>> *section,
>>>> const char *name)
>>>
>>> Instead of adding a 'VFIOContainerBase *' argument, I would add an
>>> extra 'bool bypass_ro' argument.
>>
>> Done, see
>https://github.com/yiliu1765/qemu/commit/b0be8bfc9a899334819f3f4f0704
>e47116944a53
>
>I don't think the extra 'bool bypass_ro' are useful. A part from that,
>looks good.
Guess you mean " bypass_ro = bcontainer->bypass_ro", I'll use bcontainer->bypass_ro directly.
Thanks
Zhenzhong
>
>
>Thanks,
>
>C.
>
>
>> Opportunistically, I also introduced another patch to bypass dirty tracking
>for readonly region, see
>https://github.com/yiliu1765/qemu/commit/a21ce0afd8aec5d5a9e6de46cf4
>6757530cb7d9f
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
2025-07-30 10:51 ` Duan, Zhenzhong
@ 2025-08-04 18:45 ` Nicolin Chen
2025-08-05 6:04 ` Duan, Zhenzhong
0 siblings, 1 reply; 41+ messages in thread
From: Nicolin Chen @ 2025-08-04 18:45 UTC (permalink / raw)
To: Duan, Zhenzhong
Cc: Cédric Le Goater, qemu-devel@nongnu.org,
alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
Tian, Kevin, Liu, Yi L, Peng, Chao P
On Wed, Jul 30, 2025 at 10:51:13AM +0000, Duan, Zhenzhong wrote:
> >> 2. there can also be more than one vIOMMUs with different user
> >> configuration, e.g., arm smmuv3.
That's correct. But would you please elaborate how different user
configurations would benefit from this new op? I don't see a good
reasoning behind that.
> >> 4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
> >> because we need get_viommu_cap() to determine if creating nested
> >parent
> >> hwpt or not at attaching stage, meanwhile hiod realize needs
> >iommufd,
> >
> >hiod -> "host IOMMU device"
>
> Will do.
>
> >
> >> devid and hwpt_id which are ready after attach_device().
> >
> >I find the above sentence difficult to understand.
>
> This is trying to explain the reason of order between attach_device(), get_viommu_cap() and hiod realizing.
> What about:
>
> 4. host IOMMU capabilities are passed to vIOMMU through set_iommu_device()
> interface which have to be after attach_device(), when get_viommu_cap()
> is called in attach_device(), there is no way for vIOMMU to get host
> IOMMU capabilities yet, so only emulated capabilities can be returned.
> See below sequence:
>
> attach_device()
> get_viommu_cap()
> create hwpt
> ...
> vfio_device_hiod_create_and_realize()
> set_iommu_device(hiod)
I think it should be:
vfio_device_attach():
iommufd_cdev_attach():
pci_device_get_viommu_cap() for HW nesting cap
create a nesting parent hwpt
attach device to the hwpt
vfio_device_hiod_create_and_realize() creating hiod
...
pci_device_set_iommu_device(hiod)
?
Thanks
Nicolin
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
2025-08-04 18:45 ` Nicolin Chen
@ 2025-08-05 6:04 ` Duan, Zhenzhong
0 siblings, 0 replies; 41+ messages in thread
From: Duan, Zhenzhong @ 2025-08-05 6:04 UTC (permalink / raw)
To: Nicolin Chen
Cc: Cédric Le Goater, qemu-devel@nongnu.org,
alex.williamson@redhat.com, eric.auger@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, shameerali.kolothum.thodi@huawei.com,
joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
Tian, Kevin, Liu, Yi L, Peng, Chao P
>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [PATCH v4 02/20] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>On Wed, Jul 30, 2025 at 10:51:13AM +0000, Duan, Zhenzhong wrote:
>> >> 2. there can also be more than one vIOMMUs with different user
>> >> configuration, e.g., arm smmuv3.
>
>That's correct. But would you please elaborate how different user
>configurations would benefit from this new op? I don't see a good
>reasoning behind that.
It is the reason to have get_viommu_cap(), get_viommu_cap()
returns capabilities of vIOMMU which underlying VFIO device attaches to,
but indeed, it's not reason for leaving out host capabilities, I'll drop this line.
>
>> >> 4. It's too late for VFIO to call get_viommu_cap() after
>set_iommu_device()
>> >> because we need get_viommu_cap() to determine if creating
>nested
>> >parent
>> >> hwpt or not at attaching stage, meanwhile hiod realize needs
>> >iommufd,
>> >
>> >hiod -> "host IOMMU device"
>>
>> Will do.
>>
>> >
>> >> devid and hwpt_id which are ready after attach_device().
>> >
>> >I find the above sentence difficult to understand.
>>
>> This is trying to explain the reason of order between attach_device(),
>get_viommu_cap() and hiod realizing.
>> What about:
>>
>> 4. host IOMMU capabilities are passed to vIOMMU through
>set_iommu_device()
>> interface which have to be after attach_device(), when
>get_viommu_cap()
>> is called in attach_device(), there is no way for vIOMMU to get host
>> IOMMU capabilities yet, so only emulated capabilities can be returned.
>> See below sequence:
>>
>> attach_device()
>> get_viommu_cap()
>> create hwpt
>> ...
>> vfio_device_hiod_create_and_realize()
>> set_iommu_device(hiod)
>
>I think it should be:
>
> vfio_device_attach():
> iommufd_cdev_attach():
> pci_device_get_viommu_cap() for HW nesting cap
> create a nesting parent hwpt
> attach device to the hwpt
> vfio_device_hiod_create_and_realize() creating hiod
> ...
> pci_device_set_iommu_device(hiod)
Good catch, done, https://github.com/yiliu1765/qemu/commit/34b02bdc0d615c60d1b07bfee842aabf1f7e2b28
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
2025-07-29 9:20 ` [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
2025-07-29 13:19 ` Cédric Le Goater
@ 2025-08-13 6:43 ` Jim Shu
2025-08-13 6:57 ` Duan, Zhenzhong
1 sibling, 1 reply; 41+ messages in thread
From: Jim Shu @ 2025-08-13 6:43 UTC (permalink / raw)
To: Zhenzhong Duan
Cc: qemu-devel, alex.williamson, clg, eric.auger, mst, jasowang,
peterx, ddutile, jgg, nicolinc, shameerali.kolothum.thodi,
joao.m.martins, clement.mathieu--drif, kevin.tian, yi.l.liu,
chao.p.peng
Hi Zhenzhong,
On Tue, Jul 29, 2025 at 5:21 PM Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:
>
> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
> allows to retrieve capabilities exposed by a vIOMMU. The first planned
> capability is VIOMMU_CAP_HW_NESTED that advertises the support of HW
> nested stage translation scheme. pci_device_get_viommu_cap is a wrapper
> that can be called on a PCI device potentially protected by a vIOMMU.
>
> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
> capabilities which are only derermined by user's configuration, no host
> capabilities involved. Reasons are:
>
> 1. there can be more than one host IOMMUs with different capabilities
> 2. there can also be more than one vIOMMUs with different user
> configuration, e.g., arm smmuv3.
> 3. This is migration friendly, return value is consistent between source
> and target.
> 4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
> because we need get_viommu_cap() to determine if creating nested parent
> hwpt or not at attaching stage, meanwhile hiod realize needs iommufd,
> devid and hwpt_id which are ready after attach_device().
> See below sequence:
>
> attach_device()
> get_viommu_cap()
> create hwpt
> ...
> create hiod
> set_iommu_device(hiod)
>
> Suggested-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> MAINTAINERS | 1 +
> include/hw/iommu.h | 17 +++++++++++++++++
> include/hw/pci/pci.h | 25 +++++++++++++++++++++++++
> hw/pci/pci.c | 11 +++++++++++
> 4 files changed, 54 insertions(+)
> create mode 100644 include/hw/iommu.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 37879ab64e..840cb1e604 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2304,6 +2304,7 @@ F: include/system/iommufd.h
> F: backends/host_iommu_device.c
> F: include/system/host_iommu_device.h
> F: include/qemu/chardev_open.h
> +F: include/hw/iommu.h
> F: util/chardev_open.c
> F: docs/devel/vfio-iommufd.rst
>
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> new file mode 100644
> index 0000000000..021db50db5
> --- /dev/null
> +++ b/include/hw/iommu.h
> @@ -0,0 +1,17 @@
> +/*
> + * General vIOMMU capabilities, flags, etc
> + *
> + * Copyright (C) 2025 Intel Corporation.
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_IOMMU_H
> +#define HW_IOMMU_H
> +
Could we include header "qemu/bitops.h" here to use `BIT_ULL()` in this file?
Or it will get compile error when including it in the other IOMMU.
> +enum {
> + /* hardware nested stage-1 page table support */
> + VIOMMU_CAP_HW_NESTED = BIT_ULL(0),
> +};
> +
> +#endif /* HW_IOMMU_H */
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index 6b7d3ac8a3..d89aefc030 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -462,6 +462,21 @@ typedef struct PCIIOMMUOps {
> * @devfn: device and function number of the PCI device.
> */
> void (*unset_iommu_device)(PCIBus *bus, void *opaque, int devfn);
> + /**
> + * @get_viommu_cap: get vIOMMU capabilities
> + *
> + * Optional callback, if not implemented, then vIOMMU doesn't
> + * support exposing capabilities to other subsystem, e.g., VFIO.
> + * vIOMMU can choose which capabilities to expose.
> + *
> + * @opaque: the data passed to pci_setup_iommu().
> + *
> + * Returns: 64bit bitmap with each bit represents a capability emulated by
> + * VIOMMU_CAP_* in include/hw/iommu.h, these capabilities are theoretical
> + * which are only determined by user's configuration and independent on the
> + * actual host capabilities they may depend on.
> + */
> + uint64_t (*get_viommu_cap)(void *opaque);
> /**
> * @get_iotlb_info: get properties required to initialize a device IOTLB.
> *
> @@ -642,6 +657,16 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
> Error **errp);
> void pci_device_unset_iommu_device(PCIDevice *dev);
>
> +/**
> + * pci_device_get_viommu_cap: get vIOMMU capabilities.
> + *
> + * Returns a 64bit bitmap with each bit represents a vIOMMU exposed
> + * capability, 0 if vIOMMU doesn't support esposing capabilities.
> + *
> + * @dev: PCI device pointer.
> + */
> +uint64_t pci_device_get_viommu_cap(PCIDevice *dev);
> +
> /**
> * pci_iommu_get_iotlb_info: get properties required to initialize a
> * device IOTLB.
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index c70b5ceeba..df1fb615a8 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2992,6 +2992,17 @@ void pci_device_unset_iommu_device(PCIDevice *dev)
> }
> }
>
> +uint64_t pci_device_get_viommu_cap(PCIDevice *dev)
> +{
> + PCIBus *iommu_bus;
> +
> + pci_device_get_iommu_bus_devfn(dev, &iommu_bus, NULL, NULL);
> + if (iommu_bus && iommu_bus->iommu_ops->get_viommu_cap) {
> + return iommu_bus->iommu_ops->get_viommu_cap(iommu_bus->iommu_opaque);
> + }
> + return 0;
> +}
> +
> int pci_pri_request_page(PCIDevice *dev, uint32_t pasid, bool priv_req,
> bool exec_req, hwaddr addr, bool lpig,
> uint16_t prgi, bool is_read, bool is_write)
> --
> 2.47.1
>
>
Thanks,
Jim
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap()
2025-08-13 6:43 ` Jim Shu
@ 2025-08-13 6:57 ` Duan, Zhenzhong
0 siblings, 0 replies; 41+ messages in thread
From: Duan, Zhenzhong @ 2025-08-13 6:57 UTC (permalink / raw)
To: Jim Shu
Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
eric.auger@redhat.com, mst@redhat.com, jasowang@redhat.com,
peterx@redhat.com, ddutile@redhat.com, jgg@nvidia.com,
nicolinc@nvidia.com, shameerali.kolothum.thodi@huawei.com,
joao.m.martins@oracle.com, clement.mathieu--drif@eviden.com,
Tian, Kevin, Liu, Yi L, Peng, Chao P
>-----Original Message-----
>From: Jim Shu <jim.shu@sifive.com>
>Subject: Re: [PATCH v4 02/20] hw/pci: Introduce
>pci_device_get_viommu_cap()
>
>Hi Zhenzhong,
>
>On Tue, Jul 29, 2025 at 5:21 PM Zhenzhong Duan
><zhenzhong.duan@intel.com> wrote:
>>
>> Introduce a new PCIIOMMUOps optional callback, get_viommu_cap() which
>> allows to retrieve capabilities exposed by a vIOMMU. The first planned
>> capability is VIOMMU_CAP_HW_NESTED that advertises the support of HW
>> nested stage translation scheme. pci_device_get_viommu_cap is a wrapper
>> that can be called on a PCI device potentially protected by a vIOMMU.
>>
>> get_viommu_cap() is designed to return 64bit bitmap of purely emulated
>> capabilities which are only derermined by user's configuration, no host
>> capabilities involved. Reasons are:
>>
>> 1. there can be more than one host IOMMUs with different capabilities
>> 2. there can also be more than one vIOMMUs with different user
>> configuration, e.g., arm smmuv3.
>> 3. This is migration friendly, return value is consistent between source
>> and target.
>> 4. It's too late for VFIO to call get_viommu_cap() after set_iommu_device()
>> because we need get_viommu_cap() to determine if creating nested
>parent
>> hwpt or not at attaching stage, meanwhile hiod realize needs iommufd,
>> devid and hwpt_id which are ready after attach_device().
>> See below sequence:
>>
>> attach_device()
>> get_viommu_cap()
>> create hwpt
>> ...
>> create hiod
>> set_iommu_device(hiod)
>>
>> Suggested-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> MAINTAINERS | 1 +
>> include/hw/iommu.h | 17 +++++++++++++++++
>> include/hw/pci/pci.h | 25 +++++++++++++++++++++++++
>> hw/pci/pci.c | 11 +++++++++++
>> 4 files changed, 54 insertions(+)
>> create mode 100644 include/hw/iommu.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 37879ab64e..840cb1e604 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -2304,6 +2304,7 @@ F: include/system/iommufd.h
>> F: backends/host_iommu_device.c
>> F: include/system/host_iommu_device.h
>> F: include/qemu/chardev_open.h
>> +F: include/hw/iommu.h
>> F: util/chardev_open.c
>> F: docs/devel/vfio-iommufd.rst
>>
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> new file mode 100644
>> index 0000000000..021db50db5
>> --- /dev/null
>> +++ b/include/hw/iommu.h
>> @@ -0,0 +1,17 @@
>> +/*
>> + * General vIOMMU capabilities, flags, etc
>> + *
>> + * Copyright (C) 2025 Intel Corporation.
>> + *
>> + * SPDX-License-Identifier: GPL-2.0-or-later
>> + */
>> +
>> +#ifndef HW_IOMMU_H
>> +#define HW_IOMMU_H
>> +
>
>Could we include header "qemu/bitops.h" here to use `BIT_ULL()` in this file?
>Or it will get compile error when including it in the other IOMMU.
Oh, thanks for point out, fixed at https://github.com/yiliu1765/qemu/commits/zhenzhong/iommufd_nesting.v5.wip/
BRs,
Zhenzhong
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (19 preceding siblings ...)
2025-07-29 9:20 ` [PATCH v4 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
@ 2025-08-21 7:19 ` Duan, Zhenzhong
2025-08-21 8:50 ` Yi Liu
2025-08-21 8:51 ` Michael S. Tsirkin
20 siblings, 2 replies; 41+ messages in thread
From: Duan, Zhenzhong @ 2025-08-21 7:19 UTC (permalink / raw)
To: qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
Kindly ping, any more comments?
Thanks
Zhenzhong
>-----Original Message-----
>From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>Subject: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for
>passthrough device
>
>Hi,
>
>For passthrough device with intel_iommu.x-flts=on, we don't do shadowing
>of
>guest page table for passthrough device but pass stage-1 page table to host
>side to construct a nested domain. There was some effort to enable this
>feature
>in old days, see [1] for details.
>
>The key design is to utilize the dual-stage IOMMU translation (also known as
>IOMMU nested translation) capability in host IOMMU. As the below diagram
>shows,
>guest I/O page table pointer in GPA (guest physical address) is passed to host
>and be used to perform the stage-1 address translation. Along with it,
>modifications to present mappings in the guest I/O page table should be
>followed
>with an IOTLB invalidation.
>
> .-------------. .---------------------------.
> | vIOMMU | | Guest I/O page table |
> | | '---------------------------'
> .----------------/
> | PASID Entry |--- PASID cache flush --+
> '-------------' |
> | | V
> | | I/O page table pointer in GPA
> '-------------'
> Guest
> ------| Shadow |---------------------------|--------
> v v v
> Host
> .-------------. .------------------------.
> | pIOMMU | | Stage1 for GIOVA->GPA |
> | | '------------------------'
> .----------------/ |
> | PASID Entry | V (Nested xlate)
> '----------------\.--------------------------------------.
> | | | Stage2 for GPA->HPA, unmanaged domain|
> | | '--------------------------------------'
> '-------------'
>For history reason, there are different namings in different VTD spec rev,
>Where:
> - Stage1 = First stage = First level = flts
> - Stage2 = Second stage = Second level = slts
><Intel VT-d Nested translation>
>
>This series reuse VFIO device's default hwpt as nested parent instead of
>creating new one. This way avoids duplicate code of a new memory listener,
>all existing feature from VFIO listener can be shared, e.g., ram discard,
>dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
>under a PCI bridge with emulated device, because emulated device wants
>IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
>reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
>VFIO device's default hwpt is created with NEST_PARENT flag, kernel
>inhibit RO mappings when switch to shadow mode.
>
>This series is also a prerequisite work for vSVA, i.e. Sharing guest
>application address space with passthrough devices.
>
>There are some interactions between VFIO and vIOMMU
>* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
> subsystem. VFIO calls them to register/unregister HostIOMMUDevice
> instance to vIOMMU at vfio device realize stage.
>* vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
> VFIO calls it to get vIOMMU exposed capabilities.
>* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
> to bind/unbind device to IOMMUFD backed domains, either nested
> domain or not.
>
>See below diagram:
>
> VFIO Device Intel IOMMU
> .-----------------. .-------------------.
> | | |
>|
> | .---------|PCIIOMMUOps |.-------------. |
> | | IOMMUFD |(set/unset_iommu_device) || Host IOMMU |
>|
> | | Device |------------------------>|| Device list | |
> | .---------|(get_viommu_cap) |.-------------. |
> | | | |
>|
> | | | V
>|
> | .---------| HostIOMMUDeviceIOMMUFD | .-------------. |
> | | IOMMUFD | (attach_hwpt)| | Host IOMMU
>| |
> | | link |<------------------------| | Device | |
> | .---------| (detach_hwpt)| .-------------. |
> | | | |
>|
> | | | ...
>|
> .-----------------. .-------------------.
>
>Below is an example to enable stage-1 translation for passthrough device:
>
> -M q35,...
> -device intel-iommu,x-scalable-mode=on,x-flts=on...
> -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>
>Test done:
>- VFIO devices hotplug/unplug
>- different VFIO devices linked to different iommufds
>- vhost net device ping test
>
>PATCH1-6: Some preparing work
>PATCH7-8: Compatibility check between vIOMMU and Host IOMMU
>PATCH9-17: Implement stage-1 page table for passthrough device
>PATCH18-19:Workaround for ERRATA_772415_SPR17
>PATCH20: Enable stage-1 translation for passthrough device
>
>Qemu code can be found at [2]
>
>Fault report isn't supported in this series, we presume guest kernel always
>construct correct stage1 page table for passthrough device. For emulated
>devices, the emulation code already provided stage1 fault injection.
>
>TODO:
>- Fault report to guest when HW stage1 faults
>
>[1]
>https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1
>-yi.l.liu@intel.com/
>[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4
>
>Thanks
>Zhenzhong
>
>Changelog:
>v4:
>- s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin,
>Donald, Shameer)
>- clarify get_viommu_cap() return pure emulated caps and explain reason in
>commit log (Eric)
>- retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric)
>- refine doc comment and commit log in patch10-11 (Eric)
>
>v3:
>- define enum type for VIOMMU_CAP_* (Eric)
>- drop inline flag in the patch which uses the helper (Eric)
>- use extract64 in new introduced MACRO (Eric)
>- polish comments and fix typo error (Eric)
>- split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
>- optimize bind/unbind error path processing
>
>v2:
>- introduce get_viommu_cap() to get STAGE1 flag to create nested parent
>hwpt (Liuyi)
>- reuse VFIO's default hwpt as parent hwpt of nested translation (Nicolin,
>Liuyi)
>- abandon support of VFIO device under pcie-to-pci bridge to simplify design
>(Liuyi)
>- bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 (Liuyi)
>- drop vtd_dev_to_context_entry optimization (Liuyi)
>
>v1:
>- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
>- rebase to master
>
>rfcv3:
>- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter
>(Shameer)
>- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
>- simplify return value check of get_cap() (Eric)
>- drop realize_late (Cedric, Eric)
>- split patch13:intel_iommu: Add PASID cache management infrastructure
>(Eric)
>- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
>- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
>- refine comments (Eric, Donald)
>
>rfcv2:
>- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
>- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily
>rebase
>- add two cleanup patches(patch9-10)
>- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of
>iommufd/devid/ioas_id
>- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as
>and
> iommu pasid, this is important for dropping VTDPASIDAddressSpace
>
>
>Yi Liu (3):
> intel_iommu: Replay pasid bindings after context cache invalidation
> intel_iommu: Propagate PASID-based iotlb invalidation to host
> intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
> changed
>
>Zhenzhong Duan (17):
> intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
> vtd_ce_get_pasid_entry
> hw/pci: Introduce pci_device_get_viommu_cap()
> intel_iommu: Implement get_viommu_cap() callback
> vfio/iommufd: Force creating nested parent domain
> hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
> intel_iommu: Introduce a new structure VTDHostIOMMUDevice
> intel_iommu: Check for compatibility with IOMMUFD backed device when
> x-flts=on
> intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
> intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
> intel_iommu: Handle PASID entry removal and update
> intel_iommu: Handle PASID entry addition
> intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
> intel_iommu: Stick to system MR for IOMMUFD backed host device when
> x-fls=on
> intel_iommu: Bind/unbind guest page table to host
> vfio: Add a new element bypass_ro in VFIOContainerBase
> Workaround for ERRATA_772415_SPR17
> intel_iommu: Enable host device when x-flts=on in scalable mode
>
> MAINTAINERS | 1 +
> hw/i386/intel_iommu_internal.h | 68 +-
> include/hw/i386/intel_iommu.h | 9 +-
> include/hw/iommu.h | 17 +
> include/hw/pci/pci.h | 27 +
> include/hw/vfio/vfio-container-base.h | 1 +
> hw/i386/intel_iommu.c | 941
>+++++++++++++++++++++++++-
> hw/pci/pci.c | 23 +-
> hw/vfio/iommufd.c | 22 +-
> hw/vfio/listener.c | 13 +-
> hw/i386/trace-events | 8 +
> 11 files changed, 1088 insertions(+), 42 deletions(-)
> create mode 100644 include/hw/iommu.h
>
>
>base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0
>--
>2.47.1
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device
2025-08-21 8:50 ` Yi Liu
@ 2025-08-21 8:50 ` Eric Auger
2025-08-21 8:58 ` Duan, Zhenzhong
2025-08-21 8:50 ` Duan, Zhenzhong
1 sibling, 1 reply; 41+ messages in thread
From: Eric Auger @ 2025-08-21 8:50 UTC (permalink / raw)
To: Yi Liu, Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P
On 8/21/25 10:50 AM, Yi Liu wrote:
> On 2025/8/21 15:19, Duan, Zhenzhong wrote:
>> Kindly ping, any more comments?
>
> Do you have enough comments for a new version. I plan to have a look
> either this version or a new version next week. :)
same for me ;-)
Eric
>
> Regards,
> Yi Liu
>
>> Thanks
>> Zhenzhong
>>
>>> -----Original Message-----
>>> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>>> Subject: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for
>>> passthrough device
>>>
>>> Hi,
>>>
>>> For passthrough device with intel_iommu.x-flts=on, we don't do
>>> shadowing
>>> of
>>> guest page table for passthrough device but pass stage-1 page table
>>> to host
>>> side to construct a nested domain. There was some effort to enable this
>>> feature
>>> in old days, see [1] for details.
>>>
>>> The key design is to utilize the dual-stage IOMMU translation (also
>>> known as
>>> IOMMU nested translation) capability in host IOMMU. As the below
>>> diagram
>>> shows,
>>> guest I/O page table pointer in GPA (guest physical address) is
>>> passed to host
>>> and be used to perform the stage-1 address translation. Along with it,
>>> modifications to present mappings in the guest I/O page table should be
>>> followed
>>> with an IOTLB invalidation.
>>>
>>> .-------------. .---------------------------.
>>> | vIOMMU | | Guest I/O page table |
>>> | | '---------------------------'
>>> .----------------/
>>> | PASID Entry |--- PASID cache flush --+
>>> '-------------' |
>>> | | V
>>> | | I/O page table pointer in GPA
>>> '-------------'
>>> Guest
>>> ------| Shadow |---------------------------|--------
>>> v v v
>>> Host
>>> .-------------. .------------------------.
>>> | pIOMMU | | Stage1 for GIOVA->GPA |
>>> | | '------------------------'
>>> .----------------/ |
>>> | PASID Entry | V (Nested xlate)
>>> '----------------\.--------------------------------------.
>>> | | | Stage2 for GPA->HPA, unmanaged domain|
>>> | | '--------------------------------------'
>>> '-------------'
>>> For history reason, there are different namings in different VTD
>>> spec rev,
>>> Where:
>>> - Stage1 = First stage = First level = flts
>>> - Stage2 = Second stage = Second level = slts
>>> <Intel VT-d Nested translation>
>>>
>>> This series reuse VFIO device's default hwpt as nested parent
>>> instead of
>>> creating new one. This way avoids duplicate code of a new memory
>>> listener,
>>> all existing feature from VFIO listener can be shared, e.g., ram
>>> discard,
>>> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
>>> under a PCI bridge with emulated device, because emulated device wants
>>> IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
>>> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
>>> VFIO device's default hwpt is created with NEST_PARENT flag, kernel
>>> inhibit RO mappings when switch to shadow mode.
>>>
>>> This series is also a prerequisite work for vSVA, i.e. Sharing guest
>>> application address space with passthrough devices.
>>>
>>> There are some interactions between VFIO and vIOMMU
>>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>> subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>> instance to vIOMMU at vfio device realize stage.
>>> * vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
>>> VFIO calls it to get vIOMMU exposed capabilities.
>>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>> to bind/unbind device to IOMMUFD backed domains, either nested
>>> domain or not.
>>>
>>> See below diagram:
>>>
>>> VFIO Device Intel IOMMU
>>> .-----------------. .-------------------.
>>> | | |
>>> |
>>> | .---------|PCIIOMMUOps |.-------------. |
>>> | | IOMMUFD |(set/unset_iommu_device) || Host IOMMU |
>>> |
>>> | | Device |------------------------>|| Device list | |
>>> | .---------|(get_viommu_cap) |.-------------. |
>>> | | | |
>>> |
>>> | | | V
>>> |
>>> | .---------| HostIOMMUDeviceIOMMUFD | .-------------. |
>>> | | IOMMUFD | (attach_hwpt)| | Host IOMMU
>>> | |
>>> | | link |<------------------------| | Device | |
>>> | .---------| (detach_hwpt)| .-------------. |
>>> | | | |
>>> |
>>> | | | ...
>>> |
>>> .-----------------. .-------------------.
>>>
>>> Below is an example to enable stage-1 translation for passthrough
>>> device:
>>>
>>> -M q35,...
>>> -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>> -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>>
>>> Test done:
>>> - VFIO devices hotplug/unplug
>>> - different VFIO devices linked to different iommufds
>>> - vhost net device ping test
>>>
>>> PATCH1-6: Some preparing work
>>> PATCH7-8: Compatibility check between vIOMMU and Host IOMMU
>>> PATCH9-17: Implement stage-1 page table for passthrough device
>>> PATCH18-19:Workaround for ERRATA_772415_SPR17
>>> PATCH20: Enable stage-1 translation for passthrough device
>>>
>>> Qemu code can be found at [2]
>>>
>>> Fault report isn't supported in this series, we presume guest kernel
>>> always
>>> construct correct stage1 page table for passthrough device. For
>>> emulated
>>> devices, the emulation code already provided stage1 fault injection.
>>>
>>> TODO:
>>> - Fault report to guest when HW stage1 faults
>>>
>>> [1]
>>> https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1
>>> -yi.l.liu@intel.com/
>>> [2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4
>>>
>>> Thanks
>>> Zhenzhong
>>>
>>> Changelog:
>>> v4:
>>> - s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin,
>>> Donald, Shameer)
>>> - clarify get_viommu_cap() return pure emulated caps and explain
>>> reason in
>>> commit log (Eric)
>>> - retrieve the ce only if vtd_as->pasid in
>>> vtd_as_to_iommu_pasid_locked (Eric)
>>> - refine doc comment and commit log in patch10-11 (Eric)
>>>
>>> v3:
>>> - define enum type for VIOMMU_CAP_* (Eric)
>>> - drop inline flag in the patch which uses the helper (Eric)
>>> - use extract64 in new introduced MACRO (Eric)
>>> - polish comments and fix typo error (Eric)
>>> - split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
>>> - optimize bind/unbind error path processing
>>>
>>> v2:
>>> - introduce get_viommu_cap() to get STAGE1 flag to create nested parent
>>> hwpt (Liuyi)
>>> - reuse VFIO's default hwpt as parent hwpt of nested translation
>>> (Nicolin,
>>> Liuyi)
>>> - abandon support of VFIO device under pcie-to-pci bridge to
>>> simplify design
>>> (Liuyi)
>>> - bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17
>>> (Liuyi)
>>> - drop vtd_dev_to_context_entry optimization (Liuyi)
>>>
>>> v1:
>>> - simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
>>> - rebase to master
>>>
>>> rfcv3:
>>> - s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter
>>> (Shameer)
>>> - hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
>>> - simplify return value check of get_cap() (Eric)
>>> - drop realize_late (Cedric, Eric)
>>> - split patch13:intel_iommu: Add PASID cache management infrastructure
>>> (Eric)
>>> - s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
>>> - s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
>>> - refine comments (Eric, Donald)
>>>
>>> rfcv2:
>>> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
>>> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily
>>> rebase
>>> - add two cleanup patches(patch9-10)
>>> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of
>>> iommufd/devid/ioas_id
>>> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as
>>> and
>>> iommu pasid, this is important for dropping VTDPASIDAddressSpace
>>>
>>>
>>> Yi Liu (3):
>>> intel_iommu: Replay pasid bindings after context cache invalidation
>>> intel_iommu: Propagate PASID-based iotlb invalidation to host
>>> intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
>>> changed
>>>
>>> Zhenzhong Duan (17):
>>> intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>>> vtd_ce_get_pasid_entry
>>> hw/pci: Introduce pci_device_get_viommu_cap()
>>> intel_iommu: Implement get_viommu_cap() callback
>>> vfio/iommufd: Force creating nested parent domain
>>> hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
>>> intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>>> intel_iommu: Check for compatibility with IOMMUFD backed device when
>>> x-flts=on
>>> intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
>>> intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
>>> intel_iommu: Handle PASID entry removal and update
>>> intel_iommu: Handle PASID entry addition
>>> intel_iommu: Introduce a new pasid cache invalidation type
>>> FORCE_RESET
>>> intel_iommu: Stick to system MR for IOMMUFD backed host device when
>>> x-fls=on
>>> intel_iommu: Bind/unbind guest page table to host
>>> vfio: Add a new element bypass_ro in VFIOContainerBase
>>> Workaround for ERRATA_772415_SPR17
>>> intel_iommu: Enable host device when x-flts=on in scalable mode
>>>
>>> MAINTAINERS | 1 +
>>> hw/i386/intel_iommu_internal.h | 68 +-
>>> include/hw/i386/intel_iommu.h | 9 +-
>>> include/hw/iommu.h | 17 +
>>> include/hw/pci/pci.h | 27 +
>>> include/hw/vfio/vfio-container-base.h | 1 +
>>> hw/i386/intel_iommu.c | 941
>>> +++++++++++++++++++++++++-
>>> hw/pci/pci.c | 23 +-
>>> hw/vfio/iommufd.c | 22 +-
>>> hw/vfio/listener.c | 13 +-
>>> hw/i386/trace-events | 8 +
>>> 11 files changed, 1088 insertions(+), 42 deletions(-)
>>> create mode 100644 include/hw/iommu.h
>>>
>>>
>>> base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0
>>> --
>>> 2.47.1
>>
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device
2025-08-21 7:19 ` [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Duan, Zhenzhong
@ 2025-08-21 8:50 ` Yi Liu
2025-08-21 8:50 ` Eric Auger
2025-08-21 8:50 ` Duan, Zhenzhong
2025-08-21 8:51 ` Michael S. Tsirkin
1 sibling, 2 replies; 41+ messages in thread
From: Yi Liu @ 2025-08-21 8:50 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P
On 2025/8/21 15:19, Duan, Zhenzhong wrote:
> Kindly ping, any more comments?
Do you have enough comments for a new version. I plan to have a look
either this version or a new version next week. :)
Regards,
Yi Liu
> Thanks
> Zhenzhong
>
>> -----Original Message-----
>> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>> Subject: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for
>> passthrough device
>>
>> Hi,
>>
>> For passthrough device with intel_iommu.x-flts=on, we don't do shadowing
>> of
>> guest page table for passthrough device but pass stage-1 page table to host
>> side to construct a nested domain. There was some effort to enable this
>> feature
>> in old days, see [1] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation (also known as
>> IOMMU nested translation) capability in host IOMMU. As the below diagram
>> shows,
>> guest I/O page table pointer in GPA (guest physical address) is passed to host
>> and be used to perform the stage-1 address translation. Along with it,
>> modifications to present mappings in the guest I/O page table should be
>> followed
>> with an IOTLB invalidation.
>>
>> .-------------. .---------------------------.
>> | vIOMMU | | Guest I/O page table |
>> | | '---------------------------'
>> .----------------/
>> | PASID Entry |--- PASID cache flush --+
>> '-------------' |
>> | | V
>> | | I/O page table pointer in GPA
>> '-------------'
>> Guest
>> ------| Shadow |---------------------------|--------
>> v v v
>> Host
>> .-------------. .------------------------.
>> | pIOMMU | | Stage1 for GIOVA->GPA |
>> | | '------------------------'
>> .----------------/ |
>> | PASID Entry | V (Nested xlate)
>> '----------------\.--------------------------------------.
>> | | | Stage2 for GPA->HPA, unmanaged domain|
>> | | '--------------------------------------'
>> '-------------'
>> For history reason, there are different namings in different VTD spec rev,
>> Where:
>> - Stage1 = First stage = First level = flts
>> - Stage2 = Second stage = Second level = slts
>> <Intel VT-d Nested translation>
>>
>> This series reuse VFIO device's default hwpt as nested parent instead of
>> creating new one. This way avoids duplicate code of a new memory listener,
>> all existing feature from VFIO listener can be shared, e.g., ram discard,
>> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
>> under a PCI bridge with emulated device, because emulated device wants
>> IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
>> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
>> VFIO device's default hwpt is created with NEST_PARENT flag, kernel
>> inhibit RO mappings when switch to shadow mode.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing guest
>> application address space with passthrough devices.
>>
>> There are some interactions between VFIO and vIOMMU
>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>> subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>> instance to vIOMMU at vfio device realize stage.
>> * vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
>> VFIO calls it to get vIOMMU exposed capabilities.
>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>> to bind/unbind device to IOMMUFD backed domains, either nested
>> domain or not.
>>
>> See below diagram:
>>
>> VFIO Device Intel IOMMU
>> .-----------------. .-------------------.
>> | | |
>> |
>> | .---------|PCIIOMMUOps |.-------------. |
>> | | IOMMUFD |(set/unset_iommu_device) || Host IOMMU |
>> |
>> | | Device |------------------------>|| Device list | |
>> | .---------|(get_viommu_cap) |.-------------. |
>> | | | |
>> |
>> | | | V
>> |
>> | .---------| HostIOMMUDeviceIOMMUFD | .-------------. |
>> | | IOMMUFD | (attach_hwpt)| | Host IOMMU
>> | |
>> | | link |<------------------------| | Device | |
>> | .---------| (detach_hwpt)| .-------------. |
>> | | | |
>> |
>> | | | ...
>> |
>> .-----------------. .-------------------.
>>
>> Below is an example to enable stage-1 translation for passthrough device:
>>
>> -M q35,...
>> -device intel-iommu,x-scalable-mode=on,x-flts=on...
>> -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>
>> Test done:
>> - VFIO devices hotplug/unplug
>> - different VFIO devices linked to different iommufds
>> - vhost net device ping test
>>
>> PATCH1-6: Some preparing work
>> PATCH7-8: Compatibility check between vIOMMU and Host IOMMU
>> PATCH9-17: Implement stage-1 page table for passthrough device
>> PATCH18-19:Workaround for ERRATA_772415_SPR17
>> PATCH20: Enable stage-1 translation for passthrough device
>>
>> Qemu code can be found at [2]
>>
>> Fault report isn't supported in this series, we presume guest kernel always
>> construct correct stage1 page table for passthrough device. For emulated
>> devices, the emulation code already provided stage1 fault injection.
>>
>> TODO:
>> - Fault report to guest when HW stage1 faults
>>
>> [1]
>> https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1
>> -yi.l.liu@intel.com/
>> [2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4
>>
>> Thanks
>> Zhenzhong
>>
>> Changelog:
>> v4:
>> - s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin,
>> Donald, Shameer)
>> - clarify get_viommu_cap() return pure emulated caps and explain reason in
>> commit log (Eric)
>> - retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric)
>> - refine doc comment and commit log in patch10-11 (Eric)
>>
>> v3:
>> - define enum type for VIOMMU_CAP_* (Eric)
>> - drop inline flag in the patch which uses the helper (Eric)
>> - use extract64 in new introduced MACRO (Eric)
>> - polish comments and fix typo error (Eric)
>> - split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
>> - optimize bind/unbind error path processing
>>
>> v2:
>> - introduce get_viommu_cap() to get STAGE1 flag to create nested parent
>> hwpt (Liuyi)
>> - reuse VFIO's default hwpt as parent hwpt of nested translation (Nicolin,
>> Liuyi)
>> - abandon support of VFIO device under pcie-to-pci bridge to simplify design
>> (Liuyi)
>> - bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 (Liuyi)
>> - drop vtd_dev_to_context_entry optimization (Liuyi)
>>
>> v1:
>> - simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
>> - rebase to master
>>
>> rfcv3:
>> - s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter
>> (Shameer)
>> - hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
>> - simplify return value check of get_cap() (Eric)
>> - drop realize_late (Cedric, Eric)
>> - split patch13:intel_iommu: Add PASID cache management infrastructure
>> (Eric)
>> - s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
>> - s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
>> - refine comments (Eric, Donald)
>>
>> rfcv2:
>> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
>> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily
>> rebase
>> - add two cleanup patches(patch9-10)
>> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of
>> iommufd/devid/ioas_id
>> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as
>> and
>> iommu pasid, this is important for dropping VTDPASIDAddressSpace
>>
>>
>> Yi Liu (3):
>> intel_iommu: Replay pasid bindings after context cache invalidation
>> intel_iommu: Propagate PASID-based iotlb invalidation to host
>> intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
>> changed
>>
>> Zhenzhong Duan (17):
>> intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
>> vtd_ce_get_pasid_entry
>> hw/pci: Introduce pci_device_get_viommu_cap()
>> intel_iommu: Implement get_viommu_cap() callback
>> vfio/iommufd: Force creating nested parent domain
>> hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
>> intel_iommu: Introduce a new structure VTDHostIOMMUDevice
>> intel_iommu: Check for compatibility with IOMMUFD backed device when
>> x-flts=on
>> intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
>> intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
>> intel_iommu: Handle PASID entry removal and update
>> intel_iommu: Handle PASID entry addition
>> intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
>> intel_iommu: Stick to system MR for IOMMUFD backed host device when
>> x-fls=on
>> intel_iommu: Bind/unbind guest page table to host
>> vfio: Add a new element bypass_ro in VFIOContainerBase
>> Workaround for ERRATA_772415_SPR17
>> intel_iommu: Enable host device when x-flts=on in scalable mode
>>
>> MAINTAINERS | 1 +
>> hw/i386/intel_iommu_internal.h | 68 +-
>> include/hw/i386/intel_iommu.h | 9 +-
>> include/hw/iommu.h | 17 +
>> include/hw/pci/pci.h | 27 +
>> include/hw/vfio/vfio-container-base.h | 1 +
>> hw/i386/intel_iommu.c | 941
>> +++++++++++++++++++++++++-
>> hw/pci/pci.c | 23 +-
>> hw/vfio/iommufd.c | 22 +-
>> hw/vfio/listener.c | 13 +-
>> hw/i386/trace-events | 8 +
>> 11 files changed, 1088 insertions(+), 42 deletions(-)
>> create mode 100644 include/hw/iommu.h
>>
>>
>> base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0
>> --
>> 2.47.1
>
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device
2025-08-21 8:50 ` Yi Liu
2025-08-21 8:50 ` Eric Auger
@ 2025-08-21 8:50 ` Duan, Zhenzhong
1 sibling, 0 replies; 41+ messages in thread
From: Duan, Zhenzhong @ 2025-08-21 8:50 UTC (permalink / raw)
To: Liu, Yi L, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P
>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for
>passthrough device
>
>On 2025/8/21 15:19, Duan, Zhenzhong wrote:
>> Kindly ping, any more comments?
>
>Do you have enough comments for a new version. I plan to have a look
>either this version or a new version next week. :)
That's appreciated, thanks Yi. I think not, there are only a few comments from Cedric for VFIO related patches.
Zhenzhong
^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device
2025-08-21 7:19 ` [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Duan, Zhenzhong
2025-08-21 8:50 ` Yi Liu
@ 2025-08-21 8:51 ` Michael S. Tsirkin
1 sibling, 0 replies; 41+ messages in thread
From: Michael S. Tsirkin @ 2025-08-21 8:51 UTC (permalink / raw)
To: Duan, Zhenzhong
Cc: qemu-devel@nongnu.org, alex.williamson@redhat.com, clg@redhat.com,
eric.auger@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P
On Thu, Aug 21, 2025 at 07:19:43AM +0000, Duan, Zhenzhong wrote:
> Kindly ping, any more comments?
>
> Thanks
> Zhenzhong
I think there's been enough comments to spin v5.
--
MST
^ permalink raw reply [flat|nested] 41+ messages in thread
* RE: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device
2025-08-21 8:50 ` Eric Auger
@ 2025-08-21 8:58 ` Duan, Zhenzhong
0 siblings, 0 replies; 41+ messages in thread
From: Duan, Zhenzhong @ 2025-08-21 8:58 UTC (permalink / raw)
To: eric.auger@redhat.com, Liu, Yi L, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P
Hi Eric, Yi,
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for
>passthrough device
>
>
>
>On 8/21/25 10:50 AM, Yi Liu wrote:
>> On 2025/8/21 15:19, Duan, Zhenzhong wrote:
>>> Kindly ping, any more comments?
>>
>> Do you have enough comments for a new version. I plan to have a look
>> either this version or a new version next week. :)
>same for me ;-)
I'll send v5 per Michael's suggestion by end of this week.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 41+ messages in thread
end of thread, other threads:[~2025-08-21 8:59 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-29 9:20 [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 01/20] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 02/20] hw/pci: Introduce pci_device_get_viommu_cap() Zhenzhong Duan
2025-07-29 13:19 ` Cédric Le Goater
2025-07-30 10:51 ` Duan, Zhenzhong
2025-08-04 18:45 ` Nicolin Chen
2025-08-05 6:04 ` Duan, Zhenzhong
2025-08-13 6:43 ` Jim Shu
2025-08-13 6:57 ` Duan, Zhenzhong
2025-07-29 9:20 ` [PATCH v4 03/20] intel_iommu: Implement get_viommu_cap() callback Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 04/20] vfio/iommufd: Force creating nested parent domain Zhenzhong Duan
2025-07-29 13:44 ` Cédric Le Goater
2025-07-30 10:55 ` Duan, Zhenzhong
2025-07-30 14:00 ` Cédric Le Goater
2025-07-31 3:29 ` Duan, Zhenzhong
2025-07-29 9:20 ` [PATCH v4 05/20] hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 06/20] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 07/20] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 08/20] intel_iommu: Fail passthrough device under PCI bridge if x-flts=on Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 09/20] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 10/20] intel_iommu: Handle PASID entry removal and update Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 11/20] intel_iommu: Handle PASID entry addition Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 12/20] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 13/20] intel_iommu: Stick to system MR for IOMMUFD backed host device when x-fls=on Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 14/20] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 15/20] intel_iommu: Replay pasid bindings after context cache invalidation Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 16/20] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 17/20] intel_iommu: Replay all pasid bindings when either SRTP or TE bit is changed Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 18/20] vfio: Add a new element bypass_ro in VFIOContainerBase Zhenzhong Duan
2025-07-29 13:55 ` Cédric Le Goater
2025-07-30 10:58 ` Duan, Zhenzhong
2025-07-30 14:18 ` Cédric Le Goater
2025-07-31 3:30 ` Duan, Zhenzhong
2025-07-29 9:20 ` [PATCH v4 19/20] Workaround for ERRATA_772415_SPR17 Zhenzhong Duan
2025-07-29 9:20 ` [PATCH v4 20/20] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
2025-08-21 7:19 ` [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for passthrough device Duan, Zhenzhong
2025-08-21 8:50 ` Yi Liu
2025-08-21 8:50 ` Eric Auger
2025-08-21 8:58 ` Duan, Zhenzhong
2025-08-21 8:50 ` Duan, Zhenzhong
2025-08-21 8:51 ` Michael S. Tsirkin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).