* [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device
@ 2025-06-06 10:04 Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 01/15] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
` (14 more replies)
0 siblings, 15 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan
Hi,
After VFIO/IOMMUFD prerequisite patchset got accepted, now this focuses on
stage-1 translation for passthrough device in intel_iommu. I thought it's
time to bump to v1 from rfcv3.
rfcv3 cover-letter:
Per Jason Wang's suggestion, iommufd nesting series[1] is split into
"Enable stage-1 translation for emulated device" series and
"Enable stage-1 translation for passthrough device" series.
This series is 2nd part focusing on passthrough device. We don't do shadowing
of guest page table for passthrough device but pass stage-1 page table to host
side to construct a nested domain. There was some effort to enable this feature
in old days, see [2] for details.
The key design is to utilize the dual-stage IOMMU translation (also known as
IOMMU nested translation) capability in host IOMMU. As the below diagram shows,
guest I/O page table pointer in GPA (guest physical address) is passed to host
and be used to perform the stage-1 address translation. Along with it,
modifications to present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | Stage1 for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.--------------------------------------.
| | | Stage2 for GPA->HPA, unmanaged domain|
| | '--------------------------------------'
'-------------'
For history reason, there are different namings in different VTD spec rev,
Where:
- Stage1 = First stage = First level = flts
- Stage2 = Second stage = Second level = slts
<Intel VT-d Nested translation>
There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
subsystem. VFIO calls them to register/unregister HostIOMMUDevice
instance to vIOMMU at vfio device realize stage.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
to bind/unbind device to IOMMUFD backed domains, either nested
domain or not.
See below diagram:
VFIO Device Intel IOMMU
.-----------------. .-------------------.
| | | |
| .---------|PCIIOMMUOps |.-------------. |
| | IOMMUFD |(set_iommu_device) || Host IOMMU | |
| | Device |------------------------>|| Device list | |
| .---------|(unset_iommu_device) |.-------------. |
| | | | |
| | | V |
| .---------| HostIOMMUDeviceIOMMUFD | .-------------. |
| | IOMMUFD | (attach_hwpt)| | Host IOMMU | |
| | link |<------------------------| | Device | |
| .---------| (detach_hwpt)| .-------------. |
| | | | |
| | | ... |
.-----------------. .-------------------.
Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt
whenever possible and create new one on demand, also supports multiple
iommufd objects and ERRATA_772415.
E.g., Under one guest's scope, Stage-2 page table could be shared by different
devices if there is no conflict and devices link to same iommufd object,
i.e. devices under same host IOMMU can share same stage-2 page table. If there
is conflict, i.e. there is one device under non cache coherency mode which is
different from others, it requires a separate stage-2 page table in non-CC mode.
SPR platform has ERRATA_772415 which requires no readonly mappings
in stage-2 page table. This series supports creating VTDIOASContainer
with no readonly mappings. If there is a rare case that some IOMMUs
on a multiple IOMMU host have ERRATA_772415 and others not, this
design can still survive.
See below example diagram for a full view:
IntelIOMMUState
|
V
.------------------. .------------------. .-------------------.
| VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer |-->...
| (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,only RW)|
.------------------. .------------------. .-------------------.
| | |
| .-->... |
V V
.-------------------. .-------------------. .---------------.
| VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | VTDS2Hwpt(CC) |-->...
.-------------------. .-------------------. .---------------.
| | | |
| | | |
.-----------. .-----------. .------------. .------------.
| IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD |
| Device(CC)| | Device(CC)| | Device | | Device(CC) |
| (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) |
| | | | | (iommufd0) | | (iommufd0) |
.-----------. .-----------. .------------. .------------.
This series is also a prerequisite work for vSVA, i.e. Sharing
guest application address space with passthrough devices.
To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on".
i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on...
Passthrough device should use iommufd backend to work with stage-1 translation.
i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
If host doesn't support nested translation, qemu will fail with an unsupported
report.
Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test
- build on windows
Fault report isn't supported in this series, we presume guest kernel always
construct correct S1 page table for passthrough device. For emulated devices,
the emulation code already provided S1 fault injection.
PATCH1-2: Some cleanup work
PATCH3: cap/ecap related compatibility check between vIOMMU and Host IOMMU
PATCH4-14: Implement stage-1 page table for passthrough device
PATCH15: Enable stage-1 translation for passthrough device
Qemu code can be found at [3]
TODO:
- RAM discard
- dirty tracking on stage-2 page table
- Fault report to guest when HW S1 faults
[1] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02740.html
[2] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v1
Thanks
Zhenzhong
Changelog:
v1:
- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
- rebase to master
rfcv3:
- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer)
- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
- simplify return value check of get_cap() (Eric)
- drop realize_late (Cedric, Eric)
- split patch13:intel_iommu: Add PASID cache management infrastructure (Eric)
- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
- refine comments (Eric, Donald)
rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
iommu pasid, this is important for dropping VTDPASIDAddressSpace
Yi Liu (3):
intel_iommu: Replay pasid binds after context cache invalidation
intel_iommu: Propagate PASID-based iotlb invalidation to host
intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
Zhenzhong Duan (12):
intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
vtd_ce_get_pasid_entry
intel_iommu: Optimize context entry cache utilization
intel_iommu: Check for compatibility with IOMMUFD backed device when
x-flts=on
intel_iommu: Introduce a new structure VTDHostIOMMUDevice
intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
intel_iommu: Handle PASID entry removing and updating
intel_iommu: Handle PASID entry adding
intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
intel_iommu: Bind/unbind guest page table to host
intel_iommu: ERRATA_772415 workaround
intel_iommu: Bypass replay in stage-1 page table mode
intel_iommu: Enable host device when x-flts=on in scalable mode
hw/i386/intel_iommu_internal.h | 56 ++
include/hw/i386/intel_iommu.h | 33 +-
hw/i386/intel_iommu.c | 1659 ++++++++++++++++++++++++++++----
hw/i386/trace-events | 13 +
4 files changed, 1563 insertions(+), 198 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 36+ messages in thread
* [PATCH v1 01/15] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-11 7:20 ` Yi Liu
2025-06-17 17:16 ` Eric Auger
2025-06-06 10:04 ` [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
` (13 subsequent siblings)
14 siblings, 2 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
Richard Henderson, Eduardo Habkost
In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
of rid2pasid, then it was extended to get any pasid entry. So a new name
vtd_ce_get_pasid_entry is better to match what it actually does.
No functional change intended.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
---
hw/i386/intel_iommu.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 69d72ad35c..f0b1f90eff 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
return 0;
}
-static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
+static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
VTDContextEntry *ce,
VTDPASIDEntry *pe,
uint32_t pasid)
@@ -1025,7 +1025,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
if (s->flts) {
return VTD_PE_GET_FL_LEVEL(&pe);
} else {
@@ -1048,7 +1048,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
}
@@ -1116,7 +1116,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
if (s->flts) {
return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
} else {
@@ -1522,7 +1522,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
* has valid rid2pasid setting, which includes valid
* rid2pasid field and corresponding pasid entry setting
*/
- return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
+ return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
}
/* Map a device to its corresponding domain (context-entry) */
@@ -1611,7 +1611,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
VTDPASIDEntry pe;
if (s->root_scalable) {
- vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
}
@@ -1687,7 +1687,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
int ret;
if (s->root_scalable) {
- ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
+ ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
if (ret) {
/*
* This error is guest triggerable. We should assumt PT
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 01/15] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-11 7:48 ` Yi Liu
2025-06-17 17:24 ` Eric Auger
2025-06-06 10:04 ` [PATCH v1 03/15] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
` (12 subsequent siblings)
14 siblings, 2 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
Richard Henderson, Eduardo Habkost
There are many call sites referencing context entry by calling
vtd_dev_to_context_entry() which will traverse the DMAR table.
In most cases we can use cached context entry in vtd_as->context_cache_entry
except when its entry is stale. Currently only global and domain context
invalidation stale it.
So introduce a helper function vtd_as_to_context_entry() to fetch from cache
before trying with vtd_dev_to_context_entry().
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 36 +++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 13 deletions(-)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index f0b1f90eff..a2f3250724 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1597,6 +1597,22 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
return 0;
}
+static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
+{
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ uint8_t bus_num = pci_bus_num(vtd_as->bus);
+ uint8_t devfn = vtd_as->devfn;
+ VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
+
+ /* Try to fetch context-entry from cache first */
+ if (cc_entry->context_cache_gen == s->context_cache_gen) {
+ *ce = cc_entry->context_entry;
+ return 0;
+ } else {
+ return vtd_dev_to_context_entry(s, bus_num, devfn, ce);
+ }
+}
+
static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
void *private)
{
@@ -1649,9 +1665,7 @@ static int vtd_address_space_sync(VTDAddressSpace *vtd_as)
return 0;
}
- ret = vtd_dev_to_context_entry(vtd_as->iommu_state,
- pci_bus_num(vtd_as->bus),
- vtd_as->devfn, &ce);
+ ret = vtd_as_to_context_entry(vtd_as, &ce);
if (ret) {
if (ret == -VTD_FR_CONTEXT_ENTRY_P) {
/*
@@ -1710,8 +1724,7 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
assert(as);
s = as->iommu_state;
- if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
- &ce)) {
+ if (vtd_as_to_context_entry(as, &ce)) {
/*
* Possibly failed to parse the context entry for some reason
* (e.g., during init, or any guest configuration errors on
@@ -2435,8 +2448,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
vtd_iommu_unlock(s);
QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
- if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
- vtd_as->devfn, &ce) &&
+ if (!vtd_as_to_context_entry(vtd_as, &ce) &&
domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
vtd_address_space_sync(vtd_as);
}
@@ -2458,8 +2470,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
hwaddr size = (1 << am) * VTD_PAGE_SIZE;
QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
- ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
- vtd_as->devfn, &ce);
+ ret = vtd_as_to_context_entry(vtd_as, &ce);
if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
uint32_t rid2pasid = PCI_NO_PASID;
@@ -2966,8 +2977,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
vtd_iommu_unlock(s);
QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
- if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
- vtd_as->devfn, &ce) &&
+ if (!vtd_as_to_context_entry(vtd_as, &ce) &&
domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
@@ -4146,7 +4156,7 @@ static void vtd_report_ir_illegal_access(VTDAddressSpace *vtd_as,
assert(vtd_as->pasid != PCI_NO_PASID);
/* Try out best to fetch FPD, we can't do anything more */
- if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+ if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
if (!is_fpd_set && s->root_scalable) {
vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, vtd_as->pasid);
@@ -4506,7 +4516,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
/* replay is protected by BQL, page walk will re-setup it safely */
iova_tree_remove(vtd_as->iova_tree, map);
- if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
+ if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
"legacy mode",
bus_n, PCI_SLOT(vtd_as->devfn),
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 03/15] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 01/15] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-17 17:49 ` Eric Auger
2025-06-06 10:04 ` [PATCH v1 04/15] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
` (11 subsequent siblings)
14 siblings, 1 reply; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
is passed to host to construct nested page table. We need to check
compatibility of some critical IOMMU capabilities between vIOMMU and
host IOMMU to ensure guest stage-1 page table could be used by host.
For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
does not, then this IOMMUFD backed device should be failed.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 1 +
hw/i386/intel_iommu.c | 28 ++++++++++++++++++++++++++++
2 files changed, 29 insertions(+)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e8b211e8b0..2cda744786 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -191,6 +191,7 @@
#define VTD_ECAP_PT (1ULL << 6)
#define VTD_ECAP_SC (1ULL << 7)
#define VTD_ECAP_MHMV (15ULL << 20)
+#define VTD_ECAP_NEST (1ULL << 26)
#define VTD_ECAP_SRS (1ULL << 31)
#define VTD_ECAP_PASID (1ULL << 40)
#define VTD_ECAP_SMTS (1ULL << 43)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index a2f3250724..c42ef83ddc 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -39,6 +39,7 @@
#include "kvm/kvm_i386.h"
#include "migration/vmstate.h"
#include "trace.h"
+#include "system/iommufd.h"
/* context entry operations */
#define VTD_CE_GET_RID2PASID(ce) \
@@ -4361,6 +4362,33 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
return true;
}
+#ifdef CONFIG_IOMMUFD
+ struct HostIOMMUDeviceCaps *caps = &hiod->caps;
+ struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
+
+ /* Remaining checks are all stage-1 translation specific */
+ if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
+ error_setg(errp, "Need IOMMUFD backend when x-flts=on");
+ return false;
+ }
+
+ if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
+ error_setg(errp, "Incompatible host platform IOMMU type %d",
+ caps->type);
+ return false;
+ }
+
+ if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
+ error_setg(errp, "Host IOMMU doesn't support nested translation");
+ return false;
+ }
+
+ if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
+ error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
+ return false;
+ }
+#endif
+
error_setg(errp, "host device is uncompatible with stage-1 translation");
return false;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 04/15] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (2 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 03/15] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-12 16:04 ` CLEMENT MATHIEU--DRIF
2025-06-06 10:04 ` [PATCH v1 05/15] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
` (10 subsequent siblings)
14 siblings, 1 reply; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Introduce a new structure VTDHostIOMMUDevice which replaces
HostIOMMUDevice to be stored in hash table.
It includes a reference to HostIOMMUDevice and IntelIOMMUState,
also includes BDF information which will be used in future
patches.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 7 +++++++
include/hw/i386/intel_iommu.h | 2 +-
hw/i386/intel_iommu.c | 14 ++++++++++++--
3 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 2cda744786..18bc22fc72 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -28,6 +28,7 @@
#ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
#define HW_I386_INTEL_IOMMU_INTERNAL_H
#include "hw/i386/intel_iommu.h"
+#include "system/host_iommu_device.h"
/*
* Intel IOMMU register specification
@@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
/* Bits to decide the offset for each level */
#define VTD_LEVEL_BITS 9
+typedef struct VTDHostIOMMUDevice {
+ IntelIOMMUState *iommu_state;
+ PCIBus *bus;
+ uint8_t devfn;
+ HostIOMMUDevice *hiod;
+} VTDHostIOMMUDevice;
#endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index e95477e855..50f9b27a45 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -295,7 +295,7 @@ struct IntelIOMMUState {
/* list of registered notifiers */
QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
- GHashTable *vtd_host_iommu_dev; /* HostIOMMUDevice */
+ GHashTable *vtd_host_iommu_dev; /* VTDHostIOMMUDevice */
/* interrupt remapping */
bool intr_enabled; /* Whether guest enabled IR */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c42ef83ddc..796b71605c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
static void vtd_hiod_destroy(gpointer v)
{
- object_unref(v);
+ VTDHostIOMMUDevice *vtd_hiod = v;
+
+ object_unref(vtd_hiod->hiod);
+ g_free(vtd_hiod);
}
static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
@@ -4397,6 +4400,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
HostIOMMUDevice *hiod, Error **errp)
{
IntelIOMMUState *s = opaque;
+ VTDHostIOMMUDevice *vtd_hiod;
struct vtd_as_key key = {
.bus = bus,
.devfn = devfn,
@@ -4413,6 +4417,12 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
return false;
}
+ vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
+ vtd_hiod->bus = bus;
+ vtd_hiod->devfn = (uint8_t)devfn;
+ vtd_hiod->iommu_state = s;
+ vtd_hiod->hiod = hiod;
+
if (!vtd_check_hiod(s, hiod, errp)) {
vtd_iommu_unlock(s);
return false;
@@ -4423,7 +4433,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
new_key->devfn = devfn;
object_ref(hiod);
- g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
+ g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
vtd_iommu_unlock(s);
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 05/15] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (3 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 04/15] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-11 9:54 ` Yi Liu
2025-06-06 10:04 ` [PATCH v1 06/15] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
` (9 subsequent siblings)
14 siblings, 1 reply; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
We already have vtd_find_add_as() to find an AS from BDF+pasid, but this
pasid is passed from PCI subsystem. PCI device supports two request types,
Requests-without-PASID and Requests-with-PASID. Requests-without-PASID
doesn't include a PASID TLP prefix, IOMMU fetches rid_pasid from context
entry and use it as IOMMU's pasid to index pasid table.
So we need to translate between PCI's pasid and IOMMU's pasid specially
for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to vtd_as
which contains PCI's pasid vtd_as->pasid.
vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to iommu_pasid.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 50 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 796b71605c..112e09e305 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1617,6 +1617,56 @@ static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
}
}
+static inline int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
+ uint32_t *pasid)
+{
+ VTDContextEntry ce;
+ int ret;
+
+ ret = vtd_as_to_context_entry(vtd_as, &ce);
+ if (ret) {
+ return ret;
+ }
+
+ /* Translate to iommu pasid if PCI_NO_PASID */
+ if (vtd_as->pasid == PCI_NO_PASID) {
+ *pasid = VTD_CE_GET_RID2PASID(&ce);
+ } else {
+ *pasid = vtd_as->pasid;
+ }
+
+ return 0;
+}
+
+static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
+ gpointer user_data)
+{
+ VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
+ struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
+ uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
+ uint32_t pasid;
+
+ if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+ return false;
+ }
+
+ return (pasid == target->pasid) && (sid == target->sid);
+}
+
+/* Translate iommu pasid to vtd_as */
+static inline
+VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
+ uint16_t sid, uint32_t pasid)
+{
+ struct vtd_as_raw_key key = {
+ .sid = sid,
+ .pasid = pasid
+ };
+
+ return g_hash_table_find(s->vtd_address_spaces,
+ vtd_find_as_by_sid_and_iommu_pasid, &key);
+}
+
static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
void *private)
{
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 06/15] intel_iommu: Handle PASID entry removing and updating
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (4 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 05/15] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-17 12:29 ` Yi Liu
2025-06-06 10:04 ` [PATCH v1 07/15] intel_iommu: Handle PASID entry adding Zhenzhong Duan
` (8 subsequent siblings)
14 siblings, 1 reply; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Yi Sun, Paolo Bonzini, Richard Henderson,
Eduardo Habkost, Marcel Apfelbaum
This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
pasid entry and track PASID usage and future PASID tagged DMA address
translation support in vIOMMU.
VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
never freed. For other pasid, VTDAddressSpace instance is created/destroyed
per the guest pasid entry set up/destroy for passthrough devices. While for
emulated devices, VTDAddressSpace instance is created in the PASID tagged DMA
translation and be destroyed per guest PASID cache invalidation. This focuses
on the PASID cache management for passthrough devices as there is no PASID
capable emulated devices yet.
When guest modifies a PASID entry, QEMU will capture the guest pasid selective
pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
invalidation reasons:
a) a present pasid entry moved to non-present
b) a present pasid entry to be a present entry
c) a non-present pasid entry moved to present
This handles a) and b), following patch will handle c).
vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
and compare it with the PASID cache.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 26 ++++
include/hw/i386/intel_iommu.h | 6 +
hw/i386/intel_iommu.c | 252 +++++++++++++++++++++++++++++++--
hw/i386/trace-events | 3 +
4 files changed, 277 insertions(+), 10 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 18bc22fc72..82b84db80f 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
* request while disabled */
VTD_FR_IR_SID_ERR = 0x26, /* Invalid Source-ID */
+ VTD_FR_RTADDR_INV_TTM = 0x31, /* Invalid TTM in RTADDR */
/* PASID directory entry access failure */
VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
/* The Present(P) field of pasid directory entry is 0 */
@@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
#define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff000000000f1c0ULL
#define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
+#define VTD_INV_DESC_PASIDC_G (3ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
+#define VTD_INV_DESC_PASIDC_DID(val) (((val) >> 16) & VTD_DOMAIN_ID_MASK)
+#define VTD_INV_DESC_PASIDC_RSVD_VAL0 0xfff000000000f1c0ULL
+
+#define VTD_INV_DESC_PASIDC_DSI (0ULL << 4)
+#define VTD_INV_DESC_PASIDC_PASID_SI (1ULL << 4)
+#define VTD_INV_DESC_PASIDC_GLOBAL (3ULL << 4)
+
/* Information about page-selective IOTLB invalidate */
struct VTDIOTLBPageInvInfo {
uint16_t domain_id;
@@ -552,6 +562,21 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
+typedef enum VTDPCInvType {
+ /* pasid cache invalidation rely on guest PASID entry */
+ VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
+ VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation */
+ VTD_PASID_CACHE_PASIDSI, /* pasid cache pasid selective invalidation */
+} VTDPCInvType;
+
+typedef struct VTDPASIDCacheInfo {
+ VTDPCInvType type;
+ uint16_t domain_id;
+ uint32_t pasid;
+ PCIBus *bus;
+ uint16_t devfn;
+} VTDPASIDCacheInfo;
+
/* PASID Table Related Definitions */
#define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
#define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
@@ -563,6 +588,7 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_PASID_TABLE_BITS_MASK (0x3fULL)
#define VTD_PASID_TABLE_INDEX(pasid) ((pasid) & VTD_PASID_TABLE_BITS_MASK)
#define VTD_PASID_ENTRY_FPD (1ULL << 1) /* Fault Processing Disable */
+#define VTD_PASID_TBL_ENTRY_NUM (1ULL << 6)
/* PASID Granular Translation Type Mask */
#define VTD_PASID_ENTRY_P 1ULL
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 50f9b27a45..fbc9da903a 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -95,6 +95,11 @@ struct VTDPASIDEntry {
uint64_t val[8];
};
+typedef struct VTDPASIDCacheEntry {
+ struct VTDPASIDEntry pasid_entry;
+ bool cache_filled;
+} VTDPASIDCacheEntry;
+
struct VTDAddressSpace {
PCIBus *bus;
uint8_t devfn;
@@ -107,6 +112,7 @@ struct VTDAddressSpace {
MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
IntelIOMMUState *iommu_state;
VTDContextCacheEntry context_cache_entry;
+ VTDPASIDCacheEntry pasid_cache_entry;
QLIST_ENTRY(VTDAddressSpace) next;
/* Superset of notifier flags that this address space has */
IOMMUNotifierFlag notifier_flags;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 112e09e305..c7162647e6 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -825,6 +825,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
}
}
+static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
+{
+ return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
+}
+
static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
{
return pdire->val & 1;
@@ -3104,6 +3109,236 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
return true;
}
+static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
+ uint32_t pasid, VTDPASIDEntry *pe)
+{
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ VTDContextEntry ce;
+ int ret;
+
+ if (!s->root_scalable) {
+ return -VTD_FR_RTADDR_INV_TTM;
+ }
+
+ ret = vtd_as_to_context_entry(vtd_as, &ce);
+ if (ret) {
+ return ret;
+ }
+
+ return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
+}
+
+static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
+{
+ return !memcmp(p1, p2, sizeof(*p1));
+}
+
+/*
+ * This function fills in the pasid entry in &vtd_as. Caller
+ * of this function should hold iommu_lock.
+ */
+static void vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
+ VTDPASIDEntry *pe)
+{
+ VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+
+ if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+ /* No need to go further as cached pasid entry is latest */
+ return;
+ }
+
+ pc_entry->pasid_entry = *pe;
+ pc_entry->cache_filled = true;
+ /*
+ * TODO: send pasid bind to host for passthru devices
+ */
+}
+
+/*
+ * This function is used to clear cached pasid entry in vtd_as
+ * instances. Caller of this function should hold iommu_lock.
+ */
+static gboolean vtd_flush_pasid(gpointer key, gpointer value,
+ gpointer user_data)
+{
+ VTDPASIDCacheInfo *pc_info = user_data;
+ VTDAddressSpace *vtd_as = value;
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+ VTDPASIDEntry pe;
+ uint16_t did;
+ uint32_t pasid;
+ int ret;
+
+ /* Replay only filled pasid entry cache for passthrough device */
+ if (!pc_entry->cache_filled) {
+ return false;
+ }
+ did = vtd_pe_get_did(&pc_entry->pasid_entry);
+
+ if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+ goto remove;
+ }
+
+ switch (pc_info->type) {
+ case VTD_PASID_CACHE_PASIDSI:
+ if (pc_info->pasid != pasid) {
+ return false;
+ }
+ /* Fall through */
+ case VTD_PASID_CACHE_DOMSI:
+ if (pc_info->domain_id != did) {
+ return false;
+ }
+ /* Fall through */
+ case VTD_PASID_CACHE_GLOBAL_INV:
+ break;
+ default:
+ error_report("invalid pc_info->type");
+ abort();
+ }
+
+ /*
+ * pasid cache invalidation may indicate a present pasid
+ * entry to present pasid entry modification. To cover such
+ * case, vIOMMU emulator needs to fetch latest guest pasid
+ * entry and check cached pasid entry, then update pasid
+ * cache and send pasid bind/unbind to host properly.
+ */
+ ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
+ if (ret) {
+ /*
+ * No valid pasid entry in guest memory. e.g. pasid entry
+ * was modified to be either all-zero or non-present. Either
+ * case means existing pasid cache should be removed.
+ */
+ goto remove;
+ }
+
+ vtd_fill_pe_in_cache(s, vtd_as, &pe);
+ return false;
+
+remove:
+ /*
+ * TODO: send pasid unbind to host for passthru devices
+ */
+ pc_entry->cache_filled = false;
+
+ /*
+ * Don't remove address space of PCI_NO_PASID which is created by PCI
+ * sub-system.
+ */
+ if (vtd_as->pasid == PCI_NO_PASID) {
+ return false;
+ }
+ return true;
+}
+
+/*
+ * This function syncs the pasid bindings between guest and host.
+ * It includes updating the pasid cache in vIOMMU and updating the
+ * pasid bindings per guest's latest pasid entry presence.
+ */
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+ VTDPASIDCacheInfo *pc_info)
+{
+ if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
+ return;
+ }
+
+ /*
+ * Regards to a pasid cache invalidation, e.g. a PSI.
+ * it could be either cases of below:
+ * a) a present pasid entry moved to non-present
+ * b) a present pasid entry to be a present entry
+ * c) a non-present pasid entry moved to present
+ *
+ * Different invalidation granularity may affect different device
+ * scope and pasid scope. But for each invalidation granularity,
+ * it needs to do two steps to sync host and guest pasid binding.
+ *
+ * Here is the handling of a PSI:
+ * 1) loop all the existing vtd_as instances to update them
+ * according to the latest guest pasid entry in pasid table.
+ * this will make sure affected existing vtd_as instances
+ * cached the latest pasid entries. Also, during the loop, the
+ * host should be notified if needed. e.g. pasid unbind or pasid
+ * update. Should be able to cover case a) and case b).
+ *
+ * 2) loop all devices to cover case c)
+ * - For devices which are backed by HostIOMMUDeviceIOMMUFD instances,
+ * we loop them and check if guest pasid entry exists. If yes,
+ * it is case c), we update the pasid cache and also notify
+ * host.
+ * - For devices which are not backed by HostIOMMUDeviceIOMMUFD,
+ * it is not necessary to create pasid cache at this phase since
+ * it could be created when vIOMMU does DMA address translation.
+ * This is not yet implemented since there is no emulated
+ * pasid-capable devices today. If we have such devices in
+ * future, the pasid cache shall be created there.
+ * Other granularity follow the same steps, just with different scope
+ *
+ */
+
+ vtd_iommu_lock(s);
+ /*
+ * Step 1: loop all the existing vtd_as instances for pasid unbind and
+ * update.
+ */
+ g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid,
+ pc_info);
+ vtd_iommu_unlock(s);
+
+ /* TODO: Step 2: loop all the existing vtd_hiod instances for pasid bind. */
+}
+
+static bool vtd_process_pasid_desc(IntelIOMMUState *s,
+ VTDInvDesc *inv_desc)
+{
+ uint16_t domain_id;
+ uint32_t pasid;
+ VTDPASIDCacheInfo pc_info;
+ uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
+ VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
+
+ if (!vtd_inv_desc_reserved_check(s, inv_desc, mask, true,
+ __func__, "pasid cache inv")) {
+ return false;
+ }
+
+ domain_id = VTD_INV_DESC_PASIDC_DID(inv_desc->val[0]);
+ pasid = VTD_INV_DESC_PASIDC_PASID(inv_desc->val[0]);
+
+ switch (inv_desc->val[0] & VTD_INV_DESC_PASIDC_G) {
+ case VTD_INV_DESC_PASIDC_DSI:
+ trace_vtd_pasid_cache_dsi(domain_id);
+ pc_info.type = VTD_PASID_CACHE_DOMSI;
+ pc_info.domain_id = domain_id;
+ break;
+
+ case VTD_INV_DESC_PASIDC_PASID_SI:
+ /* PASID selective implies a DID selective */
+ trace_vtd_pasid_cache_psi(domain_id, pasid);
+ pc_info.type = VTD_PASID_CACHE_PASIDSI;
+ pc_info.domain_id = domain_id;
+ pc_info.pasid = pasid;
+ break;
+
+ case VTD_INV_DESC_PASIDC_GLOBAL:
+ trace_vtd_pasid_cache_gsi();
+ pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+ break;
+
+ default:
+ error_report_once("invalid-inv-granu-in-pc_inv_desc hi: 0x%" PRIx64
+ " lo: 0x%" PRIx64, inv_desc->val[1], inv_desc->val[0]);
+ return false;
+ }
+
+ vtd_pasid_cache_sync(s, &pc_info);
+ return true;
+}
+
static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
VTDInvDesc *inv_desc)
{
@@ -3265,6 +3500,13 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
}
break;
+ case VTD_INV_DESC_PC:
+ trace_vtd_inv_desc("pasid-cache", inv_desc.val[1], inv_desc.val[0]);
+ if (!vtd_process_pasid_desc(s, &inv_desc)) {
+ return false;
+ }
+ break;
+
case VTD_INV_DESC_PIOTLB:
trace_vtd_inv_desc("p-iotlb", inv_desc.val[1], inv_desc.val[0]);
if (!vtd_process_piotlb_desc(s, &inv_desc)) {
@@ -3300,16 +3542,6 @@ static bool vtd_process_inv_desc(IntelIOMMUState *s)
}
break;
- /*
- * TODO: the entity of below two cases will be implemented in future series.
- * To make guest (which integrates scalable mode support patch set in
- * iommu driver) work, just return true is enough so far.
- */
- case VTD_INV_DESC_PC:
- if (s->scalable_mode) {
- break;
- }
- /* fallthrough */
default:
error_report_once("%s: invalid inv desc: hi=%"PRIx64", lo=%"PRIx64
" (unknown type)", __func__, inv_desc.hi,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ac9e1a10aa..ae5bbfcdc0 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,9 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
vtd_inv_qi_tail(uint16_t head) "write tail %d"
vtd_inv_qi_fetch(void) ""
vtd_context_cache_reset(void) ""
+vtd_pasid_cache_gsi(void) ""
+vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
+vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 07/15] intel_iommu: Handle PASID entry adding
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (5 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 06/15] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 08/15] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
` (7 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Yi Sun, Marcel Apfelbaum, Paolo Bonzini,
Richard Henderson, Eduardo Habkost
When guest modifies a PASID entry, QEMU will capture the guest pasid selective
pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
invalidation reasons:
a) a present pasid entry moved to non-present
b) a present pasid entry to be a present entry
c) a non-present pasid entry moved to present
This handles c).
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 1 +
hw/i386/intel_iommu.c | 167 ++++++++++++++++++++++++++++++++-
2 files changed, 167 insertions(+), 1 deletion(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 82b84db80f..4f6d9e9036 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -558,6 +558,7 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_CTX_ENTRY_LEGACY_SIZE 16
#define VTD_CTX_ENTRY_SCALABLE_SIZE 32
+#define VTD_SM_CONTEXT_ENTRY_PDTS(val) (((val) >> 9) & 0x7)
#define VTD_SM_CONTEXT_ENTRY_RID2PASID_MASK 0xfffff
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index c7162647e6..fb3c740593 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -825,6 +825,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
}
}
+static inline uint32_t vtd_sm_ce_get_pdt_entry_num(VTDContextEntry *ce)
+{
+ return 1U << (VTD_SM_CONTEXT_ENTRY_PDTS(ce->val[0]) + 7);
+}
+
static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
{
return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
@@ -3234,6 +3239,157 @@ remove:
return true;
}
+static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
+ dma_addr_t pt_base,
+ int start,
+ int end,
+ VTDPASIDCacheInfo *info)
+{
+ VTDPASIDEntry pe;
+ int pasid = start;
+ int pasid_next;
+
+ while (pasid < end) {
+ pasid_next = pasid + 1;
+
+ if (!vtd_get_pe_in_pasid_leaf_table(s, pasid, pt_base, &pe)
+ && vtd_pe_present(&pe)) {
+ int bus_n = pci_bus_num(info->bus), devfn = info->devfn;
+ uint16_t sid = PCI_BUILD_BDF(bus_n, devfn);
+ VTDAddressSpace *vtd_as;
+
+ vtd_iommu_lock(s);
+ /*
+ * When indexed by rid2pasid, vtd_as should have been created,
+ * e.g., by PCI subsystem. For other iommu pasid, we need to
+ * create vtd_as dynamically. The other iommu pasid is same as
+ * PCI's pasid, so it's used as input of vtd_find_add_as().
+ */
+ vtd_as = vtd_as_from_iommu_pasid_locked(s, sid, pasid);
+ if (!vtd_as) {
+ vtd_iommu_unlock(s);
+ vtd_as = vtd_find_add_as(s, info->bus, devfn, pasid);
+ }
+ vtd_iommu_unlock(s);
+
+ if ((info->type == VTD_PASID_CACHE_DOMSI ||
+ info->type == VTD_PASID_CACHE_PASIDSI) &&
+ !(info->domain_id == vtd_pe_get_did(&pe))) {
+ /*
+ * VTD_PASID_CACHE_DOMSI and VTD_PASID_CACHE_PASIDSI
+ * requires domain ID check. If domain Id check fail,
+ * go to next pasid.
+ */
+ pasid = pasid_next;
+ continue;
+ }
+ vtd_fill_pe_in_cache(s, vtd_as, &pe);
+ }
+ pasid = pasid_next;
+ }
+}
+
+/*
+ * Currently, VT-d scalable mode pasid table is a two level table,
+ * this function aims to loop a range of PASIDs in a given pasid
+ * table to identify the pasid config in guest.
+ */
+static void vtd_sm_pasid_table_walk(IntelIOMMUState *s,
+ dma_addr_t pdt_base,
+ int start,
+ int end,
+ VTDPASIDCacheInfo *info)
+{
+ VTDPASIDDirEntry pdire;
+ int pasid = start;
+ int pasid_next;
+ dma_addr_t pt_base;
+
+ while (pasid < end) {
+ pasid_next = ((end - pasid) > VTD_PASID_TBL_ENTRY_NUM) ?
+ (pasid + VTD_PASID_TBL_ENTRY_NUM) : end;
+ if (!vtd_get_pdire_from_pdir_table(pdt_base, pasid, &pdire)
+ && vtd_pdire_present(&pdire)) {
+ pt_base = pdire.val & VTD_PASID_TABLE_BASE_ADDR_MASK;
+ vtd_sm_pasid_table_walk_one(s, pt_base, pasid, pasid_next, info);
+ }
+ pasid = pasid_next;
+ }
+}
+
+static void vtd_replay_pasid_bind_for_dev(IntelIOMMUState *s,
+ int start, int end,
+ VTDPASIDCacheInfo *info)
+{
+ VTDContextEntry ce;
+ VTDAddressSpace *vtd_as;
+
+ vtd_as = vtd_find_add_as(s, info->bus, info->devfn, PCI_NO_PASID);
+
+ if (!vtd_as_to_context_entry(vtd_as, &ce)) {
+ uint32_t max_pasid;
+
+ max_pasid = vtd_sm_ce_get_pdt_entry_num(&ce) * VTD_PASID_TBL_ENTRY_NUM;
+ if (end > max_pasid) {
+ end = max_pasid;
+ }
+ vtd_sm_pasid_table_walk(s,
+ VTD_CE_GET_PASID_DIR_TABLE(&ce),
+ start,
+ end,
+ info);
+ }
+}
+
+/*
+ * This function replay the guest pasid bindings to hosts by
+ * walking the guest PASID table. This ensures host will have
+ * latest guest pasid bindings.
+ */
+static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
+ VTDPASIDCacheInfo *pc_info)
+{
+ VTDHostIOMMUDevice *vtd_hiod;
+ int start = 0, end = 1; /* only rid2pasid is supported */
+ VTDPASIDCacheInfo walk_info;
+ GHashTableIter as_it;
+
+ switch (pc_info->type) {
+ case VTD_PASID_CACHE_PASIDSI:
+ start = pc_info->pasid;
+ end = pc_info->pasid + 1;
+ /*
+ * PASID selective invalidation is within domain,
+ * thus fall through.
+ */
+ case VTD_PASID_CACHE_DOMSI:
+ case VTD_PASID_CACHE_GLOBAL_INV:
+ /* loop all assigned devices */
+ break;
+ default:
+ error_report("invalid pc_info->type for replay");
+ abort();
+ }
+
+ /*
+ * In this replay, only needs to care about the devices which
+ * are backed by host IOMMU. For such devices, their vtd_hiod
+ * instances are in the s->vtd_host_iommu_dev. For devices which
+ * are not backed by host IOMMU, it is not necessary to replay
+ * the bindings since their cache could be re-created in the future
+ * DMA address translation. Access to vtd_host_iommu_dev is already
+ * protected by BQL, so no iommu lock needed here.
+ */
+ walk_info = *pc_info;
+ g_hash_table_iter_init(&as_it, s->vtd_host_iommu_dev);
+ while (g_hash_table_iter_next(&as_it, NULL, (void **)&vtd_hiod)) {
+ /* bus|devfn fields are not identical with pc_info */
+ walk_info.bus = vtd_hiod->bus;
+ walk_info.devfn = vtd_hiod->devfn;
+ vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+ }
+}
+
/*
* This function syncs the pasid bindings between guest and host.
* It includes updating the pasid cache in vIOMMU and updating the
@@ -3289,7 +3445,16 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
pc_info);
vtd_iommu_unlock(s);
- /* TODO: Step 2: loop all the existing vtd_hiod instances for pasid bind. */
+ /*
+ * Step 2: loop all the existing vtd_hiod instances for pasid bind.
+ * Ideally, needs to loop all devices to find if there is any new
+ * PASID binding regards to the PASID cache invalidation request.
+ * But it is enough to loop the devices which are backed by host
+ * IOMMU. For devices backed by vIOMMU (a.k.a emulated devices),
+ * if new PASID happened on them, their vtd_as instance could
+ * be created during future vIOMMU DMA translation.
+ */
+ vtd_replay_guest_pasid_bindings(s, pc_info);
}
static bool vtd_process_pasid_desc(IntelIOMMUState *s,
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 08/15] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (6 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 07/15] intel_iommu: Handle PASID entry adding Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 09/15] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
` (6 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
Richard Henderson, Eduardo Habkost
FORCE_RESET is different from GLOBAL_INV which updates pasid cache if
underlying pasid entry is still valid, it drops all the pasid caches.
FORCE_RESET isn't a VTD spec defined invalidation type for pasid cache,
only used internally in system level reset.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 2 ++
hw/i386/intel_iommu.c | 28 ++++++++++++++++++++++++++++
hw/i386/trace-events | 1 +
3 files changed, 31 insertions(+)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 4f6d9e9036..5e5583d94a 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -564,6 +564,8 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
typedef enum VTDPCInvType {
+ /* Force reset all */
+ VTD_PASID_CACHE_FORCE_RESET = 0,
/* pasid cache invalidation rely on guest PASID entry */
VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index fb3c740593..81b0bb978b 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -86,6 +86,8 @@ struct vtd_iotlb_key {
static void vtd_address_space_refresh_all(IntelIOMMUState *s);
static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+
static void vtd_panic_require_caching_mode(void)
{
error_report("We need to set caching-mode=on for intel-iommu to enable "
@@ -390,6 +392,7 @@ static void vtd_reset_caches(IntelIOMMUState *s)
vtd_iommu_lock(s);
vtd_reset_iotlb_locked(s);
vtd_reset_context_cache_locked(s);
+ vtd_pasid_cache_reset_locked(s);
vtd_iommu_unlock(s);
}
@@ -3186,6 +3189,8 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
}
switch (pc_info->type) {
+ case VTD_PASID_CACHE_FORCE_RESET:
+ goto remove;
case VTD_PASID_CACHE_PASIDSI:
if (pc_info->pasid != pasid) {
return false;
@@ -3239,6 +3244,26 @@ remove:
return true;
}
+/* Caller of this function should hold iommu_lock */
+static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
+{
+ VTDPASIDCacheInfo pc_info;
+
+ trace_vtd_pasid_cache_reset();
+
+ pc_info.type = VTD_PASID_CACHE_FORCE_RESET;
+
+ /*
+ * Reset pasid cache is a big hammer, so use g_hash_table_foreach_remove
+ * which will free the vtd_as instances. Also, as a big hammer, use
+ * VTD_PASID_CACHE_FORCE_RESET to ensure all the vtd_as instances are
+ * dropped, meanwhile the change will be passed to host if
+ * HostIOMMUDeviceIOMMUFD is available.
+ */
+ g_hash_table_foreach_remove(s->vtd_address_spaces,
+ vtd_flush_pasid, &pc_info);
+}
+
static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
dma_addr_t pt_base,
int start,
@@ -3366,6 +3391,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
case VTD_PASID_CACHE_GLOBAL_INV:
/* loop all assigned devices */
break;
+ case VTD_PASID_CACHE_FORCE_RESET:
+ /* For force reset, no need to go further replay */
+ return;
default:
error_report("invalid pc_info->type for replay");
abort();
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index ae5bbfcdc0..c8a936eb46 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -24,6 +24,7 @@ vtd_inv_qi_head(uint16_t head) "read head %d"
vtd_inv_qi_tail(uint16_t head) "write tail %d"
vtd_inv_qi_fetch(void) ""
vtd_context_cache_reset(void) ""
+vtd_pasid_cache_reset(void) ""
vtd_pasid_cache_gsi(void) ""
vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 09/15] intel_iommu: Bind/unbind guest page table to host
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (7 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 08/15] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 10/15] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
` (5 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Yi Sun, Paolo Bonzini, Richard Henderson,
Eduardo Habkost, Marcel Apfelbaum
This captures the guest PASID table entry modifications and
propagates the changes to host to attach a hwpt with type determined
per guest PGTT configuration.
When PGTT is Pass-through(100b), the hwpt on host side is a stage-2
page table(GPA->HPA). When PGTT is First-stage Translation only(001b),
the hwpt on host side is a nested page table.
The guest page table is configured as stage-1 page table (gIOVA->GPA)
whose translation result would further go through host VT-d stage-2
page table(GPA->HPA) under nested translation mode. This is the key
to support gIOVA over stage-1 page table for Intel VT-d in
virtualization environment.
Stage-2 page table could be shared by different devices if there is
no conflict and devices link to same iommufd object, i.e. devices
under same host IOMMU can share same stage-2 page table. If there
is conflict, i.e. there is one device under non cache coherency
mode which is different from others, it requires a separate
stage-2 page table in non-CC mode.
See below example diagram:
IntelIOMMUState
|
V
.------------------. .------------------.
| VTDIOASContainer |--->| VTDIOASContainer |--->...
| (iommufd0) | | (iommufd1) |
.------------------. .------------------.
| |
| .-->...
V
.-------------------. .-------------------.
| VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->...
.-------------------. .-------------------.
| | |
| | |
.-----------. .-----------. .------------.
| IOMMUFD | | IOMMUFD | | IOMMUFD |
| Device(CC)| | Device(CC)| | Device |
| (iommufd0)| | (iommufd0)| | (non-CC) |
| | | | | (iommufd0) |
.-----------. .-----------. .------------.
Co-Authored-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 11 +
include/hw/i386/intel_iommu.h | 24 ++
hw/i386/intel_iommu.c | 581 +++++++++++++++++++++++++++++++--
hw/i386/trace-events | 8 +
4 files changed, 604 insertions(+), 20 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 5e5583d94a..e76f43bb8f 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -563,6 +563,13 @@ typedef struct VTDRootEntry VTDRootEntry;
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
#define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
+typedef enum VTDPASIDOp {
+ VTD_PASID_BIND,
+ VTD_PASID_UPDATE,
+ VTD_PASID_UNBIND,
+ VTD_OP_NUM
+} VTDPASIDOp;
+
typedef enum VTDPCInvType {
/* Force reset all */
VTD_PASID_CACHE_FORCE_RESET = 0,
@@ -578,6 +585,7 @@ typedef struct VTDPASIDCacheInfo {
uint32_t pasid;
PCIBus *bus;
uint16_t devfn;
+ bool error_happened;
} VTDPASIDCacheInfo;
/* PASID Table Related Definitions */
@@ -606,6 +614,9 @@ typedef struct VTDPASIDCacheInfo {
#define VTD_SM_PASID_ENTRY_FLPM 3ULL
#define VTD_SM_PASID_ENTRY_FLPTPTR (~0xfffULL)
+#define VTD_SM_PASID_ENTRY_SRE_BIT(val) (!!((val) & 1ULL))
+#define VTD_SM_PASID_ENTRY_WPE_BIT(val) (!!(((val) >> 4) & 1ULL))
+#define VTD_SM_PASID_ENTRY_EAFE_BIT(val) (!!(((val) >> 7) & 1ULL))
/* First Level Paging Structure */
/* Masks for First Level Paging Entry */
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index fbc9da903a..594281c1d3 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -100,10 +100,32 @@ typedef struct VTDPASIDCacheEntry {
bool cache_filled;
} VTDPASIDCacheEntry;
+typedef struct VTDIOASContainer {
+ struct IOMMUFDBackend *iommufd;
+ uint32_t ioas_id;
+ MemoryListener listener;
+ QLIST_HEAD(, VTDS2Hwpt) s2_hwpt_list;
+ QLIST_ENTRY(VTDIOASContainer) next;
+ Error *error;
+} VTDIOASContainer;
+
+typedef struct VTDS2Hwpt {
+ uint32_t users;
+ uint32_t hwpt_id;
+ VTDIOASContainer *container;
+ QLIST_ENTRY(VTDS2Hwpt) next;
+} VTDS2Hwpt;
+
+typedef struct VTDHwpt {
+ uint32_t hwpt_id;
+ VTDS2Hwpt *s2_hwpt;
+} VTDHwpt;
+
struct VTDAddressSpace {
PCIBus *bus;
uint8_t devfn;
uint32_t pasid;
+ VTDHwpt hwpt;
AddressSpace as;
IOMMUMemoryRegion iommu;
MemoryRegion root; /* The root container of the device */
@@ -303,6 +325,8 @@ struct IntelIOMMUState {
GHashTable *vtd_host_iommu_dev; /* VTDHostIOMMUDevice */
+ QLIST_HEAD(, VTDIOASContainer) containers;
+
/* interrupt remapping */
bool intr_enabled; /* Whether guest enabled IR */
dma_addr_t intr_root; /* Interrupt remapping table pointer */
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 81b0bb978b..7064d19be1 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -20,6 +20,7 @@
*/
#include "qemu/osdep.h"
+#include CONFIG_DEVICES /* CONFIG_IOMMUFD */
#include "qemu/error-report.h"
#include "qemu/main-loop.h"
#include "qapi/error.h"
@@ -40,6 +41,9 @@
#include "migration/vmstate.h"
#include "trace.h"
#include "system/iommufd.h"
+#ifdef CONFIG_IOMMUFD
+#include <linux/iommufd.h>
+#endif
/* context entry operations */
#define VTD_CE_GET_RID2PASID(ce) \
@@ -838,11 +842,40 @@ static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
}
+static inline dma_addr_t vtd_pe_get_flpt_base(VTDPASIDEntry *pe)
+{
+ return pe->val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
+}
+
+static inline uint32_t vtd_pe_get_fl_aw(VTDPASIDEntry *pe)
+{
+ return 48 + ((pe->val[2] >> 2) & VTD_SM_PASID_ENTRY_FLPM) * 9;
+}
+
+static inline bool vtd_pe_pgtt_is_pt(VTDPASIDEntry *pe)
+{
+ return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_PT);
+}
+
+/* check if pgtt is first stage translation */
+static inline bool vtd_pe_pgtt_is_flt(VTDPASIDEntry *pe)
+{
+ return (VTD_PE_GET_TYPE(pe) == VTD_SM_PASID_ENTRY_FLT);
+}
+
static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
{
return pdire->val & 1;
}
+static inline void pasid_cache_info_set_error(VTDPASIDCacheInfo *pc_info)
+{
+ if (pc_info->error_happened) {
+ return;
+ }
+ pc_info->error_happened = true;
+}
+
/**
* Caller of this function should check present bit if wants
* to use pdir entry for further usage except for fpd bit check.
@@ -1776,7 +1809,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
*/
return false;
}
- return (VTD_PE_GET_TYPE(&pe) == VTD_SM_PASID_ENTRY_PT);
+ return vtd_pe_pgtt_is_pt(&pe);
}
return (vtd_ce_get_type(ce) == VTD_CONTEXT_TT_PASS_THROUGH);
@@ -2403,6 +2436,497 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
vtd_iommu_replay_all(s);
}
+#ifdef CONFIG_IOMMUFD
+static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
+{
+ return !memory_region_is_ram(section->mr) ||
+ memory_region_is_protected(section->mr) ||
+ /*
+ * Sizing an enabled 64-bit BAR can cause spurious mappings to
+ * addresses in the upper part of the 64-bit address space. These
+ * are never accessed by the CPU and beyond the address width of
+ * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width.
+ */
+ section->offset_within_address_space & (1ULL << 63);
+}
+
+static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
+ MemoryRegionSection *section)
+{
+ VTDIOASContainer *container = container_of(listener,
+ VTDIOASContainer, listener);
+ IOMMUFDBackend *iommufd = container->iommufd;
+ uint32_t ioas_id = container->ioas_id;
+ hwaddr iova;
+ Int128 llend, llsize;
+ void *vaddr;
+ Error *err = NULL;
+ int ret;
+
+ if (iommufd_listener_skipped_section(section)) {
+ return;
+ }
+ iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+ llend = int128_make64(section->offset_within_address_space);
+ llend = int128_add(llend, section->size);
+ llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
+ llsize = int128_sub(llend, int128_make64(iova));
+ vaddr = memory_region_get_ram_ptr(section->mr) +
+ section->offset_within_region +
+ (iova - section->offset_within_address_space);
+
+ memory_region_ref(section->mr);
+
+ ret = iommufd_backend_map_dma(iommufd, ioas_id, iova, int128_get64(llsize),
+ vaddr, section->readonly);
+ if (!ret) {
+ return;
+ }
+
+ error_setg(&err,
+ "iommufd_listener_region_add_s2domain(%p, 0x%"HWADDR_PRIx", "
+ "0x%"HWADDR_PRIx", %p) = %d (%s)",
+ container, iova, int128_get64(llsize), vaddr, ret,
+ strerror(-ret));
+
+ if (memory_region_is_ram_device(section->mr)) {
+ /* Allow unexpected mappings not to be fatal for RAM devices */
+ error_report_err(err);
+ return;
+ }
+
+ if (!container->error) {
+ error_propagate_prepend(&container->error, err, "Region %s: ",
+ memory_region_name(section->mr));
+ } else {
+ error_free(err);
+ }
+}
+
+static void iommufd_listener_region_del_s2domain(MemoryListener *listener,
+ MemoryRegionSection *section)
+{
+ VTDIOASContainer *container = container_of(listener,
+ VTDIOASContainer, listener);
+ IOMMUFDBackend *iommufd = container->iommufd;
+ uint32_t ioas_id = container->ioas_id;
+ hwaddr iova;
+ Int128 llend, llsize;
+ int ret;
+
+ if (iommufd_listener_skipped_section(section)) {
+ return;
+ }
+ iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
+ llend = int128_make64(section->offset_within_address_space);
+ llend = int128_add(llend, section->size);
+ llend = int128_and(llend, int128_exts64(qemu_real_host_page_mask()));
+ llsize = int128_sub(llend, int128_make64(iova));
+
+ ret = iommufd_backend_unmap_dma(iommufd, ioas_id,
+ iova, int128_get64(llsize));
+ if (ret) {
+ error_report("iommufd_listener_region_del_s2domain(%p, "
+ "0x%"HWADDR_PRIx", 0x%"HWADDR_PRIx") = %d (%s)",
+ container, iova, int128_get64(llsize), ret,
+ strerror(-ret));
+ }
+
+ memory_region_unref(section->mr);
+}
+
+static const MemoryListener iommufd_s2domain_memory_listener = {
+ .name = "iommufd_s2domain",
+ .priority = 1000,
+ .region_add = iommufd_listener_region_add_s2domain,
+ .region_del = iommufd_listener_region_del_s2domain,
+};
+
+static void vtd_init_s1_hwpt_data(struct iommu_hwpt_vtd_s1 *vtd,
+ VTDPASIDEntry *pe)
+{
+ memset(vtd, 0, sizeof(*vtd));
+
+ vtd->flags = (VTD_SM_PASID_ENTRY_SRE_BIT(pe->val[2]) ?
+ IOMMU_VTD_S1_SRE : 0) |
+ (VTD_SM_PASID_ENTRY_WPE_BIT(pe->val[2]) ?
+ IOMMU_VTD_S1_WPE : 0) |
+ (VTD_SM_PASID_ENTRY_EAFE_BIT(pe->val[2]) ?
+ IOMMU_VTD_S1_EAFE : 0);
+ vtd->addr_width = vtd_pe_get_fl_aw(pe);
+ vtd->pgtbl_addr = (uint64_t)vtd_pe_get_flpt_base(pe);
+}
+
+static int vtd_create_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev,
+ VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
+ VTDPASIDEntry *pe, Error **errp)
+{
+ struct iommu_hwpt_vtd_s1 vtd;
+ uint32_t hwpt_id, s2_hwpt_id = s2_hwpt->hwpt_id;
+
+ vtd_init_s1_hwpt_data(&vtd, pe);
+
+ if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid,
+ s2_hwpt_id, 0, IOMMU_HWPT_DATA_VTD_S1,
+ sizeof(vtd), &vtd, &hwpt_id, errp)) {
+ return -EINVAL;
+ }
+
+ hwpt->hwpt_id = hwpt_id;
+
+ return 0;
+}
+
+static void vtd_destroy_s1_hwpt(HostIOMMUDeviceIOMMUFD *idev, VTDHwpt *hwpt)
+{
+ iommufd_backend_free_id(idev->iommufd, hwpt->hwpt_id);
+}
+
+static VTDS2Hwpt *vtd_ioas_container_get_s2_hwpt(VTDIOASContainer *container,
+ uint32_t hwpt_id)
+{
+ VTDS2Hwpt *s2_hwpt;
+
+ QLIST_FOREACH(s2_hwpt, &container->s2_hwpt_list, next) {
+ if (s2_hwpt->hwpt_id == hwpt_id) {
+ return s2_hwpt;
+ }
+ }
+
+ s2_hwpt = g_malloc0(sizeof(*s2_hwpt));
+
+ s2_hwpt->hwpt_id = hwpt_id;
+ s2_hwpt->container = container;
+ QLIST_INSERT_HEAD(&container->s2_hwpt_list, s2_hwpt, next);
+
+ return s2_hwpt;
+}
+
+static void vtd_ioas_container_put_s2_hwpt(VTDS2Hwpt *s2_hwpt)
+{
+ VTDIOASContainer *container = s2_hwpt->container;
+
+ if (s2_hwpt->users) {
+ return;
+ }
+
+ QLIST_REMOVE(s2_hwpt, next);
+ iommufd_backend_free_id(container->iommufd, s2_hwpt->hwpt_id);
+ g_free(s2_hwpt);
+}
+
+static void vtd_ioas_container_destroy(VTDIOASContainer *container)
+{
+ if (!QLIST_EMPTY(&container->s2_hwpt_list)) {
+ return;
+ }
+
+ QLIST_REMOVE(container, next);
+ memory_listener_unregister(&container->listener);
+ iommufd_backend_free_id(container->iommufd, container->ioas_id);
+ g_free(container);
+}
+
+static int vtd_device_attach_hwpt(VTDHostIOMMUDevice *vtd_hiod,
+ uint32_t pasid, VTDPASIDEntry *pe,
+ VTDS2Hwpt *s2_hwpt, VTDHwpt *hwpt,
+ Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+ int ret;
+
+ if (vtd_pe_pgtt_is_flt(pe)) {
+ ret = vtd_create_s1_hwpt(idev, s2_hwpt, hwpt, pe, errp);
+ if (ret) {
+ return ret;
+ }
+ } else {
+ hwpt->hwpt_id = s2_hwpt->hwpt_id;
+ }
+
+ ret = !host_iommu_device_iommufd_attach_hwpt(idev, hwpt->hwpt_id, errp);
+ trace_vtd_device_attach_hwpt(idev->devid, pasid, hwpt->hwpt_id, ret);
+ if (ret) {
+ if (vtd_pe_pgtt_is_flt(pe)) {
+ vtd_destroy_s1_hwpt(idev, hwpt);
+ }
+ hwpt->hwpt_id = 0;
+ error_report("devid %d pasid %d failed to attach hwpt %d",
+ idev->devid, pasid, hwpt->hwpt_id);
+ return ret;
+ }
+
+ s2_hwpt->users++;
+ hwpt->s2_hwpt = s2_hwpt;
+
+ return 0;
+}
+
+static void vtd_device_detach_hwpt(VTDHostIOMMUDevice *vtd_hiod,
+ uint32_t pasid, VTDPASIDEntry *pe,
+ VTDHwpt *hwpt, Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+ int ret;
+
+ if (vtd_hiod->iommu_state->dmar_enabled) {
+ ret = !host_iommu_device_iommufd_detach_hwpt(idev, errp);
+ trace_vtd_device_detach_hwpt(idev->devid, pasid, ret);
+ } else {
+ ret = !host_iommu_device_iommufd_attach_hwpt(idev, idev->hwpt_id, errp);
+ trace_vtd_device_reattach_def_hwpt(idev->devid, pasid, idev->hwpt_id,
+ ret);
+ }
+
+ if (ret) {
+ error_report("devid %d pasid %d failed to attach hwpt %d",
+ idev->devid, pasid, hwpt->hwpt_id);
+ }
+
+ if (vtd_pe_pgtt_is_flt(pe)) {
+ vtd_destroy_s1_hwpt(idev, hwpt);
+ }
+
+ hwpt->s2_hwpt->users--;
+ hwpt->s2_hwpt = NULL;
+ hwpt->hwpt_id = 0;
+}
+
+static int vtd_device_attach_container(VTDHostIOMMUDevice *vtd_hiod,
+ VTDIOASContainer *container,
+ uint32_t pasid, VTDPASIDEntry *pe,
+ VTDHwpt *hwpt, Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+ IOMMUFDBackend *iommufd = idev->iommufd;
+ VTDS2Hwpt *s2_hwpt;
+ uint32_t s2_hwpt_id;
+ Error *err = NULL;
+ int ret;
+
+ /* try to attach to an existing hwpt in this container */
+ QLIST_FOREACH(s2_hwpt, &container->s2_hwpt_list, next) {
+ ret = vtd_device_attach_hwpt(vtd_hiod, pasid, pe, s2_hwpt, hwpt, &err);
+ if (ret) {
+ const char *msg = error_get_pretty(err);
+
+ trace_vtd_device_fail_attach_existing_hwpt(msg);
+ error_free(err);
+ err = NULL;
+ } else {
+ goto found_hwpt;
+ }
+ }
+
+ if (!iommufd_backend_alloc_hwpt(iommufd, idev->devid, container->ioas_id,
+ IOMMU_HWPT_ALLOC_NEST_PARENT,
+ IOMMU_HWPT_DATA_NONE, 0, NULL,
+ &s2_hwpt_id, errp)) {
+ return -EINVAL;
+ }
+
+ s2_hwpt = vtd_ioas_container_get_s2_hwpt(container, s2_hwpt_id);
+
+ /* Attach vtd device to a new allocated hwpt within iommufd */
+ ret = vtd_device_attach_hwpt(vtd_hiod, pasid, pe, s2_hwpt, hwpt, errp);
+ if (ret) {
+ goto err_attach_hwpt;
+ }
+
+found_hwpt:
+ trace_vtd_device_attach_container(iommufd->fd, idev->devid, pasid,
+ container->ioas_id, hwpt->hwpt_id);
+ return 0;
+
+err_attach_hwpt:
+ vtd_ioas_container_put_s2_hwpt(s2_hwpt);
+ return ret;
+}
+
+static void vtd_device_detach_container(VTDHostIOMMUDevice *vtd_hiod,
+ uint32_t pasid, VTDPASIDEntry *pe,
+ VTDHwpt *hwpt, Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+ IOMMUFDBackend *iommufd = idev->iommufd;
+ VTDS2Hwpt *s2_hwpt = hwpt->s2_hwpt;
+
+ trace_vtd_device_detach_container(iommufd->fd, idev->devid, pasid);
+ vtd_device_detach_hwpt(vtd_hiod, pasid, pe, hwpt, errp);
+ vtd_ioas_container_put_s2_hwpt(s2_hwpt);
+}
+
+static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+ uint32_t pasid, VTDPASIDEntry *pe,
+ VTDHwpt *hwpt, Error **errp)
+{
+ HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+ IOMMUFDBackend *iommufd = idev->iommufd;
+ IntelIOMMUState *s = vtd_hiod->iommu_state;
+ VTDIOASContainer *container;
+ Error *err = NULL;
+ uint32_t ioas_id;
+ int ret;
+
+ /* try to attach to an existing container in this space */
+ QLIST_FOREACH(container, &s->containers, next) {
+ if (container->iommufd != iommufd) {
+ continue;
+ }
+
+ if (vtd_device_attach_container(vtd_hiod, container, pasid, pe, hwpt,
+ &err)) {
+ const char *msg = error_get_pretty(err);
+
+ trace_vtd_device_fail_attach_existing_container(msg);
+ error_free(err);
+ err = NULL;
+ } else {
+ return 0;
+ }
+ }
+
+ /* Need to allocate a new dedicated container */
+ ret = iommufd_backend_alloc_ioas(iommufd, &ioas_id, errp);
+ if (ret < 0) {
+ return ret;
+ }
+
+ trace_vtd_device_alloc_ioas(iommufd->fd, ioas_id);
+
+ container = g_malloc0(sizeof(*container));
+ container->iommufd = iommufd;
+ container->ioas_id = ioas_id;
+ QLIST_INIT(&container->s2_hwpt_list);
+
+ if (vtd_device_attach_container(vtd_hiod, container, pasid, pe, hwpt,
+ errp)) {
+ goto err_attach_container;
+ }
+
+ container->listener = iommufd_s2domain_memory_listener;
+ memory_listener_register(&container->listener, &address_space_memory);
+
+ if (container->error) {
+ ret = -1;
+ error_propagate_prepend(errp, container->error,
+ "memory listener initialization failed: ");
+ goto err_listener_register;
+ }
+
+ QLIST_INSERT_HEAD(&s->containers, container, next);
+
+ return 0;
+
+err_listener_register:
+ vtd_device_detach_container(vtd_hiod, pasid, pe, hwpt, errp);
+err_attach_container:
+ iommufd_backend_free_id(iommufd, container->ioas_id);
+ g_free(container);
+ return ret;
+}
+
+static void vtd_device_detach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
+ uint32_t pasid, VTDPASIDEntry *pe,
+ VTDHwpt *hwpt, Error **errp)
+{
+ VTDIOASContainer *container = hwpt->s2_hwpt->container;
+
+ vtd_device_detach_container(vtd_hiod, pasid, pe, hwpt, errp);
+ vtd_ioas_container_destroy(container);
+}
+
+static int vtd_device_attach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
+ VTDAddressSpace *vtd_as, VTDPASIDEntry *pe)
+{
+ /*
+ * If pe->gptt != FLT, should be go ahead to do bind as host only
+ * accepts guest FLT under nesting. If pe->pgtt==PT, should setup
+ * the pasid with GPA page table. Otherwise should return failure.
+ */
+ if (!vtd_pe_pgtt_is_flt(pe) && !vtd_pe_pgtt_is_pt(pe)) {
+ return -EINVAL;
+ }
+
+ /* Should fail if the FLPT base is 0 */
+ if (vtd_pe_pgtt_is_flt(pe) && !vtd_pe_get_flpt_base(pe)) {
+ return -EINVAL;
+ }
+
+ return vtd_device_attach_iommufd(vtd_hiod, vtd_as->pasid, pe,
+ &vtd_as->hwpt, &error_abort);
+}
+
+static int vtd_device_detach_pgtbl(VTDHostIOMMUDevice *vtd_hiod,
+ VTDAddressSpace *vtd_as)
+{
+ VTDPASIDEntry *cached_pe = vtd_as->pasid_cache_entry.cache_filled ?
+ &vtd_as->pasid_cache_entry.pasid_entry : NULL;
+
+ if (!cached_pe ||
+ (!vtd_pe_pgtt_is_flt(cached_pe) && !vtd_pe_pgtt_is_pt(cached_pe))) {
+ return 0;
+ }
+
+ vtd_device_detach_iommufd(vtd_hiod, vtd_as->pasid, cached_pe,
+ &vtd_as->hwpt, &error_abort);
+
+ return 0;
+}
+
+/**
+ * Caller should hold iommu_lock.
+ */
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
+ VTDPASIDEntry *pe, VTDPASIDOp op)
+{
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ VTDHostIOMMUDevice *vtd_hiod;
+ int devfn = vtd_as->devfn;
+ int ret = -EINVAL;
+ struct vtd_as_key key = {
+ .bus = vtd_as->bus,
+ .devfn = devfn,
+ };
+
+ vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
+ if (!vtd_hiod || !vtd_hiod->hiod) {
+ /* means no need to go further, e.g. for emulated devices */
+ return 0;
+ }
+
+ if (vtd_as->pasid != PCI_NO_PASID) {
+ error_report("Non-rid_pasid %d not supported yet", vtd_as->pasid);
+ return ret;
+ }
+
+ switch (op) {
+ case VTD_PASID_UPDATE:
+ case VTD_PASID_BIND:
+ {
+ ret = vtd_device_attach_pgtbl(vtd_hiod, vtd_as, pe);
+ break;
+ }
+ case VTD_PASID_UNBIND:
+ {
+ ret = vtd_device_detach_pgtbl(vtd_hiod, vtd_as);
+ break;
+ }
+ default:
+ error_report_once("Unknown VTDPASIDOp!!!\n");
+ break;
+ }
+
+ return ret;
+}
+#else
+static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
+ VTDPASIDEntry *pe, VTDPASIDOp op)
+{
+ return 0;
+}
+#endif
+
/* Do a context-cache device-selective invalidation.
* @func_mask: FM field after shifting
*/
@@ -3145,21 +3669,27 @@ static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
* This function fills in the pasid entry in &vtd_as. Caller
* of this function should hold iommu_lock.
*/
-static void vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
- VTDPASIDEntry *pe)
+static int vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
+ VTDPASIDEntry *pe)
{
VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+ int ret;
- if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
- /* No need to go further as cached pasid entry is latest */
- return;
+ if (pc_entry->cache_filled) {
+ if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
+ /* No need to go further as cached pasid entry is latest */
+ return 0;
+ }
+ ret = vtd_bind_guest_pasid(vtd_as, pe, VTD_PASID_UPDATE);
+ } else {
+ ret = vtd_bind_guest_pasid(vtd_as, pe, VTD_PASID_BIND);
}
- pc_entry->pasid_entry = *pe;
- pc_entry->cache_filled = true;
- /*
- * TODO: send pasid bind to host for passthru devices
- */
+ if (!ret) {
+ pc_entry->pasid_entry = *pe;
+ pc_entry->cache_filled = true;
+ }
+ return ret;
}
/*
@@ -3225,14 +3755,20 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
goto remove;
}
- vtd_fill_pe_in_cache(s, vtd_as, &pe);
+ if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
+ pasid_cache_info_set_error(pc_info);
+ }
return false;
remove:
- /*
- * TODO: send pasid unbind to host for passthru devices
- */
- pc_entry->cache_filled = false;
+ if (pc_entry->cache_filled) {
+ if (vtd_bind_guest_pasid(vtd_as, NULL, VTD_PASID_UNBIND)) {
+ pasid_cache_info_set_error(pc_info);
+ return false;
+ } else {
+ pc_entry->cache_filled = false;
+ }
+ }
/*
* Don't remove address space of PCI_NO_PASID which is created by PCI
@@ -3247,7 +3783,7 @@ remove:
/* Caller of this function should hold iommu_lock */
static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s)
{
- VTDPASIDCacheInfo pc_info;
+ VTDPASIDCacheInfo pc_info = { .error_happened = false, };
trace_vtd_pasid_cache_reset();
@@ -3308,7 +3844,9 @@ static void vtd_sm_pasid_table_walk_one(IntelIOMMUState *s,
pasid = pasid_next;
continue;
}
- vtd_fill_pe_in_cache(s, vtd_as, &pe);
+ if (vtd_fill_pe_in_cache(s, vtd_as, &pe)) {
+ pasid_cache_info_set_error(info);
+ }
}
pasid = pasid_next;
}
@@ -3416,6 +3954,9 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
walk_info.devfn = vtd_hiod->devfn;
vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
}
+ if (walk_info.error_happened) {
+ pasid_cache_info_set_error(pc_info);
+ }
}
/*
@@ -3488,9 +4029,9 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
static bool vtd_process_pasid_desc(IntelIOMMUState *s,
VTDInvDesc *inv_desc)
{
+ VTDPASIDCacheInfo pc_info = { .error_happened = false, };
uint16_t domain_id;
uint32_t pasid;
- VTDPASIDCacheInfo pc_info;
uint64_t mask[4] = {VTD_INV_DESC_PASIDC_RSVD_VAL0, VTD_INV_DESC_ALL_ONE,
VTD_INV_DESC_ALL_ONE, VTD_INV_DESC_ALL_ONE};
@@ -3529,7 +4070,7 @@ static bool vtd_process_pasid_desc(IntelIOMMUState *s,
}
vtd_pasid_cache_sync(s, &pc_info);
- return true;
+ return !pc_info.error_happened ? true : false;
}
static bool vtd_process_inv_iec_desc(IntelIOMMUState *s,
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index c8a936eb46..de903a0033 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -73,6 +73,14 @@ vtd_warn_invalid_qi_tail(uint16_t tail) "tail 0x%"PRIx16
vtd_warn_ir_vector(uint16_t sid, int index, int vec, int target) "sid 0x%"PRIx16" index %d vec %d (should be: %d)"
vtd_warn_ir_trigger(uint16_t sid, int index, int trig, int target) "sid 0x%"PRIx16" index %d trigger %d (should be: %d)"
vtd_reset_exit(void) ""
+vtd_device_attach_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_detach_hwpt(uint32_t dev_id, uint32_t pasid, int ret) "dev_id %d pasid %d ret: %d"
+vtd_device_reattach_def_hwpt(uint32_t dev_id, uint32_t pasid, uint32_t hwpt_id, int ret) "dev_id %d pasid %d hwpt_id %d, ret: %d"
+vtd_device_fail_attach_existing_hwpt(const char *msg) " %s"
+vtd_device_attach_container(int fd, uint32_t dev_id, uint32_t pasid, uint32_t ioas_id, uint32_t hwpt_id) "iommufd %d dev_id %d pasid %d ioas_id %d hwpt_id %d"
+vtd_device_detach_container(int fd, uint32_t dev_id, uint32_t pasid) "iommufd %d dev_id %d pasid %d"
+vtd_device_fail_attach_existing_container(const char *msg) " %s"
+vtd_device_alloc_ioas(int fd, uint32_t ioas_id) "iommufd %d ioas_id %d"
# amd_iommu.c
amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" + offset 0x%"PRIx32
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 10/15] intel_iommu: ERRATA_772415 workaround
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (8 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 09/15] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 11/15] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
` (4 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
Richard Henderson, Eduardo Habkost
On a system influenced by ERRATA_772415, IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17
is repored by IOMMU_DEVICE_GET_HW_INFO. Due to this errata, even the readonly
range mapped on stage-2 page table could still be written.
Reference from 4th Gen Intel Xeon Processor Scalable Family Specification
Update, Errata Details, SPR17.
[0] https://edc.intel.com/content/www/us/en/design/products-and-solutions/processors-and-chipsets/eagle-stream/sapphire-rapids-specification-update
We utilize the new added IOMMUFD container/ioas/hwpt management framework in
VTD. Add a check to create new VTDIOASContainer to only hold RW mappings,
then this VTDIOASContainer can be used as backend for device with
ERRATA_772415. See below diagram for details:
IntelIOMMUState
|
V
.------------------. .------------------. .-------------------.
| VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer |-->...
| (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,only RW)|
.------------------. .------------------. .-------------------.
| | |
| .-->... |
V V
.-------------------. .-------------------. .---------------.
| VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | VTDS2Hwpt(CC) |-->...
.-------------------. .-------------------. .---------------.
| | | |
| | | |
.-----------. .-----------. .------------. .------------.
| IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD |
| Device(CC)| | Device(CC)| | Device | | Device(CC) |
| (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) |
| | | | | (iommufd0) | | (iommufd0) |
.-----------. .-----------. .------------. .------------.
Changed to pass VTDHostIOMMUDevice pointer to vtd_check_hdev() so errata
could be saved.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 1 +
include/hw/i386/intel_iommu.h | 1 +
hw/i386/intel_iommu.c | 22 +++++++++++++++-------
3 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index e76f43bb8f..75d840f9fe 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -654,5 +654,6 @@ typedef struct VTDHostIOMMUDevice {
PCIBus *bus;
uint8_t devfn;
HostIOMMUDevice *hiod;
+ uint32_t errata;
} VTDHostIOMMUDevice;
#endif
diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
index 594281c1d3..9b156dc32e 100644
--- a/include/hw/i386/intel_iommu.h
+++ b/include/hw/i386/intel_iommu.h
@@ -103,6 +103,7 @@ typedef struct VTDPASIDCacheEntry {
typedef struct VTDIOASContainer {
struct IOMMUFDBackend *iommufd;
uint32_t ioas_id;
+ uint32_t errata;
MemoryListener listener;
QLIST_HEAD(, VTDS2Hwpt) s2_hwpt_list;
QLIST_ENTRY(VTDIOASContainer) next;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 7064d19be1..b8c0d0dd2e 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2437,7 +2437,8 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
}
#ifdef CONFIG_IOMMUFD
-static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
+static bool iommufd_listener_skipped_section(VTDIOASContainer *container,
+ MemoryRegionSection *section)
{
return !memory_region_is_ram(section->mr) ||
memory_region_is_protected(section->mr) ||
@@ -2447,7 +2448,8 @@ static bool iommufd_listener_skipped_section(MemoryRegionSection *section)
* are never accessed by the CPU and beyond the address width of
* some IOMMU hardware. TODO: VFIO should tell us the IOMMU width.
*/
- section->offset_within_address_space & (1ULL << 63);
+ section->offset_within_address_space & (1ULL << 63) ||
+ (container->errata && section->readonly);
}
static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
@@ -2463,7 +2465,7 @@ static void iommufd_listener_region_add_s2domain(MemoryListener *listener,
Error *err = NULL;
int ret;
- if (iommufd_listener_skipped_section(section)) {
+ if (iommufd_listener_skipped_section(container, section)) {
return;
}
iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
@@ -2514,7 +2516,7 @@ static void iommufd_listener_region_del_s2domain(MemoryListener *listener,
Int128 llend, llsize;
int ret;
- if (iommufd_listener_skipped_section(section)) {
+ if (iommufd_listener_skipped_section(container, section)) {
return;
}
iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
@@ -2770,7 +2772,8 @@ static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
/* try to attach to an existing container in this space */
QLIST_FOREACH(container, &s->containers, next) {
- if (container->iommufd != iommufd) {
+ if (container->iommufd != iommufd ||
+ container->errata != vtd_hiod->errata) {
continue;
}
@@ -2797,6 +2800,7 @@ static int vtd_device_attach_iommufd(VTDHostIOMMUDevice *vtd_hiod,
container = g_malloc0(sizeof(*container));
container->iommufd = iommufd;
container->ioas_id = ioas_id;
+ container->errata = vtd_hiod->errata;
QLIST_INIT(&container->s2_hwpt_list);
if (vtd_device_attach_container(vtd_hiod, container, pasid, pe, hwpt,
@@ -5355,9 +5359,10 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus,
return vtd_dev_as;
}
-static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
+static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
Error **errp)
{
+ HostIOMMUDevice *hiod = vtd_hiod->hiod;
HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
int ret;
@@ -5406,6 +5411,8 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
return false;
}
+
+ vtd_hiod->errata = vtd->flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17;
#endif
error_setg(errp, "host device is uncompatible with stage-1 translation");
@@ -5439,7 +5446,8 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
vtd_hiod->iommu_state = s;
vtd_hiod->hiod = hiod;
- if (!vtd_check_hiod(s, hiod, errp)) {
+ if (!vtd_check_hiod(s, vtd_hiod, errp)) {
+ g_free(vtd_hiod);
vtd_iommu_unlock(s);
return false;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 11/15] intel_iommu: Replay pasid binds after context cache invalidation
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (9 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 10/15] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 12/15] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
` (3 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
From: Yi Liu <yi.l.liu@intel.com>
This replays guest pasid attachments after context cache invalidation.
This is a behavior to ensure safety. Actually, programmer should issue
pasid cache invalidation with proper granularity after issuing a context
cache invalidation.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 1 +
hw/i386/intel_iommu.c | 51 ++++++++++++++++++++++++++++++++--
hw/i386/trace-events | 1 +
3 files changed, 51 insertions(+), 2 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 75d840f9fe..198726b48f 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -575,6 +575,7 @@ typedef enum VTDPCInvType {
VTD_PASID_CACHE_FORCE_RESET = 0,
/* pasid cache invalidation rely on guest PASID entry */
VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
+ VTD_PASID_CACHE_DEVSI, /* pasid cache device selective invalidation */
VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation */
VTD_PASID_CACHE_PASIDSI, /* pasid cache pasid selective invalidation */
} VTDPCInvType;
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index b8c0d0dd2e..275b8bafef 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -91,6 +91,10 @@ static void vtd_address_space_refresh_all(IntelIOMMUState *s);
static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
+static void vtd_pasid_cache_sync(IntelIOMMUState *s,
+ VTDPASIDCacheInfo *pc_info);
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+ PCIBus *bus, uint16_t devfn);
static void vtd_panic_require_caching_mode(void)
{
@@ -2417,6 +2421,8 @@ static void vtd_iommu_replay_all(IntelIOMMUState *s)
static void vtd_context_global_invalidate(IntelIOMMUState *s)
{
+ VTDPASIDCacheInfo pc_info = { .error_happened = false, };
+
trace_vtd_inv_desc_cc_global();
/* Protects context cache */
vtd_iommu_lock(s);
@@ -2434,6 +2440,9 @@ static void vtd_context_global_invalidate(IntelIOMMUState *s)
* VT-d emulation codes.
*/
vtd_iommu_replay_all(s);
+
+ pc_info.type = VTD_PASID_CACHE_GLOBAL_INV;
+ vtd_pasid_cache_sync(s, &pc_info);
}
#ifdef CONFIG_IOMMUFD
@@ -2989,6 +2998,21 @@ static void vtd_context_device_invalidate(IntelIOMMUState *s,
* happened.
*/
vtd_address_space_sync(vtd_as);
+ /*
+ * Per spec, context flush should also followed with PASID
+ * cache and iotlb flush. Regards to a device selective
+ * context cache invalidation:
+ * if (emaulted_device)
+ * invalidate pasid cache and pasid-based iotlb
+ * else if (assigned_device)
+ * check if the device has been bound to any pasid
+ * invoke pasid_unbind regards to each bound pasid
+ * Here, we have vtd_pasid_cache_devsi() to invalidate pasid
+ * caches, while for piotlb in QEMU, we don't have it yet, so
+ * no handling. For assigned device, host iommu driver would
+ * flush piotlb when a pasid unbind is pass down to it.
+ */
+ vtd_pasid_cache_devsi(s, vtd_as->bus, devfn);
}
}
}
@@ -3737,6 +3761,11 @@ static gboolean vtd_flush_pasid(gpointer key, gpointer value,
/* Fall through */
case VTD_PASID_CACHE_GLOBAL_INV:
break;
+ case VTD_PASID_CACHE_DEVSI:
+ if (pc_info->bus != vtd_as->bus || pc_info->devfn != vtd_as->devfn) {
+ return false;
+ }
+ break;
default:
error_report("invalid pc_info->type");
abort();
@@ -3933,6 +3962,11 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
case VTD_PASID_CACHE_GLOBAL_INV:
/* loop all assigned devices */
break;
+ case VTD_PASID_CACHE_DEVSI:
+ walk_info.bus = pc_info->bus;
+ walk_info.devfn = pc_info->devfn;
+ vtd_replay_pasid_bind_for_dev(s, start, end, &walk_info);
+ return;
case VTD_PASID_CACHE_FORCE_RESET:
/* For force reset, no need to go further replay */
return;
@@ -3968,8 +4002,7 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
* It includes updating the pasid cache in vIOMMU and updating the
* pasid bindings per guest's latest pasid entry presence.
*/
-static void vtd_pasid_cache_sync(IntelIOMMUState *s,
- VTDPASIDCacheInfo *pc_info)
+static void vtd_pasid_cache_sync(IntelIOMMUState *s, VTDPASIDCacheInfo *pc_info)
{
if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
return;
@@ -4030,6 +4063,20 @@ static void vtd_pasid_cache_sync(IntelIOMMUState *s,
vtd_replay_guest_pasid_bindings(s, pc_info);
}
+static void vtd_pasid_cache_devsi(IntelIOMMUState *s,
+ PCIBus *bus, uint16_t devfn)
+{
+ VTDPASIDCacheInfo pc_info = { .error_happened = false, };
+
+ trace_vtd_pasid_cache_devsi(devfn);
+
+ pc_info.type = VTD_PASID_CACHE_DEVSI;
+ pc_info.bus = bus;
+ pc_info.devfn = devfn;
+
+ vtd_pasid_cache_sync(s, &pc_info);
+}
+
static bool vtd_process_pasid_desc(IntelIOMMUState *s,
VTDInvDesc *inv_desc)
{
diff --git a/hw/i386/trace-events b/hw/i386/trace-events
index de903a0033..f001b820d9 100644
--- a/hw/i386/trace-events
+++ b/hw/i386/trace-events
@@ -28,6 +28,7 @@ vtd_pasid_cache_reset(void) ""
vtd_pasid_cache_gsi(void) ""
vtd_pasid_cache_dsi(uint16_t domain) "Domain selective PC invalidation domain 0x%"PRIx16
vtd_pasid_cache_psi(uint16_t domain, uint32_t pasid) "PASID selective PC invalidation domain 0x%"PRIx16" pasid 0x%"PRIx32
+vtd_pasid_cache_devsi(uint16_t devfn) "Dev selective PC invalidation dev: 0x%"PRIx16
vtd_re_not_present(uint8_t bus) "Root entry bus %"PRIu8" not present"
vtd_ce_not_present(uint8_t bus, uint8_t devfn) "Context entry bus %"PRIu8" devfn %"PRIu8" not present"
vtd_iotlb_page_hit(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t domain) "IOTLB page hit sid 0x%"PRIx16" iova 0x%"PRIx64" slpte 0x%"PRIx64" domain 0x%"PRIx16
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 12/15] intel_iommu: Propagate PASID-based iotlb invalidation to host
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (10 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 11/15] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 13/15] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
` (2 subsequent siblings)
14 siblings, 0 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng, Yi Sun,
Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
Richard Henderson, Eduardo Habkost
From: Yi Liu <yi.l.liu@intel.com>
This traps the guest PASID-based iotlb invalidation request and propagate it
to host.
Intel VT-d 3.0 supports nested translation in PASID granular. Guest SVA support
could be implemented by configuring nested translation on specific PASID. This
is also known as dual stage DMA translation.
Under such configuration, guest owns the GVA->GPA translation which is
configured as stage-1 page table in host side for a specific pasid, and host
owns GPA->HPA translation. As guest owns stage-1 translation table, piotlb
invalidation should be propagated to host since host IOMMU will cache first
level page table related mappings during DMA address translation.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu_internal.h | 6 ++
hw/i386/intel_iommu.c | 118 ++++++++++++++++++++++++++++++++-
2 files changed, 122 insertions(+), 2 deletions(-)
diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
index 198726b48f..e4552ff9bd 100644
--- a/hw/i386/intel_iommu_internal.h
+++ b/hw/i386/intel_iommu_internal.h
@@ -589,6 +589,12 @@ typedef struct VTDPASIDCacheInfo {
bool error_happened;
} VTDPASIDCacheInfo;
+typedef struct VTDPIOTLBInvInfo {
+ uint16_t domain_id;
+ uint32_t pasid;
+ struct iommu_hwpt_vtd_s1_invalidate *inv_data;
+} VTDPIOTLBInvInfo;
+
/* PASID Table Related Definitions */
#define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
#define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 275b8bafef..79e1cda364 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -2932,12 +2932,110 @@ static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
return ret;
}
+
+/*
+ * Caller of this function should hold iommu_lock.
+ */
+static void vtd_invalidate_piotlb(VTDAddressSpace *vtd_as,
+ struct iommu_hwpt_vtd_s1_invalidate *cache)
+{
+ VTDHostIOMMUDevice *vtd_hiod;
+ HostIOMMUDeviceIOMMUFD *idev;
+ VTDHwpt *hwpt = &vtd_as->hwpt;
+ int devfn = vtd_as->devfn;
+ struct vtd_as_key key = {
+ .bus = vtd_as->bus,
+ .devfn = devfn,
+ };
+ IntelIOMMUState *s = vtd_as->iommu_state;
+ uint32_t entry_num = 1; /* Only implement one request for simplicity */
+ Error *err;
+
+ if (!hwpt) {
+ return;
+ }
+
+ vtd_hiod = g_hash_table_lookup(s->vtd_host_iommu_dev, &key);
+ if (!vtd_hiod || !vtd_hiod->hiod) {
+ return;
+ }
+ idev = HOST_IOMMU_DEVICE_IOMMUFD(vtd_hiod->hiod);
+
+ if (!iommufd_backend_invalidate_cache(idev->iommufd, hwpt->hwpt_id,
+ IOMMU_HWPT_INVALIDATE_DATA_VTD_S1,
+ sizeof(*cache), &entry_num, cache,
+ &err)) {
+ error_report_err(err);
+ }
+}
+
+/*
+ * This function is a loop function for the s->vtd_address_spaces
+ * list with VTDPIOTLBInvInfo as execution filter. It propagates
+ * the piotlb invalidation to host. Caller of this function
+ * should hold iommu_lock.
+ */
+static void vtd_flush_pasid_iotlb(gpointer key, gpointer value,
+ gpointer user_data)
+{
+ VTDPIOTLBInvInfo *piotlb_info = user_data;
+ VTDAddressSpace *vtd_as = value;
+ VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
+ uint32_t pasid;
+ uint16_t did;
+
+ /* Replay only fill pasid entry cache for passthrough device */
+ if (!pc_entry->cache_filled ||
+ !vtd_pe_pgtt_is_flt(&pc_entry->pasid_entry)) {
+ return;
+ }
+
+ if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
+ return;
+ }
+
+ did = vtd_pe_get_did(&pc_entry->pasid_entry);
+
+ if (piotlb_info->domain_id == did && piotlb_info->pasid == pasid) {
+ vtd_invalidate_piotlb(vtd_as, piotlb_info->inv_data);
+ }
+}
+
+static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
+ uint16_t domain_id, uint32_t pasid,
+ hwaddr addr, uint64_t npages, bool ih)
+{
+ struct iommu_hwpt_vtd_s1_invalidate cache_info = { 0 };
+ VTDPIOTLBInvInfo piotlb_info;
+
+ cache_info.addr = addr;
+ cache_info.npages = npages;
+ cache_info.flags = ih ? IOMMU_VTD_INV_FLAGS_LEAF : 0;
+
+ piotlb_info.domain_id = domain_id;
+ piotlb_info.pasid = pasid;
+ piotlb_info.inv_data = &cache_info;
+
+ /*
+ * Here loops all the vtd_as instances in s->vtd_address_spaces
+ * to find out the affected devices since piotlb invalidation
+ * should check pasid cache per architecture point of view.
+ */
+ g_hash_table_foreach(s->vtd_address_spaces,
+ vtd_flush_pasid_iotlb, &piotlb_info);
+}
#else
static int vtd_bind_guest_pasid(VTDAddressSpace *vtd_as,
VTDPASIDEntry *pe, VTDPASIDOp op)
{
return 0;
}
+
+static void vtd_flush_pasid_iotlb_all(IntelIOMMUState *s,
+ uint16_t domain_id, uint32_t pasid,
+ hwaddr addr, uint64_t npages, bool ih)
+{
+}
#endif
/* Do a context-cache device-selective invalidation.
@@ -3591,6 +3689,13 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
info.pasid = pasid;
vtd_iommu_lock(s);
+ /*
+ * Here loops all the vtd_as instances in s->vtd_as
+ * to find out the affected devices since piotlb invalidation
+ * should check pasid cache per architecture point of view.
+ */
+ vtd_flush_pasid_iotlb_all(s, domain_id, pasid, 0, (uint64_t)-1, 0);
+
g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_pasid,
&info);
vtd_iommu_unlock(s);
@@ -3613,7 +3718,8 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
}
static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
- uint32_t pasid, hwaddr addr, uint8_t am)
+ uint32_t pasid, hwaddr addr, uint8_t am,
+ bool ih)
{
VTDIOTLBPageInvInfo info;
@@ -3623,6 +3729,13 @@ static void vtd_piotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
info.mask = ~((1 << am) - 1);
vtd_iommu_lock(s);
+ /*
+ * Here loops all the vtd_as instances in s->vtd_as
+ * to find out the affected devices since piotlb invalidation
+ * should check pasid cache per architecture point of view.
+ */
+ vtd_flush_pasid_iotlb_all(s, domain_id, pasid, addr, 1 << am, ih);
+
g_hash_table_foreach_remove(s->iotlb,
vtd_hash_remove_by_page_piotlb, &info);
vtd_iommu_unlock(s);
@@ -3656,7 +3769,8 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
case VTD_INV_DESC_PIOTLB_PSI_IN_PASID:
am = VTD_INV_DESC_PIOTLB_AM(inv_desc->val[1]);
addr = (hwaddr) VTD_INV_DESC_PIOTLB_ADDR(inv_desc->val[1]);
- vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am);
+ vtd_piotlb_page_invalidate(s, domain_id, pasid, addr, am,
+ VTD_INV_DESC_PIOTLB_IH(inv_desc->val[1]));
break;
default:
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 13/15] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (11 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 12/15] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 14/15] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 15/15] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
14 siblings, 0 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
From: Yi Liu <yi.l.liu@intel.com>
When either 'Set Root Table Pointer' or 'Translation Enable' bit is changed,
the pasid bindings on host side become stale and need to be updated.
Introduce a helper function vtd_refresh_pasid_bind() for that purpose.
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 79e1cda364..ec8b0ff13a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -89,6 +89,7 @@ struct vtd_iotlb_key {
static void vtd_address_space_refresh_all(IntelIOMMUState *s);
static void vtd_address_space_unmap(VTDAddressSpace *as, IOMMUNotifier *n);
+static void vtd_refresh_pasid_bind(IntelIOMMUState *s);
static void vtd_pasid_cache_reset_locked(IntelIOMMUState *s);
static void vtd_pasid_cache_sync(IntelIOMMUState *s,
@@ -3362,6 +3363,7 @@ static void vtd_handle_gcmd_srtp(IntelIOMMUState *s)
vtd_set_clear_mask_long(s, DMAR_GSTS_REG, 0, VTD_GSTS_RTPS);
vtd_reset_caches(s);
vtd_address_space_refresh_all(s);
+ vtd_refresh_pasid_bind(s);
}
/* Set Interrupt Remap Table Pointer */
@@ -3396,6 +3398,7 @@ static void vtd_handle_gcmd_te(IntelIOMMUState *s, bool en)
vtd_reset_caches(s);
vtd_address_space_refresh_all(s);
+ vtd_refresh_pasid_bind(s);
}
/* Handle Interrupt Remap Enable/Disable */
@@ -4111,6 +4114,26 @@ static void vtd_replay_guest_pasid_bindings(IntelIOMMUState *s,
}
}
+static void vtd_refresh_pasid_bind(IntelIOMMUState *s)
+{
+ VTDPASIDCacheInfo pc_info = { .error_happened = false,
+ .type = VTD_PASID_CACHE_GLOBAL_INV };
+
+ /*
+ * Only when dmar is enabled, should pasid bindings replayed,
+ * otherwise no need to replay.
+ */
+ if (!s->dmar_enabled) {
+ return;
+ }
+
+ if (!s->flts || !s->root_scalable) {
+ return;
+ }
+
+ vtd_replay_guest_pasid_bindings(s, &pc_info);
+}
+
/*
* This function syncs the pasid bindings between guest and host.
* It includes updating the pasid cache in vIOMMU and updating the
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 14/15] intel_iommu: Bypass replay in stage-1 page table mode
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (12 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 13/15] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 15/15] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
14 siblings, 0 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
VFIO utilizes replay to setup initial shadow iommu mappings.
But when stage-1 page table is configured, it is passed to
host to construct nested page table, there is no replay needed.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index ec8b0ff13a..165998896c 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -5759,6 +5759,14 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
VTDContextEntry ce;
DMAMap map = { .iova = 0, .size = HWADDR_MAX };
+ /*
+ * Replay on stage-1 page table is meaningless as stage-1 page table
+ * is passthroughed to host to construct nested page table
+ */
+ if (s->flts && s->root_scalable) {
+ return;
+ }
+
/* replay is protected by BQL, page walk will re-setup it safely */
iova_tree_remove(vtd_as->iova_tree, map);
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* [PATCH v1 15/15] intel_iommu: Enable host device when x-flts=on in scalable mode
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
` (13 preceding siblings ...)
2025-06-06 10:04 ` [PATCH v1 14/15] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
@ 2025-06-06 10:04 ` Zhenzhong Duan
14 siblings, 0 replies; 36+ messages in thread
From: Zhenzhong Duan @ 2025-06-06 10:04 UTC (permalink / raw)
To: qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Zhenzhong Duan, Marcel Apfelbaum, Paolo Bonzini,
Richard Henderson, Eduardo Habkost
Now that all infrastructures of supporting passthrough device running
with stage-1 translation are there, enable it now.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
---
hw/i386/intel_iommu.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 165998896c..1df861ba90 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -5597,6 +5597,7 @@ static bool vtd_check_hiod(IntelIOMMUState *s, VTDHostIOMMUDevice *vtd_hiod,
}
vtd_hiod->errata = vtd->flags & IOMMU_HW_INFO_VTD_ERRATA_772415_SPR17;
+ return true;
#endif
error_setg(errp, "host device is uncompatible with stage-1 translation");
--
2.34.1
^ permalink raw reply related [flat|nested] 36+ messages in thread
* Re: [PATCH v1 01/15] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
2025-06-06 10:04 ` [PATCH v1 01/15] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
@ 2025-06-11 7:20 ` Yi Liu
2025-06-17 17:16 ` Eric Auger
1 sibling, 0 replies; 36+ messages in thread
From: Yi Liu @ 2025-06-11 7:20 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, chao.p.peng, Marcel Apfelbaum,
Paolo Bonzini, Richard Henderson, Eduardo Habkost
On 2025/6/6 18:04, Zhenzhong Duan wrote:
> In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
> of rid2pasid, then it was extended to get any pasid entry. So a new name
> vtd_ce_get_pasid_entry is better to match what it actually does.
>
> No functional change intended.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
> ---
> hw/i386/intel_iommu.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 69d72ad35c..f0b1f90eff 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
> return 0;
> }
>
> -static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
> +static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
> VTDContextEntry *ce,
> VTDPASIDEntry *pe,
> uint32_t pasid)
> @@ -1025,7 +1025,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
> VTDPASIDEntry pe;
>
> if (s->root_scalable) {
> - vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> + vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> if (s->flts) {
> return VTD_PE_GET_FL_LEVEL(&pe);
> } else {
> @@ -1048,7 +1048,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
> VTDPASIDEntry pe;
>
> if (s->root_scalable) {
> - vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> + vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
> }
>
> @@ -1116,7 +1116,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> VTDPASIDEntry pe;
>
> if (s->root_scalable) {
> - vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> + vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> if (s->flts) {
> return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
> } else {
> @@ -1522,7 +1522,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
> * has valid rid2pasid setting, which includes valid
> * rid2pasid field and corresponding pasid entry setting
> */
> - return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
> + return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
> }
>
> /* Map a device to its corresponding domain (context-entry) */
> @@ -1611,7 +1611,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
> VTDPASIDEntry pe;
>
> if (s->root_scalable) {
> - vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> + vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
> }
>
> @@ -1687,7 +1687,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
> int ret;
>
> if (s->root_scalable) {
> - ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> + ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> if (ret) {
> /*
> * This error is guest triggerable. We should assumt PT
a typo. not related to this patch though.
--
Regards,
Yi Liu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization
2025-06-06 10:04 ` [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
@ 2025-06-11 7:48 ` Yi Liu
2025-06-11 10:06 ` Duan, Zhenzhong
2025-06-17 17:24 ` Eric Auger
1 sibling, 1 reply; 36+ messages in thread
From: Yi Liu @ 2025-06-11 7:48 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, chao.p.peng, Marcel Apfelbaum,
Paolo Bonzini, Richard Henderson, Eduardo Habkost
On 2025/6/6 18:04, Zhenzhong Duan wrote:
> There are many call sites referencing context entry by calling
> vtd_dev_to_context_entry() which will traverse the DMAR table.
>
> In most cases we can use cached context entry in vtd_as->context_cache_entry
> except when its entry is stale. Currently only global and domain context
> invalidation stale it.
>
> So introduce a helper function vtd_as_to_context_entry() to fetch from cache
> before trying with vtd_dev_to_context_entry().
The cached context entry is now protected by vtd_iommu_lock(). While not
all caller of vtd_dev_to_context_entry() are under this lock.
Also, the cached context entry is created in the translate path. IMHO,
this path is not supposed to be triggered for passthrough devices.
While this may need double check and may change in the future. But let's
see if any locking issue with the current code.
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu.c | 36 +++++++++++++++++++++++-------------
> 1 file changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f0b1f90eff..a2f3250724 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1597,6 +1597,22 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
> return 0;
> }
>
> +static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
> +{
> + IntelIOMMUState *s = vtd_as->iommu_state;
> + uint8_t bus_num = pci_bus_num(vtd_as->bus);
> + uint8_t devfn = vtd_as->devfn;
> + VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
> +
> + /* Try to fetch context-entry from cache first */
> + if (cc_entry->context_cache_gen == s->context_cache_gen) {
> + *ce = cc_entry->context_entry;
> + return 0;
> + } else {
> + return vtd_dev_to_context_entry(s, bus_num, devfn, ce);
> + }
> +}
> +
> static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
> void *private)
> {
> @@ -1649,9 +1665,7 @@ static int vtd_address_space_sync(VTDAddressSpace *vtd_as)
> return 0;
> }
>
> - ret = vtd_dev_to_context_entry(vtd_as->iommu_state,
> - pci_bus_num(vtd_as->bus),
> - vtd_as->devfn, &ce);
> + ret = vtd_as_to_context_entry(vtd_as, &ce);
> if (ret) {
> if (ret == -VTD_FR_CONTEXT_ENTRY_P) {
> /*
> @@ -1710,8 +1724,7 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
> assert(as);
>
> s = as->iommu_state;
> - if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
> - &ce)) {
> + if (vtd_as_to_context_entry(as, &ce)) {
> /*
> * Possibly failed to parse the context entry for some reason
> * (e.g., during init, or any guest configuration errors on
> @@ -2435,8 +2448,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
> vtd_iommu_unlock(s);
>
> QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
> - if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> - vtd_as->devfn, &ce) &&
> + if (!vtd_as_to_context_entry(vtd_as, &ce) &&
this one apparently is called out of lock.
> domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
> vtd_address_space_sync(vtd_as);
> }
> @@ -2458,8 +2470,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> hwaddr size = (1 << am) * VTD_PAGE_SIZE;
>
> QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
> - ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> - vtd_as->devfn, &ce);
> + ret = vtd_as_to_context_entry(vtd_as, &ce);
> if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
> uint32_t rid2pasid = PCI_NO_PASID;
>
> @@ -2966,8 +2977,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
> vtd_iommu_unlock(s);
>
> QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
> - if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> - vtd_as->devfn, &ce) &&
> + if (!vtd_as_to_context_entry(vtd_as, &ce) &&
> domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
> uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
>
> @@ -4146,7 +4156,7 @@ static void vtd_report_ir_illegal_access(VTDAddressSpace *vtd_as,
> assert(vtd_as->pasid != PCI_NO_PASID);
>
> /* Try out best to fetch FPD, we can't do anything more */
> - if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> + if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
> is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> if (!is_fpd_set && s->root_scalable) {
> vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, vtd_as->pasid);
> @@ -4506,7 +4516,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> /* replay is protected by BQL, page walk will re-setup it safely */
> iova_tree_remove(vtd_as->iova_tree, map);
>
> - if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> + if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
> trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
> "legacy mode",
> bus_n, PCI_SLOT(vtd_as->devfn),
--
Regards,
Yi Liu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 05/15] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
2025-06-06 10:04 ` [PATCH v1 05/15] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
@ 2025-06-11 9:54 ` Yi Liu
2025-06-11 10:46 ` Duan, Zhenzhong
0 siblings, 1 reply; 36+ messages in thread
From: Yi Liu @ 2025-06-11 9:54 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, chao.p.peng, Paolo Bonzini,
Richard Henderson, Eduardo Habkost, Marcel Apfelbaum
On 2025/6/6 18:04, Zhenzhong Duan wrote:
> We already have vtd_find_add_as() to find an AS from BDF+pasid, but this
> pasid is passed from PCI subsystem. PCI device supports two request types,
> Requests-without-PASID and Requests-with-PASID. Requests-without-PASID
> doesn't include a PASID TLP prefix, IOMMU fetches rid_pasid from context
> entry and use it as IOMMU's pasid to index pasid table.
When reading the first line, it makes me thinking why need the helpers
since there is already a helper to find. The key is the later part. We need
to translate the PCI_NO_PASID to ridpasid.
> So we need to translate between PCI's pasid and IOMMU's pasid specially
> for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
> For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
>
> vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to vtd_as
> which contains PCI's pasid vtd_as->pasid.
>
> vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to iommu_pasid.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu.c | 50 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 50 insertions(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 796b71605c..112e09e305 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1617,6 +1617,56 @@ static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
> }
> }
>
> +static inline int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
> + uint32_t *pasid)
> +{
> + VTDContextEntry ce;
> + int ret;
> +
> + ret = vtd_as_to_context_entry(vtd_as, &ce);
> + if (ret) {
> + return ret;
> + }
> +
> + /* Translate to iommu pasid if PCI_NO_PASID */
> + if (vtd_as->pasid == PCI_NO_PASID) {
> + *pasid = VTD_CE_GET_RID2PASID(&ce);
> + } else {
> + *pasid = vtd_as->pasid;
> + }
> +
> + return 0;
> +}
> +
> +static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer value,
> + gpointer user_data)
> +{
> + VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
> + struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
> + uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
> + uint32_t pasid;
> +
> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
> + return false;
> + }
> +
> + return (pasid == target->pasid) && (sid == target->sid);
> +}
> +
> +/* Translate iommu pasid to vtd_as */
> +static inline
> +VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
> + uint16_t sid, uint32_t pasid)
> +{
> + struct vtd_as_raw_key key = {
> + .sid = sid,
> + .pasid = pasid
> + };
> +
> + return g_hash_table_find(s->vtd_address_spaces,
> + vtd_find_as_by_sid_and_iommu_pasid, &key);
> +}
> +
> static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
> void *private)
> {
--
Regards,
Yi Liu
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization
2025-06-11 7:48 ` Yi Liu
@ 2025-06-11 10:06 ` Duan, Zhenzhong
2025-06-17 10:57 ` Yi Liu
0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-06-11 10:06 UTC (permalink / raw)
To: Liu, Yi L, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
Eduardo Habkost
>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v1 02/15] intel_iommu: Optimize context entry cache
>utilization
>
>On 2025/6/6 18:04, Zhenzhong Duan wrote:
>> There are many call sites referencing context entry by calling
>> vtd_dev_to_context_entry() which will traverse the DMAR table.
>>
>> In most cases we can use cached context entry in vtd_as->context_cache_entry
>> except when its entry is stale. Currently only global and domain context
>> invalidation stale it.
>>
>> So introduce a helper function vtd_as_to_context_entry() to fetch from cache
>> before trying with vtd_dev_to_context_entry().
>
>The cached context entry is now protected by vtd_iommu_lock(). While not
>all caller of vtd_dev_to_context_entry() are under this lock.
>
>Also, the cached context entry is created in the translate path. IMHO,
>this path is not supposed to be triggered for passthrough devices.
>While this may need double check and may change in the future. But let's
>see if any locking issue with the current code.
Good finding, yes.
Previously I thought translation path updates cc_entry->context_entry after cc_entry->context_cache_gen.
In vtd_as_to_context_entry() cc_entry->context_cache_gen is checked first, so there was no real race.
But I still missed a memory barrier like below:
@@ -2277,6 +2286,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
cc_entry->context_cache_gen,
s->context_cache_gen);
cc_entry->context_entry = ce;
+ smp_wmb();
cc_entry->context_cache_gen = s->context_cache_gen;
}
Another option I can think of is adding lock to cache reading like below:
@@ -1659,11 +1659,15 @@ static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
uint8_t devfn = vtd_as->devfn;
VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
+ vtd_iommu_lock(s);
+
/* Try to fetch context-entry from cache first */
if (cc_entry->context_cache_gen == s->context_cache_gen) {
*ce = cc_entry->context_entry;
+ vtd_iommu_unlock(s);
return 0;
} else {
+ vtd_iommu_unlock(s);
return vtd_dev_to_context_entry(s, bus_num, devfn, ce);
}
}
Which one do you prefer?
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v1 05/15] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
2025-06-11 9:54 ` Yi Liu
@ 2025-06-11 10:46 ` Duan, Zhenzhong
2025-06-17 10:58 ` Yi Liu
0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-06-11 10:46 UTC (permalink / raw)
To: Liu, Yi L, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v1 05/15] intel_iommu: Introduce two helpers
>vtd_as_from/to_iommu_pasid_locked
>
>On 2025/6/6 18:04, Zhenzhong Duan wrote:
>> We already have vtd_find_add_as() to find an AS from BDF+pasid, but this
>> pasid is passed from PCI subsystem. PCI device supports two request types,
>> Requests-without-PASID and Requests-with-PASID. Requests-without-PASID
>> doesn't include a PASID TLP prefix, IOMMU fetches rid_pasid from context
>> entry and use it as IOMMU's pasid to index pasid table.
>
>When reading the first line, it makes me thinking why need the helpers
>since there is already a helper to find. The key is the later part. We need
>to translate the PCI_NO_PASID to ridpasid.
OK, presume you want me to delete first line.
Let me know if you mean not.
Thanks
Zhenzhong
>
>> So we need to translate between PCI's pasid and IOMMU's pasid specially
>> for Requests-without-PASID, e.g., PCI_NO_PASID(-1) <-> rid_pasid.
>> For Requests-with-PASID, PCI's pasid and IOMMU's pasid are same value.
>>
>> vtd_as_from_iommu_pasid_locked() translates from BDF+iommu_pasid to
>vtd_as
>> which contains PCI's pasid vtd_as->pasid.
>>
>> vtd_as_to_iommu_pasid_locked() translates from BDF+vtd_as->pasid to
>iommu_pasid.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu.c | 50
>+++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 50 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 796b71605c..112e09e305 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -1617,6 +1617,56 @@ static int
>vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
>> }
>> }
>>
>> +static inline int vtd_as_to_iommu_pasid_locked(VTDAddressSpace *vtd_as,
>> + uint32_t *pasid)
>> +{
>> + VTDContextEntry ce;
>> + int ret;
>> +
>> + ret = vtd_as_to_context_entry(vtd_as, &ce);
>> + if (ret) {
>> + return ret;
>> + }
>> +
>> + /* Translate to iommu pasid if PCI_NO_PASID */
>> + if (vtd_as->pasid == PCI_NO_PASID) {
>> + *pasid = VTD_CE_GET_RID2PASID(&ce);
>> + } else {
>> + *pasid = vtd_as->pasid;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static gboolean vtd_find_as_by_sid_and_iommu_pasid(gpointer key, gpointer
>value,
>> + gpointer user_data)
>> +{
>> + VTDAddressSpace *vtd_as = (VTDAddressSpace *)value;
>> + struct vtd_as_raw_key *target = (struct vtd_as_raw_key *)user_data;
>> + uint16_t sid = PCI_BUILD_BDF(pci_bus_num(vtd_as->bus), vtd_as->devfn);
>> + uint32_t pasid;
>> +
>> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
>> + return false;
>> + }
>> +
>> + return (pasid == target->pasid) && (sid == target->sid);
>> +}
>> +
>> +/* Translate iommu pasid to vtd_as */
>> +static inline
>> +VTDAddressSpace *vtd_as_from_iommu_pasid_locked(IntelIOMMUState *s,
>> + uint16_t sid, uint32_t pasid)
>> +{
>> + struct vtd_as_raw_key key = {
>> + .sid = sid,
>> + .pasid = pasid
>> + };
>> +
>> + return g_hash_table_find(s->vtd_address_spaces,
>> + vtd_find_as_by_sid_and_iommu_pasid, &key);
>> +}
>> +
>> static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
>> void *private)
>> {
>
>--
>Regards,
>Yi Liu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 04/15] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
2025-06-06 10:04 ` [PATCH v1 04/15] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
@ 2025-06-12 16:04 ` CLEMENT MATHIEU--DRIF
2025-06-13 9:08 ` Duan, Zhenzhong
0 siblings, 1 reply; 36+ messages in thread
From: CLEMENT MATHIEU--DRIF @ 2025-06-12 16:04 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
kevin.tian@intel.com, yi.l.liu@intel.com, chao.p.peng@intel.com,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi,
On 06/06/2025 12:04 pm, Zhenzhong Duan wrote:
> Caution: External email. Do not open attachments or click links, unless this email comes from a known sender and you know the content is safe.
>
>
> Introduce a new structure VTDHostIOMMUDevice which replaces
> HostIOMMUDevice to be stored in hash table.
>
> It includes a reference to HostIOMMUDevice and IntelIOMMUState,
> also includes BDF information which will be used in future
> patches.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu_internal.h | 7 +++++++
> include/hw/i386/intel_iommu.h | 2 +-
> hw/i386/intel_iommu.c | 14 ++++++++++++--
> 3 files changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 2cda744786..18bc22fc72 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -28,6 +28,7 @@
> #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
> #define HW_I386_INTEL_IOMMU_INTERNAL_H
> #include "hw/i386/intel_iommu.h"
> +#include "system/host_iommu_device.h"
>
> /*
> * Intel IOMMU register specification
> @@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
> /* Bits to decide the offset for each level */
> #define VTD_LEVEL_BITS 9
>
> +typedef struct VTDHostIOMMUDevice {
> + IntelIOMMUState *iommu_state;
> + PCIBus *bus;
> + uint8_t devfn;
> + HostIOMMUDevice *hiod;
> +} VTDHostIOMMUDevice;
> #endif
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index e95477e855..50f9b27a45 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -295,7 +295,7 @@ struct IntelIOMMUState {
> /* list of registered notifiers */
> QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>
> - GHashTable *vtd_host_iommu_dev; /* HostIOMMUDevice */
> + GHashTable *vtd_host_iommu_dev; /* VTDHostIOMMUDevice */
>
> /* interrupt remapping */
> bool intr_enabled; /* Whether guest enabled IR */
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index c42ef83ddc..796b71605c 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1, gconstpointer v2)
>
> static void vtd_hiod_destroy(gpointer v)
> {
> - object_unref(v);
> + VTDHostIOMMUDevice *vtd_hiod = v;
> +
> + object_unref(vtd_hiod->hiod);
> + g_free(vtd_hiod);
> }
>
> static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
> @@ -4397,6 +4400,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> HostIOMMUDevice *hiod, Error **errp)
> {
> IntelIOMMUState *s = opaque;
> + VTDHostIOMMUDevice *vtd_hiod;
> struct vtd_as_key key = {
> .bus = bus,
> .devfn = devfn,
> @@ -4413,6 +4417,12 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> return false;
> }
>
> + vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
> + vtd_hiod->bus = bus;
> + vtd_hiod->devfn = (uint8_t)devfn;
> + vtd_hiod->iommu_state = s;
> + vtd_hiod->hiod = hiod;
> +
> if (!vtd_check_hiod(s, hiod, errp)) {
Shouldn't we free vtd_hiod here?
> vtd_iommu_unlock(s);
> return false;
> @@ -4423,7 +4433,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> new_key->devfn = devfn;
>
> object_ref(hiod);
> - g_hash_table_insert(s->vtd_host_iommu_dev, new_key, hiod);
> + g_hash_table_insert(s->vtd_host_iommu_dev, new_key, vtd_hiod);
>
> vtd_iommu_unlock(s);
>
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v1 04/15] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
2025-06-12 16:04 ` CLEMENT MATHIEU--DRIF
@ 2025-06-13 9:08 ` Duan, Zhenzhong
2025-06-20 7:08 ` Eric Auger
0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-06-13 9:08 UTC (permalink / raw)
To: CLEMENT MATHIEU--DRIF, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
Tian, Kevin, Liu, Yi L, Peng, Chao P, Paolo Bonzini,
Richard Henderson, Eduardo Habkost, Marcel Apfelbaum
Hi Clement,
>-----Original Message-----
>From: CLEMENT MATHIEU--DRIF <clement.mathieu--drif@eviden.com>
>Subject: Re: [PATCH v1 04/15] intel_iommu: Introduce a new structure
>VTDHostIOMMUDevice
>
>Hi,
>
>On 06/06/2025 12:04 pm, Zhenzhong Duan wrote:
>> Caution: External email. Do not open attachments or click links, unless this
>email comes from a known sender and you know the content is safe.
>>
>>
>> Introduce a new structure VTDHostIOMMUDevice which replaces
>> HostIOMMUDevice to be stored in hash table.
>>
>> It includes a reference to HostIOMMUDevice and IntelIOMMUState,
>> also includes BDF information which will be used in future
>> patches.
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu_internal.h | 7 +++++++
>> include/hw/i386/intel_iommu.h | 2 +-
>> hw/i386/intel_iommu.c | 14 ++++++++++++--
>> 3 files changed, 20 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> index 2cda744786..18bc22fc72 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -28,6 +28,7 @@
>> #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
>> #define HW_I386_INTEL_IOMMU_INTERNAL_H
>> #include "hw/i386/intel_iommu.h"
>> +#include "system/host_iommu_device.h"
>>
>> /*
>> * Intel IOMMU register specification
>> @@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
>> /* Bits to decide the offset for each level */
>> #define VTD_LEVEL_BITS 9
>>
>> +typedef struct VTDHostIOMMUDevice {
>> + IntelIOMMUState *iommu_state;
>> + PCIBus *bus;
>> + uint8_t devfn;
>> + HostIOMMUDevice *hiod;
>> +} VTDHostIOMMUDevice;
>> #endif
>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>> index e95477e855..50f9b27a45 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -295,7 +295,7 @@ struct IntelIOMMUState {
>> /* list of registered notifiers */
>> QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>>
>> - GHashTable *vtd_host_iommu_dev; /* HostIOMMUDevice */
>> + GHashTable *vtd_host_iommu_dev; /* VTDHostIOMMUDevice */
>>
>> /* interrupt remapping */
>> bool intr_enabled; /* Whether guest enabled IR */
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index c42ef83ddc..796b71605c 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1,
>gconstpointer v2)
>>
>> static void vtd_hiod_destroy(gpointer v)
>> {
>> - object_unref(v);
>> + VTDHostIOMMUDevice *vtd_hiod = v;
>> +
>> + object_unref(vtd_hiod->hiod);
>> + g_free(vtd_hiod);
>> }
>>
>> static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
>> @@ -4397,6 +4400,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus,
>void *opaque, int devfn,
>> HostIOMMUDevice *hiod, Error **errp)
>> {
>> IntelIOMMUState *s = opaque;
>> + VTDHostIOMMUDevice *vtd_hiod;
>> struct vtd_as_key key = {
>> .bus = bus,
>> .devfn = devfn,
>> @@ -4413,6 +4417,12 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus,
>void *opaque, int devfn,
>> return false;
>> }
>>
>> + vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
>> + vtd_hiod->bus = bus;
>> + vtd_hiod->devfn = (uint8_t)devfn;
>> + vtd_hiod->iommu_state = s;
>> + vtd_hiod->hiod = hiod;
>> +
>> if (!vtd_check_hiod(s, hiod, errp)) {
>
>Shouldn't we free vtd_hiod here?
Good catch, Thanks.
I put g_free in patch10, but it should be in this patch, will fix.
BRs,
Zhenzhong
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization
2025-06-11 10:06 ` Duan, Zhenzhong
@ 2025-06-17 10:57 ` Yi Liu
2025-06-18 1:58 ` Duan, Zhenzhong
0 siblings, 1 reply; 36+ messages in thread
From: Yi Liu @ 2025-06-17 10:57 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
Eduardo Habkost
On 2025/6/11 18:06, Duan, Zhenzhong wrote:
>
>
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH v1 02/15] intel_iommu: Optimize context entry cache
>> utilization
>>
>> On 2025/6/6 18:04, Zhenzhong Duan wrote:
>>> There are many call sites referencing context entry by calling
>>> vtd_dev_to_context_entry() which will traverse the DMAR table.
>>>
>>> In most cases we can use cached context entry in vtd_as->context_cache_entry
>>> except when its entry is stale. Currently only global and domain context
>>> invalidation stale it.
>>>
>>> So introduce a helper function vtd_as_to_context_entry() to fetch from cache
>>> before trying with vtd_dev_to_context_entry().
>>
>> The cached context entry is now protected by vtd_iommu_lock(). While not
>> all caller of vtd_dev_to_context_entry() are under this lock.
>>
>> Also, the cached context entry is created in the translate path. IMHO,
>> this path is not supposed to be triggered for passthrough devices.
>> While this may need double check and may change in the future. But let's
>> see if any locking issue with the current code.
>
> Good finding, yes.
> Previously I thought translation path updates cc_entry->context_entry after cc_entry->context_cache_gen.
> In vtd_as_to_context_entry() cc_entry->context_cache_gen is checked first, so there was no real race.
> But I still missed a memory barrier like below:
yeah, testing context_cache_gen is necessary. But without lock, this
cannot guarantee the cc_entry is valid after the test.
> @@ -2277,6 +2286,7 @@ static bool vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
> cc_entry->context_cache_gen,
> s->context_cache_gen);
> cc_entry->context_entry = ce;
> + smp_wmb();
> cc_entry->context_cache_gen = s->context_cache_gen;
> }
>
> Another option I can think of is adding lock to cache reading like below:
this is in-enough as well since the cc_entry->context_entry can be modified
after lock is released.
> @@ -1659,11 +1659,15 @@ static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
> uint8_t devfn = vtd_as->devfn;
> VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
>
> + vtd_iommu_lock(s);
> +
> /* Try to fetch context-entry from cache first */
> if (cc_entry->context_cache_gen == s->context_cache_gen) {
> *ce = cc_entry->context_entry;
> + vtd_iommu_unlock(s);
> return 0;
> } else {
> + vtd_iommu_unlock(s);
> return vtd_dev_to_context_entry(s, bus_num, devfn, ce);
> }
> }
>
> Which one do you prefer?
If it's just optimization, perhaps just drop it. :)
--
Regards,
Yi Liu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 05/15] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
2025-06-11 10:46 ` Duan, Zhenzhong
@ 2025-06-17 10:58 ` Yi Liu
0 siblings, 0 replies; 36+ messages in thread
From: Yi Liu @ 2025-06-17 10:58 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
On 2025/6/11 18:46, Duan, Zhenzhong wrote:
>
>
>> -----Original Message-----
>> From: Liu, Yi L <yi.l.liu@intel.com>
>> Subject: Re: [PATCH v1 05/15] intel_iommu: Introduce two helpers
>> vtd_as_from/to_iommu_pasid_locked
>>
>> On 2025/6/6 18:04, Zhenzhong Duan wrote:
>>> We already have vtd_find_add_as() to find an AS from BDF+pasid, but this
>>> pasid is passed from PCI subsystem. PCI device supports two request types,
>>> Requests-without-PASID and Requests-with-PASID. Requests-without-PASID
>>> doesn't include a PASID TLP prefix, IOMMU fetches rid_pasid from context
>>> entry and use it as IOMMU's pasid to index pasid table.
>>
>> When reading the first line, it makes me thinking why need the helpers
>> since there is already a helper to find. The key is the later part. We need
>> to translate the PCI_NO_PASID to ridpasid.
>
> OK, presume you want me to delete first line.
> Let me know if you mean not.
>
yes. :)
Regards,
Yi Liu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 06/15] intel_iommu: Handle PASID entry removing and updating
2025-06-06 10:04 ` [PATCH v1 06/15] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
@ 2025-06-17 12:29 ` Yi Liu
2025-06-18 6:03 ` Duan, Zhenzhong
0 siblings, 1 reply; 36+ messages in thread
From: Yi Liu @ 2025-06-17 12:29 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, eric.auger, mst, jasowang, peterx, ddutile,
jgg, nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, chao.p.peng, Yi Sun,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
On 2025/6/6 18:04, Zhenzhong Duan wrote:
> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
> pasid entry and track PASID usage and future PASID tagged DMA address
> translation support in vIOMMU.
>
> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
> never freed. For other pasid, VTDAddressSpace instance is created/destroyed
> per the guest pasid entry set up/destroy for passthrough devices. While for
> emulated devices, VTDAddressSpace instance is created in the PASID tagged DMA
> translation and be destroyed per guest PASID cache invalidation. This focuses
> on the PASID cache management for passthrough devices as there is no PASID
> capable emulated devices yet.
>
> When guest modifies a PASID entry, QEMU will capture the guest pasid selective
> pasid cache invalidation, allocate or remove a VTDAddressSpace instance per the
> invalidation reasons:
>
> a) a present pasid entry moved to non-present
> b) a present pasid entry to be a present entry
> c) a non-present pasid entry moved to present
>
> This handles a) and b), following patch will handle c).
>
> vIOMMU emulator could figure out the reason by fetching latest guest pasid entry
> and compare it with the PASID cache.
To aovid confusion, maybe better to say cached pasid entry. :)
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu_internal.h | 26 ++++
> include/hw/i386/intel_iommu.h | 6 +
> hw/i386/intel_iommu.c | 252 +++++++++++++++++++++++++++++++--
> hw/i386/trace-events | 3 +
> 4 files changed, 277 insertions(+), 10 deletions(-)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index 18bc22fc72..82b84db80f 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
> * request while disabled */
> VTD_FR_IR_SID_ERR = 0x26, /* Invalid Source-ID */
>
> + VTD_FR_RTADDR_INV_TTM = 0x31, /* Invalid TTM in RTADDR */
> /* PASID directory entry access failure */
> VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
> /* The Present(P) field of pasid directory entry is 0 */
> @@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
> #define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff000000000f1c0ULL
> #define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
>
> +#define VTD_INV_DESC_PASIDC_G (3ULL << 4)
> +#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
> +#define VTD_INV_DESC_PASIDC_DID(val) (((val) >> 16) & VTD_DOMAIN_ID_MASK)
> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0 0xfff000000000f1c0ULL
> +
> +#define VTD_INV_DESC_PASIDC_DSI (0ULL << 4)
> +#define VTD_INV_DESC_PASIDC_PASID_SI (1ULL << 4)
> +#define VTD_INV_DESC_PASIDC_GLOBAL (3ULL << 4)
> +
> /* Information about page-selective IOTLB invalidate */
> struct VTDIOTLBPageInvInfo {
> uint16_t domain_id;
> @@ -552,6 +562,21 @@ typedef struct VTDRootEntry VTDRootEntry;
> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL | ~VTD_HAW_MASK(aw))
> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
>
> +typedef enum VTDPCInvType {
> + /* pasid cache invalidation rely on guest PASID entry */
> + VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
> + VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation */
> + VTD_PASID_CACHE_PASIDSI, /* pasid cache pasid selective invalidation */
> +} VTDPCInvType;
> +
> +typedef struct VTDPASIDCacheInfo {
> + VTDPCInvType type;
> + uint16_t domain_id;
> + uint32_t pasid;
> + PCIBus *bus;
> + uint16_t devfn;
> +} VTDPASIDCacheInfo;
> +
> /* PASID Table Related Definitions */
> #define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
> #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
> @@ -563,6 +588,7 @@ typedef struct VTDRootEntry VTDRootEntry;
> #define VTD_PASID_TABLE_BITS_MASK (0x3fULL)
> #define VTD_PASID_TABLE_INDEX(pasid) ((pasid) & VTD_PASID_TABLE_BITS_MASK)
> #define VTD_PASID_ENTRY_FPD (1ULL << 1) /* Fault Processing Disable */
> +#define VTD_PASID_TBL_ENTRY_NUM (1ULL << 6)
>
> /* PASID Granular Translation Type Mask */
> #define VTD_PASID_ENTRY_P 1ULL
> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
> index 50f9b27a45..fbc9da903a 100644
> --- a/include/hw/i386/intel_iommu.h
> +++ b/include/hw/i386/intel_iommu.h
> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
> uint64_t val[8];
> };
>
> +typedef struct VTDPASIDCacheEntry {
> + struct VTDPASIDEntry pasid_entry;
> + bool cache_filled;
> +} VTDPASIDCacheEntry;
> +
> struct VTDAddressSpace {
> PCIBus *bus;
> uint8_t devfn;
> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
> MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
> IntelIOMMUState *iommu_state;
> VTDContextCacheEntry context_cache_entry;
> + VTDPASIDCacheEntry pasid_cache_entry;
> QLIST_ENTRY(VTDAddressSpace) next;
> /* Superset of notifier flags that this address space has */
> IOMMUNotifierFlag notifier_flags;
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 112e09e305..c7162647e6 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -825,6 +825,11 @@ static inline bool vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
> }
> }
>
> +static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
> +{
> + return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
> +}
> +
> static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
> {
> return pdire->val & 1;
> @@ -3104,6 +3109,236 @@ static bool vtd_process_piotlb_desc(IntelIOMMUState *s,
> return true;
> }
>
> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
> + uint32_t pasid, VTDPASIDEntry *pe)
> +{
> + IntelIOMMUState *s = vtd_as->iommu_state;
> + VTDContextEntry ce;
> + int ret;
> +
> + if (!s->root_scalable) {
> + return -VTD_FR_RTADDR_INV_TTM;
> + }
> +
> + ret = vtd_as_to_context_entry(vtd_as, &ce);
> + if (ret) {
> + return ret;
> + }
> +
> + return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
> +}
> +
> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
> +{
> + return !memcmp(p1, p2, sizeof(*p1));
> +}
> +
> +/*
> + * This function fills in the pasid entry in &vtd_as. Caller
> + * of this function should hold iommu_lock.
> + */
> +static void vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace *vtd_as,
> + VTDPASIDEntry *pe)
> +{
> + VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> +
> + if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
> + /* No need to go further as cached pasid entry is latest */
> + return;
> + }
> +
> + pc_entry->pasid_entry = *pe;
> + pc_entry->cache_filled = true;
> + /*
> + * TODO: send pasid bind to host for passthru devices
> + */
> +}
> +
> +/*
> + * This function is used to clear cached pasid entry in vtd_as
> + * instances. Caller of this function should hold iommu_lock.
it also covers pasid entry update.
> + */
> +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
> + gpointer user_data)
> +{
> + VTDPASIDCacheInfo *pc_info = user_data;
> + VTDAddressSpace *vtd_as = value;
> + IntelIOMMUState *s = vtd_as->iommu_state;
> + VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
> + VTDPASIDEntry pe;
> + uint16_t did;
> + uint32_t pasid;
> + int ret;
> +
> + /* Replay only filled pasid entry cache for passthrough device */
the comment of this helper already implies only continue if the
pc_entry->cache_filled is true. Also, replay is a concept in the
upper level helpers, no need to mention it here. I noticed replay
in low level helpers in other patch as well, please drop them as well. :)
> + if (!pc_entry->cache_filled) {
> + return false;
> + }
> + did = vtd_pe_get_did(&pc_entry->pasid_entry);
> +
> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
> + goto remove;
> + }
> +
> + switch (pc_info->type) {
> + case VTD_PASID_CACHE_PASIDSI:
> + if (pc_info->pasid != pasid) {
> + return false;
> + }
> + /* Fall through */
> + case VTD_PASID_CACHE_DOMSI:
> + if (pc_info->domain_id != did) {
> + return false;
> + }
> + /* Fall through */
> + case VTD_PASID_CACHE_GLOBAL_INV:
> + break;
> + default:
> + error_report("invalid pc_info->type");
> + abort();
> + }
> +
> + /*
> + * pasid cache invalidation may indicate a present pasid
> + * entry to present pasid entry modification. To cover such
> + * case, vIOMMU emulator needs to fetch latest guest pasid
> + * entry and check cached pasid entry, then update pasid
> + * cache and send pasid bind/unbind to host properly.
> + */
> + ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
> + if (ret) {
> + /*
> + * No valid pasid entry in guest memory. e.g. pasid entry
> + * was modified to be either all-zero or non-present. Either
> + * case means existing pasid cache should be removed.
> + */
> + goto remove;
> + }
> +
> + vtd_fill_pe_in_cache(s, vtd_as, &pe);
> + return false;
> +
> +remove:
> + /*
> + * TODO: send pasid unbind to host for passthru devices
> + */
> + pc_entry->cache_filled = false;
> +
> + /*
> + * Don't remove address space of PCI_NO_PASID which is created by PCI
> + * sub-system.
> + */
I get why it cannot be removed. But I think the ce and pe field of this
vtd_as might need to be updated given it is supposed to be removed if it
is not PCI_NO_PASID.
> + if (vtd_as->pasid == PCI_NO_PASID) {
> + return false;
> + }
> + return true;
> +}
> +
> +/*
> + * This function syncs the pasid bindings between guest and host.
> + * It includes updating the pasid cache in vIOMMU and updating the
> + * pasid bindings per guest's latest pasid entry presence.
> + */
> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
> + VTDPASIDCacheInfo *pc_info)
> +{
> + if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
> + return;
> + }
> +
> + /*
> + * Regards to a pasid cache invalidation, e.g. a PSI.
> + * it could be either cases of below:
> + * a) a present pasid entry moved to non-present
> + * b) a present pasid entry to be a present entry
> + * c) a non-present pasid entry moved to present
> + *
> + * Different invalidation granularity may affect different device
> + * scope and pasid scope. But for each invalidation granularity,
> + * it needs to do two steps to sync host and guest pasid binding.
> + *
> + * Here is the handling of a PSI:
> + * 1) loop all the existing vtd_as instances to update them
> + * according to the latest guest pasid entry in pasid table.
> + * this will make sure affected existing vtd_as instances
> + * cached the latest pasid entries. Also, during the loop, the
> + * host should be notified if needed. e.g. pasid unbind or pasid
> + * update. Should be able to cover case a) and case b).
> + *
> + * 2) loop all devices to cover case c)
> + * - For devices which are backed by HostIOMMUDeviceIOMMUFD instances,
> + * we loop them and check if guest pasid entry exists. If yes,
> + * it is case c), we update the pasid cache and also notify
> + * host.
> + * - For devices which are not backed by HostIOMMUDeviceIOMMUFD,
> + * it is not necessary to create pasid cache at this phase since
> + * it could be created when vIOMMU does DMA address translation.
> + * This is not yet implemented since there is no emulated
> + * pasid-capable devices today. If we have such devices in
> + * future, the pasid cache shall be created there.
> + * Other granularity follow the same steps, just with different scope
> + *
> + */
> +
> + vtd_iommu_lock(s);
> + /*
> + * Step 1: loop all the existing vtd_as instances for pasid unbind and
> + * update.
> + */
> + g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid,
> + pc_info);
> + vtd_iommu_unlock(s);
just realized s->vtd_address_spaces is not protected by iommu_lock. If so,
will it be ok to drop the lock here and add it in the lower level helpers
if lock is needed?
Regards,
Yi Liu
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 01/15] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry
2025-06-06 10:04 ` [PATCH v1 01/15] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-06-11 7:20 ` Yi Liu
@ 2025-06-17 17:16 ` Eric Auger
1 sibling, 0 replies; 36+ messages in thread
From: Eric Auger @ 2025-06-17 17:16 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
Eduardo Habkost
Hi Zhenzhong,
On 6/6/25 12:04 PM, Zhenzhong Duan wrote:
> In early days vtd_ce_get_rid2pasid_entry() was used to get pasid entry
> of rid2pasid, then it was extended to get any pasid entry. So a new name
> vtd_ce_get_pasid_entry is better to match what it actually does.
>
> No functional change intended.
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Clément Mathieu--Drif<clement.mathieu--drif@eviden.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Cheers
Eric
> ---
> hw/i386/intel_iommu.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 69d72ad35c..f0b1f90eff 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -944,7 +944,7 @@ static int vtd_get_pe_from_pasid_table(IntelIOMMUState *s,
> return 0;
> }
>
> -static int vtd_ce_get_rid2pasid_entry(IntelIOMMUState *s,
> +static int vtd_ce_get_pasid_entry(IntelIOMMUState *s,
> VTDContextEntry *ce,
> VTDPASIDEntry *pe,
> uint32_t pasid)
> @@ -1025,7 +1025,7 @@ static uint32_t vtd_get_iova_level(IntelIOMMUState *s,
> VTDPASIDEntry pe;
>
> if (s->root_scalable) {
> - vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> + vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> if (s->flts) {
> return VTD_PE_GET_FL_LEVEL(&pe);
> } else {
> @@ -1048,7 +1048,7 @@ static uint32_t vtd_get_iova_agaw(IntelIOMMUState *s,
> VTDPASIDEntry pe;
>
> if (s->root_scalable) {
> - vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> + vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> return 30 + ((pe.val[0] >> 2) & VTD_SM_PASID_ENTRY_AW) * 9;
> }
>
> @@ -1116,7 +1116,7 @@ static dma_addr_t vtd_get_iova_pgtbl_base(IntelIOMMUState *s,
> VTDPASIDEntry pe;
>
> if (s->root_scalable) {
> - vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> + vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> if (s->flts) {
> return pe.val[2] & VTD_SM_PASID_ENTRY_FLPTPTR;
> } else {
> @@ -1522,7 +1522,7 @@ static int vtd_ce_rid2pasid_check(IntelIOMMUState *s,
> * has valid rid2pasid setting, which includes valid
> * rid2pasid field and corresponding pasid entry setting
> */
> - return vtd_ce_get_rid2pasid_entry(s, ce, &pe, PCI_NO_PASID);
> + return vtd_ce_get_pasid_entry(s, ce, &pe, PCI_NO_PASID);
> }
>
> /* Map a device to its corresponding domain (context-entry) */
> @@ -1611,7 +1611,7 @@ static uint16_t vtd_get_domain_id(IntelIOMMUState *s,
> VTDPASIDEntry pe;
>
> if (s->root_scalable) {
> - vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> + vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> return VTD_SM_PASID_ENTRY_DID(pe.val[1]);
> }
>
> @@ -1687,7 +1687,7 @@ static bool vtd_dev_pt_enabled(IntelIOMMUState *s, VTDContextEntry *ce,
> int ret;
>
> if (s->root_scalable) {
> - ret = vtd_ce_get_rid2pasid_entry(s, ce, &pe, pasid);
> + ret = vtd_ce_get_pasid_entry(s, ce, &pe, pasid);
> if (ret) {
> /*
> * This error is guest triggerable. We should assumt PT
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization
2025-06-06 10:04 ` [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
2025-06-11 7:48 ` Yi Liu
@ 2025-06-17 17:24 ` Eric Auger
2025-06-18 2:10 ` Duan, Zhenzhong
1 sibling, 1 reply; 36+ messages in thread
From: Eric Auger @ 2025-06-17 17:24 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
Eduardo Habkost
Hi Zhenzhong,
On 6/6/25 12:04 PM, Zhenzhong Duan wrote:
> There are many call sites referencing context entry by calling
> vtd_dev_to_context_entry() which will traverse the DMAR table.
>
> In most cases we can use cached context entry in vtd_as->context_cache_entry
> except when its entry is stale. Currently only global and domain context
> invalidation stale it.
>
> So introduce a helper function vtd_as_to_context_entry() to fetch from cache
> before trying with vtd_dev_to_context_entry().
>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu.c | 36 +++++++++++++++++++++++-------------
> 1 file changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index f0b1f90eff..a2f3250724 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1597,6 +1597,22 @@ static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
> return 0;
> }
>
> +static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
> +{
> + IntelIOMMUState *s = vtd_as->iommu_state;
> + uint8_t bus_num = pci_bus_num(vtd_as->bus);
> + uint8_t devfn = vtd_as->devfn;
> + VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
> +
> + /* Try to fetch context-entry from cache first */
> + if (cc_entry->context_cache_gen == s->context_cache_gen) {
> + *ce = cc_entry->context_entry;
> + return 0;
> + } else {
> + return vtd_dev_to_context_entry(s, bus_num, devfn, ce);
> + }
> +}
> +
While the patch looks good to me can't you use the helper also in
vtd_do_iommu_translate()?
See " /* Try to fetch context-entry from cache first */"
If not you may add a comment in the commit desc while it can't be
applied there.
Thanks
Eric
> static int vtd_sync_shadow_page_hook(const IOMMUTLBEvent *event,
> void *private)
> {
> @@ -1649,9 +1665,7 @@ static int vtd_address_space_sync(VTDAddressSpace *vtd_as)
> return 0;
> }
>
> - ret = vtd_dev_to_context_entry(vtd_as->iommu_state,
> - pci_bus_num(vtd_as->bus),
> - vtd_as->devfn, &ce);
> + ret = vtd_as_to_context_entry(vtd_as, &ce);
> if (ret) {
> if (ret == -VTD_FR_CONTEXT_ENTRY_P) {
> /*
> @@ -1710,8 +1724,7 @@ static bool vtd_as_pt_enabled(VTDAddressSpace *as)
> assert(as);
>
> s = as->iommu_state;
> - if (vtd_dev_to_context_entry(s, pci_bus_num(as->bus), as->devfn,
> - &ce)) {
> + if (vtd_as_to_context_entry(as, &ce)) {
> /*
> * Possibly failed to parse the context entry for some reason
> * (e.g., during init, or any guest configuration errors on
> @@ -2435,8 +2448,7 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
> vtd_iommu_unlock(s);
>
> QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
> - if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> - vtd_as->devfn, &ce) &&
> + if (!vtd_as_to_context_entry(vtd_as, &ce) &&
> domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
> vtd_address_space_sync(vtd_as);
> }
> @@ -2458,8 +2470,7 @@ static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> hwaddr size = (1 << am) * VTD_PAGE_SIZE;
>
> QLIST_FOREACH(vtd_as, &(s->vtd_as_with_notifiers), next) {
> - ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> - vtd_as->devfn, &ce);
> + ret = vtd_as_to_context_entry(vtd_as, &ce);
> if (!ret && domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
> uint32_t rid2pasid = PCI_NO_PASID;
>
> @@ -2966,8 +2977,7 @@ static void vtd_piotlb_pasid_invalidate(IntelIOMMUState *s,
> vtd_iommu_unlock(s);
>
> QLIST_FOREACH(vtd_as, &s->vtd_as_with_notifiers, next) {
> - if (!vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> - vtd_as->devfn, &ce) &&
> + if (!vtd_as_to_context_entry(vtd_as, &ce) &&
> domain_id == vtd_get_domain_id(s, &ce, vtd_as->pasid)) {
> uint32_t rid2pasid = VTD_CE_GET_RID2PASID(&ce);
>
> @@ -4146,7 +4156,7 @@ static void vtd_report_ir_illegal_access(VTDAddressSpace *vtd_as,
> assert(vtd_as->pasid != PCI_NO_PASID);
>
> /* Try out best to fetch FPD, we can't do anything more */
> - if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> + if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
> is_fpd_set = ce.lo & VTD_CONTEXT_ENTRY_FPD;
> if (!is_fpd_set && s->root_scalable) {
> vtd_ce_get_pasid_fpd(s, &ce, &is_fpd_set, vtd_as->pasid);
> @@ -4506,7 +4516,7 @@ static void vtd_iommu_replay(IOMMUMemoryRegion *iommu_mr, IOMMUNotifier *n)
> /* replay is protected by BQL, page walk will re-setup it safely */
> iova_tree_remove(vtd_as->iova_tree, map);
>
> - if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> + if (vtd_as_to_context_entry(vtd_as, &ce) == 0) {
> trace_vtd_replay_ce_valid(s->root_scalable ? "scalable mode" :
> "legacy mode",
> bus_n, PCI_SLOT(vtd_as->devfn),
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 03/15] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
2025-06-06 10:04 ` [PATCH v1 03/15] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
@ 2025-06-17 17:49 ` Eric Auger
2025-06-18 2:14 ` Duan, Zhenzhong
0 siblings, 1 reply; 36+ messages in thread
From: Eric Auger @ 2025-06-17 17:49 UTC (permalink / raw)
To: Zhenzhong Duan, qemu-devel
Cc: alex.williamson, clg, mst, jasowang, peterx, ddutile, jgg,
nicolinc, shameerali.kolothum.thodi, joao.m.martins,
clement.mathieu--drif, kevin.tian, yi.l.liu, chao.p.peng,
Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
Hi Zhenzhong,
On 6/6/25 12:04 PM, Zhenzhong Duan wrote:
> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
> is passed to host to construct nested page table. We need to check
> compatibility of some critical IOMMU capabilities between vIOMMU and
> host IOMMU to ensure guest stage-1 page table could be used by host.
>
> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
> does not, then this IOMMUFD backed device should be failed.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> ---
> hw/i386/intel_iommu_internal.h | 1 +
> hw/i386/intel_iommu.c | 28 ++++++++++++++++++++++++++++
> 2 files changed, 29 insertions(+)
>
> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
> index e8b211e8b0..2cda744786 100644
> --- a/hw/i386/intel_iommu_internal.h
> +++ b/hw/i386/intel_iommu_internal.h
> @@ -191,6 +191,7 @@
> #define VTD_ECAP_PT (1ULL << 6)
> #define VTD_ECAP_SC (1ULL << 7)
> #define VTD_ECAP_MHMV (15ULL << 20)
> +#define VTD_ECAP_NEST (1ULL << 26)
> #define VTD_ECAP_SRS (1ULL << 31)
> #define VTD_ECAP_PASID (1ULL << 40)
> #define VTD_ECAP_SMTS (1ULL << 43)
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index a2f3250724..c42ef83ddc 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -39,6 +39,7 @@
> #include "kvm/kvm_i386.h"
> #include "migration/vmstate.h"
> #include "trace.h"
> +#include "system/iommufd.h"
>
> /* context entry operations */
> #define VTD_CE_GET_RID2PASID(ce) \
> @@ -4361,6 +4362,33 @@ static bool vtd_check_hiod(IntelIOMMUState *s, HostIOMMUDevice *hiod,
> return true;
> }
>
> +#ifdef CONFIG_IOMMUFD
is it requested?
Cheers
Eric
> + struct HostIOMMUDeviceCaps *caps = &hiod->caps;
> + struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
> +
> + /* Remaining checks are all stage-1 translation specific */
> + if (!object_dynamic_cast(OBJECT(hiod), TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
> + error_setg(errp, "Need IOMMUFD backend when x-flts=on");
> + return false;
> + }
> +
> + if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
> + error_setg(errp, "Incompatible host platform IOMMU type %d",
> + caps->type);
> + return false;
> + }
> +
> + if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
> + error_setg(errp, "Host IOMMU doesn't support nested translation");
> + return false;
> + }
> +
> + if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
> + error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
> + return false;
> + }
> +#endif
> +
> error_setg(errp, "host device is uncompatible with stage-1 translation");
> return false;
> }
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization
2025-06-17 10:57 ` Yi Liu
@ 2025-06-18 1:58 ` Duan, Zhenzhong
0 siblings, 0 replies; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-06-18 1:58 UTC (permalink / raw)
To: Liu, Yi L, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
Eduardo Habkost
>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v1 02/15] intel_iommu: Optimize context entry cache
>utilization
>
>On 2025/6/11 18:06, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Liu, Yi L <yi.l.liu@intel.com>
>>> Subject: Re: [PATCH v1 02/15] intel_iommu: Optimize context entry cache
>>> utilization
>>>
>>> On 2025/6/6 18:04, Zhenzhong Duan wrote:
>>>> There are many call sites referencing context entry by calling
>>>> vtd_dev_to_context_entry() which will traverse the DMAR table.
>>>>
>>>> In most cases we can use cached context entry in vtd_as-
>>context_cache_entry
>>>> except when its entry is stale. Currently only global and domain context
>>>> invalidation stale it.
>>>>
>>>> So introduce a helper function vtd_as_to_context_entry() to fetch from
>cache
>>>> before trying with vtd_dev_to_context_entry().
>>>
>>> The cached context entry is now protected by vtd_iommu_lock(). While not
>>> all caller of vtd_dev_to_context_entry() are under this lock.
>>>
>>> Also, the cached context entry is created in the translate path. IMHO,
>>> this path is not supposed to be triggered for passthrough devices.
>>> While this may need double check and may change in the future. But let's
>>> see if any locking issue with the current code.
>>
>> Good finding, yes.
>> Previously I thought translation path updates cc_entry->context_entry after
>cc_entry->context_cache_gen.
>> In vtd_as_to_context_entry() cc_entry->context_cache_gen is checked first, so
>there was no real race.
>> But I still missed a memory barrier like below:
>
>yeah, testing context_cache_gen is necessary. But without lock, this
>cannot guarantee the cc_entry is valid after the test.
>
>> @@ -2277,6 +2286,7 @@ static bool
>vtd_do_iommu_translate(VTDAddressSpace *vtd_as, PCIBus *bus,
>> cc_entry->context_cache_gen,
>> s->context_cache_gen);
>> cc_entry->context_entry = ce;
>> + smp_wmb();
>> cc_entry->context_cache_gen = s->context_cache_gen;
>> }
>>
>> Another option I can think of is adding lock to cache reading like below:
>
>this is in-enough as well since the cc_entry->context_entry can be modified
>after lock is released.
>
>> @@ -1659,11 +1659,15 @@ static int
>vtd_as_to_context_entry(VTDAddressSpace *vtd_as, VTDContextEntry *ce)
>> uint8_t devfn = vtd_as->devfn;
>> VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
>>
>> + vtd_iommu_lock(s);
>> +
>> /* Try to fetch context-entry from cache first */
>> if (cc_entry->context_cache_gen == s->context_cache_gen) {
>> *ce = cc_entry->context_entry;
>> + vtd_iommu_unlock(s);
>> return 0;
>> } else {
>> + vtd_iommu_unlock(s);
>> return vtd_dev_to_context_entry(s, bus_num, devfn, ce);
>> }
>> }
>>
>> Which one do you prefer?
>
>If it's just optimization, perhaps just drop it. :)
Fine for me, will do.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization
2025-06-17 17:24 ` Eric Auger
@ 2025-06-18 2:10 ` Duan, Zhenzhong
2025-06-18 7:08 ` Eric Auger
0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-06-18 2:10 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
Eduardo Habkost
Hi Eric,
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v1 02/15] intel_iommu: Optimize context entry cache
>utilization
>
>Hi Zhenzhong,
>
>On 6/6/25 12:04 PM, Zhenzhong Duan wrote:
>> There are many call sites referencing context entry by calling
>> vtd_dev_to_context_entry() which will traverse the DMAR table.
>>
>> In most cases we can use cached context entry in vtd_as->context_cache_entry
>> except when its entry is stale. Currently only global and domain context
>> invalidation stale it.
>>
>> So introduce a helper function vtd_as_to_context_entry() to fetch from cache
>> before trying with vtd_dev_to_context_entry().
>>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu.c | 36 +++++++++++++++++++++++-------------
>> 1 file changed, 23 insertions(+), 13 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index f0b1f90eff..a2f3250724 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -1597,6 +1597,22 @@ static int
>vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>> return 0;
>> }
>>
>> +static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as,
>VTDContextEntry *ce)
>> +{
>> + IntelIOMMUState *s = vtd_as->iommu_state;
>> + uint8_t bus_num = pci_bus_num(vtd_as->bus);
>> + uint8_t devfn = vtd_as->devfn;
>> + VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
>> +
>> + /* Try to fetch context-entry from cache first */
>> + if (cc_entry->context_cache_gen == s->context_cache_gen) {
>> + *ce = cc_entry->context_entry;
>> + return 0;
>> + } else {
>> + return vtd_dev_to_context_entry(s, bus_num, devfn, ce);
>> + }
>> +}
>> +
>While the patch looks good to me can't you use the helper also in
>vtd_do_iommu_translate()?
>See " /* Try to fetch context-entry from cache first */"
It can, but it finally calls into vtd_dev_to_context_entry() so we can call vtd_dev_to_context_entry() directly.
I will drop this patch following Yi's suggestion.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v1 03/15] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
2025-06-17 17:49 ` Eric Auger
@ 2025-06-18 2:14 ` Duan, Zhenzhong
2025-06-18 7:08 ` Eric Auger
0 siblings, 1 reply; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-06-18 2:14 UTC (permalink / raw)
To: eric.auger@redhat.com, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
>-----Original Message-----
>From: Eric Auger <eric.auger@redhat.com>
>Subject: Re: [PATCH v1 03/15] intel_iommu: Check for compatibility with
>IOMMUFD backed device when x-flts=on
>
>Hi Zhenzhong,
>
>On 6/6/25 12:04 PM, Zhenzhong Duan wrote:
>> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
>> is passed to host to construct nested page table. We need to check
>> compatibility of some critical IOMMU capabilities between vIOMMU and
>> host IOMMU to ensure guest stage-1 page table could be used by host.
>>
>> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
>> does not, then this IOMMUFD backed device should be failed.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu_internal.h | 1 +
>> hw/i386/intel_iommu.c | 28 ++++++++++++++++++++++++++++
>> 2 files changed, 29 insertions(+)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> index e8b211e8b0..2cda744786 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -191,6 +191,7 @@
>> #define VTD_ECAP_PT (1ULL << 6)
>> #define VTD_ECAP_SC (1ULL << 7)
>> #define VTD_ECAP_MHMV (15ULL << 20)
>> +#define VTD_ECAP_NEST (1ULL << 26)
>> #define VTD_ECAP_SRS (1ULL << 31)
>> #define VTD_ECAP_PASID (1ULL << 40)
>> #define VTD_ECAP_SMTS (1ULL << 43)
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index a2f3250724..c42ef83ddc 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -39,6 +39,7 @@
>> #include "kvm/kvm_i386.h"
>> #include "migration/vmstate.h"
>> #include "trace.h"
>> +#include "system/iommufd.h"
>>
>> /* context entry operations */
>> #define VTD_CE_GET_RID2PASID(ce) \
>> @@ -4361,6 +4362,33 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>HostIOMMUDevice *hiod,
>> return true;
>> }
>>
>> +#ifdef CONFIG_IOMMUFD
>is it requested?
Yes, windows build needs it.
iommu_hw_info_vtd and IOMMU_HW_INFO_TYPE_INTEL_VTD are defined in linux/iommufd.h,
meanwhile all below check take effect only when IOMMUFD is supported.
Thanks
Zhenzhong
>
>Cheers
>
>Eric
>> + struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>> + struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
>> +
>> + /* Remaining checks are all stage-1 translation specific */
>> + if (!object_dynamic_cast(OBJECT(hiod),
>TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
>> + error_setg(errp, "Need IOMMUFD backend when x-flts=on");
>> + return false;
>> + }
>> +
>> + if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>> + error_setg(errp, "Incompatible host platform IOMMU type %d",
>> + caps->type);
>> + return false;
>> + }
>> +
>> + if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
>> + error_setg(errp, "Host IOMMU doesn't support nested translation");
>> + return false;
>> + }
>> +
>> + if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
>> + error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
>> + return false;
>> + }
>> +#endif
>> +
>> error_setg(errp, "host device is uncompatible with stage-1 translation");
>> return false;
>> }
^ permalink raw reply [flat|nested] 36+ messages in thread
* RE: [PATCH v1 06/15] intel_iommu: Handle PASID entry removing and updating
2025-06-17 12:29 ` Yi Liu
@ 2025-06-18 6:03 ` Duan, Zhenzhong
0 siblings, 0 replies; 36+ messages in thread
From: Duan, Zhenzhong @ 2025-06-18 6:03 UTC (permalink / raw)
To: Liu, Yi L, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, eric.auger@redhat.com,
mst@redhat.com, jasowang@redhat.com, peterx@redhat.com,
ddutile@redhat.com, jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Peng, Chao P,
Yi Sun, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
>-----Original Message-----
>From: Liu, Yi L <yi.l.liu@intel.com>
>Subject: Re: [PATCH v1 06/15] intel_iommu: Handle PASID entry removing and
>updating
>
>On 2025/6/6 18:04, Zhenzhong Duan wrote:
>> This adds an new entry VTDPASIDCacheEntry in VTDAddressSpace to cache the
>> pasid entry and track PASID usage and future PASID tagged DMA address
>> translation support in vIOMMU.
>>
>> VTDAddressSpace of PCI_NO_PASID is allocated when device is plugged and
>> never freed. For other pasid, VTDAddressSpace instance is created/destroyed
>> per the guest pasid entry set up/destroy for passthrough devices. While for
>> emulated devices, VTDAddressSpace instance is created in the PASID tagged
>DMA
>> translation and be destroyed per guest PASID cache invalidation. This focuses
>> on the PASID cache management for passthrough devices as there is no PASID
>> capable emulated devices yet.
>>
>> When guest modifies a PASID entry, QEMU will capture the guest pasid
>selective
>> pasid cache invalidation, allocate or remove a VTDAddressSpace instance per
>the
>> invalidation reasons:
>>
>> a) a present pasid entry moved to non-present
>> b) a present pasid entry to be a present entry
>> c) a non-present pasid entry moved to present
>>
>> This handles a) and b), following patch will handle c).
>>
>> vIOMMU emulator could figure out the reason by fetching latest guest pasid
>entry
>> and compare it with the PASID cache.
>
>To aovid confusion, maybe better to say cached pasid entry. :)
Sure.
>
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>> ---
>> hw/i386/intel_iommu_internal.h | 26 ++++
>> include/hw/i386/intel_iommu.h | 6 +
>> hw/i386/intel_iommu.c | 252 +++++++++++++++++++++++++++++++--
>> hw/i386/trace-events | 3 +
>> 4 files changed, 277 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>> index 18bc22fc72..82b84db80f 100644
>> --- a/hw/i386/intel_iommu_internal.h
>> +++ b/hw/i386/intel_iommu_internal.h
>> @@ -315,6 +315,7 @@ typedef enum VTDFaultReason {
>> * request while disabled */
>> VTD_FR_IR_SID_ERR = 0x26, /* Invalid Source-ID */
>>
>> + VTD_FR_RTADDR_INV_TTM = 0x31, /* Invalid TTM in RTADDR */
>> /* PASID directory entry access failure */
>> VTD_FR_PASID_DIR_ACCESS_ERR = 0x50,
>> /* The Present(P) field of pasid directory entry is 0 */
>> @@ -492,6 +493,15 @@ typedef union VTDInvDesc VTDInvDesc;
>> #define VTD_INV_DESC_PIOTLB_RSVD_VAL0 0xfff000000000f1c0ULL
>> #define VTD_INV_DESC_PIOTLB_RSVD_VAL1 0xf80ULL
>>
>> +#define VTD_INV_DESC_PASIDC_G (3ULL << 4)
>> +#define VTD_INV_DESC_PASIDC_PASID(val) (((val) >> 32) & 0xfffffULL)
>> +#define VTD_INV_DESC_PASIDC_DID(val) (((val) >> 16) &
>VTD_DOMAIN_ID_MASK)
>> +#define VTD_INV_DESC_PASIDC_RSVD_VAL0 0xfff000000000f1c0ULL
>> +
>> +#define VTD_INV_DESC_PASIDC_DSI (0ULL << 4)
>> +#define VTD_INV_DESC_PASIDC_PASID_SI (1ULL << 4)
>> +#define VTD_INV_DESC_PASIDC_GLOBAL (3ULL << 4)
>> +
>> /* Information about page-selective IOTLB invalidate */
>> struct VTDIOTLBPageInvInfo {
>> uint16_t domain_id;
>> @@ -552,6 +562,21 @@ typedef struct VTDRootEntry VTDRootEntry;
>> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL0(aw) (0x1e0ULL |
>~VTD_HAW_MASK(aw))
>> #define VTD_SM_CONTEXT_ENTRY_RSVD_VAL1 0xffffffffffe00000ULL
>>
>> +typedef enum VTDPCInvType {
>> + /* pasid cache invalidation rely on guest PASID entry */
>> + VTD_PASID_CACHE_GLOBAL_INV, /* pasid cache global invalidation */
>> + VTD_PASID_CACHE_DOMSI, /* pasid cache domain selective invalidation
>*/
>> + VTD_PASID_CACHE_PASIDSI, /* pasid cache pasid selective invalidation */
>> +} VTDPCInvType;
>> +
>> +typedef struct VTDPASIDCacheInfo {
>> + VTDPCInvType type;
>> + uint16_t domain_id;
>> + uint32_t pasid;
>> + PCIBus *bus;
>> + uint16_t devfn;
>> +} VTDPASIDCacheInfo;
>> +
>> /* PASID Table Related Definitions */
>> #define VTD_PASID_DIR_BASE_ADDR_MASK (~0xfffULL)
>> #define VTD_PASID_TABLE_BASE_ADDR_MASK (~0xfffULL)
>> @@ -563,6 +588,7 @@ typedef struct VTDRootEntry VTDRootEntry;
>> #define VTD_PASID_TABLE_BITS_MASK (0x3fULL)
>> #define VTD_PASID_TABLE_INDEX(pasid) ((pasid) &
>VTD_PASID_TABLE_BITS_MASK)
>> #define VTD_PASID_ENTRY_FPD (1ULL << 1) /* Fault Processing Disable
>*/
>> +#define VTD_PASID_TBL_ENTRY_NUM (1ULL << 6)
>>
>> /* PASID Granular Translation Type Mask */
>> #define VTD_PASID_ENTRY_P 1ULL
>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>> index 50f9b27a45..fbc9da903a 100644
>> --- a/include/hw/i386/intel_iommu.h
>> +++ b/include/hw/i386/intel_iommu.h
>> @@ -95,6 +95,11 @@ struct VTDPASIDEntry {
>> uint64_t val[8];
>> };
>>
>> +typedef struct VTDPASIDCacheEntry {
>> + struct VTDPASIDEntry pasid_entry;
>> + bool cache_filled;
>> +} VTDPASIDCacheEntry;
>> +
>> struct VTDAddressSpace {
>> PCIBus *bus;
>> uint8_t devfn;
>> @@ -107,6 +112,7 @@ struct VTDAddressSpace {
>> MemoryRegion iommu_ir_fault; /* Interrupt region for catching fault */
>> IntelIOMMUState *iommu_state;
>> VTDContextCacheEntry context_cache_entry;
>> + VTDPASIDCacheEntry pasid_cache_entry;
>> QLIST_ENTRY(VTDAddressSpace) next;
>> /* Superset of notifier flags that this address space has */
>> IOMMUNotifierFlag notifier_flags;
>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>> index 112e09e305..c7162647e6 100644
>> --- a/hw/i386/intel_iommu.c
>> +++ b/hw/i386/intel_iommu.c
>> @@ -825,6 +825,11 @@ static inline bool
>vtd_pe_type_check(IntelIOMMUState *s, VTDPASIDEntry *pe)
>> }
>> }
>>
>> +static inline uint16_t vtd_pe_get_did(VTDPASIDEntry *pe)
>> +{
>> + return VTD_SM_PASID_ENTRY_DID((pe)->val[1]);
>> +}
>> +
>> static inline bool vtd_pdire_present(VTDPASIDDirEntry *pdire)
>> {
>> return pdire->val & 1;
>> @@ -3104,6 +3109,236 @@ static bool
>vtd_process_piotlb_desc(IntelIOMMUState *s,
>> return true;
>> }
>>
>> +static inline int vtd_dev_get_pe_from_pasid(VTDAddressSpace *vtd_as,
>> + uint32_t pasid, VTDPASIDEntry *pe)
>> +{
>> + IntelIOMMUState *s = vtd_as->iommu_state;
>> + VTDContextEntry ce;
>> + int ret;
>> +
>> + if (!s->root_scalable) {
>> + return -VTD_FR_RTADDR_INV_TTM;
>> + }
>> +
>> + ret = vtd_as_to_context_entry(vtd_as, &ce);
>> + if (ret) {
>> + return ret;
>> + }
>> +
>> + return vtd_ce_get_pasid_entry(s, &ce, pe, pasid);
>> +}
>> +
>> +static bool vtd_pasid_entry_compare(VTDPASIDEntry *p1, VTDPASIDEntry *p2)
>> +{
>> + return !memcmp(p1, p2, sizeof(*p1));
>> +}
>> +
>> +/*
>> + * This function fills in the pasid entry in &vtd_as. Caller
>> + * of this function should hold iommu_lock.
>> + */
>> +static void vtd_fill_pe_in_cache(IntelIOMMUState *s, VTDAddressSpace
>*vtd_as,
>> + VTDPASIDEntry *pe)
>> +{
>> + VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> +
>> + if (vtd_pasid_entry_compare(pe, &pc_entry->pasid_entry)) {
>> + /* No need to go further as cached pasid entry is latest */
>> + return;
>> + }
>> +
>> + pc_entry->pasid_entry = *pe;
>> + pc_entry->cache_filled = true;
>> + /*
>> + * TODO: send pasid bind to host for passthru devices
>> + */
>> +}
>> +
>> +/*
>> + * This function is used to clear cached pasid entry in vtd_as
>> + * instances. Caller of this function should hold iommu_lock.
>
>it also covers pasid entry update.
Will do like:
* This function is used to update or clear cached pasid entry in vtd_as
>
>> + */
>> +static gboolean vtd_flush_pasid(gpointer key, gpointer value,
>> + gpointer user_data)
>> +{
>> + VTDPASIDCacheInfo *pc_info = user_data;
>> + VTDAddressSpace *vtd_as = value;
>> + IntelIOMMUState *s = vtd_as->iommu_state;
>> + VTDPASIDCacheEntry *pc_entry = &vtd_as->pasid_cache_entry;
>> + VTDPASIDEntry pe;
>> + uint16_t did;
>> + uint32_t pasid;
>> + int ret;
>> +
>> + /* Replay only filled pasid entry cache for passthrough device */
>
>the comment of this helper already implies only continue if the
>pc_entry->cache_filled is true. Also, replay is a concept in the
>upper level helpers, no need to mention it here. I noticed replay
>in low level helpers in other patch as well, please drop them as well. :)
Will do.
>
>> + if (!pc_entry->cache_filled) {
>> + return false;
>> + }
>> + did = vtd_pe_get_did(&pc_entry->pasid_entry);
>> +
>> + if (vtd_as_to_iommu_pasid_locked(vtd_as, &pasid)) {
>> + goto remove;
>> + }
>> +
>> + switch (pc_info->type) {
>> + case VTD_PASID_CACHE_PASIDSI:
>> + if (pc_info->pasid != pasid) {
>> + return false;
>> + }
>> + /* Fall through */
>> + case VTD_PASID_CACHE_DOMSI:
>> + if (pc_info->domain_id != did) {
>> + return false;
>> + }
>> + /* Fall through */
>> + case VTD_PASID_CACHE_GLOBAL_INV:
>> + break;
>> + default:
>> + error_report("invalid pc_info->type");
>> + abort();
>> + }
>> +
>> + /*
>> + * pasid cache invalidation may indicate a present pasid
>> + * entry to present pasid entry modification. To cover such
>> + * case, vIOMMU emulator needs to fetch latest guest pasid
>> + * entry and check cached pasid entry, then update pasid
>> + * cache and send pasid bind/unbind to host properly.
>> + */
>> + ret = vtd_dev_get_pe_from_pasid(vtd_as, pasid, &pe);
>> + if (ret) {
>> + /*
>> + * No valid pasid entry in guest memory. e.g. pasid entry
>> + * was modified to be either all-zero or non-present. Either
>> + * case means existing pasid cache should be removed.
>> + */
>> + goto remove;
>> + }
>> +
>> + vtd_fill_pe_in_cache(s, vtd_as, &pe);
>> + return false;
>> +
>> +remove:
>> + /*
>> + * TODO: send pasid unbind to host for passthru devices
>> + */
>> + pc_entry->cache_filled = false;
>> +
>> + /*
>> + * Don't remove address space of PCI_NO_PASID which is created by PCI
>> + * sub-system.
>> + */
>
>I get why it cannot be removed. But I think the ce and pe field of this
>vtd_as might need to be updated given it is supposed to be removed if it
>is not PCI_NO_PASID.
With patch2 dropped, we always call vtd_dev_to_context_entry() to get ce,
vtd_as->ce only take effect for emulated device.
About pe, when we jump to remove label, that means pe.present == false,
We record that through pc_entry->cache_filled = false, so updating
unpresented pe field looks no need?
>
>> + if (vtd_as->pasid == PCI_NO_PASID) {
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> +/*
>> + * This function syncs the pasid bindings between guest and host.
>> + * It includes updating the pasid cache in vIOMMU and updating the
>> + * pasid bindings per guest's latest pasid entry presence.
>> + */
>> +static void vtd_pasid_cache_sync(IntelIOMMUState *s,
>> + VTDPASIDCacheInfo *pc_info)
>> +{
>> + if (!s->flts || !s->root_scalable || !s->dmar_enabled) {
>> + return;
>> + }
>> +
>> + /*
>> + * Regards to a pasid cache invalidation, e.g. a PSI.
>> + * it could be either cases of below:
>> + * a) a present pasid entry moved to non-present
>> + * b) a present pasid entry to be a present entry
>> + * c) a non-present pasid entry moved to present
>> + *
>> + * Different invalidation granularity may affect different device
>> + * scope and pasid scope. But for each invalidation granularity,
>> + * it needs to do two steps to sync host and guest pasid binding.
>> + *
>> + * Here is the handling of a PSI:
>> + * 1) loop all the existing vtd_as instances to update them
>> + * according to the latest guest pasid entry in pasid table.
>> + * this will make sure affected existing vtd_as instances
>> + * cached the latest pasid entries. Also, during the loop, the
>> + * host should be notified if needed. e.g. pasid unbind or pasid
>> + * update. Should be able to cover case a) and case b).
>> + *
>> + * 2) loop all devices to cover case c)
>> + * - For devices which are backed by HostIOMMUDeviceIOMMUFD
>instances,
>> + * we loop them and check if guest pasid entry exists. If yes,
>> + * it is case c), we update the pasid cache and also notify
>> + * host.
>> + * - For devices which are not backed by HostIOMMUDeviceIOMMUFD,
>> + * it is not necessary to create pasid cache at this phase since
>> + * it could be created when vIOMMU does DMA address translation.
>> + * This is not yet implemented since there is no emulated
>> + * pasid-capable devices today. If we have such devices in
>> + * future, the pasid cache shall be created there.
>> + * Other granularity follow the same steps, just with different scope
>> + *
>> + */
>> +
>> + vtd_iommu_lock(s);
>> + /*
>> + * Step 1: loop all the existing vtd_as instances for pasid unbind and
>> + * update.
>> + */
>> + g_hash_table_foreach_remove(s->vtd_address_spaces, vtd_flush_pasid,
>> + pc_info);
>> + vtd_iommu_unlock(s);
>
>just realized s->vtd_address_spaces is not protected by iommu_lock. If so,
>will it be ok to drop the lock here and add it in the lower level helpers
>if lock is needed?
Could you elaborate which lower level helper to add iommu_lock?
We want to protect update to vtd_address_spaces list with iommu_lock here.
Thanks
Zhenzhong
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 03/15] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on
2025-06-18 2:14 ` Duan, Zhenzhong
@ 2025-06-18 7:08 ` Eric Auger
0 siblings, 0 replies; 36+ messages in thread
From: Eric Auger @ 2025-06-18 7:08 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Paolo Bonzini, Richard Henderson, Eduardo Habkost,
Marcel Apfelbaum
On 6/18/25 4:14 AM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH v1 03/15] intel_iommu: Check for compatibility with
>> IOMMUFD backed device when x-flts=on
>>
>> Hi Zhenzhong,
>>
>> On 6/6/25 12:04 PM, Zhenzhong Duan wrote:
>>> When vIOMMU is configured x-flts=on in scalable mode, stage-1 page table
>>> is passed to host to construct nested page table. We need to check
>>> compatibility of some critical IOMMU capabilities between vIOMMU and
>>> host IOMMU to ensure guest stage-1 page table could be used by host.
>>>
>>> For instance, vIOMMU supports stage-1 1GB huge page mapping, but host
>>> does not, then this IOMMUFD backed device should be failed.
>>>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>> hw/i386/intel_iommu_internal.h | 1 +
>>> hw/i386/intel_iommu.c | 28 ++++++++++++++++++++++++++++
>>> 2 files changed, 29 insertions(+)
>>>
>>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>>> index e8b211e8b0..2cda744786 100644
>>> --- a/hw/i386/intel_iommu_internal.h
>>> +++ b/hw/i386/intel_iommu_internal.h
>>> @@ -191,6 +191,7 @@
>>> #define VTD_ECAP_PT (1ULL << 6)
>>> #define VTD_ECAP_SC (1ULL << 7)
>>> #define VTD_ECAP_MHMV (15ULL << 20)
>>> +#define VTD_ECAP_NEST (1ULL << 26)
>>> #define VTD_ECAP_SRS (1ULL << 31)
>>> #define VTD_ECAP_PASID (1ULL << 40)
>>> #define VTD_ECAP_SMTS (1ULL << 43)
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index a2f3250724..c42ef83ddc 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -39,6 +39,7 @@
>>> #include "kvm/kvm_i386.h"
>>> #include "migration/vmstate.h"
>>> #include "trace.h"
>>> +#include "system/iommufd.h"
>>>
>>> /* context entry operations */
>>> #define VTD_CE_GET_RID2PASID(ce) \
>>> @@ -4361,6 +4362,33 @@ static bool vtd_check_hiod(IntelIOMMUState *s,
>> HostIOMMUDevice *hiod,
>>> return true;
>>> }
>>>
>>> +#ifdef CONFIG_IOMMUFD
>> is it requested?
> Yes, windows build needs it.
> iommu_hw_info_vtd and IOMMU_HW_INFO_TYPE_INTEL_VTD are defined in linux/iommufd.h,
which is a linux dep
> meanwhile all below check take effect only when IOMMUFD is supported.
OK. I tried to remove #imply IOMMUFD in hw/i386/Kconfig and I did not
get any error but well I did not build for Windows.
Eric
>
> Thanks
> Zhenzhong
>
>> Cheers
>>
>> Eric
>>> + struct HostIOMMUDeviceCaps *caps = &hiod->caps;
>>> + struct iommu_hw_info_vtd *vtd = &caps->vendor_caps.vtd;
>>> +
>>> + /* Remaining checks are all stage-1 translation specific */
>>> + if (!object_dynamic_cast(OBJECT(hiod),
>> TYPE_HOST_IOMMU_DEVICE_IOMMUFD)) {
>>> + error_setg(errp, "Need IOMMUFD backend when x-flts=on");
>>> + return false;
>>> + }
>>> +
>>> + if (caps->type != IOMMU_HW_INFO_TYPE_INTEL_VTD) {
>>> + error_setg(errp, "Incompatible host platform IOMMU type %d",
>>> + caps->type);
>>> + return false;
>>> + }
>>> +
>>> + if (!(vtd->ecap_reg & VTD_ECAP_NEST)) {
>>> + error_setg(errp, "Host IOMMU doesn't support nested translation");
>>> + return false;
>>> + }
>>> +
>>> + if (s->fs1gp && !(vtd->cap_reg & VTD_CAP_FS1GP)) {
>>> + error_setg(errp, "Stage-1 1GB huge page is unsupported by host IOMMU");
>>> + return false;
>>> + }
>>> +#endif
>>> +
>>> error_setg(errp, "host device is uncompatible with stage-1 translation");
>>> return false;
>>> }
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization
2025-06-18 2:10 ` Duan, Zhenzhong
@ 2025-06-18 7:08 ` Eric Auger
0 siblings, 0 replies; 36+ messages in thread
From: Eric Auger @ 2025-06-18 7:08 UTC (permalink / raw)
To: Duan, Zhenzhong, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
clement.mathieu--drif@eviden.com, Tian, Kevin, Liu, Yi L,
Peng, Chao P, Marcel Apfelbaum, Paolo Bonzini, Richard Henderson,
Eduardo Habkost
On 6/18/25 4:10 AM, Duan, Zhenzhong wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Subject: Re: [PATCH v1 02/15] intel_iommu: Optimize context entry cache
>> utilization
>>
>> Hi Zhenzhong,
>>
>> On 6/6/25 12:04 PM, Zhenzhong Duan wrote:
>>> There are many call sites referencing context entry by calling
>>> vtd_dev_to_context_entry() which will traverse the DMAR table.
>>>
>>> In most cases we can use cached context entry in vtd_as->context_cache_entry
>>> except when its entry is stale. Currently only global and domain context
>>> invalidation stale it.
>>>
>>> So introduce a helper function vtd_as_to_context_entry() to fetch from cache
>>> before trying with vtd_dev_to_context_entry().
>>>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>> hw/i386/intel_iommu.c | 36 +++++++++++++++++++++++-------------
>>> 1 file changed, 23 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index f0b1f90eff..a2f3250724 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -1597,6 +1597,22 @@ static int
>> vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
>>> return 0;
>>> }
>>>
>>> +static int vtd_as_to_context_entry(VTDAddressSpace *vtd_as,
>> VTDContextEntry *ce)
>>> +{
>>> + IntelIOMMUState *s = vtd_as->iommu_state;
>>> + uint8_t bus_num = pci_bus_num(vtd_as->bus);
>>> + uint8_t devfn = vtd_as->devfn;
>>> + VTDContextCacheEntry *cc_entry = &vtd_as->context_cache_entry;
>>> +
>>> + /* Try to fetch context-entry from cache first */
>>> + if (cc_entry->context_cache_gen == s->context_cache_gen) {
>>> + *ce = cc_entry->context_entry;
>>> + return 0;
>>> + } else {
>>> + return vtd_dev_to_context_entry(s, bus_num, devfn, ce);
>>> + }
>>> +}
>>> +
>> While the patch looks good to me can't you use the helper also in
>> vtd_do_iommu_translate()?
>> See " /* Try to fetch context-entry from cache first */"
> It can, but it finally calls into vtd_dev_to_context_entry() so we can call vtd_dev_to_context_entry() directly.
> I will drop this patch following Yi's suggestion.
OK
Cheers
Eric
>
> Thanks
> Zhenzhong
^ permalink raw reply [flat|nested] 36+ messages in thread
* Re: [PATCH v1 04/15] intel_iommu: Introduce a new structure VTDHostIOMMUDevice
2025-06-13 9:08 ` Duan, Zhenzhong
@ 2025-06-20 7:08 ` Eric Auger
0 siblings, 0 replies; 36+ messages in thread
From: Eric Auger @ 2025-06-20 7:08 UTC (permalink / raw)
To: Duan, Zhenzhong, CLEMENT MATHIEU--DRIF, qemu-devel@nongnu.org
Cc: alex.williamson@redhat.com, clg@redhat.com, mst@redhat.com,
jasowang@redhat.com, peterx@redhat.com, ddutile@redhat.com,
jgg@nvidia.com, nicolinc@nvidia.com,
shameerali.kolothum.thodi@huawei.com, joao.m.martins@oracle.com,
Tian, Kevin, Liu, Yi L, Peng, Chao P, Paolo Bonzini,
Richard Henderson, Eduardo Habkost, Marcel Apfelbaum
On 6/13/25 11:08 AM, Duan, Zhenzhong wrote:
> Hi Clement,
>
>> -----Original Message-----
>> From: CLEMENT MATHIEU--DRIF <clement.mathieu--drif@eviden.com>
>> Subject: Re: [PATCH v1 04/15] intel_iommu: Introduce a new structure
>> VTDHostIOMMUDevice
>>
>> Hi,
>>
>> On 06/06/2025 12:04 pm, Zhenzhong Duan wrote:
>>> Caution: External email. Do not open attachments or click links, unless this
>> email comes from a known sender and you know the content is safe.
>>>
>>> Introduce a new structure VTDHostIOMMUDevice which replaces
>>> HostIOMMUDevice to be stored in hash table.
>>>
>>> It includes a reference to HostIOMMUDevice and IntelIOMMUState,
>>> also includes BDF information which will be used in future
>>> patches.
>>>
>>> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
>>> ---
>>> hw/i386/intel_iommu_internal.h | 7 +++++++
>>> include/hw/i386/intel_iommu.h | 2 +-
>>> hw/i386/intel_iommu.c | 14 ++++++++++++--
>>> 3 files changed, 20 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/hw/i386/intel_iommu_internal.h b/hw/i386/intel_iommu_internal.h
>>> index 2cda744786..18bc22fc72 100644
>>> --- a/hw/i386/intel_iommu_internal.h
>>> +++ b/hw/i386/intel_iommu_internal.h
>>> @@ -28,6 +28,7 @@
>>> #ifndef HW_I386_INTEL_IOMMU_INTERNAL_H
>>> #define HW_I386_INTEL_IOMMU_INTERNAL_H
>>> #include "hw/i386/intel_iommu.h"
>>> +#include "system/host_iommu_device.h"
>>>
>>> /*
>>> * Intel IOMMU register specification
>>> @@ -608,4 +609,10 @@ typedef struct VTDRootEntry VTDRootEntry;
>>> /* Bits to decide the offset for each level */
>>> #define VTD_LEVEL_BITS 9
>>>
>>> +typedef struct VTDHostIOMMUDevice {
>>> + IntelIOMMUState *iommu_state;
>>> + PCIBus *bus;
>>> + uint8_t devfn;
>>> + HostIOMMUDevice *hiod;
>>> +} VTDHostIOMMUDevice;
>>> #endif
>>> diff --git a/include/hw/i386/intel_iommu.h b/include/hw/i386/intel_iommu.h
>>> index e95477e855..50f9b27a45 100644
>>> --- a/include/hw/i386/intel_iommu.h
>>> +++ b/include/hw/i386/intel_iommu.h
>>> @@ -295,7 +295,7 @@ struct IntelIOMMUState {
>>> /* list of registered notifiers */
>>> QLIST_HEAD(, VTDAddressSpace) vtd_as_with_notifiers;
>>>
>>> - GHashTable *vtd_host_iommu_dev; /* HostIOMMUDevice */
>>> + GHashTable *vtd_host_iommu_dev; /* VTDHostIOMMUDevice */
>>>
>>> /* interrupt remapping */
>>> bool intr_enabled; /* Whether guest enabled IR */
>>> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
>>> index c42ef83ddc..796b71605c 100644
>>> --- a/hw/i386/intel_iommu.c
>>> +++ b/hw/i386/intel_iommu.c
>>> @@ -281,7 +281,10 @@ static gboolean vtd_hiod_equal(gconstpointer v1,
>> gconstpointer v2)
>>> static void vtd_hiod_destroy(gpointer v)
>>> {
>>> - object_unref(v);
>>> + VTDHostIOMMUDevice *vtd_hiod = v;
>>> +
>>> + object_unref(vtd_hiod->hiod);
>>> + g_free(vtd_hiod);
>>> }
>>>
>>> static gboolean vtd_hash_remove_by_domain(gpointer key, gpointer value,
>>> @@ -4397,6 +4400,7 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus,
>> void *opaque, int devfn,
>>> HostIOMMUDevice *hiod, Error **errp)
>>> {
>>> IntelIOMMUState *s = opaque;
>>> + VTDHostIOMMUDevice *vtd_hiod;
>>> struct vtd_as_key key = {
>>> .bus = bus,
>>> .devfn = devfn,
>>> @@ -4413,6 +4417,12 @@ static bool vtd_dev_set_iommu_device(PCIBus *bus,
>> void *opaque, int devfn,
>>> return false;
>>> }
>>>
>>> + vtd_hiod = g_malloc0(sizeof(VTDHostIOMMUDevice));
>>> + vtd_hiod->bus = bus;
>>> + vtd_hiod->devfn = (uint8_t)devfn;
>>> + vtd_hiod->iommu_state = s;
>>> + vtd_hiod->hiod = hiod;
>>> +
>>> if (!vtd_check_hiod(s, hiod, errp)) {
>> Shouldn't we free vtd_hiod here?
> Good catch, Thanks.
> I put g_free in patch10, but it should be in this patch, will fix.
With this fixed,
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Eric
>
> BRs,
> Zhenzhong
>
^ permalink raw reply [flat|nested] 36+ messages in thread
end of thread, other threads:[~2025-06-20 7:09 UTC | newest]
Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06 10:04 [PATCH v1 00/15] intel_iommu: Enable stage-1 translation for passthrough device Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 01/15] intel_iommu: Rename vtd_ce_get_rid2pasid_entry to vtd_ce_get_pasid_entry Zhenzhong Duan
2025-06-11 7:20 ` Yi Liu
2025-06-17 17:16 ` Eric Auger
2025-06-06 10:04 ` [PATCH v1 02/15] intel_iommu: Optimize context entry cache utilization Zhenzhong Duan
2025-06-11 7:48 ` Yi Liu
2025-06-11 10:06 ` Duan, Zhenzhong
2025-06-17 10:57 ` Yi Liu
2025-06-18 1:58 ` Duan, Zhenzhong
2025-06-17 17:24 ` Eric Auger
2025-06-18 2:10 ` Duan, Zhenzhong
2025-06-18 7:08 ` Eric Auger
2025-06-06 10:04 ` [PATCH v1 03/15] intel_iommu: Check for compatibility with IOMMUFD backed device when x-flts=on Zhenzhong Duan
2025-06-17 17:49 ` Eric Auger
2025-06-18 2:14 ` Duan, Zhenzhong
2025-06-18 7:08 ` Eric Auger
2025-06-06 10:04 ` [PATCH v1 04/15] intel_iommu: Introduce a new structure VTDHostIOMMUDevice Zhenzhong Duan
2025-06-12 16:04 ` CLEMENT MATHIEU--DRIF
2025-06-13 9:08 ` Duan, Zhenzhong
2025-06-20 7:08 ` Eric Auger
2025-06-06 10:04 ` [PATCH v1 05/15] intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked Zhenzhong Duan
2025-06-11 9:54 ` Yi Liu
2025-06-11 10:46 ` Duan, Zhenzhong
2025-06-17 10:58 ` Yi Liu
2025-06-06 10:04 ` [PATCH v1 06/15] intel_iommu: Handle PASID entry removing and updating Zhenzhong Duan
2025-06-17 12:29 ` Yi Liu
2025-06-18 6:03 ` Duan, Zhenzhong
2025-06-06 10:04 ` [PATCH v1 07/15] intel_iommu: Handle PASID entry adding Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 08/15] intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 09/15] intel_iommu: Bind/unbind guest page table to host Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 10/15] intel_iommu: ERRATA_772415 workaround Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 11/15] intel_iommu: Replay pasid binds after context cache invalidation Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 12/15] intel_iommu: Propagate PASID-based iotlb invalidation to host Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 13/15] intel_iommu: Refresh pasid bind when either SRTP or TE bit is changed Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 14/15] intel_iommu: Bypass replay in stage-1 page table mode Zhenzhong Duan
2025-06-06 10:04 ` [PATCH v1 15/15] intel_iommu: Enable host device when x-flts=on in scalable mode Zhenzhong Duan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).