* [PATCH v3 0/5] KVM: arm64: nv: Implement nested stage-2 reverse map
@ 2026-05-10 14:53 Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 1/5] KVM: arm64: Use a variable for the canonical GPA in kvm_s2_fault_map() Wei-Lin Chang
` (4 more replies)
0 siblings, 5 replies; 6+ messages in thread
From: Wei-Lin Chang @ 2026-05-10 14:53 UTC (permalink / raw)
To: linux-arm-kernel, kvmarm, linux-kernel
Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang
Hi,
This is v3 of optimizing the shadow s2 mmu unmapping during MMU
notifiers.
Two new preparatory patches are added, one reduces a hole in kvm_s2_mmu
and another refactors the code a bit during s2 faults. Other changes are
listed below.
* Changes from v2 [1]:
- Removed "polluted" teminology.
- Use xa_{mk, to}_value() when storing and retriving values from maple
trees.
- Avoid using the 63rd bit in maple tree values so that xa_{mk, to}_value()
does not lose us a bit.
- Added reverse map removal during TLBI handling.
- Other suggested refactorings.
Thanks!
[1]: https://lore.kernel.org/kvmarm/20260411125024.3735989-1-weilin.chang@arm.com/
Wei-Lin Chang (5):
KVM: arm64: Use a variable for the canonical GPA in kvm_s2_fault_map()
KVM: arm64: Move shadow_pt_debugfs_dentry to reduce holes in
kvm_s2_mmu
KVM: arm64: nv: Avoid full shadow s2 unmap
KVM: arm64: nv: Remove reverse map entries during TLBI handling
KVM: arm64: nv: Create nested IPA direct map to speed up reverse map
removal
arch/arm64/include/asm/kvm_host.h | 17 +-
arch/arm64/include/asm/kvm_nested.h | 6 +
arch/arm64/kvm/mmu.c | 43 +++--
arch/arm64/kvm/nested.c | 238 +++++++++++++++++++++++++++-
arch/arm64/kvm/sys_regs.c | 3 +
5 files changed, 290 insertions(+), 17 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v3 1/5] KVM: arm64: Use a variable for the canonical GPA in kvm_s2_fault_map()
2026-05-10 14:53 [PATCH v3 0/5] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
@ 2026-05-10 14:53 ` Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 2/5] KVM: arm64: Move shadow_pt_debugfs_dentry to reduce holes in kvm_s2_mmu Wei-Lin Chang
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Wei-Lin Chang @ 2026-05-10 14:53 UTC (permalink / raw)
To: linux-arm-kernel, kvmarm, linux-kernel
Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang
Create a variable to store the canonical GPA, instead of calculating it
when needed. This will be useful when we need to use the canonical GPA
for the nested reverse map later.
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
arch/arm64/kvm/mmu.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d089c107d9b7..e4becd5cdf36 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1981,6 +1981,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
long mapping_size;
kvm_pfn_t pfn;
gfn_t gfn;
+ phys_addr_t canonical_gpa;
int ret;
kvm_fault_lock(kvm);
@@ -1994,6 +1995,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
mapping_size = s2vi->vma_pagesize;
pfn = s2vi->pfn;
gfn = s2vi->gfn;
+ canonical_gpa = gfn_to_gpa(get_canonical_gfn(s2fd, s2vi));
/*
* If we are not forced to use page mapping, check if we are
@@ -2012,6 +2014,7 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
goto out_unlock;
}
}
+ canonical_gpa = ALIGN_DOWN(canonical_gpa, mapping_size);
}
if (!perm_fault_granule && !s2vi->map_non_cacheable && kvm_has_mte(kvm))
@@ -2045,11 +2048,9 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
* making sure we adjust the canonical IPA if the mapping size has
* been updated (via a THP upgrade, for example).
*/
- if (writable && !ret) {
- phys_addr_t ipa = gfn_to_gpa(get_canonical_gfn(s2fd, s2vi));
- ipa &= ~(mapping_size - 1);
- mark_page_dirty_in_slot(kvm, s2fd->memslot, gpa_to_gfn(ipa));
- }
+ if (writable && !ret)
+ mark_page_dirty_in_slot(kvm, s2fd->memslot,
+ gpa_to_gfn(canonical_gpa));
if (ret != -EAGAIN)
return ret;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v3 2/5] KVM: arm64: Move shadow_pt_debugfs_dentry to reduce holes in kvm_s2_mmu
2026-05-10 14:53 [PATCH v3 0/5] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 1/5] KVM: arm64: Use a variable for the canonical GPA in kvm_s2_fault_map() Wei-Lin Chang
@ 2026-05-10 14:53 ` Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 3/5] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Wei-Lin Chang @ 2026-05-10 14:53 UTC (permalink / raw)
To: linux-arm-kernel, kvmarm, linux-kernel
Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang
dentry pointer shadow_pt_debugfs_dentry was placed between two booleans
in kvm_s2_mmu, which created unnecessary holes in the struct. Move it so
the two booleans connect.
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
arch/arm64/include/asm/kvm_host.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 851f6171751c..1a56d137df10 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -217,16 +217,16 @@ struct kvm_s2_mmu {
*/
bool nested_stage2_enabled;
-#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
- struct dentry *shadow_pt_debugfs_dentry;
-#endif
-
/*
* true when this MMU needs to be unmapped before being used for a new
* purpose.
*/
bool pending_unmap;
+#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
+ struct dentry *shadow_pt_debugfs_dentry;
+#endif
+
/*
* 0: Nobody is currently using this, check vttbr for validity
* >0: Somebody is actively using this.
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v3 3/5] KVM: arm64: nv: Avoid full shadow s2 unmap
2026-05-10 14:53 [PATCH v3 0/5] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 1/5] KVM: arm64: Use a variable for the canonical GPA in kvm_s2_fault_map() Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 2/5] KVM: arm64: Move shadow_pt_debugfs_dentry to reduce holes in kvm_s2_mmu Wei-Lin Chang
@ 2026-05-10 14:53 ` Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 4/5] KVM: arm64: nv: Remove reverse map entries during TLBI handling Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 5/5] KVM: arm64: nv: Create nested IPA direct map to speed up reverse map removal Wei-Lin Chang
4 siblings, 0 replies; 6+ messages in thread
From: Wei-Lin Chang @ 2026-05-10 14:53 UTC (permalink / raw)
To: linux-arm-kernel, kvmarm, linux-kernel
Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang
Currently we are forced to fully unmap all shadow stage-2 for a VM when
unmapping a page from the canonical stage-2, for example during an MMU
notifier call. This is because we are not tracking what canonical IPA
are mapped in the shadow stage-2 page tables hence there is no way to
know what to unmap.
Create a per kvm_s2_mmu maple tree to track canonical IPA range ->
nested IPA range, so that it is possible to partially unmap shadow
stage-2 when a canonical IPA range is unmapped. The algorithm is simple
and conservative:
At each shadow stage-2 map, insert the nested IPA range into the maple
tree, with the canonical IPA range as the key. If the canonical IPA
range doesn't overlap with existing ranges in the tree, insert as is,
and a reverse mapping for this range is established. But if the
canonical IPA range overlaps with any existing ranges in the tree,
create a new range that spans all the overlapping ranges including the
input range and replace those existing ranges. In the mean time, mark
this new spanning canonical IPA range with an "UNKNOWN_IPA" bit,
indicating we give up tracking the nested IPA ranges that map to this
canonical IPA range.
The maple tree's 64 bit entry is enough to store the nested IPA and
the UNKNOWN_IPA status, therefore besides maple tree's internal
operation, memory allocation is avoided.
Example:
|||| means existing range, ---- means empty range
input: $$$$$$$$$$$$$$$$$$$$$$$$$$
tree: --||||-----|||||||---------||||||||||-----------
insert spanning range and replace overlapping ones:
--||||-----||||||||||||||||||||||||||-----------
^^^^marked UNKNOWN_IPA^^^^
With the reverse map created, when a canonical IPA range gets unmapped,
look into each s2 mmu's maple tree and look for canonical IPA ranges
affected, and base on their UNKNOWN_IPA status:
UNKNOWN_IPA -> fall back and fully unmap the current shadow
stage-2, also clear the tree
not UNKNOWN_IPA -> unmap the nested IPA range, and remove the reverse
map entry
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
arch/arm64/include/asm/kvm_host.h | 4 +
arch/arm64/include/asm/kvm_nested.h | 4 +
arch/arm64/kvm/mmu.c | 27 ++++--
arch/arm64/kvm/nested.c | 140 +++++++++++++++++++++++++++-
4 files changed, 167 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 1a56d137df10..dc4c0bce1bbb 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -223,6 +223,10 @@ struct kvm_s2_mmu {
*/
bool pending_unmap;
+ bool nested_revmap_broken;
+ /* canonical IPA to nested IPA range lookup */
+ struct maple_tree nested_revmap_mt;
+
#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
struct dentry *shadow_pt_debugfs_dentry;
#endif
diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index 091544e6af44..5cbf78dfc685 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -76,6 +76,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
const union tlbi_info *info,
void (*)(struct kvm_s2_mmu *,
const union tlbi_info *));
+extern void kvm_record_nested_revmap(gpa_t gpa, struct kvm_s2_mmu *mmu,
+ gpa_t fault_ipa, size_t map_size);
extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu);
@@ -164,6 +166,8 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu,
struct kvm_s2_trans *trans);
extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2);
extern void kvm_nested_s2_wp(struct kvm *kvm);
+extern void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
+ bool may_block);
extern void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block);
extern void kvm_nested_s2_flush(struct kvm *kvm);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index e4becd5cdf36..ce0bd88cd3c1 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -5,6 +5,7 @@
*/
#include <linux/acpi.h>
+#include <linux/maple_tree.h>
#include <linux/mman.h>
#include <linux/kvm_host.h>
#include <linux/io.h>
@@ -1099,6 +1100,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
{
struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
struct kvm_pgtable *pgt = NULL;
+ struct maple_tree *revmap_mt = &mmu->nested_revmap_mt;
write_lock(&kvm->mmu_lock);
pgt = mmu->pgt;
@@ -1108,8 +1110,11 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
free_percpu(mmu->last_vcpu_ran);
}
- if (kvm_is_nested_s2_mmu(kvm, mmu))
+ if (kvm_is_nested_s2_mmu(kvm, mmu)) {
+ if (!mtree_empty(revmap_mt))
+ mtree_destroy(revmap_mt);
kvm_init_nested_s2_mmu(mmu);
+ }
write_unlock(&kvm->mmu_lock);
@@ -1631,6 +1636,10 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
goto out_unlock;
}
+ if (s2fd->nested)
+ kvm_record_nested_revmap(gfn << PAGE_SHIFT, pgt->mmu,
+ s2fd->fault_ipa, PAGE_SIZE);
+
ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
__pfn_to_phys(pfn), prot,
memcache, flags);
@@ -2034,6 +2043,10 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
prot, flags);
} else {
+ if (s2fd->nested)
+ kvm_record_nested_revmap(canonical_gpa, pgt->mmu,
+ gfn_to_gpa(gfn), mapping_size);
+
ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
__pfn_to_phys(pfn), prot,
memcache, flags);
@@ -2389,14 +2402,16 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
+ gpa_t gpa = range->start << PAGE_SHIFT;
+ size_t size = (range->end - range->start) << PAGE_SHIFT;
+ bool may_block = range->may_block;
+
if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
return false;
- __unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT,
- (range->end - range->start) << PAGE_SHIFT,
- range->may_block);
+ __unmap_stage2_range(&kvm->arch.mmu, gpa, size, may_block);
+ kvm_unmap_gfn_range_nested(kvm, gpa, size, may_block);
- kvm_nested_s2_unmap(kvm, range->may_block);
return false;
}
@@ -2674,7 +2689,7 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
write_lock(&kvm->mmu_lock);
kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size, true);
- kvm_nested_s2_unmap(kvm, true);
+ kvm_unmap_gfn_range_nested(kvm, gpa, size, true);
write_unlock(&kvm->mmu_lock);
}
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 883b6c1008fb..35b5d5f21a23 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -7,6 +7,7 @@
#include <linux/bitfield.h>
#include <linux/kvm.h>
#include <linux/kvm_host.h>
+#include <linux/maple_tree.h>
#include <asm/fixmap.h>
#include <asm/kvm_arm.h>
@@ -43,6 +44,20 @@ struct vncr_tlb {
*/
#define S2_MMU_PER_VCPU 2
+/*
+ * Per shadow S2 reverse map (IPA -> nested IPA range) maple tree payload
+ * layout:
+ *
+ * bit 62: valid, prevents the case where the nested IPA is 0 and turning
+ * the whole value to 0
+ * bits 55-12: nested IPA bits 55-12
+ * bit 0: UNKNOWN_IPA bit, 1 indicates we give up on tracking what nested
+ * IPA maps to this canonical IPA in the shadow stage-2
+ */
+#define VALID_ENTRY BIT(62)
+#define ADDR_MASK GENMASK_ULL(55, 12)
+#define UNKNOWN_IPA BIT(0)
+
void kvm_init_nested(struct kvm *kvm)
{
kvm->arch.nested_mmus = NULL;
@@ -769,12 +784,57 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
return s2_mmu;
}
+void kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
+ gpa_t fault_ipa, size_t map_size)
+{
+ struct maple_tree *revmap_mt = &mmu->nested_revmap_mt;
+ gpa_t ipa_end = ipa + map_size - 1;
+ u64 entry, new_entry = 0;
+ MA_STATE(mas_rev, revmap_mt, ipa, ipa_end);
+
+ if (mmu->nested_revmap_broken)
+ return;
+
+ mtree_lock(revmap_mt);
+ entry = xa_to_value(mas_find_range(&mas_rev, ipa_end));
+
+ if (entry) {
+ /* maybe just a perm update... */
+ if (!(entry & UNKNOWN_IPA) && mas_rev.index == ipa &&
+ mas_rev.last == ipa_end &&
+ fault_ipa == (entry & ADDR_MASK))
+ goto unlock;
+ /*
+ * Create a "UNKNOWN_IPA" range that spans all the overlapping
+ * ranges and store it.
+ */
+ while (entry && mas_rev.index <= ipa_end) {
+ ipa = min(mas_rev.index, ipa);
+ ipa_end = max(mas_rev.last, ipa_end);
+ entry = xa_to_value(mas_find_range(&mas_rev, ipa_end));
+ }
+ new_entry |= UNKNOWN_IPA;
+ } else {
+ new_entry |= fault_ipa;
+ new_entry |= VALID_ENTRY;
+ }
+
+ mas_set_range(&mas_rev, ipa, ipa_end);
+ if (mas_store_gfp(&mas_rev, xa_mk_value(new_entry),
+ GFP_NOWAIT | __GFP_ACCOUNT))
+ mmu->nested_revmap_broken = true;
+unlock:
+ mtree_unlock(revmap_mt);
+}
+
void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
{
/* CnP being set denotes an invalid entry */
mmu->tlb_vttbr = VTTBR_CNP_BIT;
mmu->nested_stage2_enabled = false;
atomic_set(&mmu->refcnt, 0);
+ mt_init(&mmu->nested_revmap_mt);
+ mmu->nested_revmap_broken = false;
}
void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
@@ -1150,6 +1210,82 @@ void kvm_nested_s2_wp(struct kvm *kvm)
kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
}
+static void reset_revmap_and_unmap(struct kvm_s2_mmu *mmu, bool may_block)
+{
+ mtree_destroy(&mmu->nested_revmap_mt);
+ mmu->nested_revmap_broken = false;
+ kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
+}
+
+static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
+ size_t unmap_size, bool may_block)
+{
+ struct maple_tree *revmap_mt = &mmu->nested_revmap_mt;
+ gpa_t ipa = gpa;
+ gpa_t ipa_end = gpa + unmap_size - 1;
+ u64 entry;
+ size_t entry_size;
+ MA_STATE(mas_rev, revmap_mt, gpa, ipa_end);
+
+ if (mmu->nested_revmap_broken) {
+ reset_revmap_and_unmap(mmu, may_block);
+ return;
+ }
+
+ mtree_lock(revmap_mt);
+ entry = xa_to_value(mas_find_range(&mas_rev, ipa_end));
+
+ while (entry && mas_rev.index <= ipa_end) {
+ ipa = mas_rev.last + 1;
+ entry_size = mas_rev.last - mas_rev.index + 1;
+ /*
+ * Give up and invalidate this s2 mmu if the unmap range
+ * touches any UNKNOWN_IPA range.
+ */
+ if (entry & UNKNOWN_IPA) {
+ mtree_unlock(revmap_mt);
+ reset_revmap_and_unmap(mmu, may_block);
+ return;
+ }
+
+ /*
+ * Ignore result, it is okay if a reverse mapping erase
+ * fails.
+ */
+ mas_store_gfp(&mas_rev, NULL, GFP_NOWAIT | __GFP_ACCOUNT);
+
+ mtree_unlock(revmap_mt);
+ kvm_stage2_unmap_range(mmu, entry & ADDR_MASK, entry_size,
+ may_block);
+ mtree_lock(revmap_mt);
+ /*
+ * Other maple tree operations during preemption could render
+ * this ma_state invalid, so reset it.
+ */
+ mas_set_range(&mas_rev, ipa, ipa_end);
+ entry = xa_to_value(mas_find_range(&mas_rev, ipa_end));
+ }
+ mtree_unlock(revmap_mt);
+}
+
+void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
+ bool may_block)
+{
+ int i;
+
+ if (!kvm->arch.nested_mmus_size)
+ return;
+
+ for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
+ struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
+
+ if (kvm_s2_mmu_valid(mmu))
+ unmap_mmu_ipa_range(mmu, gpa, size, may_block);
+ }
+
+ kvm_invalidate_vncr_ipa(kvm, gpa, gpa + size);
+}
+
void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
{
int i;
@@ -1163,7 +1299,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
if (kvm_s2_mmu_valid(mmu))
- kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
+ reset_revmap_and_unmap(mmu, may_block);
}
kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
@@ -1848,7 +1984,7 @@ void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
write_lock(&vcpu->kvm->mmu_lock);
if (mmu->pending_unmap) {
- kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
+ reset_revmap_and_unmap(mmu, true);
mmu->pending_unmap = false;
}
write_unlock(&vcpu->kvm->mmu_lock);
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v3 4/5] KVM: arm64: nv: Remove reverse map entries during TLBI handling
2026-05-10 14:53 [PATCH v3 0/5] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
` (2 preceding siblings ...)
2026-05-10 14:53 ` [PATCH v3 3/5] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
@ 2026-05-10 14:53 ` Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 5/5] KVM: arm64: nv: Create nested IPA direct map to speed up reverse map removal Wei-Lin Chang
4 siblings, 0 replies; 6+ messages in thread
From: Wei-Lin Chang @ 2026-05-10 14:53 UTC (permalink / raw)
To: linux-arm-kernel, kvmarm, linux-kernel
Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang
When a guest hypervisor issues a TLBI for a specific IPA range, KVM
unmaps that range from all the effected shadow stage-2s. During this we
get the opportunity to remove the reverse map, and lower the probability
of creating UNKNOWN_IPA reverse map ranges at subsequent stage-2 faults.
However, the TLBI ranges are specified in nested IPA, so in order to
locate the affected ranges in the reverse map maple tree, which is a
mapping from canonical IPA to nested IPA, we can only iterate through
the entire tree and check each entry.
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
arch/arm64/include/asm/kvm_nested.h | 2 ++
arch/arm64/kvm/nested.c | 38 +++++++++++++++++++++++++++++
arch/arm64/kvm/sys_regs.c | 3 +++
3 files changed, 43 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index 5cbf78dfc685..b11925826b25 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -76,6 +76,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
const union tlbi_info *info,
void (*)(struct kvm_s2_mmu *,
const union tlbi_info *));
+extern void kvm_remove_nested_revmap(struct kvm_s2_mmu *mmu, u64 nested_ipa,
+ size_t size);
extern void kvm_record_nested_revmap(gpa_t gpa, struct kvm_s2_mmu *mmu,
gpa_t fault_ipa, size_t map_size);
extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 35b5d5f21a23..96b88d9c0c2a 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -784,6 +784,44 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
return s2_mmu;
}
+void kvm_remove_nested_revmap(struct kvm_s2_mmu *mmu, u64 nested_ipa, size_t size)
+{
+ /*
+ * Iterate through the mt of this mmu, remove all canonical ipa ranges
+ * with !UNKNOWN_IPA that maps to ranges that are strictly within
+ * [addr, addr + size).
+ */
+ struct maple_tree *revmap_mt = &mmu->nested_revmap_mt;
+ void *entry;
+ u64 entry_val, nested_ipa_end = nested_ipa + size;
+ u64 this_nested_ipa, this_nested_ipa_end;
+ size_t revmap_size;
+
+ MA_STATE(mas_rev, revmap_mt, 0, ULONG_MAX);
+
+ mtree_lock(revmap_mt);
+ mas_for_each(&mas_rev, entry, ULONG_MAX) {
+ entry_val = xa_to_value(entry);
+ if (entry_val & UNKNOWN_IPA)
+ continue;
+
+ revmap_size = mas_rev.last - mas_rev.index + 1;
+ this_nested_ipa = entry_val & ADDR_MASK;
+ this_nested_ipa_end = this_nested_ipa + revmap_size;
+
+ if (this_nested_ipa >= nested_ipa &&
+ this_nested_ipa_end <= nested_ipa_end) {
+ /*
+ * As the shadow stage-2 is about to be unmapped
+ * after this function, it doesn't matter whether the
+ * removal of the reverse map failed or not.
+ */
+ mas_store_gfp(&mas_rev, NULL, GFP_NOWAIT | __GFP_ACCOUNT);
+ }
+ }
+ mtree_unlock(revmap_mt);
+}
+
void kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
gpa_t fault_ipa, size_t map_size)
{
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 6a96cb7ba9a3..a97304680cee 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -4006,6 +4006,7 @@ union tlbi_info {
static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu,
const union tlbi_info *info)
{
+ kvm_remove_nested_revmap(mmu, info->range.start, info->range.size);
/*
* The unmap operation is allowed to drop the MMU lock and block, which
* means that @mmu could be used for a different context than the one
@@ -4104,6 +4105,8 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu,
max_size = compute_tlb_inval_range(mmu, info->ipa.addr);
base_addr &= ~(max_size - 1);
+ kvm_remove_nested_revmap(mmu, base_addr, max_size);
+
/*
* See comment in s2_mmu_unmap_range() for why this is allowed to
* reschedule.
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v3 5/5] KVM: arm64: nv: Create nested IPA direct map to speed up reverse map removal
2026-05-10 14:53 [PATCH v3 0/5] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
` (3 preceding siblings ...)
2026-05-10 14:53 ` [PATCH v3 4/5] KVM: arm64: nv: Remove reverse map entries during TLBI handling Wei-Lin Chang
@ 2026-05-10 14:53 ` Wei-Lin Chang
4 siblings, 0 replies; 6+ messages in thread
From: Wei-Lin Chang @ 2026-05-10 14:53 UTC (permalink / raw)
To: linux-arm-kernel, kvmarm, linux-kernel
Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang
Iterating through the whole reverse map to find which entries to remove
when handling guest hypervisor TLBIs is not efficient. Create a direct
map that goes from nested IPA to canonical IPA so that the canonical
IPA range affected by the TLBI can be quickly determined, then remove
the entries in the reverse map accordingly.
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
arch/arm64/include/asm/kvm_host.h | 5 ++
arch/arm64/kvm/mmu.c | 9 ++-
arch/arm64/kvm/nested.c | 124 ++++++++++++++++++++++--------
3 files changed, 104 insertions(+), 34 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index dc4c0bce1bbb..f9e95a023ec4 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -226,6 +226,11 @@ struct kvm_s2_mmu {
bool nested_revmap_broken;
/* canonical IPA to nested IPA range lookup */
struct maple_tree nested_revmap_mt;
+ /*
+ * Nested IPA to canonical IPA range lookup, essentially a cache of
+ * the guest's stage-2.
+ */
+ struct maple_tree nested_direct_mt;
#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
struct dentry *shadow_pt_debugfs_dentry;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index ce0bd88cd3c1..77146431be6d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1101,6 +1101,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
struct kvm_pgtable *pgt = NULL;
struct maple_tree *revmap_mt = &mmu->nested_revmap_mt;
+ struct maple_tree *direct_mt = &mmu->nested_direct_mt;
write_lock(&kvm->mmu_lock);
pgt = mmu->pgt;
@@ -1111,8 +1112,12 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
}
if (kvm_is_nested_s2_mmu(kvm, mmu)) {
- if (!mtree_empty(revmap_mt))
- mtree_destroy(revmap_mt);
+ if (!mtree_empty(revmap_mt) || !mtree_empty(direct_mt)) {
+ mtree_lock(revmap_mt);
+ __mt_destroy(revmap_mt);
+ __mt_destroy(direct_mt);
+ mtree_unlock(revmap_mt);
+ }
kvm_init_nested_s2_mmu(mmu);
}
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 96b88d9c0c2a..fcb6a88047e1 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -45,14 +45,14 @@ struct vncr_tlb {
#define S2_MMU_PER_VCPU 2
/*
- * Per shadow S2 reverse map (IPA -> nested IPA range) maple tree payload
- * layout:
+ * Per shadow S2 reverse & direct map maple tree payload layout:
*
- * bit 62: valid, prevents the case where the nested IPA is 0 and turning
+ * bit 62: valid, prevents the case where the address is 0 and turning
* the whole value to 0
- * bits 55-12: nested IPA bits 55-12
+ * bits 55-12: {nested, canonical} IPA bits 55-12
* bit 0: UNKNOWN_IPA bit, 1 indicates we give up on tracking what nested
- * IPA maps to this canonical IPA in the shadow stage-2
+ * IPA maps to this canonical IPA in the shadow stage-2, only used
+ * in reverse map
*/
#define VALID_ENTRY BIT(62)
#define ADDR_MASK GENMASK_ULL(55, 12)
@@ -787,37 +787,67 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
void kvm_remove_nested_revmap(struct kvm_s2_mmu *mmu, u64 nested_ipa, size_t size)
{
/*
- * Iterate through the mt of this mmu, remove all canonical ipa ranges
- * with !UNKNOWN_IPA that maps to ranges that are strictly within
- * [addr, addr + size).
+ * For all ranges in direct_mt that are completely covered by the range
+ * we are TLBIing [gpa, gpa + size), remove the reverse map and its
+ * corresponding direct map together, when these conditions are met:
+ *
+ * 1. The reverse map is not UNKNOWN_IPA.
+ * 2. The reverse map is completely covered by the TLBI range.
+ * 3. The reverse map and the direct map are symmetric i.e. they map to
+ * each other, with the same size.
+ *
+ * Symmetry must be checked because there are three places where the
+ * direct map could become inconsistent:
+ *
+ * 1. Direct map removal failure during an mmu notifier in
+ * unmap_mmu_ipa_range().
+ * 2. Direct map insertion failure during an s2 fault in
+ * kvm_record_nested_revmap().
+ * 3. Direct map removal failure during a previous call of this very
+ * function.
*/
struct maple_tree *revmap_mt = &mmu->nested_revmap_mt;
- void *entry;
- u64 entry_val, nested_ipa_end = nested_ipa + size;
- u64 this_nested_ipa, this_nested_ipa_end;
- size_t revmap_size;
-
- MA_STATE(mas_rev, revmap_mt, 0, ULONG_MAX);
-
+ struct maple_tree *direct_mt = &mmu->nested_direct_mt;
+ gpa_t nested_ipa_end = nested_ipa + size - 1;
+ u64 entry_dir;
+ struct mapping {
+ u64 from;
+ u64 to;
+ size_t size;
+ };
+
+ MA_STATE(mas_dir, direct_mt, nested_ipa, nested_ipa_end);
mtree_lock(revmap_mt);
- mas_for_each(&mas_rev, entry, ULONG_MAX) {
- entry_val = xa_to_value(entry);
- if (entry_val & UNKNOWN_IPA)
- continue;
-
- revmap_size = mas_rev.last - mas_rev.index + 1;
- this_nested_ipa = entry_val & ADDR_MASK;
- this_nested_ipa_end = this_nested_ipa + revmap_size;
-
- if (this_nested_ipa >= nested_ipa &&
- this_nested_ipa_end <= nested_ipa_end) {
- /*
- * As the shadow stage-2 is about to be unmapped
- * after this function, it doesn't matter whether the
- * removal of the reverse map failed or not.
- */
+ entry_dir = xa_to_value(mas_find_range(&mas_dir, nested_ipa_end));
+
+ while (entry_dir && mas_dir.index <= nested_ipa_end) {
+ struct mapping dir, rev;
+ u64 entry_rev;
+
+ dir.from = mas_dir.index;
+ dir.to = entry_dir & ADDR_MASK;
+ dir.size = mas_dir.last - mas_dir.index + 1;
+
+ /* Use ipa range to find the corresponding entry in revmap. */
+ MA_STATE(mas_rev, revmap_mt, dir.to, dir.to + dir.size - 1);
+ entry_rev = xa_to_value(mas_find_range(&mas_rev,
+ dir.to + dir.size - 1));
+
+ rev.from = mas_rev.index;
+ rev.to = entry_rev & ADDR_MASK;
+ rev.size = mas_rev.last - mas_rev.index + 1;
+
+ /* The three conditions outlined above. */
+ if (entry_rev && !(entry_rev & UNKNOWN_IPA) &&
+ dir.from >= nested_ipa &&
+ dir.from + dir.size - 1 <= nested_ipa_end &&
+ dir.from == rev.to &&
+ rev.from == dir.to &&
+ dir.size == rev.size) {
+ mas_store_gfp(&mas_dir, NULL, GFP_NOWAIT | __GFP_ACCOUNT);
mas_store_gfp(&mas_rev, NULL, GFP_NOWAIT | __GFP_ACCOUNT);
}
+ entry_dir = xa_to_value(mas_find_range(&mas_dir, nested_ipa_end));
}
mtree_unlock(revmap_mt);
}
@@ -826,9 +856,12 @@ void kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
gpa_t fault_ipa, size_t map_size)
{
struct maple_tree *revmap_mt = &mmu->nested_revmap_mt;
+ struct maple_tree *direct_mt = &mmu->nested_direct_mt;
gpa_t ipa_end = ipa + map_size - 1;
+ gpa_t fault_ipa_end = fault_ipa + map_size - 1;
u64 entry, new_entry = 0;
MA_STATE(mas_rev, revmap_mt, ipa, ipa_end);
+ MA_STATE(mas_dir, direct_mt, fault_ipa, fault_ipa_end);
if (mmu->nested_revmap_broken)
return;
@@ -861,6 +894,15 @@ void kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
if (mas_store_gfp(&mas_rev, xa_mk_value(new_entry),
GFP_NOWAIT | __GFP_ACCOUNT))
mmu->nested_revmap_broken = true;
+
+ /*
+ * Add direct map but ignore the result, missing a direct map does not
+ * affect correctness.
+ */
+ if (new_entry & VALID_ENTRY && !mmu->nested_revmap_broken)
+ mas_store_gfp(&mas_dir, xa_mk_value(ipa | VALID_ENTRY),
+ GFP_NOWAIT | __GFP_ACCOUNT);
+
unlock:
mtree_unlock(revmap_mt);
}
@@ -872,6 +914,8 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
mmu->nested_stage2_enabled = false;
atomic_set(&mmu->refcnt, 0);
mt_init(&mmu->nested_revmap_mt);
+ mt_init_flags(&mmu->nested_direct_mt, MT_FLAGS_LOCK_EXTERN);
+ mt_set_external_lock(&mmu->nested_direct_mt, &mmu->nested_revmap_mt.ma_lock);
mmu->nested_revmap_broken = false;
}
@@ -1250,7 +1294,10 @@ void kvm_nested_s2_wp(struct kvm *kvm)
static void reset_revmap_and_unmap(struct kvm_s2_mmu *mmu, bool may_block)
{
- mtree_destroy(&mmu->nested_revmap_mt);
+ mtree_lock(&mmu->nested_revmap_mt);
+ __mt_destroy(&mmu->nested_revmap_mt);
+ __mt_destroy(&mmu->nested_direct_mt);
+ mtree_unlock(&mmu->nested_revmap_mt);
mmu->nested_revmap_broken = false;
kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
}
@@ -1259,11 +1306,14 @@ static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
size_t unmap_size, bool may_block)
{
struct maple_tree *revmap_mt = &mmu->nested_revmap_mt;
+ struct maple_tree *direct_mt = &mmu->nested_direct_mt;
gpa_t ipa = gpa;
gpa_t ipa_end = gpa + unmap_size - 1;
+ gpa_t nested_ipa, nested_ipa_end;
u64 entry;
size_t entry_size;
MA_STATE(mas_rev, revmap_mt, gpa, ipa_end);
+ MA_STATE(mas_dir, direct_mt, 0, ULONG_MAX);
if (mmu->nested_revmap_broken) {
reset_revmap_and_unmap(mmu, may_block);
@@ -1292,6 +1342,16 @@ static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
*/
mas_store_gfp(&mas_rev, NULL, GFP_NOWAIT | __GFP_ACCOUNT);
+ /*
+ * Try to also remove the direct map, it is okay if this fails,
+ * as we check for direct map consistency in
+ * kvm_remove_nested_revmap().
+ */
+ nested_ipa = entry & ADDR_MASK;
+ nested_ipa_end = nested_ipa + entry_size - 1;
+ mas_set_range(&mas_dir, nested_ipa, nested_ipa_end);
+ mas_store_gfp(&mas_dir, NULL, GFP_NOWAIT | __GFP_ACCOUNT);
+
mtree_unlock(revmap_mt);
kvm_stage2_unmap_range(mmu, entry & ADDR_MASK, entry_size,
may_block);
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-10 14:54 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-10 14:53 [PATCH v3 0/5] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 1/5] KVM: arm64: Use a variable for the canonical GPA in kvm_s2_fault_map() Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 2/5] KVM: arm64: Move shadow_pt_debugfs_dentry to reduce holes in kvm_s2_mmu Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 3/5] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 4/5] KVM: arm64: nv: Remove reverse map entries during TLBI handling Wei-Lin Chang
2026-05-10 14:53 ` [PATCH v3 5/5] KVM: arm64: nv: Create nested IPA direct map to speed up reverse map removal Wei-Lin Chang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox