* [PATCH 0/1] KVM: arm64: nv: Implement nested stage-2 reverse map
@ 2026-04-11 12:50 Wei-Lin Chang
2026-04-11 12:50 ` [PATCH 1/1] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
2026-04-11 14:00 ` [PATCH v2 0/1] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
0 siblings, 2 replies; 5+ messages in thread
From: Wei-Lin Chang @ 2026-04-11 12:50 UTC (permalink / raw)
To: linux-arm-kernel, kvmarm, linux-kernel
Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang
Hi,
This is v2 of optimizing the shadow s2 mmu unmapping during MMU
notifiers. Thanks to Sashiko, who helped point out the many problems [1]
in v1.
* Changes from v1 [2]:
- Rebased on to a newer kvmarm/next, where user_mem_abort() underwent
a significant refactor.
- Added a flag VALID_ENTRY (bit 63) to each non-polluted reverse map
entry, so that if nested IPA == 0, we still insert a non-zero entry
to the maple tree.
- Added usage of the maple tree lock while using the tree. Previously
I though I could piggyback on kvm->mmu_lock, but this doesn't work
for 2 reasons:
1. The maple tree advanced API (mas_*) expects the maple tree lock
to be held.
2. At stage-2 fault time, kvm->mmu_lock is only taken for read.
Therefore even if 1. does not matter, parallel accesses to the
maple tree could still happen.
- Changed from using GFP_KERNEL_ACCOUNT to (GFP_NOWAIT | __GFP_ACCOUNT)
in maple tree operations. This is done because GFP_KERNEL_ACCOUNT
can sleep, and we are holding kvm->mmu_lock while doing the
operations.
- Made the code able to tolerate reverse map creation failure. In v1
if a maple tree operation fails, the error is reported back to the
caller which in the end fails the vCPU run. It shouldn't be this way
as the reverse map is an optimization and it shouldn't fail the
normal operation as we can fallback to a full unmap.
- Added a boolean nested_revmap_broken in struct kvm_s2_mmu. If
reverse map creation fails, the reverse map becomes unreliable. Keep
this failure information with nested_revmap_broken so that we can
fallback when we need to unmap.
- Removed patch 2,3,4 for now. After we start using the maple tree
lock, and keeping track of the reverse map failure state in
nested_revmap_broken, the s2 mmu look up acceleration in v1 patch 2
becomes very complicated, as the canonical maple tree used to speed
up s2 mmu look up can also encounter allocation failures which we
also need to keep track of and fallback. In the mean time the
consistency between the trees is not easy to reason about when
errors happen. Additionally, the extra lock of the canonical maple
tree also needs to be considered and care must be taken to not
introduce lock order inversion.
Given the above I believe it is best to leave the reverse map
improvements out for now, so as to not use too much time thinking
about optimization before the initial version of the reverse map is
even good.
Thanks!
[1]: https://sashiko.dev/#/patchset/20260330100633.2817076-1-weilin.chang%40arm.com
[2]: https://lore.kernel.org/kvmarm/20260330100633.2817076-1-weilin.chang@arm.com/
Wei-Lin Chang (1):
KVM: arm64: nv: Avoid full shadow s2 unmap
arch/arm64/include/asm/kvm_host.h | 4 +
arch/arm64/include/asm/kvm_nested.h | 4 +
arch/arm64/kvm/mmu.c | 30 ++++--
arch/arm64/kvm/nested.c | 147 +++++++++++++++++++++++++++-
4 files changed, 177 insertions(+), 8 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/1] KVM: arm64: nv: Avoid full shadow s2 unmap
2026-04-11 12:50 [PATCH 0/1] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
@ 2026-04-11 12:50 ` Wei-Lin Chang
2026-04-15 8:38 ` Marc Zyngier
2026-04-11 14:00 ` [PATCH v2 0/1] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
1 sibling, 1 reply; 5+ messages in thread
From: Wei-Lin Chang @ 2026-04-11 12:50 UTC (permalink / raw)
To: linux-arm-kernel, kvmarm, linux-kernel
Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang
Currently we are forced to fully unmap all shadow stage-2 for a VM when
unmapping a page from the canonical stage-2, for example during an MMU
notifier call. This is because we are not tracking what canonical IPA
are mapped in the shadow stage-2 page tables hence there is no way to
know what to unmap.
Create a per kvm_s2_mmu maple tree to track canonical IPA range ->
nested IPA range, so that it is possible to partially unmap shadow
stage-2 when a canonical IPA range is unmapped. The algorithm is simple
and conservative:
At each shadow stage-2 map, insert the nested IPA range into the maple
tree, with the canonical IPA range as the key. If the canonical IPA
range doesn't overlap with existing ranges in the tree, insert as is,
and a reverse mapping for this range is established. But if the
canonical IPA range overlaps with any existing ranges in the tree,
create a new range that spans all the overlapping ranges including the
input range and replace those existing ranges. In the mean time, mark
this new spanning canonical IPA range as "polluted" indicating we lost
track of the nested IPA ranges that map to this canonical IPA range.
The maple tree's 64 bit entry is enough to store the nested IPA and
polluted status (stored as a bit called UNKNOWN_IPA), therefore besides
maple tree's internal operation, memory allocation is avoided.
Example:
|||| means existing range, ---- means empty range
input: $$$$$$$$$$$$$$$$$$$$$$$$$$
tree: --||||-----|||||||---------||||||||||-----------
insert spanning range and replace overlapping ones:
--||||-----||||||||||||||||||||||||||-----------
^^^^^^^^polluted!^^^^^^^^^
With the reverse map created, when a canonical IPA range gets unmapped,
look into each s2 mmu's maple tree and look for canonical IPA ranges
affected, and base on their polluted status:
polluted -> fall back and fully invalidate the current shadow stage-2,
also clear the tree
not polluted -> unmap the nested IPA range, and remove the reverse map
entry
Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
arch/arm64/include/asm/kvm_host.h | 4 +
arch/arm64/include/asm/kvm_nested.h | 4 +
arch/arm64/kvm/mmu.c | 30 ++++--
arch/arm64/kvm/nested.c | 147 +++++++++++++++++++++++++++-
4 files changed, 177 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 851f6171751c..a97bd461c1e1 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -217,6 +217,10 @@ struct kvm_s2_mmu {
*/
bool nested_stage2_enabled;
+ /* canonical IPA to nested IPA range lookup */
+ struct maple_tree nested_revmap_mt;
+ bool nested_revmap_broken;
+
#ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
struct dentry *shadow_pt_debugfs_dentry;
#endif
diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index 091544e6af44..f039220e87a6 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -76,6 +76,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
const union tlbi_info *info,
void (*)(struct kvm_s2_mmu *,
const union tlbi_info *));
+extern void kvm_record_nested_revmap(gpa_t gpa, struct kvm_s2_mmu *mmu,
+ gpa_t fault_gpa, size_t map_size);
extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu);
@@ -164,6 +166,8 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu,
struct kvm_s2_trans *trans);
extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2);
extern void kvm_nested_s2_wp(struct kvm *kvm);
+extern void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
+ bool may_block);
extern void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block);
extern void kvm_nested_s2_flush(struct kvm *kvm);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d089c107d9b7..4c9b9cf6dc43 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -5,6 +5,7 @@
*/
#include <linux/acpi.h>
+#include <linux/maple_tree.h>
#include <linux/mman.h>
#include <linux/kvm_host.h>
#include <linux/io.h>
@@ -1099,6 +1100,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
{
struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
struct kvm_pgtable *pgt = NULL;
+ struct maple_tree *mt = &mmu->nested_revmap_mt;
write_lock(&kvm->mmu_lock);
pgt = mmu->pgt;
@@ -1108,8 +1110,11 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
free_percpu(mmu->last_vcpu_ran);
}
- if (kvm_is_nested_s2_mmu(kvm, mmu))
+ if (kvm_is_nested_s2_mmu(kvm, mmu)) {
+ if (!mtree_empty(mt))
+ mtree_destroy(mt);
kvm_init_nested_s2_mmu(mmu);
+ }
write_unlock(&kvm->mmu_lock);
@@ -1631,6 +1636,10 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
goto out_unlock;
}
+ if (s2fd->nested)
+ kvm_record_nested_revmap(gfn << PAGE_SHIFT, pgt->mmu,
+ s2fd->fault_ipa, PAGE_SIZE);
+
ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
__pfn_to_phys(pfn), prot,
memcache, flags);
@@ -2031,6 +2040,13 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
prot, flags);
} else {
+ if (s2fd->nested) {
+ phys_addr_t ipa = gfn_to_gpa(get_canonical_gfn(s2fd, s2vi));
+
+ ipa &= ~(mapping_size - 1);
+ kvm_record_nested_revmap(ipa, pgt->mmu, gfn_to_gpa(gfn),
+ mapping_size);
+ }
ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
__pfn_to_phys(pfn), prot,
memcache, flags);
@@ -2388,14 +2404,16 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
{
+ gpa_t gpa = range->start << PAGE_SHIFT;
+ size_t size = (range->end - range->start) << PAGE_SHIFT;
+ bool may_block = range->may_block;
+
if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
return false;
- __unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT,
- (range->end - range->start) << PAGE_SHIFT,
- range->may_block);
+ __unmap_stage2_range(&kvm->arch.mmu, gpa, size, may_block);
+ kvm_unmap_gfn_range_nested(kvm, gpa, size, may_block);
- kvm_nested_s2_unmap(kvm, range->may_block);
return false;
}
@@ -2673,7 +2691,7 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
write_lock(&kvm->mmu_lock);
kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size, true);
- kvm_nested_s2_unmap(kvm, true);
+ kvm_unmap_gfn_range_nested(kvm, gpa, size, true);
write_unlock(&kvm->mmu_lock);
}
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 883b6c1008fb..c9ebe969b453 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -7,6 +7,7 @@
#include <linux/bitfield.h>
#include <linux/kvm.h>
#include <linux/kvm_host.h>
+#include <linux/maple_tree.h>
#include <asm/fixmap.h>
#include <asm/kvm_arm.h>
@@ -43,6 +44,19 @@ struct vncr_tlb {
*/
#define S2_MMU_PER_VCPU 2
+/*
+ * Per shadow S2 reverse map (IPA -> nested IPA range) maple tree payload
+ * layout:
+ *
+ * bit 63: valid, 1 for non-polluted entries, prevents the case where the
+ * nested IPA is 0 and turns the whole value to 0
+ * bits 55-12: nested IPA bits 55-12
+ * bit 0: polluted, 1 for polluted, 0 for not
+ */
+#define VALID_ENTRY BIT(63)
+#define NESTED_IPA_MASK GENMASK_ULL(55, 12)
+#define UNKNOWN_IPA BIT(0)
+
void kvm_init_nested(struct kvm *kvm)
{
kvm->arch.nested_mmus = NULL;
@@ -769,12 +783,57 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
return s2_mmu;
}
+void kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
+ gpa_t fault_ipa, size_t map_size)
+{
+ struct maple_tree *mt = &mmu->nested_revmap_mt;
+ gpa_t start = ipa;
+ gpa_t end = ipa + map_size - 1;
+ u64 entry, new_entry = 0;
+ MA_STATE(mas, mt, start, end);
+
+ if (mmu->nested_revmap_broken)
+ return;
+
+ mtree_lock(mt);
+ entry = (u64)mas_find_range(&mas, end);
+
+ if (entry) {
+ /* maybe just a perm update... */
+ if (!(entry & UNKNOWN_IPA) && mas.index == start &&
+ mas.last == end &&
+ fault_ipa == (entry & NESTED_IPA_MASK))
+ goto unlock;
+ /*
+ * Create a "polluted" range that spans all the overlapping
+ * ranges and store it.
+ */
+ while (entry && mas.index <= end) {
+ start = min(mas.index, start);
+ end = max(mas.last, end);
+ entry = (u64)mas_find_range(&mas, end);
+ }
+ new_entry |= UNKNOWN_IPA;
+ } else {
+ new_entry |= fault_ipa;
+ new_entry |= VALID_ENTRY;
+ }
+
+ mas_set_range(&mas, start, end);
+ if (mas_store_gfp(&mas, (void *)new_entry, GFP_NOWAIT | __GFP_ACCOUNT))
+ mmu->nested_revmap_broken = true;
+unlock:
+ mtree_unlock(mt);
+}
+
void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
{
/* CnP being set denotes an invalid entry */
mmu->tlb_vttbr = VTTBR_CNP_BIT;
mmu->nested_stage2_enabled = false;
atomic_set(&mmu->refcnt, 0);
+ mt_init(&mmu->nested_revmap_mt);
+ mmu->nested_revmap_broken = false;
}
void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
@@ -1150,6 +1209,90 @@ void kvm_nested_s2_wp(struct kvm *kvm)
kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
}
+static void reset_revmap_and_unmap(struct kvm_s2_mmu *mmu, bool may_block)
+{
+ mtree_destroy(&mmu->nested_revmap_mt);
+ kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
+ mmu->nested_revmap_broken = false;
+}
+
+static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
+ size_t unmap_size, bool may_block)
+{
+ struct maple_tree *mt = &mmu->nested_revmap_mt;
+ gpa_t start = gpa;
+ gpa_t end = gpa + unmap_size - 1;
+ u64 entry;
+ size_t entry_size;
+ bool unlock, fallback;
+ MA_STATE(mas, mt, gpa, end);
+
+ if (mmu->nested_revmap_broken) {
+ unlock = false;
+ fallback = true;
+ goto fin;
+ }
+
+ mtree_lock(mt);
+ entry = (u64)mas_find_range(&mas, end);
+
+ while (entry && mas.index <= end) {
+ start = mas.last + 1;
+ entry_size = mas.last - mas.index + 1;
+ /*
+ * Give up and invalidate this s2 mmu if the unmap range
+ * touches any polluted range.
+ */
+ if (entry & UNKNOWN_IPA) {
+ unlock = true;
+ fallback = true;
+ goto fin;
+ }
+
+ /*
+ * Ignore result, it is okay if a reverse mapping erase
+ * fails.
+ */
+ mas_store_gfp(&mas, NULL, GFP_NOWAIT | __GFP_ACCOUNT);
+
+ mtree_unlock(mt);
+ kvm_stage2_unmap_range(mmu, entry & NESTED_IPA_MASK, entry_size,
+ may_block);
+ mtree_lock(mt);
+ /*
+ * Other maple tree operations during preemption could render
+ * this ma_state invalid, so reset it.
+ */
+ mas_set_range(&mas, start, end);
+ entry = (u64)mas_find_range(&mas, end);
+ }
+ unlock = true;
+ fallback = false;
+
+fin:
+ if (unlock)
+ mtree_unlock(mt);
+ if (fallback)
+ reset_revmap_and_unmap(mmu, may_block);
+}
+
+void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
+ bool may_block)
+{
+ int i;
+
+ if (!kvm->arch.nested_mmus_size)
+ return;
+
+ /* TODO: accelerate this using mt of canonical s2 mmu */
+ for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
+ struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
+
+ if (kvm_s2_mmu_valid(mmu))
+ unmap_mmu_ipa_range(mmu, gpa, size, may_block);
+ }
+}
+
void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
{
int i;
@@ -1163,7 +1306,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
if (kvm_s2_mmu_valid(mmu))
- kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
+ reset_revmap_and_unmap(mmu, may_block);
}
kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
@@ -1848,7 +1991,7 @@ void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
write_lock(&vcpu->kvm->mmu_lock);
if (mmu->pending_unmap) {
- kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
+ reset_revmap_and_unmap(mmu, true);
mmu->pending_unmap = false;
}
write_unlock(&vcpu->kvm->mmu_lock);
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2 0/1] KVM: arm64: nv: Implement nested stage-2 reverse map
2026-04-11 12:50 [PATCH 0/1] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
2026-04-11 12:50 ` [PATCH 1/1] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
@ 2026-04-11 14:00 ` Wei-Lin Chang
1 sibling, 0 replies; 5+ messages in thread
From: Wei-Lin Chang @ 2026-04-11 14:00 UTC (permalink / raw)
To: linux-arm-kernel, kvmarm, linux-kernel
Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
Zenghui Yu, Catalin Marinas, Will Deacon
Hi,
Sorry, I missed changing the title to v2.
I'll add this to my checklist before sending next time.
Thanks,
Wei-Lin Chang
On Sat, Apr 11, 2026 at 01:50:23PM +0100, Wei-Lin Chang wrote:
> Hi,
>
> This is v2 of optimizing the shadow s2 mmu unmapping during MMU
> notifiers. Thanks to Sashiko, who helped point out the many problems [1]
> in v1.
>
> * Changes from v1 [2]:
>
> - Rebased on to a newer kvmarm/next, where user_mem_abort() underwent
> a significant refactor.
>
> - Added a flag VALID_ENTRY (bit 63) to each non-polluted reverse map
> entry, so that if nested IPA == 0, we still insert a non-zero entry
> to the maple tree.
>
> - Added usage of the maple tree lock while using the tree. Previously
> I though I could piggyback on kvm->mmu_lock, but this doesn't work
> for 2 reasons:
> 1. The maple tree advanced API (mas_*) expects the maple tree lock
> to be held.
> 2. At stage-2 fault time, kvm->mmu_lock is only taken for read.
> Therefore even if 1. does not matter, parallel accesses to the
> maple tree could still happen.
>
> - Changed from using GFP_KERNEL_ACCOUNT to (GFP_NOWAIT | __GFP_ACCOUNT)
> in maple tree operations. This is done because GFP_KERNEL_ACCOUNT
> can sleep, and we are holding kvm->mmu_lock while doing the
> operations.
>
> - Made the code able to tolerate reverse map creation failure. In v1
> if a maple tree operation fails, the error is reported back to the
> caller which in the end fails the vCPU run. It shouldn't be this way
> as the reverse map is an optimization and it shouldn't fail the
> normal operation as we can fallback to a full unmap.
>
> - Added a boolean nested_revmap_broken in struct kvm_s2_mmu. If
> reverse map creation fails, the reverse map becomes unreliable. Keep
> this failure information with nested_revmap_broken so that we can
> fallback when we need to unmap.
>
> - Removed patch 2,3,4 for now. After we start using the maple tree
> lock, and keeping track of the reverse map failure state in
> nested_revmap_broken, the s2 mmu look up acceleration in v1 patch 2
> becomes very complicated, as the canonical maple tree used to speed
> up s2 mmu look up can also encounter allocation failures which we
> also need to keep track of and fallback. In the mean time the
> consistency between the trees is not easy to reason about when
> errors happen. Additionally, the extra lock of the canonical maple
> tree also needs to be considered and care must be taken to not
> introduce lock order inversion.
> Given the above I believe it is best to leave the reverse map
> improvements out for now, so as to not use too much time thinking
> about optimization before the initial version of the reverse map is
> even good.
>
> Thanks!
>
> [1]: https://sashiko.dev/#/patchset/20260330100633.2817076-1-weilin.chang%40arm.com
> [2]: https://lore.kernel.org/kvmarm/20260330100633.2817076-1-weilin.chang@arm.com/
>
> Wei-Lin Chang (1):
> KVM: arm64: nv: Avoid full shadow s2 unmap
>
> arch/arm64/include/asm/kvm_host.h | 4 +
> arch/arm64/include/asm/kvm_nested.h | 4 +
> arch/arm64/kvm/mmu.c | 30 ++++--
> arch/arm64/kvm/nested.c | 147 +++++++++++++++++++++++++++-
> 4 files changed, 177 insertions(+), 8 deletions(-)
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] KVM: arm64: nv: Avoid full shadow s2 unmap
2026-04-11 12:50 ` [PATCH 1/1] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
@ 2026-04-15 8:38 ` Marc Zyngier
2026-04-15 23:05 ` Wei-Lin Chang
0 siblings, 1 reply; 5+ messages in thread
From: Marc Zyngier @ 2026-04-15 8:38 UTC (permalink / raw)
To: Wei-Lin Chang
Cc: linux-arm-kernel, kvmarm, linux-kernel, Oliver Upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon
On Sat, 11 Apr 2026 13:50:24 +0100,
Wei-Lin Chang <weilin.chang@arm.com> wrote:
>
> Currently we are forced to fully unmap all shadow stage-2 for a VM when
> unmapping a page from the canonical stage-2, for example during an MMU
> notifier call. This is because we are not tracking what canonical IPA
> are mapped in the shadow stage-2 page tables hence there is no way to
> know what to unmap.
>
> Create a per kvm_s2_mmu maple tree to track canonical IPA range ->
> nested IPA range, so that it is possible to partially unmap shadow
> stage-2 when a canonical IPA range is unmapped. The algorithm is simple
> and conservative:
>
> At each shadow stage-2 map, insert the nested IPA range into the maple
> tree, with the canonical IPA range as the key. If the canonical IPA
> range doesn't overlap with existing ranges in the tree, insert as is,
> and a reverse mapping for this range is established. But if the
> canonical IPA range overlaps with any existing ranges in the tree,
> create a new range that spans all the overlapping ranges including the
> input range and replace those existing ranges. In the mean time, mark
> this new spanning canonical IPA range as "polluted" indicating we lost
> track of the nested IPA ranges that map to this canonical IPA range.
>
> The maple tree's 64 bit entry is enough to store the nested IPA and
> polluted status (stored as a bit called UNKNOWN_IPA), therefore besides
> maple tree's internal operation, memory allocation is avoided.
>
> Example:
> |||| means existing range, ---- means empty range
>
> input: $$$$$$$$$$$$$$$$$$$$$$$$$$
> tree: --||||-----|||||||---------||||||||||-----------
>
> insert spanning range and replace overlapping ones:
> --||||-----||||||||||||||||||||||||||-----------
> ^^^^^^^^polluted!^^^^^^^^^
I think you should stick to a single terminology. It is either
"polluted", or "unknown IPA". My preference goes to the latter, as the
former is not very descriptive in this context.
>
> With the reverse map created, when a canonical IPA range gets unmapped,
> look into each s2 mmu's maple tree and look for canonical IPA ranges
> affected, and base on their polluted status:
>
> polluted -> fall back and fully invalidate the current shadow stage-2,
> also clear the tree
> not polluted -> unmap the nested IPA range, and remove the reverse map
> entry
>
> Suggested-by: Marc Zyngier <maz@kernel.org>
> Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
> ---
> arch/arm64/include/asm/kvm_host.h | 4 +
> arch/arm64/include/asm/kvm_nested.h | 4 +
> arch/arm64/kvm/mmu.c | 30 ++++--
> arch/arm64/kvm/nested.c | 147 +++++++++++++++++++++++++++-
> 4 files changed, 177 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 851f6171751c..a97bd461c1e1 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -217,6 +217,10 @@ struct kvm_s2_mmu {
> */
> bool nested_stage2_enabled;
>
> + /* canonical IPA to nested IPA range lookup */
> + struct maple_tree nested_revmap_mt;
> + bool nested_revmap_broken;
> +
Consider moving this boolean next to the other ones so that you don't
create too many holes in the kvm_s2_mmu structure (use pahole to find out).
But I have some misgivings about the way things are structured
here. Only NV needs a revmap, yet this is present irrelevant of the
nature of the VM and bloats the data structure a bit.
My naive approach would have been to only keep a pointer to the
revmap, and make that pointer NULL when the tree is "broken", and
freed under RCU if the context isn't the correct one.
This would have multiple benefits: no large-ish structure embedded in
the s2_mmu structure, no extra boolean to indicate an error condition,
memory reclaimed earlier.
> #ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
> struct dentry *shadow_pt_debugfs_dentry;
> #endif
> diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
> index 091544e6af44..f039220e87a6 100644
> --- a/arch/arm64/include/asm/kvm_nested.h
> +++ b/arch/arm64/include/asm/kvm_nested.h
> @@ -76,6 +76,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
> const union tlbi_info *info,
> void (*)(struct kvm_s2_mmu *,
> const union tlbi_info *));
> +extern void kvm_record_nested_revmap(gpa_t gpa, struct kvm_s2_mmu *mmu,
> + gpa_t fault_gpa, size_t map_size);
> extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
> extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu);
>
> @@ -164,6 +166,8 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu,
> struct kvm_s2_trans *trans);
> extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2);
> extern void kvm_nested_s2_wp(struct kvm *kvm);
> +extern void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
> + bool may_block);
> extern void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block);
> extern void kvm_nested_s2_flush(struct kvm *kvm);
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d089c107d9b7..4c9b9cf6dc43 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -5,6 +5,7 @@
> */
>
> #include <linux/acpi.h>
> +#include <linux/maple_tree.h>
> #include <linux/mman.h>
> #include <linux/kvm_host.h>
> #include <linux/io.h>
> @@ -1099,6 +1100,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> {
> struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
> struct kvm_pgtable *pgt = NULL;
> + struct maple_tree *mt = &mmu->nested_revmap_mt;
>
> write_lock(&kvm->mmu_lock);
> pgt = mmu->pgt;
> @@ -1108,8 +1110,11 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> free_percpu(mmu->last_vcpu_ran);
> }
>
> - if (kvm_is_nested_s2_mmu(kvm, mmu))
> + if (kvm_is_nested_s2_mmu(kvm, mmu)) {
> + if (!mtree_empty(mt))
> + mtree_destroy(mt);
> kvm_init_nested_s2_mmu(mmu);
> + }
>
> write_unlock(&kvm->mmu_lock);
>
> @@ -1631,6 +1636,10 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> goto out_unlock;
> }
>
> + if (s2fd->nested)
> + kvm_record_nested_revmap(gfn << PAGE_SHIFT, pgt->mmu,
> + s2fd->fault_ipa, PAGE_SIZE);
> +
> ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> __pfn_to_phys(pfn), prot,
> memcache, flags);
> @@ -2031,6 +2040,13 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
> ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
> prot, flags);
> } else {
> + if (s2fd->nested) {
> + phys_addr_t ipa = gfn_to_gpa(get_canonical_gfn(s2fd, s2vi));
> +
> + ipa &= ~(mapping_size - 1);
I guess it'd be worth adding a helper for this instead of duplicating
the existing code.
> + kvm_record_nested_revmap(ipa, pgt->mmu, gfn_to_gpa(gfn),
> + mapping_size);
This worries me a bit, see below.
> + }
> ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
> __pfn_to_phys(pfn), prot,
> memcache, flags);
> @@ -2388,14 +2404,16 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>
> bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> {
> + gpa_t gpa = range->start << PAGE_SHIFT;
> + size_t size = (range->end - range->start) << PAGE_SHIFT;
> + bool may_block = range->may_block;
> +
> if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
> return false;
>
> - __unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT,
> - (range->end - range->start) << PAGE_SHIFT,
> - range->may_block);
> + __unmap_stage2_range(&kvm->arch.mmu, gpa, size, may_block);
> + kvm_unmap_gfn_range_nested(kvm, gpa, size, may_block);
>
> - kvm_nested_s2_unmap(kvm, range->may_block);
> return false;
> }
>
> @@ -2673,7 +2691,7 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>
> write_lock(&kvm->mmu_lock);
> kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size, true);
> - kvm_nested_s2_unmap(kvm, true);
> + kvm_unmap_gfn_range_nested(kvm, gpa, size, true);
> write_unlock(&kvm->mmu_lock);
> }
>
> diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> index 883b6c1008fb..c9ebe969b453 100644
> --- a/arch/arm64/kvm/nested.c
> +++ b/arch/arm64/kvm/nested.c
> @@ -7,6 +7,7 @@
> #include <linux/bitfield.h>
> #include <linux/kvm.h>
> #include <linux/kvm_host.h>
> +#include <linux/maple_tree.h>
>
> #include <asm/fixmap.h>
> #include <asm/kvm_arm.h>
> @@ -43,6 +44,19 @@ struct vncr_tlb {
> */
> #define S2_MMU_PER_VCPU 2
>
> +/*
> + * Per shadow S2 reverse map (IPA -> nested IPA range) maple tree payload
> + * layout:
> + *
> + * bit 63: valid, 1 for non-polluted entries, prevents the case where the
> + * nested IPA is 0 and turns the whole value to 0
> + * bits 55-12: nested IPA bits 55-12
> + * bit 0: polluted, 1 for polluted, 0 for not
> + */
> +#define VALID_ENTRY BIT(63)
> +#define NESTED_IPA_MASK GENMASK_ULL(55, 12)
> +#define UNKNOWN_IPA BIT(0)
> +
This only works because you are using the "advanced" API, right?
Otherwise, you'd be losing the high bit. It'd be good to add a comment
so that people keep that in mind.
> void kvm_init_nested(struct kvm *kvm)
> {
> kvm->arch.nested_mmus = NULL;
> @@ -769,12 +783,57 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
> return s2_mmu;
> }
>
> +void kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
> + gpa_t fault_ipa, size_t map_size)
> +{
> + struct maple_tree *mt = &mmu->nested_revmap_mt;
> + gpa_t start = ipa;
> + gpa_t end = ipa + map_size - 1;
> + u64 entry, new_entry = 0;
> + MA_STATE(mas, mt, start, end);
> +
> + if (mmu->nested_revmap_broken)
> + return;
> +
> + mtree_lock(mt);
> + entry = (u64)mas_find_range(&mas, end);
> +
> + if (entry) {
> + /* maybe just a perm update... */
> + if (!(entry & UNKNOWN_IPA) && mas.index == start &&
> + mas.last == end &&
> + fault_ipa == (entry & NESTED_IPA_MASK))
> + goto unlock;
> + /*
> + * Create a "polluted" range that spans all the overlapping
> + * ranges and store it.
> + */
> + while (entry && mas.index <= end) {
> + start = min(mas.index, start);
> + end = max(mas.last, end);
> + entry = (u64)mas_find_range(&mas, end);
> + }
> + new_entry |= UNKNOWN_IPA;
> + } else {
> + new_entry |= fault_ipa;
> + new_entry |= VALID_ENTRY;
> + }
> +
> + mas_set_range(&mas, start, end);
> + if (mas_store_gfp(&mas, (void *)new_entry, GFP_NOWAIT | __GFP_ACCOUNT))
> + mmu->nested_revmap_broken = true;
Can we try and minimise the risk of allocation failure here?
user_mem_abort() tries very hard to pre-allocate pages for page
tables by maintaining an memcache. Can we have a similar approach for
the revmap?
> +unlock:
> + mtree_unlock(mt);
> +}
> +
> void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
> {
> /* CnP being set denotes an invalid entry */
> mmu->tlb_vttbr = VTTBR_CNP_BIT;
> mmu->nested_stage2_enabled = false;
> atomic_set(&mmu->refcnt, 0);
> + mt_init(&mmu->nested_revmap_mt);
> + mmu->nested_revmap_broken = false;
> }
>
> void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
> @@ -1150,6 +1209,90 @@ void kvm_nested_s2_wp(struct kvm *kvm)
> kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
> }
>
> +static void reset_revmap_and_unmap(struct kvm_s2_mmu *mmu, bool may_block)
> +{
> + mtree_destroy(&mmu->nested_revmap_mt);
> + kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
> + mmu->nested_revmap_broken = false;
> +}
> +
> +static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
> + size_t unmap_size, bool may_block)
> +{
> + struct maple_tree *mt = &mmu->nested_revmap_mt;
> + gpa_t start = gpa;
> + gpa_t end = gpa + unmap_size - 1;
> + u64 entry;
> + size_t entry_size;
> + bool unlock, fallback;
> + MA_STATE(mas, mt, gpa, end);
> +
> + if (mmu->nested_revmap_broken) {
> + unlock = false;
> + fallback = true;
> + goto fin;
> + }
Using booleans to affect the control flow reads really badly. I'd
expect this to simply be:
if (...) {
reset_revmap_and_unmap(mmu, may_block);
return;
}
> +
> + mtree_lock(mt);
> + entry = (u64)mas_find_range(&mas, end);
> +
> + while (entry && mas.index <= end) {
> + start = mas.last + 1;
> + entry_size = mas.last - mas.index + 1;
> + /*
> + * Give up and invalidate this s2 mmu if the unmap range
> + * touches any polluted range.
> + */
> + if (entry & UNKNOWN_IPA) {
> + unlock = true;
> + fallback = true;
> + goto fin;
> + }
and this to be:
if (entry & UNKNOWN_IPA) {
mtree_unlock(mt);
reset_revmap_and_unmap(mmu, may_block);
return;
}
> +
> + /*
> + * Ignore result, it is okay if a reverse mapping erase
> + * fails.
> + */
> + mas_store_gfp(&mas, NULL, GFP_NOWAIT | __GFP_ACCOUNT);
> +
> + mtree_unlock(mt);
> + kvm_stage2_unmap_range(mmu, entry & NESTED_IPA_MASK, entry_size,
> + may_block);
> + mtree_lock(mt);
> + /*
> + * Other maple tree operations during preemption could render
> + * this ma_state invalid, so reset it.
> + */
> + mas_set_range(&mas, start, end);
> + entry = (u64)mas_find_range(&mas, end);
> + }
> + unlock = true;
> + fallback = false;
> +
> +fin:
> + if (unlock)
> + mtree_unlock(mt);
> + if (fallback)
> + reset_revmap_and_unmap(mmu, may_block);
and this can eventually be greatly simplified.
> +}
> +
> +void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
> + bool may_block)
> +{
> + int i;
> +
> + if (!kvm->arch.nested_mmus_size)
> + return;
> +
> + /* TODO: accelerate this using mt of canonical s2 mmu */
> + for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
> + struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
> +
> + if (kvm_s2_mmu_valid(mmu))
> + unmap_mmu_ipa_range(mmu, gpa, size, may_block);
> + }
> +}
> +
> void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
> {
> int i;
> @@ -1163,7 +1306,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
> struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
>
> if (kvm_s2_mmu_valid(mmu))
> - kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
> + reset_revmap_and_unmap(mmu, may_block);
> }
>
> kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
> @@ -1848,7 +1991,7 @@ void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
>
> write_lock(&vcpu->kvm->mmu_lock);
> if (mmu->pending_unmap) {
> - kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
> + reset_revmap_and_unmap(mmu, true);
> mmu->pending_unmap = false;
> }
> write_unlock(&vcpu->kvm->mmu_lock);
My other concern here is related to TLB invalidation. As the guest
performs TLB invalidations that remove entries from the shadow S2,
there is no way to update the revmap to account for this.
This obviously means that the revmap becomes more and more inaccurate
over time, and that is likely to accumulate conflicting entries.
What is the plan to improve the situation on this front?
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] KVM: arm64: nv: Avoid full shadow s2 unmap
2026-04-15 8:38 ` Marc Zyngier
@ 2026-04-15 23:05 ` Wei-Lin Chang
0 siblings, 0 replies; 5+ messages in thread
From: Wei-Lin Chang @ 2026-04-15 23:05 UTC (permalink / raw)
To: Marc Zyngier
Cc: linux-arm-kernel, kvmarm, linux-kernel, Oliver Upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon
On Wed, Apr 15, 2026 at 09:38:55AM +0100, Marc Zyngier wrote:
> On Sat, 11 Apr 2026 13:50:24 +0100,
> Wei-Lin Chang <weilin.chang@arm.com> wrote:
> >
> > Currently we are forced to fully unmap all shadow stage-2 for a VM when
> > unmapping a page from the canonical stage-2, for example during an MMU
> > notifier call. This is because we are not tracking what canonical IPA
> > are mapped in the shadow stage-2 page tables hence there is no way to
> > know what to unmap.
> >
> > Create a per kvm_s2_mmu maple tree to track canonical IPA range ->
> > nested IPA range, so that it is possible to partially unmap shadow
> > stage-2 when a canonical IPA range is unmapped. The algorithm is simple
> > and conservative:
> >
> > At each shadow stage-2 map, insert the nested IPA range into the maple
> > tree, with the canonical IPA range as the key. If the canonical IPA
> > range doesn't overlap with existing ranges in the tree, insert as is,
> > and a reverse mapping for this range is established. But if the
> > canonical IPA range overlaps with any existing ranges in the tree,
> > create a new range that spans all the overlapping ranges including the
> > input range and replace those existing ranges. In the mean time, mark
> > this new spanning canonical IPA range as "polluted" indicating we lost
> > track of the nested IPA ranges that map to this canonical IPA range.
> >
> > The maple tree's 64 bit entry is enough to store the nested IPA and
> > polluted status (stored as a bit called UNKNOWN_IPA), therefore besides
> > maple tree's internal operation, memory allocation is avoided.
> >
> > Example:
> > |||| means existing range, ---- means empty range
> >
> > input: $$$$$$$$$$$$$$$$$$$$$$$$$$
> > tree: --||||-----|||||||---------||||||||||-----------
> >
> > insert spanning range and replace overlapping ones:
> > --||||-----||||||||||||||||||||||||||-----------
> > ^^^^^^^^polluted!^^^^^^^^^
>
> I think you should stick to a single terminology. It is either
> "polluted", or "unknown IPA". My preference goes to the latter, as the
> former is not very descriptive in this context.
Sure, I agree.
>
> >
> > With the reverse map created, when a canonical IPA range gets unmapped,
> > look into each s2 mmu's maple tree and look for canonical IPA ranges
> > affected, and base on their polluted status:
> >
> > polluted -> fall back and fully invalidate the current shadow stage-2,
> > also clear the tree
> > not polluted -> unmap the nested IPA range, and remove the reverse map
> > entry
> >
> > Suggested-by: Marc Zyngier <maz@kernel.org>
> > Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
> > ---
> > arch/arm64/include/asm/kvm_host.h | 4 +
> > arch/arm64/include/asm/kvm_nested.h | 4 +
> > arch/arm64/kvm/mmu.c | 30 ++++--
> > arch/arm64/kvm/nested.c | 147 +++++++++++++++++++++++++++-
> > 4 files changed, 177 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index 851f6171751c..a97bd461c1e1 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -217,6 +217,10 @@ struct kvm_s2_mmu {
> > */
> > bool nested_stage2_enabled;
> >
> > + /* canonical IPA to nested IPA range lookup */
> > + struct maple_tree nested_revmap_mt;
> > + bool nested_revmap_broken;
> > +
>
> Consider moving this boolean next to the other ones so that you don't
> create too many holes in the kvm_s2_mmu structure (use pahole to find out).
>
> But I have some misgivings about the way things are structured
> here. Only NV needs a revmap, yet this is present irrelevant of the
> nature of the VM and bloats the data structure a bit.
>
> My naive approach would have been to only keep a pointer to the
> revmap, and make that pointer NULL when the tree is "broken", and
> freed under RCU if the context isn't the correct one.
Can you explain what you mean by "if the context isn't the correct one"?
If this refers to when selecting a specific kvm_s2_mmu instance for
another context, then IIUC refcnt would already be 0 and there would be
no other user of the tree.
However I do see RCU can be used for parallel accesses to the tree from
parallel s2 faults, and when one fault's revmap store fails RCU defers
freeing the tree until the other fault finishes using it.
>
> This would have multiple benefits: no large-ish structure embedded in
> the s2_mmu structure, no extra boolean to indicate an error condition,
> memory reclaimed earlier.
Yes I see.
>
> > #ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
> > struct dentry *shadow_pt_debugfs_dentry;
> > #endif
> > diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
> > index 091544e6af44..f039220e87a6 100644
> > --- a/arch/arm64/include/asm/kvm_nested.h
> > +++ b/arch/arm64/include/asm/kvm_nested.h
> > @@ -76,6 +76,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
> > const union tlbi_info *info,
> > void (*)(struct kvm_s2_mmu *,
> > const union tlbi_info *));
> > +extern void kvm_record_nested_revmap(gpa_t gpa, struct kvm_s2_mmu *mmu,
> > + gpa_t fault_gpa, size_t map_size);
> > extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
> > extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu);
> >
> > @@ -164,6 +166,8 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu,
> > struct kvm_s2_trans *trans);
> > extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2);
> > extern void kvm_nested_s2_wp(struct kvm *kvm);
> > +extern void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
> > + bool may_block);
> > extern void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block);
> > extern void kvm_nested_s2_flush(struct kvm *kvm);
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index d089c107d9b7..4c9b9cf6dc43 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -5,6 +5,7 @@
> > */
> >
> > #include <linux/acpi.h>
> > +#include <linux/maple_tree.h>
> > #include <linux/mman.h>
> > #include <linux/kvm_host.h>
> > #include <linux/io.h>
> > @@ -1099,6 +1100,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> > {
> > struct kvm *kvm = kvm_s2_mmu_to_kvm(mmu);
> > struct kvm_pgtable *pgt = NULL;
> > + struct maple_tree *mt = &mmu->nested_revmap_mt;
> >
> > write_lock(&kvm->mmu_lock);
> > pgt = mmu->pgt;
> > @@ -1108,8 +1110,11 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
> > free_percpu(mmu->last_vcpu_ran);
> > }
> >
> > - if (kvm_is_nested_s2_mmu(kvm, mmu))
> > + if (kvm_is_nested_s2_mmu(kvm, mmu)) {
> > + if (!mtree_empty(mt))
> > + mtree_destroy(mt);
> > kvm_init_nested_s2_mmu(mmu);
> > + }
> >
> > write_unlock(&kvm->mmu_lock);
> >
> > @@ -1631,6 +1636,10 @@ static int gmem_abort(const struct kvm_s2_fault_desc *s2fd)
> > goto out_unlock;
> > }
> >
> > + if (s2fd->nested)
> > + kvm_record_nested_revmap(gfn << PAGE_SHIFT, pgt->mmu,
> > + s2fd->fault_ipa, PAGE_SIZE);
> > +
> > ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, s2fd->fault_ipa, PAGE_SIZE,
> > __pfn_to_phys(pfn), prot,
> > memcache, flags);
> > @@ -2031,6 +2040,13 @@ static int kvm_s2_fault_map(const struct kvm_s2_fault_desc *s2fd,
> > ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, gfn_to_gpa(gfn),
> > prot, flags);
> > } else {
> > + if (s2fd->nested) {
> > + phys_addr_t ipa = gfn_to_gpa(get_canonical_gfn(s2fd, s2vi));
> > +
> > + ipa &= ~(mapping_size - 1);
>
> I guess it'd be worth adding a helper for this instead of duplicating
> the existing code.
Ack.
>
> > + kvm_record_nested_revmap(ipa, pgt->mmu, gfn_to_gpa(gfn),
> > + mapping_size);
>
> This worries me a bit, see below.
>
> > + }
> > ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, gfn_to_gpa(gfn), mapping_size,
> > __pfn_to_phys(pfn), prot,
> > memcache, flags);
> > @@ -2388,14 +2404,16 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
> >
> > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
> > {
> > + gpa_t gpa = range->start << PAGE_SHIFT;
> > + size_t size = (range->end - range->start) << PAGE_SHIFT;
> > + bool may_block = range->may_block;
> > +
> > if (!kvm->arch.mmu.pgt || kvm_vm_is_protected(kvm))
> > return false;
> >
> > - __unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT,
> > - (range->end - range->start) << PAGE_SHIFT,
> > - range->may_block);
> > + __unmap_stage2_range(&kvm->arch.mmu, gpa, size, may_block);
> > + kvm_unmap_gfn_range_nested(kvm, gpa, size, may_block);
> >
> > - kvm_nested_s2_unmap(kvm, range->may_block);
> > return false;
> > }
> >
> > @@ -2673,7 +2691,7 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> >
> > write_lock(&kvm->mmu_lock);
> > kvm_stage2_unmap_range(&kvm->arch.mmu, gpa, size, true);
> > - kvm_nested_s2_unmap(kvm, true);
> > + kvm_unmap_gfn_range_nested(kvm, gpa, size, true);
> > write_unlock(&kvm->mmu_lock);
> > }
> >
> > diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> > index 883b6c1008fb..c9ebe969b453 100644
> > --- a/arch/arm64/kvm/nested.c
> > +++ b/arch/arm64/kvm/nested.c
> > @@ -7,6 +7,7 @@
> > #include <linux/bitfield.h>
> > #include <linux/kvm.h>
> > #include <linux/kvm_host.h>
> > +#include <linux/maple_tree.h>
> >
> > #include <asm/fixmap.h>
> > #include <asm/kvm_arm.h>
> > @@ -43,6 +44,19 @@ struct vncr_tlb {
> > */
> > #define S2_MMU_PER_VCPU 2
> >
> > +/*
> > + * Per shadow S2 reverse map (IPA -> nested IPA range) maple tree payload
> > + * layout:
> > + *
> > + * bit 63: valid, 1 for non-polluted entries, prevents the case where the
> > + * nested IPA is 0 and turns the whole value to 0
> > + * bits 55-12: nested IPA bits 55-12
> > + * bit 0: polluted, 1 for polluted, 0 for not
> > + */
> > +#define VALID_ENTRY BIT(63)
> > +#define NESTED_IPA_MASK GENMASK_ULL(55, 12)
> > +#define UNKNOWN_IPA BIT(0)
> > +
>
> This only works because you are using the "advanced" API, right?
> Otherwise, you'd be losing the high bit. It'd be good to add a comment
> so that people keep that in mind.
Sorry, I can't find any relationship between the advanced API and the
top most bit of the maple tree value, what am I missing?
>
> > void kvm_init_nested(struct kvm *kvm)
> > {
> > kvm->arch.nested_mmus = NULL;
> > @@ -769,12 +783,57 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
> > return s2_mmu;
> > }
> >
> > +void kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
> > + gpa_t fault_ipa, size_t map_size)
> > +{
> > + struct maple_tree *mt = &mmu->nested_revmap_mt;
> > + gpa_t start = ipa;
> > + gpa_t end = ipa + map_size - 1;
> > + u64 entry, new_entry = 0;
> > + MA_STATE(mas, mt, start, end);
> > +
> > + if (mmu->nested_revmap_broken)
> > + return;
> > +
> > + mtree_lock(mt);
> > + entry = (u64)mas_find_range(&mas, end);
> > +
> > + if (entry) {
> > + /* maybe just a perm update... */
> > + if (!(entry & UNKNOWN_IPA) && mas.index == start &&
> > + mas.last == end &&
> > + fault_ipa == (entry & NESTED_IPA_MASK))
> > + goto unlock;
> > + /*
> > + * Create a "polluted" range that spans all the overlapping
> > + * ranges and store it.
> > + */
> > + while (entry && mas.index <= end) {
> > + start = min(mas.index, start);
> > + end = max(mas.last, end);
> > + entry = (u64)mas_find_range(&mas, end);
> > + }
> > + new_entry |= UNKNOWN_IPA;
> > + } else {
> > + new_entry |= fault_ipa;
> > + new_entry |= VALID_ENTRY;
> > + }
> > +
> > + mas_set_range(&mas, start, end);
> > + if (mas_store_gfp(&mas, (void *)new_entry, GFP_NOWAIT | __GFP_ACCOUNT))
> > + mmu->nested_revmap_broken = true;
>
> Can we try and minimise the risk of allocation failure here?
>
> user_mem_abort() tries very hard to pre-allocate pages for page
> tables by maintaining an memcache. Can we have a similar approach for
> the revmap?
Unfortunately, as I understand the maple tree can only pre-allocate for
a store when the range and the entry to be stored is given, but in this
case we must inspect the tree to get that information after we hold the
mmu and maple tree locks. It is possible to do a two pass approach:
pre-allocate -> take MMU lock -> take maple tree lock -> revalidate what
we pre-allocated is still usable (nobody changed the tree before we took
the maple tree lock)
But I am not fond of this extra complexity..
>
> > +unlock:
> > + mtree_unlock(mt);
> > +}
> > +
> > void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
> > {
> > /* CnP being set denotes an invalid entry */
> > mmu->tlb_vttbr = VTTBR_CNP_BIT;
> > mmu->nested_stage2_enabled = false;
> > atomic_set(&mmu->refcnt, 0);
> > + mt_init(&mmu->nested_revmap_mt);
> > + mmu->nested_revmap_broken = false;
> > }
> >
> > void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
> > @@ -1150,6 +1209,90 @@ void kvm_nested_s2_wp(struct kvm *kvm)
> > kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
> > }
> >
> > +static void reset_revmap_and_unmap(struct kvm_s2_mmu *mmu, bool may_block)
> > +{
> > + mtree_destroy(&mmu->nested_revmap_mt);
> > + kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
> > + mmu->nested_revmap_broken = false;
> > +}
> > +
> > +static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
> > + size_t unmap_size, bool may_block)
> > +{
> > + struct maple_tree *mt = &mmu->nested_revmap_mt;
> > + gpa_t start = gpa;
> > + gpa_t end = gpa + unmap_size - 1;
> > + u64 entry;
> > + size_t entry_size;
> > + bool unlock, fallback;
> > + MA_STATE(mas, mt, gpa, end);
> > +
> > + if (mmu->nested_revmap_broken) {
> > + unlock = false;
> > + fallback = true;
> > + goto fin;
> > + }
>
> Using booleans to affect the control flow reads really badly. I'd
> expect this to simply be:
>
> if (...) {
> reset_revmap_and_unmap(mmu, may_block);
> return;
> }
>
> > +
> > + mtree_lock(mt);
> > + entry = (u64)mas_find_range(&mas, end);
> > +
> > + while (entry && mas.index <= end) {
> > + start = mas.last + 1;
> > + entry_size = mas.last - mas.index + 1;
> > + /*
> > + * Give up and invalidate this s2 mmu if the unmap range
> > + * touches any polluted range.
> > + */
> > + if (entry & UNKNOWN_IPA) {
> > + unlock = true;
> > + fallback = true;
> > + goto fin;
> > + }
>
> and this to be:
>
> if (entry & UNKNOWN_IPA) {
> mtree_unlock(mt);
> reset_revmap_and_unmap(mmu, may_block);
> return;
> }
>
> > +
> > + /*
> > + * Ignore result, it is okay if a reverse mapping erase
> > + * fails.
> > + */
> > + mas_store_gfp(&mas, NULL, GFP_NOWAIT | __GFP_ACCOUNT);
> > +
> > + mtree_unlock(mt);
> > + kvm_stage2_unmap_range(mmu, entry & NESTED_IPA_MASK, entry_size,
> > + may_block);
> > + mtree_lock(mt);
> > + /*
> > + * Other maple tree operations during preemption could render
> > + * this ma_state invalid, so reset it.
> > + */
> > + mas_set_range(&mas, start, end);
> > + entry = (u64)mas_find_range(&mas, end);
> > + }
> > + unlock = true;
> > + fallback = false;
> > +
> > +fin:
> > + if (unlock)
> > + mtree_unlock(mt);
> > + if (fallback)
> > + reset_revmap_and_unmap(mmu, may_block);
>
> and this can eventually be greatly simplified.
Sure, I agree.
>
> > +}
> > +
> > +void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
> > + bool may_block)
> > +{
> > + int i;
> > +
> > + if (!kvm->arch.nested_mmus_size)
> > + return;
> > +
> > + /* TODO: accelerate this using mt of canonical s2 mmu */
> > + for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
> > + struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
> > +
> > + if (kvm_s2_mmu_valid(mmu))
> > + unmap_mmu_ipa_range(mmu, gpa, size, may_block);
> > + }
> > +}
> > +
> > void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
> > {
> > int i;
> > @@ -1163,7 +1306,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
> > struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
> >
> > if (kvm_s2_mmu_valid(mmu))
> > - kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
> > + reset_revmap_and_unmap(mmu, may_block);
> > }
> >
> > kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
> > @@ -1848,7 +1991,7 @@ void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
> >
> > write_lock(&vcpu->kvm->mmu_lock);
> > if (mmu->pending_unmap) {
> > - kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
> > + reset_revmap_and_unmap(mmu, true);
> > mmu->pending_unmap = false;
> > }
> > write_unlock(&vcpu->kvm->mmu_lock);
>
> My other concern here is related to TLB invalidation. As the guest
> performs TLB invalidations that remove entries from the shadow S2,
> there is no way to update the revmap to account for this.
>
> This obviously means that the revmap becomes more and more inaccurate
> over time, and that is likely to accumulate conflicting entries.
>
> What is the plan to improve the situation on this front?
Right now I think using a direct map which goes from nested IPA to
canonical IPA could work while not generating too much complexity, if we
keep the reverse map and direct map in lockstep (direct map keeping the
same mappings as the reverse map but just in reverse).
I'll try to do that and include it in the next iteration.
Thanks,
Wei-Lin Chang
>
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-04-15 23:05 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-11 12:50 [PATCH 0/1] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
2026-04-11 12:50 ` [PATCH 1/1] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
2026-04-15 8:38 ` Marc Zyngier
2026-04-15 23:05 ` Wei-Lin Chang
2026-04-11 14:00 ` [PATCH v2 0/1] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox