[PATCH 0/4] KVM: arm64: nv: Implement nested stage-2 reverse map

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH 0/4] KVM: arm64: nv: Implement nested stage-2 reverse map
@ 2026-03-30 10:06 Wei-Lin Chang
  2026-03-30 10:06 ` [PATCH 1/4] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Wei-Lin Chang @ 2026-03-30 10:06 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, linux-kernel
  Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang

Hi,

This series optimizes the shadow s2 mmu unmapping during MMU notifiers.

Motivation
==========

KVM registers MMU notifiers to unmap stage-2 mappings for the guest when
the backing memory's userspace VA to PA translation has changed, some
reasons include memory reclaim and and migration. In the non-NV case
this is straight forward, the registered function simply unmaps the VM's
IPA from the stage-2 page tables. However, in the NV case the nested
MMUs store nested IPA to PA mappings, and we have no clue which of these
nested mappings are backed by the same memory that the MMU notifiers are
unmapping. The consequence is that since we don't know which nested
mappings should be removed, we can only unmap every nested MMU in its
entirety within the guest to be safe. This kills performance when MMU
notifiers are called often, and we would like a better alternative than
unmapping all shadow stage-2s everytime.

Design
======

The basic idea is create a reverse map from the canonical IPA to the
nested IPA, so that when the MMU notifier informs us about the canonical
IPA range that must be unmapped, we can look up the reverse map to find
the nested IPA range affected and unmap it from the nested MMU. To
achieve fine grained unmapping, each nested MMU is equipped with its own
reverse map.

The maple tree is chosen to store the reverse map, mainly for its good
support for dealing with ranges. Two methods of storing the reverse map
are considered: either using the canonical IPA as the key for the tree,
or using the PA as the key for the tree, the value stored is the nested
IPA range for both. In this series the method using canonical IPA as the
key is implemented, which I believe is a better scheme. A comparison
between the two is presented in a later section.

It is possible for a nested context to have multiple nested IPA ranges
mapped to the same IPA. In these cases idealy our reverse map should
contain 1-to-many relations, so that we are able to find all nested IPA
ranges to unmap during MMU notifiers. However since this requires more
information than what a 64 bit maple tree value can store, we will be
forced to store the information in allocated data pointed to by the
maple tree value. This creates extra memory we have to manage, and
increases the maintenance effort from tracking 1-to-many mappings, for
example by keeping a linked list of nested IPA ranges.

Instead, we introduce what is called the "polluted" canonical IPA
ranges, which means for these canonical IPA ranges we have lost track of
what nested IPA ranges are mapped to this canonical IPA range. Polluted
canonical IPA ranges are created when at shadow stage-2 fault time, we
find that the canonical IPA range we are trying to insert to the reverse
map overlaps one or more pre-existing ranges, in this case the minimum
polluted spanning range is calculated and inserted to replace all
pre-existing ranges.

Example:
|||| means existing range, ---- means empty range

input:            $$$$$$$$$$$$$$$$$$$$$$$$$$
tree:  --||||-----|||||||---------||||||||||-----------

free overlaps:
       --||||------------------------------------------
insert spanning polluted range:
       --||||-----||||||||||||||||||||||||||-----------
                  ^^^^^^^^polluted!^^^^^^^^^

Later when a request to unmap a canonical IPA range arises which affects
a polluted canonical IPA range, simply fall back to unmapping the entire
nested MMU.

MMU notifier optimization
=========================

Every nested MMU keeps its own reverse map, therefore we must check
every nested MMU when we unmap canonical IPA ranges in the MMU notifier,
which is not efficient. We can leverage the canonical stage-2 MMU's
unused maple tree to point to the nested MMUs that hold mappings
of each stored canonical IPA range. This is implemented in patch 2 with
more detail in the commit message.

TLBI handling
=============

When a guest hypervisor issues a TLBI for a specific IPA range, KVM
unmaps that range from all the effected shadow stage-2s. During this we
get the opportunity to remove the reverse map, and lower the probability
of creating polluted reverse map ranges at subsequent stage-2 faults.

However, the TLBI ranges are specified in nested IPA, so in order to
locate the affected ranges in the reverse map maple tree, which is a
mapping from canonical IPA to nested IPA, we can only iterate through
the entire tree and check each entry. This is implemented in patch 3.

In patch 4, we further improve this by introducing a direct map that
goes from nested IPA to canonical IPA, allowing us to quicky locate
which reverse mapping to remove when handed a nested IPA range during
TLBI handling.

Reverse map key, canonical IPA vs PA
====================================

This is a brief comparison of using either canonical IPA or PA as the
key to the reverse map, base on the 4 aspects of the implementation.

Reverse map creation
--------------------

Using both canonical IPA and PA requires almost identical operations.

Canonical IPA unmapping (MMU notifier)
--------------------------------------

For canonical IPA as the key, simply search the reverse map and
invalidate the retrieved nested IPA range.

For PA as the key, we must first translate the given canonical IPA range
into PA either via

a) walking the user space page table or..
b) calling kvm_gmem_get_pfn() if the memslot is a guest_memfd one

Further, kvm_gmem_get_pfn() forcefully allocates the physical page if
the queried canonical IPA is not faulted in. This of course is not
acceptable for our use case, so writing some guest_memfd code will be
required for this to work.

Canonical IPA unmapping optimization
------------------------------------

Using both canonical IPA and PA requires identical operation.

TLBI handling
-------------

For canonical IPA as the key, like said above we can either:

a) iterate the reverse map to find the entry to remove, or
b) create a direct map to find the canonical IPA range

For PA as the key, it is more straightforward, simply find the PA by
walking the shadow stage-2, then remove the PA range from the reverse
map. However this still requires a page table walk.

Summary
-------

I believe it is clear that using canonical IPA as the key saves us a lot
of trouble:

a) no page table walks are required
b) we go from dealing with 3 address spaces (PA, canonical IPA, nested
IPA) to 2 (canonical IPA, nested IPA)
c) the problem with guest_memfd is circumvented

Locking
=======

All maple trees are protected by kvm.mmu_lock, therefore no maple tree
locks are taken.

Testing
=======

The current plan to test is to enhance kselftest with NV capability, so
that we can instruct L1 and L2 to set up and access memory to populate
shadow page tables, userspace can then trigger MMU notifiers by e.g.
munmap, mremap, etc. During these operations userspace can read the
shadow page tables exposed in debugfs [1] to check whether the shadow
page tables are in an expected state or not.

Thanks!

[1]: https://lore.kernel.org/kvmarm/20260317182638.1592507-2-weilin.chang@arm.com

Wei-Lin Chang (4):
  KVM: arm64: nv: Avoid full shadow s2 unmap
  KVM: arm64: nv: Accelerate canonical IPA unmapping with canonical s2
    mmu maple tree
  KVM: arm64: nv: Remove reverse map entries during TLBI handling
  KVM: arm64: nv: Create nested IPA direct map to speed up reverse map
    removal

 arch/arm64/include/asm/kvm_host.h   |   7 +
 arch/arm64/include/asm/kvm_nested.h |   5 +
 arch/arm64/kvm/mmu.c                |  32 ++-
 arch/arm64/kvm/nested.c             | 342 +++++++++++++++++++++++++++-
 arch/arm64/kvm/sys_regs.c           |   3 +
 5 files changed, 382 insertions(+), 7 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/4] KVM: arm64: nv: Avoid full shadow s2 unmap
  2026-03-30 10:06 [PATCH 0/4] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
@ 2026-03-30 10:06 ` Wei-Lin Chang
  2026-03-31 15:16   ` kernel test robot
  2026-03-30 10:06 ` [PATCH 2/4] KVM: arm64: nv: Accelerate canonical IPA unmapping with canonical s2 mmu maple tree Wei-Lin Chang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 6+ messages in thread
From: Wei-Lin Chang @ 2026-03-30 10:06 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, linux-kernel
  Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang

Currently we are forced to fully unmap all shadow stage-2 for a VM when
unmapping a page from the canonical stage-2, for example during an MMU
notifier call. This is because we are not tracking what canonical IPA
are mapped in the shadow stage-2 page tables hence there is no way to
know what to unmap.

Create a per kvm_s2_mmu maple tree to track canonical IPA range ->
nested IPA range, so that it is possible to partially unmap shadow
stage-2 when a canonical IPA range is unmapped. The algorithm is simple
and conservative:

At each shadow stage-2 map, insert the nested IPA range into the maple
tree, with the canonical IPA range as the key. If the canonical IPA
range doesn't overlap with existing ranges in the tree, insert as is,
and a reverse mapping for this range is established. But if the
canonical IPA range overlaps with any existing ranges in the tree, erase
those existing ranges, and create a new range that spans all the
overlapping ranges including the input range. In the mean time, mark
this new spanning canonical IPA range as "polluted" indicating we lost
track of the nested IPA ranges that map to this canonical IPA range.

The maple tree's 64 bit entry is enough to store the nested IPA and
polluted status (stored as a bit called UNKNOWN_IPA), therefore besides
maple tree's internal operation, memory allocation is avoided.

Example:
|||| means existing range, ---- means empty range

input:            $$$$$$$$$$$$$$$$$$$$$$$$$$
tree:  --||||-----|||||||---------||||||||||-----------

free overlaps:
       --||||------------------------------------------
insert spanning range:
       --||||-----||||||||||||||||||||||||||-----------
                  ^^^^^^^^polluted!^^^^^^^^^

With the reverse map created, when a canonical IPA range gets unmapped,
look into each s2 mmu's maple tree and look for canonical IPA ranges
affected, and base on their polluted status:

polluted -> fall back and fully invalidate the current shadow stage-2,
            also clear the tree
not polluted -> unmap the nested IPA range, and remove the reverse map
                entry

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
 arch/arm64/include/asm/kvm_host.h   |   3 +
 arch/arm64/include/asm/kvm_nested.h |   4 +
 arch/arm64/kvm/mmu.c                |  27 +++++--
 arch/arm64/kvm/nested.c             | 112 +++++++++++++++++++++++++++-
 4 files changed, 140 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 8545811e2238..1d0db7f268cc 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -217,6 +217,9 @@ struct kvm_s2_mmu {
 	 */
 	bool	nested_stage2_enabled;
 
+	/* canonical IPA to nested IPA range lookup, protected by kvm.mmu_lock */
+	struct maple_tree nested_revmap_mt;
+
 #ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
 	struct dentry *shadow_pt_debugfs_dentry;
 #endif
diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index 091544e6af44..4d09d567d7f9 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -76,6 +76,8 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
 				       const union tlbi_info *info,
 				       void (*)(struct kvm_s2_mmu *,
 						const union tlbi_info *));
+extern int kvm_record_nested_revmap(gpa_t gpa, struct kvm_s2_mmu *mmu,
+				    gpa_t fault_gpa, size_t map_size);
 extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
 extern void kvm_vcpu_put_hw_mmu(struct kvm_vcpu *vcpu);
 
@@ -164,6 +166,8 @@ extern int kvm_s2_handle_perm_fault(struct kvm_vcpu *vcpu,
 				    struct kvm_s2_trans *trans);
 extern int kvm_inject_s2_fault(struct kvm_vcpu *vcpu, u64 esr_el2);
 extern void kvm_nested_s2_wp(struct kvm *kvm);
+extern void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
+				       bool may_block);
 extern void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block);
 extern void kvm_nested_s2_flush(struct kvm *kvm);
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 17d64a1e11e5..6beb07d817c8 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1107,8 +1107,10 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 		free_percpu(mmu->last_vcpu_ran);
 	}
 
-	if (kvm_is_nested_s2_mmu(kvm, mmu))
+	if (kvm_is_nested_s2_mmu(kvm, mmu)) {
+		mtree_destroy(&mmu->nested_revmap_mt);
 		kvm_init_nested_s2_mmu(mmu);
+	}
 
 	write_unlock(&kvm->mmu_lock);
 
@@ -1625,6 +1627,13 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		goto out_unlock;
 	}
 
+	if (nested) {
+		ret = kvm_record_nested_revmap(gfn << PAGE_SHIFT, pgt->mmu,
+					       fault_ipa, PAGE_SIZE);
+		if (ret)
+			goto out_unlock;
+	}
+
 	ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, PAGE_SIZE,
 						 __pfn_to_phys(pfn), prot,
 						 memcache, flags);
@@ -1922,6 +1931,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		prot &= ~KVM_NV_GUEST_MAP_SZ;
 		ret = KVM_PGT_FN(kvm_pgtable_stage2_relax_perms)(pgt, fault_ipa, prot, flags);
 	} else {
+		if (nested) {
+			ret = kvm_record_nested_revmap(gfn << PAGE_SHIFT, pgt->mmu,
+						       fault_ipa, vma_pagesize);
+			if (ret)
+				goto out_unlock;
+		}
 		ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, vma_pagesize,
 					     __pfn_to_phys(pfn), prot,
 					     memcache, flags);
@@ -2223,14 +2238,16 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
+	gpa_t gpa = range->start << PAGE_SHIFT;
+	size_t size = (range->end - range->start) << PAGE_SHIFT;
+	bool may_block = range->may_block;
+
 	if (!kvm->arch.mmu.pgt)
 		return false;
 
-	__unmap_stage2_range(&kvm->arch.mmu, range->start << PAGE_SHIFT,
-			     (range->end - range->start) << PAGE_SHIFT,
-			     range->may_block);
+	__unmap_stage2_range(&kvm->arch.mmu, gpa, size, may_block);
+	kvm_unmap_gfn_range_nested(kvm, gpa, size, may_block);
 
-	kvm_nested_s2_unmap(kvm, range->may_block);
 	return false;
 }
 
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 883b6c1008fb..53392cc7dbae 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -7,6 +7,7 @@
 #include <linux/bitfield.h>
 #include <linux/kvm.h>
 #include <linux/kvm_host.h>
+#include <linux/maple_tree.h>
 
 #include <asm/fixmap.h>
 #include <asm/kvm_arm.h>
@@ -43,6 +44,16 @@ struct vncr_tlb {
  */
 #define S2_MMU_PER_VCPU		2
 
+/*
+ * Per shadow S2 reverse map (IPA -> nested IPA range) maple tree payload
+ * layout:
+ *
+ * bits 55-12: nested IPA bits 55-12
+ * bit 0: polluted, 1 for polluted, 0 for not
+ */
+#define NESTED_IPA_MASK		GENMASK_ULL(55, 12)
+#define UNKNOWN_IPA		BIT(0)
+
 void kvm_init_nested(struct kvm *kvm)
 {
 	kvm->arch.nested_mmus = NULL;
@@ -769,12 +780,54 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
 	return s2_mmu;
 }
 
+int kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
+			     gpa_t fault_ipa, size_t map_size)
+{
+	struct maple_tree *mt = &mmu->nested_revmap_mt;
+	gpa_t start = ipa;
+	gpa_t end = ipa + map_size - 1;
+	u64 entry, new_entry = 0;
+	int r = 0;
+
+	lockdep_assert_held_write(kvm_s2_mmu_to_kvm(mmu)->mmu_lock);
+
+	MA_STATE(mas, mt, start, end);
+	entry = (u64)mas_find_range(&mas, end);
+
+	if (entry) {
+		/* maybe just a perm update... */
+		if (!(entry & UNKNOWN_IPA) && mas.index == start &&
+		    mas.last == end &&
+		    fault_ipa == (entry & NESTED_IPA_MASK))
+			goto out;
+		/*
+		 * Remove every overlapping range, then create a "polluted"
+		 * range that spans all these ranges and store it.
+		 */
+		while (entry && mas.index <= end) {
+			start = min(mas.index, start);
+			end = max(mas.last, end);
+			mas_erase(&mas);
+			entry = (u64)mas_find_range(&mas, end);
+		}
+		new_entry |= UNKNOWN_IPA;
+	} else {
+		new_entry |= fault_ipa;
+	}
+
+	mas_set_range(&mas, start, end);
+	r = mas_store_gfp(&mas, (void *)new_entry, GFP_KERNEL_ACCOUNT);
+out:
+	return r;
+}
+
 void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
 {
 	/* CnP being set denotes an invalid entry */
 	mmu->tlb_vttbr = VTTBR_CNP_BIT;
 	mmu->nested_stage2_enabled = false;
 	atomic_set(&mmu->refcnt, 0);
+	mt_init(&mmu->nested_revmap_mt);
 }
 
 void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
@@ -1150,6 +1203,60 @@ void kvm_nested_s2_wp(struct kvm *kvm)
 	kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
 }
 
+static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
+				  size_t unmap_size, bool may_block)
+{
+	struct maple_tree *mt = &mmu->nested_revmap_mt;
+	gpa_t start = gpa;
+	gpa_t end = gpa + unmap_size - 1;
+	u64 entry;
+	size_t entry_size;
+
+	MA_STATE(mas, mt, gpa, end);
+	entry = (u64)mas_find_range(&mas, end);
+
+	while (entry && mas.index <= end) {
+		start = mas.last + 1;
+		entry_size = mas.last - mas.index + 1;
+		/*
+		 * Give up and invalidate this s2 mmu if the unmap range
+		 * touches any polluted range.
+		 */
+		if (entry & UNKNOWN_IPA) {
+			mtree_destroy(mt);
+			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu),
+					       may_block);
+			return;
+		}
+		mas_erase(&mas);
+		kvm_stage2_unmap_range(mmu, entry & NESTED_IPA_MASK, entry_size,
+				       may_block);
+		/*
+		 * Other maple tree operations during preemption could render
+		 * this ma_state invalid, so reset it.
+		 */
+		mas_set_range(&mas, start, end);
+		entry = (u64)mas_find_range(&mas, end);
+	}
+}
+
+void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
+				bool may_block)
+{
+	int i;
+
+	if (!kvm->arch.nested_mmus_size)
+		return;
+
+	/* TODO: accelerate this using mt of canonical s2 mmu */
+	for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
+		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
+
+		if (kvm_s2_mmu_valid(mmu))
+			unmap_mmu_ipa_range(mmu, gpa, size, may_block);
+	}
+}
+
 void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
 {
 	int i;
@@ -1162,8 +1269,10 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
 	for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
 		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
 
-		if (kvm_s2_mmu_valid(mmu))
+		if (kvm_s2_mmu_valid(mmu)) {
+			mtree_destroy(&mmu->nested_revmap_mt);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
+		}
 	}
 
 	kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
@@ -1848,6 +1957,7 @@ void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
 
 		write_lock(&vcpu->kvm->mmu_lock);
 		if (mmu->pending_unmap) {
+			mtree_destroy(&mmu->nested_revmap_mt);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
 			mmu->pending_unmap = false;
 		}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/4] KVM: arm64: nv: Accelerate canonical IPA unmapping with canonical s2 mmu maple tree
  2026-03-30 10:06 [PATCH 0/4] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
  2026-03-30 10:06 ` [PATCH 1/4] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
@ 2026-03-30 10:06 ` Wei-Lin Chang
  2026-03-30 10:06 ` [PATCH 3/4] KVM: arm64: nv: Remove reverse map entries during TLBI handling Wei-Lin Chang
  2026-03-30 10:06 ` [PATCH 4/4] KVM: arm64: nv: Create nested IPA direct map to speed up reverse map removal Wei-Lin Chang
  3 siblings, 0 replies; 6+ messages in thread
From: Wei-Lin Chang @ 2026-03-30 10:06 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, linux-kernel
  Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang

Checking every nested mmu during canonical IPA unmapping is slow,
especially when there are many valid nested mmus. We can leverage the
unused maple tree in the canonical kvm_s2_mmu to accelerate this
process.

At stage-2 fault time, other than recording the reverse map, also add an
entry in canonical s2 mmu's maple tree, with the canonical IPA range as
the key, and the "nested s2 mmu this fault is happending to" encoded in
the entry.

With the new maple tree for acceleration's information, at canonical
IPA unmap time we can look into the tree to retrieve the nested mmus
affected by this unmap much quicker.

In terms of encoding the nested mmus in the entry, there are 62 bits
available for each entry (bits 1 and 0 are reserved by the maple tree).
Each bit represents a number of nested mmus base on the total number of
nested mmus, this value grows in power of 2, so for example:

total nested mmus: 1-62    -> each bit represents: 1 nested mmu
                   63-124  ->                      2 nested mmus
                   125-248 ->                      4 nested mmus
                   ...                             ...

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
 arch/arm64/include/asm/kvm_host.h |   1 +
 arch/arm64/kvm/mmu.c              |   5 +-
 arch/arm64/kvm/nested.c           | 166 ++++++++++++++++++++++++++++--
 3 files changed, 163 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 1d0db7f268cc..06f83bb7ff1d 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -321,6 +321,7 @@ struct kvm_arch {
 	struct kvm_s2_mmu *nested_mmus;
 	size_t nested_mmus_size;
 	int nested_mmus_next;
+	int mmus_per_bit_power;
 
 	/* Interrupt controller */
 	struct vgic_dist	vgic;
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 6beb07d817c8..2b413d3dc790 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1009,6 +1009,8 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 	if (kvm_is_nested_s2_mmu(kvm, mmu))
 		kvm_init_nested_s2_mmu(mmu);
 
+	mt_init(&mmu->nested_revmap_mt);
+
 	return 0;
 
 out_destroy_pgtable:
@@ -1107,8 +1109,9 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 		free_percpu(mmu->last_vcpu_ran);
 	}
 
+	mtree_destroy(&mmu->nested_revmap_mt);
+
 	if (kvm_is_nested_s2_mmu(kvm, mmu)) {
-		mtree_destroy(&mmu->nested_revmap_mt);
 		kvm_init_nested_s2_mmu(mmu);
 	}
 
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 53392cc7dbae..c7d00cb40ba5 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -80,7 +80,7 @@ int kvm_vcpu_init_nested(struct kvm_vcpu *vcpu)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_s2_mmu *tmp;
-	int num_mmus, ret = 0;
+	int num_mmus, power = 0, ret = 0;
 
 	if (test_bit(KVM_ARM_VCPU_HAS_EL2_E2H0, kvm->arch.vcpu_features) &&
 	    !cpus_have_final_cap(ARM64_HAS_HCR_NV1))
@@ -131,6 +131,25 @@ int kvm_vcpu_init_nested(struct kvm_vcpu *vcpu)
 
 	kvm->arch.nested_mmus_size = num_mmus;
 
+	/*
+	 * Calculate how many s2 mmus are represented by each bit in the
+	 * acceleration maple tree entries.
+	 *
+	 * power == 0 -> 1 s2 mmu
+	 * power == 1 -> 2 s2 mmus
+	 * power == 2 -> 4 s2 mmus
+	 * power == 3 -> 8 s2 mmus
+	 * etc.
+	 *
+	 * We use only the top 62 bits in the canonical s2 mmu maple tree
+	 * entries, bits 0 and 1 are not used, since maple trees reserve values
+	 * with bit patterns ending in 10 that are also smaller that 4096.
+	 */
+	while (62 * (1 << power) < kvm->arch.nested_mmus_size)
+		power++;
+
+	kvm->arch.mmus_per_bit_power = power;
+
 	return 0;
 }
 
@@ -780,6 +799,119 @@ static struct kvm_s2_mmu *get_s2_mmu_nested(struct kvm_vcpu *vcpu)
 	return s2_mmu;
 }
 
+static int s2_mmu_to_accel_bit(struct kvm_s2_mmu *mmu)
+{
+	BUG_ON(&mmu->arch->mmu == mmu);
+
+	int index = mmu - mmu->arch->nested_mmus;
+	int power = mmu->arch->mmus_per_bit_power;
+
+	return (index >> power) + 2;
+}
+
+/* this returns the first s2 mmu from the span */
+static struct kvm_s2_mmu *accel_bit_to_s2_mmu(struct kvm *kvm, int bit)
+{
+	int power = kvm->arch.mmus_per_bit_power;
+	int index = (bit - 2) << power;
+
+	BUG_ON(index >= kvm->arch.nested_mmus_size);
+
+	return &kvm->arch.nested_mmus[index];
+}
+
+static void accel_clear_mmu_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
+				  size_t size)
+{
+	struct maple_tree *mt = &mmu->arch->mmu.nested_revmap_mt;
+	int bit = s2_mmu_to_accel_bit(mmu);
+	void *entry, *new_entry;
+	gpa_t start = gpa;
+	gpa_t end = gpa + size - 1;
+
+	if (mmu->arch->mmus_per_bit_power > 0) {
+		/* sadly nothing we can do here... */
+		return;
+	}
+
+	MA_STATE(mas, mt, start, end);
+
+	entry = mas_find_range(&mas, end);
+	BUG_ON(!entry);
+
+	/*
+	 * 1. Ranges smaller than the queried range should not exist, because
+	 *    for the same mmu, the same ranges are added in both the accel mt
+	 *    and the mmu's mt at fault time.
+	 *
+	 * 2. Ranges larger than the queried range could exist, since
+	 *    another mmu could have a range mapped on top.
+	 *    However in this case we don't know whether there are other
+	 *    smaller ranges in this larger range that belongs to this same
+	 *    mmu, so we can't just remove the bit.
+	 */
+	if (mas.index == start && mas.last == end) {
+		new_entry = (void *)((unsigned long)entry & ~BIT(bit));
+		/*
+		 * This naturally clears the range from the mt if
+		 * new_entry == 0.
+		 */
+		mas_store_gfp(&mas, new_entry, GFP_KERNEL_ACCOUNT);
+	}
+}
+
+static void accel_clear_mmu(struct kvm_s2_mmu *mmu)
+{
+	struct maple_tree *mt = &mmu->arch->mmu.nested_revmap_mt;
+	int bit = s2_mmu_to_accel_bit(mmu);
+	void *entry, *new_entry;
+
+	if (mmu->arch->mmus_per_bit_power > 0) {
+		/* sadly nothing we can do here... */
+		return;
+	}
+
+	MA_STATE(mas, mt, 0, ULONG_MAX);
+
+	mas_for_each(&mas, entry, ULONG_MAX) {
+		new_entry = (void *)((unsigned long)entry & ~BIT(bit));
+		/*
+		 * This naturally clears the range from the mt if
+		 * new_entry == 0.
+		 */
+		mas_store_gfp(&mas, new_entry, GFP_KERNEL_ACCOUNT);
+	}
+}
+
+static int record_accel(struct kvm_s2_mmu *mmu, gpa_t gpa,
+			       size_t map_size)
+{
+	struct maple_tree *mt = &mmu->arch->mmu.nested_revmap_mt;
+	gpa_t start = gpa;
+	gpa_t end = gpa + map_size - 1;
+	u64 entry, new_entry = 0;
+
+	MA_STATE(mas, mt, start, end);
+	entry = (u64)mas_find_range(&mas, end);
+
+	/*
+	 * OR every overlapping range's entry, then create a
+	 * range that spans all these ranges and store it.
+	 */
+	while (entry && mas.index <= end) {
+		start = min(mas.index, start);
+		end = max(mas.last, end);
+		new_entry |= entry;
+		mas_erase(&mas);
+		entry = (u64)mas_find_range(&mas, end);
+	}
+
+	new_entry |= BIT(s2_mmu_to_accel_bit(mmu));
+	mas_set_range(&mas, start, end);
+
+	return mas_store_gfp(&mas, (void *)new_entry, GFP_KERNEL_ACCOUNT);
+}
+
 int kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
 			     gpa_t fault_ipa, size_t map_size)
 {
@@ -792,6 +924,11 @@ int kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
 	lockdep_assert_held_write(kvm_s2_mmu_to_kvm(mmu)->mmu_lock);
 
 	MA_STATE(mas, mt, start, end);
+
+	r = record_accel(mmu, ipa, map_size);
+	if (r)
+		goto out;
+
 	entry = (u64)mas_find_range(&mas, end);
 
 	if (entry) {
@@ -827,7 +964,6 @@ void kvm_init_nested_s2_mmu(struct kvm_s2_mmu *mmu)
 	mmu->tlb_vttbr = VTTBR_CNP_BIT;
 	mmu->nested_stage2_enabled = false;
 	atomic_set(&mmu->refcnt, 0);
-	mt_init(&mmu->nested_revmap_mt);
 }
 
 void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu)
@@ -1224,11 +1360,13 @@ static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
 		 */
 		if (entry & UNKNOWN_IPA) {
 			mtree_destroy(mt);
+			accel_clear_mmu(mmu);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu),
 					       may_block);
 			return;
 		}
 		mas_erase(&mas);
+		accel_clear_mmu_range(mmu, mas.index, entry_size);
 		kvm_stage2_unmap_range(mmu, entry & NESTED_IPA_MASK, entry_size,
 				       may_block);
 		/*
@@ -1243,17 +1381,27 @@ static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
 void kvm_unmap_gfn_range_nested(struct kvm *kvm, gpa_t gpa, size_t size,
 				bool may_block)
 {
-	int i;
+	struct maple_tree *mt = &kvm->arch.mmu.nested_revmap_mt;
+	gpa_t start = gpa;
+	gpa_t end = gpa + size - 1;
+	u64 entry;
+	int bit, i = 0;
+	int power = kvm->arch.mmus_per_bit_power;
+	struct kvm_s2_mmu *mmu;
+	MA_STATE(mas, mt, start, end);
 
 	if (!kvm->arch.nested_mmus_size)
 		return;
 
-	/* TODO: accelerate this using mt of canonical s2 mmu */
-	for (i = 0; i < kvm->arch.nested_mmus_size; i++) {
-		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
+	entry = (u64)mas_find_range(&mas, end);
 
-		if (kvm_s2_mmu_valid(mmu))
-			unmap_mmu_ipa_range(mmu, gpa, size, may_block);
+	while (entry && mas.index <= end) {
+		for_each_set_bit(bit, (unsigned long *)&entry, 64) {
+			mmu = accel_bit_to_s2_mmu(kvm, bit);
+			for (i = 0; i < (1 << power); i++)
+				unmap_mmu_ipa_range(mmu + i, gpa, size, may_block);
+		}
+		entry = (u64)mas_find_range(&mas, end);
 	}
 }
 
@@ -1274,6 +1422,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
 		}
 	}
+	mtree_destroy(&kvm->arch.mmu.nested_revmap_mt);
 
 	kvm_invalidate_vncr_ipa(kvm, 0, BIT(kvm->arch.mmu.pgt->ia_bits));
 }
@@ -1958,6 +2107,7 @@ void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
 		write_lock(&vcpu->kvm->mmu_lock);
 		if (mmu->pending_unmap) {
 			mtree_destroy(&mmu->nested_revmap_mt);
+			accel_clear_mmu(mmu);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
 			mmu->pending_unmap = false;
 		}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/4] KVM: arm64: nv: Remove reverse map entries during TLBI handling
  2026-03-30 10:06 [PATCH 0/4] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
  2026-03-30 10:06 ` [PATCH 1/4] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
  2026-03-30 10:06 ` [PATCH 2/4] KVM: arm64: nv: Accelerate canonical IPA unmapping with canonical s2 mmu maple tree Wei-Lin Chang
@ 2026-03-30 10:06 ` Wei-Lin Chang
  2026-03-30 10:06 ` [PATCH 4/4] KVM: arm64: nv: Create nested IPA direct map to speed up reverse map removal Wei-Lin Chang
  3 siblings, 0 replies; 6+ messages in thread
From: Wei-Lin Chang @ 2026-03-30 10:06 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, linux-kernel
  Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang

When a guest hypervisor issues a TLBI for a specific IPA range, KVM
unmaps that range from all the effected shadow stage-2s. During this we
get the opportunity to remove the reverse map, and lower the probability
of creating polluted reverse map ranges at subsequent stage-2 faults.

However, the TLBI ranges are specified in nested IPA, so in order to
locate the affected ranges in the reverse map maple tree, which is a
mapping from canonical IPA to nested IPA, we can only iterate through
the entire tree and check each entry.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
 arch/arm64/include/asm/kvm_nested.h |  1 +
 arch/arm64/kvm/nested.c             | 29 +++++++++++++++++++++++++++++
 arch/arm64/kvm/sys_regs.c           |  3 +++
 3 files changed, 33 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_nested.h b/arch/arm64/include/asm/kvm_nested.h
index 4d09d567d7f9..376619cdc9d5 100644
--- a/arch/arm64/include/asm/kvm_nested.h
+++ b/arch/arm64/include/asm/kvm_nested.h
@@ -76,6 +76,7 @@ extern void kvm_s2_mmu_iterate_by_vmid(struct kvm *kvm, u16 vmid,
 				       const union tlbi_info *info,
 				       void (*)(struct kvm_s2_mmu *,
 						const union tlbi_info *));
+extern void kvm_remove_nested_revmap(struct kvm_s2_mmu *mmu, u64 addr, u64 size);
 extern int kvm_record_nested_revmap(gpa_t gpa, struct kvm_s2_mmu *mmu,
 				    gpa_t fault_gpa, size_t map_size);
 extern void kvm_vcpu_load_hw_mmu(struct kvm_vcpu *vcpu);
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index c7d00cb40ba5..125fa21ca2e7 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -912,6 +912,35 @@ static int record_accel(struct kvm_s2_mmu *mmu, gpa_t gpa,
 	return mas_store_gfp(&mas, (void *)new_entry, GFP_KERNEL_ACCOUNT);
 }
 
+void kvm_remove_nested_revmap(struct kvm_s2_mmu *mmu, u64 addr, u64 size)
+{
+	/*
+	 * Iterate through the mt of this mmu, remove all unpolluted canonical
+	 * ipa ranges that maps to ranges that are strictly within
+	 * [addr, addr + size).
+	 */
+	struct maple_tree *mt = &mmu->nested_revmap_mt;
+	void *entry;
+	u64 nested_ipa, nested_ipa_end, addr_end = addr + size;
+	size_t revmap_size;
+
+	MA_STATE(mas, mt, 0, ULONG_MAX);
+
+	mas_for_each(&mas, entry, ULONG_MAX) {
+		if ((u64)entry & UNKNOWN_IPA)
+			continue;
+
+		revmap_size = mas.last - mas.index + 1;
+		nested_ipa = (u64)entry & NESTED_IPA_MASK;
+		nested_ipa_end = nested_ipa + revmap_size;
+
+		if (nested_ipa >= addr && nested_ipa_end <= addr_end) {
+			accel_clear_mmu_range(mmu, mas.index, revmap_size);
+			mas_erase(&mas);
+		}
+	}
+}
+
 int kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
 			     gpa_t fault_ipa, size_t map_size)
 {
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index e1001544d4f4..c7af0eac9ee4 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -4006,6 +4006,7 @@ union tlbi_info {
 static void s2_mmu_unmap_range(struct kvm_s2_mmu *mmu,
 			       const union tlbi_info *info)
 {
+	kvm_remove_nested_revmap(mmu, info->range.start, info->range.size);
 	/*
 	 * The unmap operation is allowed to drop the MMU lock and block, which
 	 * means that @mmu could be used for a different context than the one
@@ -4104,6 +4105,8 @@ static void s2_mmu_unmap_ipa(struct kvm_s2_mmu *mmu,
 	max_size = compute_tlb_inval_range(mmu, info->ipa.addr);
 	base_addr &= ~(max_size - 1);
 
+	kvm_remove_nested_revmap(mmu, base_addr, max_size);
+
 	/*
 	 * See comment in s2_mmu_unmap_range() for why this is allowed to
 	 * reschedule.
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/4] KVM: arm64: nv: Create nested IPA direct map to speed up reverse map removal
  2026-03-30 10:06 [PATCH 0/4] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
                   ` (2 preceding siblings ...)
  2026-03-30 10:06 ` [PATCH 3/4] KVM: arm64: nv: Remove reverse map entries during TLBI handling Wei-Lin Chang
@ 2026-03-30 10:06 ` Wei-Lin Chang
  3 siblings, 0 replies; 6+ messages in thread
From: Wei-Lin Chang @ 2026-03-30 10:06 UTC (permalink / raw)
  To: linux-arm-kernel, kvmarm, linux-kernel
  Cc: Marc Zyngier, Oliver Upton, Joey Gouly, Suzuki K Poulose,
	Zenghui Yu, Catalin Marinas, Will Deacon, Wei-Lin Chang

Iterating through the whole reverse map to find which entries to remove
when handling guest hypervisor TLBIs is not efficient. Create a direct
map that goes from nested IPA to canonical IPA so that the canonical
IPA range affected by the TLBI can be quickly determined, then remove
the entries in the reverse map accordingly.

Suggested-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
---
 arch/arm64/include/asm/kvm_host.h |   3 +
 arch/arm64/kvm/mmu.c              |   2 +
 arch/arm64/kvm/nested.c           | 131 ++++++++++++++++++++----------
 3 files changed, 95 insertions(+), 41 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 06f83bb7ff1d..6b0858805530 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -220,6 +220,9 @@ struct kvm_s2_mmu {
 	/* canonical IPA to nested IPA range lookup, protected by kvm.mmu_lock */
 	struct maple_tree nested_revmap_mt;
 
+	/* nested IPA to canonical IPA range lookup, protected by kvm.mmu_lock */
+	struct maple_tree nested_direct_mt;
+
 #ifdef CONFIG_PTDUMP_STAGE2_DEBUGFS
 	struct dentry *shadow_pt_debugfs_dentry;
 #endif
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2b413d3dc790..9f27a9669ec9 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1010,6 +1010,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
 		kvm_init_nested_s2_mmu(mmu);
 
 	mt_init(&mmu->nested_revmap_mt);
+	mt_init(&mmu->nested_direct_mt);
 
 	return 0;
 
@@ -1112,6 +1113,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
 	mtree_destroy(&mmu->nested_revmap_mt);
 
 	if (kvm_is_nested_s2_mmu(kvm, mmu)) {
+		mtree_destroy(&mmu->nested_direct_mt);
 		kvm_init_nested_s2_mmu(mmu);
 	}
 
diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 125fa21ca2e7..4c96130abf82 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -45,13 +45,12 @@ struct vncr_tlb {
 #define S2_MMU_PER_VCPU		2
 
 /*
- * Per shadow S2 reverse map (IPA -> nested IPA range) maple tree payload
- * layout:
+ * Per shadow S2 reverse & direct map maple tree payload layout:
  *
- * bits 55-12: nested IPA bits 55-12
- * bit 0: polluted, 1 for polluted, 0 for not
+ * bits 55-12: {nested, canonical} IPA bits 55-12
+ * bit 0: polluted, 1 for polluted, 0 for not, only used in reverse map
  */
-#define NESTED_IPA_MASK		GENMASK_ULL(55, 12)
+#define ADDR_MASK		GENMASK_ULL(55, 12)
 #define UNKNOWN_IPA		BIT(0)
 
 void kvm_init_nested(struct kvm *kvm)
@@ -915,74 +914,118 @@ static int record_accel(struct kvm_s2_mmu *mmu, gpa_t gpa,
 void kvm_remove_nested_revmap(struct kvm_s2_mmu *mmu, u64 addr, u64 size)
 {
 	/*
-	 * Iterate through the mt of this mmu, remove all unpolluted canonical
-	 * ipa ranges that maps to ranges that are strictly within
-	 * [addr, addr + size).
+	 * For all ranges in direct_mt that are completely covered by the range
+	 * we are TLBIing [addr, addr + size), we remove the reverse map AND
+	 * its corresponding direct map together, when these conditions are
+	 * met:
+	 *
+	 * 1. The TLBI range completely covers the stored nested IPA range.
+	 * 2. The reverse map is not polluted. This ensures the reverse map
+	 *    and the direct map are 1:1.
 	 */
-	struct maple_tree *mt = &mmu->nested_revmap_mt;
-	void *entry;
-	u64 nested_ipa, nested_ipa_end, addr_end = addr + size;
-	size_t revmap_size;
+	struct maple_tree *direct_mt = &mmu->nested_direct_mt;
+	struct maple_tree *revmap_mt = &mmu->nested_revmap_mt;
+	gpa_t nested_ipa_start = addr;
+	gpa_t nested_ipa_end = addr + size - 1;
+	u64 entry_ipa, entry_nested_ipa;
+	u64 ipa, ipa_end;
 
-	MA_STATE(mas, mt, 0, ULONG_MAX);
+	MA_STATE(mas_nested_ipa, direct_mt, nested_ipa_start, nested_ipa_end);
+	entry_ipa = (u64)mas_find_range(&mas_nested_ipa, nested_ipa_end);
 
-	mas_for_each(&mas, entry, ULONG_MAX) {
-		if ((u64)entry & UNKNOWN_IPA)
-			continue;
+	while (entry_ipa && mas_nested_ipa.index <= nested_ipa_end) {
+		ipa = entry_ipa & ADDR_MASK;
+		ipa_end = ipa + mas_nested_ipa.last - mas_nested_ipa.index;
 
-		revmap_size = mas.last - mas.index + 1;
-		nested_ipa = (u64)entry & NESTED_IPA_MASK;
-		nested_ipa_end = nested_ipa + revmap_size;
+		/* Use ipa range to find the corresponding entry in revmap. */
+		MA_STATE(mas_ipa, revmap_mt, ipa, ipa_end);
+		entry_nested_ipa = (u64)mas_find_range(&mas_ipa, ipa_end);
 
-		if (nested_ipa >= addr && nested_ipa_end <= addr_end) {
-			accel_clear_mmu_range(mmu, mas.index, revmap_size);
-			mas_erase(&mas);
+		/*
+		 * Reverse and direct map are created together at s2 faults,
+		 * thus every direct map range should also have a corresponding
+		 * reverse map range, however that can be polluted.
+		 */
+		BUG_ON(!entry_nested_ipa);
+
+		/* The two conditions outlined above. */
+		if (!(entry_nested_ipa & UNKNOWN_IPA) &&
+		    mas_nested_ipa.index >= addr &&
+		    mas_nested_ipa.last <= nested_ipa_end) {
+			/*
+			 * If the reverse map isn't polluted, the direct and
+			 * reverse map are expected to be 1:1, thus they must
+			 * have the same size.
+			 */
+			BUG_ON(mas_ipa.last - mas_ipa.index !=
+			       mas_nested_ipa.last - mas_nested_ipa.index);
+
+			accel_clear_mmu_range(mmu, mas_ipa.index,
+					      mas_ipa.last - mas_ipa.index + 1);
+			mas_erase(&mas_ipa);
+			mas_erase(&mas_nested_ipa);
 		}
+		entry_ipa = (u64)mas_find_range(&mas_nested_ipa, nested_ipa_end);
 	}
 }
 
 int kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
 			     gpa_t fault_ipa, size_t map_size)
 {
-	struct maple_tree *mt = &mmu->nested_revmap_mt;
-	gpa_t start = ipa;
-	gpa_t end = ipa + map_size - 1;
+	struct maple_tree *direct_mt = &mmu->nested_direct_mt;
+	struct maple_tree *revmap_mt = &mmu->nested_revmap_mt;
+	gpa_t ipa_start = ipa;
+	gpa_t ipa_end = ipa + map_size - 1;
+	gpa_t fault_ipa_end = fault_ipa + map_size - 1;
 	u64 entry, new_entry = 0;
 	int r = 0;
 
 	lockdep_assert_held_write(kvm_s2_mmu_to_kvm(mmu)->mmu_lock);
 
-	MA_STATE(mas, mt, start, end);
+	MA_STATE(mas_ipa, revmap_mt, ipa_start, ipa_end);
+	MA_STATE(mas_nested_ipa, direct_mt, fault_ipa, fault_ipa_end);
 
 	r = record_accel(mmu, ipa, map_size);
 	if (r)
 		goto out;
 
-	entry = (u64)mas_find_range(&mas, end);
+	r = mas_store_gfp(&mas_nested_ipa, (void *)ipa, GFP_KERNEL_ACCOUNT);
+	/*
+	 * In the case of direct map store failure, don't clean up
+	 * record_accel()'s successfully installed accel mt entry. Keeping
+	 * it is fine as it will just cause us to check a few more s2 mmus
+	 * in the mmu notifier.
+	 */
+	if (r)
+		goto out;
+
+	entry = (u64)mas_find_range(&mas_ipa, ipa_end);
 
 	if (entry) {
 		/* maybe just a perm update... */
-		if (!(entry & UNKNOWN_IPA) && mas.index == start &&
-		    mas.last == end &&
-		    fault_ipa == (entry & NESTED_IPA_MASK))
+		if (!(entry & UNKNOWN_IPA) && mas_ipa.index == ipa_start &&
+		    mas_ipa.last == ipa_end &&
+		    fault_ipa == (entry & ADDR_MASK))
 			goto out;
 		/*
 		 * Remove every overlapping range, then create a "polluted"
 		 * range that spans all these ranges and store it.
 		 */
-		while (entry && mas.index <= end) {
-			start = min(mas.index, start);
-			end = max(mas.last, end);
-			mas_erase(&mas);
-			entry = (u64)mas_find_range(&mas, end);
+		while (entry && mas_ipa.index <= ipa_end) {
+			ipa_start = min(mas_ipa.index, ipa_start);
+			ipa_end = max(mas_ipa.last, ipa_end);
+			mas_erase(&mas_ipa);
+			entry = (u64)mas_find_range(&mas_ipa, ipa_end);
 		}
 		new_entry |= UNKNOWN_IPA;
 	} else {
 		new_entry |= fault_ipa;
 	}
 
-	mas_set_range(&mas, start, end);
-	r = mas_store_gfp(&mas, (void *)new_entry, GFP_KERNEL_ACCOUNT);
+	mas_set_range(&mas_ipa, ipa_start, ipa_end);
+	r = mas_store_gfp(&mas_ipa, (void *)new_entry, GFP_KERNEL_ACCOUNT);
+	if (r)
+		mas_erase(&mas_nested_ipa);
 out:
 	return r;
 }
@@ -1371,13 +1414,14 @@ void kvm_nested_s2_wp(struct kvm *kvm)
 static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
 				  size_t unmap_size, bool may_block)
 {
-	struct maple_tree *mt = &mmu->nested_revmap_mt;
+	struct maple_tree *direct_mt = &mmu->nested_direct_mt;
+	struct maple_tree *revmap_mt = &mmu->nested_revmap_mt;
 	gpa_t start = gpa;
 	gpa_t end = gpa + unmap_size - 1;
 	u64 entry;
 	size_t entry_size;
 
-	MA_STATE(mas, mt, gpa, end);
+	MA_STATE(mas, revmap_mt, gpa, end);
 	entry = (u64)mas_find_range(&mas, end);
 
 	while (entry && mas.index <= end) {
@@ -1388,15 +1432,18 @@ static void unmap_mmu_ipa_range(struct kvm_s2_mmu *mmu, gpa_t gpa,
 		 * touches any polluted range.
 		 */
 		if (entry & UNKNOWN_IPA) {
-			mtree_destroy(mt);
+			mtree_destroy(direct_mt);
+			mtree_destroy(revmap_mt);
 			accel_clear_mmu(mmu);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu),
 					       may_block);
 			return;
 		}
+		/* not polluted, direct map and reverse map must be 1:1 */
+		mtree_erase(direct_mt, entry & ADDR_MASK);
 		mas_erase(&mas);
 		accel_clear_mmu_range(mmu, mas.index, entry_size);
-		kvm_stage2_unmap_range(mmu, entry & NESTED_IPA_MASK, entry_size,
+		kvm_stage2_unmap_range(mmu, entry & ADDR_MASK, entry_size,
 				       may_block);
 		/*
 		 * Other maple tree operations during preemption could render
@@ -1447,6 +1494,7 @@ void kvm_nested_s2_unmap(struct kvm *kvm, bool may_block)
 		struct kvm_s2_mmu *mmu = &kvm->arch.nested_mmus[i];
 
 		if (kvm_s2_mmu_valid(mmu)) {
+			mtree_destroy(&mmu->nested_direct_mt);
 			mtree_destroy(&mmu->nested_revmap_mt);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), may_block);
 		}
@@ -2135,6 +2183,7 @@ void check_nested_vcpu_requests(struct kvm_vcpu *vcpu)
 
 		write_lock(&vcpu->kvm->mmu_lock);
 		if (mmu->pending_unmap) {
+			mtree_destroy(&mmu->nested_direct_mt);
 			mtree_destroy(&mmu->nested_revmap_mt);
 			accel_clear_mmu(mmu);
 			kvm_stage2_unmap_range(mmu, 0, kvm_phys_size(mmu), true);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/4] KVM: arm64: nv: Avoid full shadow s2 unmap
  2026-03-30 10:06 ` [PATCH 1/4] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
@ 2026-03-31 15:16   ` kernel test robot
  0 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2026-03-31 15:16 UTC (permalink / raw)
  To: Wei-Lin Chang, linux-arm-kernel, kvmarm, linux-kernel
  Cc: llvm, oe-kbuild-all, Marc Zyngier, Oliver Upton, Joey Gouly,
	Suzuki K Poulose, Zenghui Yu, Catalin Marinas, Will Deacon,
	Wei-Lin Chang

Hi Wei-Lin,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20260327]
[cannot apply to kvmarm/next arm64/for-next/core arm/for-next arm/fixes soc/for-next v7.0-rc6 v7.0-rc5 v7.0-rc4 linus/master v7.0-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Wei-Lin-Chang/KVM-arm64-nv-Avoid-full-shadow-s2-unmap/20260330-230122
base:   next-20260327
patch link:    https://lore.kernel.org/r/20260330100633.2817076-2-weilin.chang%40arm.com
patch subject: [PATCH 1/4] KVM: arm64: nv: Avoid full shadow s2 unmap
config: arm64-randconfig-002-20260331 (https://download.01.org/0day-ci/archive/20260331/202603312322.Bli3MO76-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260331/202603312322.Bli3MO76-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603312322.Bli3MO76-lkp@intel.com/

All errors (new ones prefixed by >>):

>> arch/arm64/kvm/nested.c:792:2: error: member reference type 'rwlock_t' (aka 'struct rwlock') is not a pointer; did you mean to use '.'?
     792 |         lockdep_assert_held_write(kvm_s2_mmu_to_kvm(mmu)->mmu_lock);
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/lockdep.h:291:22: note: expanded from macro 'lockdep_assert_held_write'
     291 |         do { lockdep_assert(lockdep_is_held_type(l, 0)); __assume_ctx_lock(l); } while (0)
         |              ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/lockdep.h:253:64: note: expanded from macro 'lockdep_is_held_type'
     253 | #define lockdep_is_held_type(lock, r)   lock_is_held_type(&(lock)->dep_map, (r))
         |                                                                  ^
   include/linux/lockdep.h:279:32: note: expanded from macro 'lockdep_assert'
     279 |         do { WARN_ON(debug_locks && !(cond)); } while (0)
         |              ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
   include/asm-generic/bug.h:110:25: note: expanded from macro 'WARN_ON'
     110 |         int __ret_warn_on = !!(condition);                              \
         |                                ^~~~~~~~~
>> arch/arm64/kvm/nested.c:792:2: error: cannot take the address of an rvalue of type 'struct lockdep_map'
     792 |         lockdep_assert_held_write(kvm_s2_mmu_to_kvm(mmu)->mmu_lock);
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/lockdep.h:291:22: note: expanded from macro 'lockdep_assert_held_write'
     291 |         do { lockdep_assert(lockdep_is_held_type(l, 0)); __assume_ctx_lock(l); } while (0)
         |              ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/lockdep.h:253:57: note: expanded from macro 'lockdep_is_held_type'
     253 | #define lockdep_is_held_type(lock, r)   lock_is_held_type(&(lock)->dep_map, (r))
         |                                                           ^
   include/linux/lockdep.h:279:32: note: expanded from macro 'lockdep_assert'
     279 |         do { WARN_ON(debug_locks && !(cond)); } while (0)
         |              ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
   include/asm-generic/bug.h:110:25: note: expanded from macro 'WARN_ON'
     110 |         int __ret_warn_on = !!(condition);                              \
         |                                ^~~~~~~~~
   2 errors generated.


vim +792 arch/arm64/kvm/nested.c

   782	
   783	int kvm_record_nested_revmap(gpa_t ipa, struct kvm_s2_mmu *mmu,
   784				     gpa_t fault_ipa, size_t map_size)
   785	{
   786		struct maple_tree *mt = &mmu->nested_revmap_mt;
   787		gpa_t start = ipa;
   788		gpa_t end = ipa + map_size - 1;
   789		u64 entry, new_entry = 0;
   790		int r = 0;
   791	
 > 792		lockdep_assert_held_write(kvm_s2_mmu_to_kvm(mmu)->mmu_lock);
   793	
   794		MA_STATE(mas, mt, start, end);
   795		entry = (u64)mas_find_range(&mas, end);
   796	
   797		if (entry) {
   798			/* maybe just a perm update... */
   799			if (!(entry & UNKNOWN_IPA) && mas.index == start &&
   800			    mas.last == end &&
   801			    fault_ipa == (entry & NESTED_IPA_MASK))
   802				goto out;
   803			/*
   804			 * Remove every overlapping range, then create a "polluted"
   805			 * range that spans all these ranges and store it.
   806			 */
   807			while (entry && mas.index <= end) {
   808				start = min(mas.index, start);
   809				end = max(mas.last, end);
   810				mas_erase(&mas);
   811				entry = (u64)mas_find_range(&mas, end);
   812			}
   813			new_entry |= UNKNOWN_IPA;
   814		} else {
   815			new_entry |= fault_ipa;
   816		}
   817	
   818		mas_set_range(&mas, start, end);
   819		r = mas_store_gfp(&mas, (void *)new_entry, GFP_KERNEL_ACCOUNT);
   820	out:
   821		return r;
   822	}
   823	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-31 15:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-30 10:06 [PATCH 0/4] KVM: arm64: nv: Implement nested stage-2 reverse map Wei-Lin Chang
2026-03-30 10:06 ` [PATCH 1/4] KVM: arm64: nv: Avoid full shadow s2 unmap Wei-Lin Chang
2026-03-31 15:16   ` kernel test robot
2026-03-30 10:06 ` [PATCH 2/4] KVM: arm64: nv: Accelerate canonical IPA unmapping with canonical s2 mmu maple tree Wei-Lin Chang
2026-03-30 10:06 ` [PATCH 3/4] KVM: arm64: nv: Remove reverse map entries during TLBI handling Wei-Lin Chang
2026-03-30 10:06 ` [PATCH 4/4] KVM: arm64: nv: Create nested IPA direct map to speed up reverse map removal Wei-Lin Chang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox