* [PATCH 0/2] KVM: arm64: Destroy the stage-2 page-table periodically
@ 2025-07-24 23:51 Raghavendra Rao Ananta
2025-07-24 23:51 ` [PATCH 2/2] " Raghavendra Rao Ananta
[not found] ` <20250724235144.2428795-2-rananta@google.com>
0 siblings, 2 replies; 10+ messages in thread
From: Raghavendra Rao Ananta @ 2025-07-24 23:51 UTC (permalink / raw)
To: Oliver Upton, Marc Zyngier
Cc: Raghavendra Rao Anata, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm
Hello,
When destroying a fully-mapped 128G VM abruptly, the following scheduler
warning is observed:
sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule
CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE
Tainted: [O]=OOT_MODULE
Call trace:
show_stack+0x20/0x38 (C)
dump_stack_lvl+0x3c/0xb8
dump_stack+0x18/0x30
resched_latency_warn+0x7c/0x88
sched_tick+0x1c4/0x268
update_process_times+0xa8/0xd8
tick_nohz_handler+0xc8/0x168
__hrtimer_run_queues+0x11c/0x338
hrtimer_interrupt+0x104/0x308
arch_timer_handler_phys+0x40/0x58
handle_percpu_devid_irq+0x8c/0x1b0
generic_handle_domain_irq+0x48/0x78
gic_handle_irq+0x1b8/0x408
call_on_irq_stack+0x24/0x30
do_interrupt_handler+0x54/0x78
el1_interrupt+0x44/0x88
el1h_64_irq_handler+0x18/0x28
el1h_64_irq+0x84/0x88
stage2_free_walker+0x30/0xa0 (P)
__kvm_pgtable_walk+0x11c/0x258
__kvm_pgtable_walk+0x180/0x258
__kvm_pgtable_walk+0x180/0x258
__kvm_pgtable_walk+0x180/0x258
kvm_pgtable_walk+0xc4/0x140
kvm_pgtable_stage2_destroy+0x5c/0xf0
kvm_free_stage2_pgd+0x6c/0xe8
kvm_uninit_stage2_mmu+0x24/0x48
kvm_arch_flush_shadow_all+0x80/0xa0
kvm_mmu_notifier_release+0x38/0x78
__mmu_notifier_release+0x15c/0x250
exit_mmap+0x68/0x400
__mmput+0x38/0x1c8
mmput+0x30/0x68
exit_mm+0xd4/0x198
do_exit+0x1a4/0xb00
do_group_exit+0x8c/0x120
get_signal+0x6d4/0x778
do_signal+0x90/0x718
do_notify_resume+0x70/0x170
el0_svc+0x74/0xd8
el0t_64_sync_handler+0x60/0xc8
el0t_64_sync+0x1b0/0x1b8
The host kernel was running with CONFIG_PREEMPT_NONE=y, and since the
page-table walk operation takes considerable amount of time for a VM
with such a large number of PTEs mapped, the warning is seen.
To mitigate this, split the walk into smaller ranges, by checking for
cond_resched() between each range. Since the path is executed during
VM destruction, after the page-table structure is unlinked from the
KVM MMU, relying on cond_resched_rwlock_write() isn't necessary.
Patch-1 splits the kvm_pgtable_stage2_destroy() function into separate
'walk' and 'free PGD' parts.
Patch-2 leverages the split and performs the walk periodically over
smaller ranges and calls cond_resched() between them.
Thank you.
Raghavendra
Raghavendra Rao Ananta (2):
KVM: arm64: Split kvm_pgtable_stage2_destroy()
KVM: arm64: Destroy the stage-2 page-table periodically
arch/arm64/include/asm/kvm_pgtable.h | 19 ++++++++++++
arch/arm64/kvm/hyp/pgtable.c | 23 ++++++++++++--
arch/arm64/kvm/mmu.c | 46 +++++++++++++++++++++++++---
3 files changed, 80 insertions(+), 8 deletions(-)
base-commit: 19272b37aa4f83ca52bdf9c16d5d81bdd1354494
--
2.50.1.470.g6ba607880d-goog
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically
2025-07-24 23:51 [PATCH 0/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta
@ 2025-07-24 23:51 ` Raghavendra Rao Ananta
2025-07-25 14:59 ` Sean Christopherson
2025-07-29 16:01 ` Oliver Upton
[not found] ` <20250724235144.2428795-2-rananta@google.com>
1 sibling, 2 replies; 10+ messages in thread
From: Raghavendra Rao Ananta @ 2025-07-24 23:51 UTC (permalink / raw)
To: Oliver Upton, Marc Zyngier
Cc: Raghavendra Rao Anata, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm
When a large VM, specifically one that holds a significant number of PTEs,
gets abruptly destroyed, the following warning is seen during the
page-table walk:
sched: CPU 0 need_resched set for > 100018840 ns (100 ticks) without schedule
CPU: 0 UID: 0 PID: 9617 Comm: kvm_page_table_ Tainted: G O 6.16.0-smp-DEV #3 NONE
Tainted: [O]=OOT_MODULE
Call trace:
show_stack+0x20/0x38 (C)
dump_stack_lvl+0x3c/0xb8
dump_stack+0x18/0x30
resched_latency_warn+0x7c/0x88
sched_tick+0x1c4/0x268
update_process_times+0xa8/0xd8
tick_nohz_handler+0xc8/0x168
__hrtimer_run_queues+0x11c/0x338
hrtimer_interrupt+0x104/0x308
arch_timer_handler_phys+0x40/0x58
handle_percpu_devid_irq+0x8c/0x1b0
generic_handle_domain_irq+0x48/0x78
gic_handle_irq+0x1b8/0x408
call_on_irq_stack+0x24/0x30
do_interrupt_handler+0x54/0x78
el1_interrupt+0x44/0x88
el1h_64_irq_handler+0x18/0x28
el1h_64_irq+0x84/0x88
stage2_free_walker+0x30/0xa0 (P)
__kvm_pgtable_walk+0x11c/0x258
__kvm_pgtable_walk+0x180/0x258
__kvm_pgtable_walk+0x180/0x258
__kvm_pgtable_walk+0x180/0x258
kvm_pgtable_walk+0xc4/0x140
kvm_pgtable_stage2_destroy+0x5c/0xf0
kvm_free_stage2_pgd+0x6c/0xe8
kvm_uninit_stage2_mmu+0x24/0x48
kvm_arch_flush_shadow_all+0x80/0xa0
kvm_mmu_notifier_release+0x38/0x78
__mmu_notifier_release+0x15c/0x250
exit_mmap+0x68/0x400
__mmput+0x38/0x1c8
mmput+0x30/0x68
exit_mm+0xd4/0x198
do_exit+0x1a4/0xb00
do_group_exit+0x8c/0x120
get_signal+0x6d4/0x778
do_signal+0x90/0x718
do_notify_resume+0x70/0x170
el0_svc+0x74/0xd8
el0t_64_sync_handler+0x60/0xc8
el0t_64_sync+0x1b0/0x1b8
The warning is seen majorly on the host kernels that are configured
not to force-preempt, such as CONFIG_PREEMPT_NONE=y. To avoid this,
instead of walking the entire page-table in one go, split it into
smaller ranges, by checking for cond_resched() between each range.
Since the path is executed during VM destruction, after the
page-table structure is unlinked from the KVM MMU, relying on
cond_resched_rwlock_write() isn't necessary.
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
---
arch/arm64/kvm/mmu.c | 38 ++++++++++++++++++++++++++++++++++++--
1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2942ec92c5a4..6c4b9fb1211b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -387,6 +387,40 @@ static void stage2_flush_vm(struct kvm *kvm)
srcu_read_unlock(&kvm->srcu, idx);
}
+/*
+ * Assume that @pgt is valid and unlinked from the KVM MMU to free the
+ * page-table without taking the kvm_mmu_lock and without performing any
+ * TLB invalidations.
+ *
+ * Also, the range of addresses can be large enough to cause need_resched
+ * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke
+ * cond_resched() periodically to prevent hogging the CPU for a long time
+ * and schedule something else, if required.
+ */
+static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr,
+ phys_addr_t end)
+{
+ u64 next;
+
+ do {
+ next = stage2_range_addr_end(addr, end);
+ kvm_pgtable_stage2_destroy_range(pgt, addr, next - addr);
+
+ if (next != end)
+ cond_resched();
+ } while (addr = next, addr != end);
+}
+
+static void kvm_destroy_stage2_pgt(struct kvm_pgtable *pgt)
+{
+ if (!is_protected_kvm_enabled()) {
+ stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits));
+ kvm_pgtable_stage2_destroy_pgd(pgt);
+ } else {
+ pkvm_pgtable_stage2_destroy(pgt);
+ }
+}
+
/**
* free_hyp_pgds - free Hyp-mode page tables
*/
@@ -984,7 +1018,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
return 0;
out_destroy_pgtable:
- KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt);
+ kvm_destroy_stage2_pgt(pgt);
out_free_pgtable:
kfree(pgt);
return err;
@@ -1081,7 +1115,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
write_unlock(&kvm->mmu_lock);
if (pgt) {
- KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt);
+ kvm_destroy_stage2_pgt(pgt);
kfree(pgt);
}
}
--
2.50.1.470.g6ba607880d-goog
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically
2025-07-24 23:51 ` [PATCH 2/2] " Raghavendra Rao Ananta
@ 2025-07-25 14:59 ` Sean Christopherson
2025-07-25 15:04 ` ChaosEsque Team
2025-07-25 16:22 ` Raghavendra Rao Ananta
2025-07-29 16:01 ` Oliver Upton
1 sibling, 2 replies; 10+ messages in thread
From: Sean Christopherson @ 2025-07-25 14:59 UTC (permalink / raw)
To: Raghavendra Rao Ananta
Cc: Oliver Upton, Marc Zyngier, Mingwei Zhang, linux-arm-kernel,
kvmarm, linux-kernel, kvm
Heh, without full conext, the shortlog reads like "destroy stage-2 page tables
from time to time". Something like this would be more appropriate:
KVM: arm64: Reschedule as needed when destroying stage-2 page-tables
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically
2025-07-25 14:59 ` Sean Christopherson
@ 2025-07-25 15:04 ` ChaosEsque Team
2025-07-25 16:22 ` Raghavendra Rao Ananta
1 sibling, 0 replies; 10+ messages in thread
From: ChaosEsque Team @ 2025-07-25 15:04 UTC (permalink / raw)
To: Sean Christopherson
Cc: Raghavendra Rao Ananta, Oliver Upton, Marc Zyngier, Mingwei Zhang,
linux-arm-kernel, kvmarm, linux-kernel, kvm
Linux License of PaX/OpenSourceSecurity should be canceled
Pax/OpenSourceSecurity(R)(C) has prevented redistribution of it's
derivative work of the Linux(R) kernel successfully.
There have been no leaks of this derivative work.
They have also added a term to their derivative work of the Linux(R)
kernel stating that any redistribution of the derivative work will
incur a penalty: no more derivative work and withholding of paid funds.
To avoid this penalty no one has leaked the proprietary work of
PaX/OpenSourceSecurity which is a derivative work of the Linux Kernel.
OpenSource licenses are now considered a joke, and opensource programmers
who work for free as clowns.
You can cancel the license PaX/OpenSourceSecurity has to the Linux Kernel code.
Any of you whom has contributed to the Linux Kernel own your code: you
can cancle PaX/OpenSourceSecurity's license to your code.
1) PaX/OpenSourceSecurity never paid you anything. (nor did you ask
for anything)
They cannot rely on contract law to enforce their license.
2) Pax/OpenSourceSecurity is required to follow copyright law.
They cannot claim following the terms of the license are "performance"
under contract law: they have a duty to not violate Federal Copyright.
3) Pax/OpenSourceSecurity is violating the copyright license (which they got
for free) of your code. Your permission states that additional restrictive terms
are not permitted to be attached to the corpus of the copyrighted Work.
Pax/OpenSourceSecurity have added additional restrictive terms: they have
abrogated the "no additional restrictions" term by adding a new term:
no redistribution of the derivative work.
That new term has been successfully enforced by self-help by
PaX/OpenSourceSecurity.
You will laugh and claim that
"THE GPL ALLOWS PEOPLE TO FORBID REDISTRIBUTION LOL!"
and
"THE GPL IS JUST TALKING ABOUT NOT CHANGING THE TEXT OF THE GPL ITSELF LOL!"
and
"ITS A PATCH, SO THEY CAN PUT IT UNDER WHATEVER LICENSE THEY WANT SINCE THEY
DONT INCLUDE LINUX WITH IT"
And there is no way for anyone to convince you otherwise.
But you are wrong.
Dear Bruce Perens,
Please. Please sue Open Source Security.
Before it's too late.
Dear RMS;
Open Source Security has defeated you.
On Fri, Jul 25, 2025 at 10:59 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Heh, without full conext, the shortlog reads like "destroy stage-2 page tables
> from time to time". Something like this would be more appropriate:
>
> KVM: arm64: Reschedule as needed when destroying stage-2 page-tables
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically
2025-07-25 14:59 ` Sean Christopherson
2025-07-25 15:04 ` ChaosEsque Team
@ 2025-07-25 16:22 ` Raghavendra Rao Ananta
1 sibling, 0 replies; 10+ messages in thread
From: Raghavendra Rao Ananta @ 2025-07-25 16:22 UTC (permalink / raw)
To: Sean Christopherson
Cc: Oliver Upton, Marc Zyngier, Mingwei Zhang, linux-arm-kernel,
kvmarm, linux-kernel, kvm
On Fri, Jul 25, 2025 at 7:59 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Heh, without full conext, the shortlog reads like "destroy stage-2 page tables
> from time to time". Something like this would be more appropriate:
>
> KVM: arm64: Reschedule as needed when destroying stage-2 page-tables
This definitely sounds better :)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy()
[not found] ` <20250724235144.2428795-2-rananta@google.com>
@ 2025-07-29 15:57 ` Oliver Upton
2025-08-08 18:57 ` Oliver Upton
1 sibling, 0 replies; 10+ messages in thread
From: Oliver Upton @ 2025-07-29 15:57 UTC (permalink / raw)
To: Raghavendra Rao Ananta
Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm
On Thu, Jul 24, 2025 at 11:51:43PM +0000, Raghavendra Rao Ananta wrote:
> Split kvm_pgtable_stage2_destroy() into two:
> - kvm_pgtable_stage2_destroy_range(), that performs the
> page-table walk and free the entries over a range of addresses.
> - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD.
>
> This refactoring enables subsequent patches to free large page-tables
> in chunks, calling cond_resched() between each chunk, to yield the CPU
> as necessary.
>
> Direct callers of kvm_pgtable_stage2_destroy() will continue to walk
> the entire range of the VM as before, ensuring no functional changes.
>
> Also, add equivalent pkvm_pgtable_stage2_*() stubs to maintain 1:1
> mapping of the page-table functions.
Uhh... We can't stub these functions out for protected mode, we already
have a load-bearing implementation of pkvm_pgtable_stage2_destroy().
Just reuse what's already there and provide a NOP for
pkvm_pgtable_stage2_destroy_pgd().
> +void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt)
> +{
> + /*
> + * We aren't doing a pgtable walk here, but the walker struct is needed
> + * for kvm_dereference_pteref(), which only looks at the ->flags.
> + */
> + struct kvm_pgtable_walker walker = {0};
This feels subtle and prone for error. I'd rather we have something that
boils down to rcu_dereference_raw() (with the appropriate n/hVHE awareness)
and add a comment why it is safe.
> +void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
> +{
> + kvm_pgtable_stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits));
> + kvm_pgtable_stage2_destroy_pgd(pgt);
> +}
> +
Move this to mmu.c as a static function and use KVM_PGT_FN()
Thanks,
Oliver
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically
2025-07-24 23:51 ` [PATCH 2/2] " Raghavendra Rao Ananta
2025-07-25 14:59 ` Sean Christopherson
@ 2025-07-29 16:01 ` Oliver Upton
2025-08-07 18:58 ` Raghavendra Rao Ananta
1 sibling, 1 reply; 10+ messages in thread
From: Oliver Upton @ 2025-07-29 16:01 UTC (permalink / raw)
To: Raghavendra Rao Ananta
Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm
On Thu, Jul 24, 2025 at 11:51:44PM +0000, Raghavendra Rao Ananta wrote:
> +/*
> + * Assume that @pgt is valid and unlinked from the KVM MMU to free the
> + * page-table without taking the kvm_mmu_lock and without performing any
> + * TLB invalidations.
> + *
> + * Also, the range of addresses can be large enough to cause need_resched
> + * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke
> + * cond_resched() periodically to prevent hogging the CPU for a long time
> + * and schedule something else, if required.
> + */
> +static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr,
> + phys_addr_t end)
> +{
> + u64 next;
> +
> + do {
> + next = stage2_range_addr_end(addr, end);
> + kvm_pgtable_stage2_destroy_range(pgt, addr, next - addr);
> +
> + if (next != end)
> + cond_resched();
> + } while (addr = next, addr != end);
> +}
> +
> +static void kvm_destroy_stage2_pgt(struct kvm_pgtable *pgt)
> +{
> + if (!is_protected_kvm_enabled()) {
> + stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits));
> + kvm_pgtable_stage2_destroy_pgd(pgt);
> + } else {
> + pkvm_pgtable_stage2_destroy(pgt);
> + }
> +}
> +
Protected mode is affected by the same problem, potentially even worse
due to the overheads of calling into EL2. Both protected and
non-protected flows should use stage2_destroy_range().
Thanks,
Oliver
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically
2025-07-29 16:01 ` Oliver Upton
@ 2025-08-07 18:58 ` Raghavendra Rao Ananta
2025-08-08 18:56 ` Oliver Upton
0 siblings, 1 reply; 10+ messages in thread
From: Raghavendra Rao Ananta @ 2025-08-07 18:58 UTC (permalink / raw)
To: Oliver Upton
Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm
Hi Oliver,
>
> Protected mode is affected by the same problem, potentially even worse
> due to the overheads of calling into EL2. Both protected and
> non-protected flows should use stage2_destroy_range().
>
I experimented with this (see diff below), and it looks like it takes
significantly longer to finish the destruction even for a very small
VM. For instance, it takes ~140 seconds on an Ampere Altra machine.
This is probably because we run cond_resched() for every breakup in
the entire sweep of the possible address range, 0 to ~(0ULL), even
though there are no actual mappings there, and we context switch out
more often.
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
+ static void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
+ {
+ u64 end = is_protected_kvm_enabled() ? ~(0ULL) : BIT(pgt->ia_bits);
+ u64 next, addr = 0;
+
+ do {
+ next = stage2_range_addr_end(addr, end);
+ KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, addr,
+ next - addr);
+
+ if (next != end)
+ cond_resched();
+ } while (addr = next, addr != end);
+
+
+ KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt);
+ }
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -316,9 +316,13 @@ static int __pkvm_pgtable_stage2_unmap(struct
kvm_pgtable *pgt, u64 start, u64 e
return 0;
}
-void pkvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
+void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt, u64
addr, u64 size)
+{
+ __pkvm_pgtable_stage2_unmap(pgt, addr, addr + size);
+}
+
+void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt)
+{
+}
Without cond_resched() in place, it takes about half the time.
I also tried moving cond_resched() to __pkvm_pgtable_stage2_unmap(),
as per the below diff, and calling pkvm_pgtable_stage2_destroy_range()
for the entire 0 to ~(1ULL) range (instead of breaking it up). Even
for a fully 4K mapped 128G VM, I see it taking ~65 seconds, which is
close to the baseline (no cond_resched()).
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -311,8 +311,11 @@ static int __pkvm_pgtable_stage2_unmap(struct
kvm_pgtable *pgt, u64 start, u64 e
return ret;
pkvm_mapping_remove(mapping, &pgt->pkvm_mappings);
kfree(mapping);
+ cond_resched();
}
Does it make sense to call cond_resched() only when we actually unmap pages?
Thank you.
Raghavendra
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 2/2] KVM: arm64: Destroy the stage-2 page-table periodically
2025-08-07 18:58 ` Raghavendra Rao Ananta
@ 2025-08-08 18:56 ` Oliver Upton
0 siblings, 0 replies; 10+ messages in thread
From: Oliver Upton @ 2025-08-08 18:56 UTC (permalink / raw)
To: Raghavendra Rao Ananta
Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm
On Thu, Aug 07, 2025 at 11:58:01AM -0700, Raghavendra Rao Ananta wrote:
> Hi Oliver,
>
> >
> > Protected mode is affected by the same problem, potentially even worse
> > due to the overheads of calling into EL2. Both protected and
> > non-protected flows should use stage2_destroy_range().
> >
> I experimented with this (see diff below), and it looks like it takes
> significantly longer to finish the destruction even for a very small
> VM. For instance, it takes ~140 seconds on an Ampere Altra machine.
> This is probably because we run cond_resched() for every breakup in
> the entire sweep of the possible address range, 0 to ~(0ULL), even
> though there are no actual mappings there, and we context switch out
> more often.
This seems more like an issue with the upper bound on a pKVM walk rather
than a problem with the suggestion. The information in pgt->ia_bits is
actually derived from the VTCR value of the owning MMU.
Even though we never use the VTCR value in hardware, pKVM MMUs have a
valid VTCR value that encodes the size of the IPA space and we use that
in the common stage-2 abort path.
I'm attaching some fixups that I have on top of your series that'd allow
the resched logic to remain common, like it is in other MMU flows.
From 421468dcaa4692208c3f708682b058cfc072a984 Mon Sep 17 00:00:00 2001
From: Oliver Upton <oliver.upton@linux.dev>
Date: Fri, 8 Aug 2025 11:43:12 -0700
Subject: [PATCH 4/4] fixup! KVM: arm64: Destroy the stage-2 page-table
periodically
---
arch/arm64/kvm/mmu.c | 60 ++++++++++++++++++--------------------------
1 file changed, 25 insertions(+), 35 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index b82412323054..fc93cc256bd8 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -383,40 +383,6 @@ static void stage2_flush_vm(struct kvm *kvm)
srcu_read_unlock(&kvm->srcu, idx);
}
-/*
- * Assume that @pgt is valid and unlinked from the KVM MMU to free the
- * page-table without taking the kvm_mmu_lock and without performing any
- * TLB invalidations.
- *
- * Also, the range of addresses can be large enough to cause need_resched
- * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke
- * cond_resched() periodically to prevent hogging the CPU for a long time
- * and schedule something else, if required.
- */
-static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr,
- phys_addr_t end)
-{
- u64 next;
-
- do {
- next = stage2_range_addr_end(addr, end);
- kvm_pgtable_stage2_destroy_range(pgt, addr, next - addr);
-
- if (next != end)
- cond_resched();
- } while (addr = next, addr != end);
-}
-
-static void kvm_destroy_stage2_pgt(struct kvm_pgtable *pgt)
-{
- if (!is_protected_kvm_enabled()) {
- stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits));
- kvm_pgtable_stage2_destroy_pgd(pgt);
- } else {
- pkvm_pgtable_stage2_destroy(pgt);
- }
-}
-
/**
* free_hyp_pgds - free Hyp-mode page tables
*/
@@ -938,11 +904,35 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type)
return 0;
}
+/*
+ * Assume that @pgt is valid and unlinked from the KVM MMU to free the
+ * page-table without taking the kvm_mmu_lock and without performing any
+ * TLB invalidations.
+ *
+ * Also, the range of addresses can be large enough to cause need_resched
+ * warnings, for instance on CONFIG_PREEMPT_NONE kernels. Hence, invoke
+ * cond_resched() periodically to prevent hogging the CPU for a long time
+ * and schedule something else, if required.
+ */
+static void stage2_destroy_range(struct kvm_pgtable *pgt, phys_addr_t addr,
+ phys_addr_t end)
+{
+ u64 next;
+
+ do {
+ next = stage2_range_addr_end(addr, end);
+ KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, addr, next - addr);
+
+ if (next != end)
+ cond_resched();
+ } while (addr = next, addr != end);
+}
+
static void kvm_stage2_destroy(struct kvm_pgtable *pgt)
{
unsigned int ia_bits = VTCR_EL2_IPA(pgt->mmu->vtcr);
- KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, 0, BIT(ia_bits));
+ stage2_destroy_range(pgt, 0, BIT(ia_bits));
KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt);
}
--
2.39.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy()
[not found] ` <20250724235144.2428795-2-rananta@google.com>
2025-07-29 15:57 ` [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() Oliver Upton
@ 2025-08-08 18:57 ` Oliver Upton
1 sibling, 0 replies; 10+ messages in thread
From: Oliver Upton @ 2025-08-08 18:57 UTC (permalink / raw)
To: Raghavendra Rao Ananta
Cc: Marc Zyngier, Mingwei Zhang, linux-arm-kernel, kvmarm,
linux-kernel, kvm
On Thu, Jul 24, 2025 at 11:51:43PM +0000, Raghavendra Rao Ananta wrote:
> Split kvm_pgtable_stage2_destroy() into two:
> - kvm_pgtable_stage2_destroy_range(), that performs the
> page-table walk and free the entries over a range of addresses.
> - kvm_pgtable_stage2_destroy_pgd(), that frees the PGD.
>
> This refactoring enables subsequent patches to free large page-tables
> in chunks, calling cond_resched() between each chunk, to yield the CPU
> as necessary.
>
> Direct callers of kvm_pgtable_stage2_destroy() will continue to walk
> the entire range of the VM as before, ensuring no functional changes.
>
> Also, add equivalent pkvm_pgtable_stage2_*() stubs to maintain 1:1
> mapping of the page-table functions.
>
> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Here's the other half of my fixups
From 7d3e948357d0d2568afc136906e1b973ed39deeb Mon Sep 17 00:00:00 2001
From: Oliver Upton <oliver.upton@linux.dev>
Date: Fri, 8 Aug 2025 11:35:43 -0700
Subject: [PATCH 2/4] fixup! KVM: arm64: Split kvm_pgtable_stage2_destroy()
---
arch/arm64/include/asm/kvm_pgtable.h | 4 ++--
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 +-
arch/arm64/kvm/hyp/pgtable.c | 2 +-
arch/arm64/kvm/mmu.c | 12 ++++++++++--
arch/arm64/kvm/pkvm.c | 12 ++++--------
5 files changed, 18 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index 20aea58eca18..fdae4685b9ac 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -562,13 +562,13 @@ void kvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt,
void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt);
/**
- * kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table.
+ * __kvm_pgtable_stage2_destroy() - Destroy an unused guest stage-2 page-table.
* @pgt: Page-table structure initialised by kvm_pgtable_stage2_init*().
*
* The page-table is assumed to be unreachable by any hardware walkers prior
* to freeing and therefore no TLB invalidation is performed.
*/
-void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
+void __kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
/**
* kvm_pgtable_stage2_free_unlinked() - Free an unlinked stage-2 paging structure.
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 95d7534c9679..5eb8d6e29ac4 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -297,7 +297,7 @@ void reclaim_pgtable_pages(struct pkvm_hyp_vm *vm, struct kvm_hyp_memcache *mc)
/* Dump all pgtable pages in the hyp_pool */
guest_lock_component(vm);
- kvm_pgtable_stage2_destroy(&vm->pgt);
+ __kvm_pgtable_stage2_destroy(&vm->pgt);
vm->kvm.arch.mmu.pgd_phys = 0ULL;
guest_unlock_component(vm);
diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
index 7fad791cf40b..aa735ffe8d49 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -1577,7 +1577,7 @@ void kvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt)
pgt->pgd = NULL;
}
-void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
+void __kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
{
kvm_pgtable_stage2_destroy_range(pgt, 0, BIT(pgt->ia_bits));
kvm_pgtable_stage2_destroy_pgd(pgt);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 9a45daf817bf..6330a02c8418 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -904,6 +904,14 @@ static int kvm_init_ipa_range(struct kvm_s2_mmu *mmu, unsigned long type)
return 0;
}
+static void kvm_stage2_destroy(struct kvm_pgtable *pgt)
+{
+ unsigned int ia_bits = VTCR_EL2_IPA(pgt->mmu->vtcr);
+
+ KVM_PGT_FN(kvm_pgtable_stage2_destroy_range)(pgt, 0, BIT(ia_bits));
+ KVM_PGT_FN(kvm_pgtable_stage2_destroy_pgd)(pgt);
+}
+
/**
* kvm_init_stage2_mmu - Initialise a S2 MMU structure
* @kvm: The pointer to the KVM structure
@@ -980,7 +988,7 @@ int kvm_init_stage2_mmu(struct kvm *kvm, struct kvm_s2_mmu *mmu, unsigned long t
return 0;
out_destroy_pgtable:
- KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt);
+ kvm_stage2_destroy(pgt);
out_free_pgtable:
kfree(pgt);
return err;
@@ -1077,7 +1085,7 @@ void kvm_free_stage2_pgd(struct kvm_s2_mmu *mmu)
write_unlock(&kvm->mmu_lock);
if (pgt) {
- KVM_PGT_FN(kvm_pgtable_stage2_destroy)(pgt);
+ kvm_stage2_destroy(pgt);
kfree(pgt);
}
}
diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
index bf737717ccb4..3be208449bd7 100644
--- a/arch/arm64/kvm/pkvm.c
+++ b/arch/arm64/kvm/pkvm.c
@@ -316,11 +316,6 @@ static int __pkvm_pgtable_stage2_unmap(struct kvm_pgtable *pgt, u64 start, u64 e
return 0;
}
-void pkvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt)
-{
- __pkvm_pgtable_stage2_unmap(pgt, 0, ~(0ULL));
-}
-
int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
u64 phys, enum kvm_pgtable_prot prot,
void *mc, enum kvm_pgtable_walk_flags flags)
@@ -452,12 +447,13 @@ int pkvm_pgtable_stage2_split(struct kvm_pgtable *pgt, u64 addr, u64 size,
}
void pkvm_pgtable_stage2_destroy_range(struct kvm_pgtable *pgt,
- u64 addr, u64 size)
+ u64 addr, u64 size)
{
- WARN_ON_ONCE(1);
+ __pkvm_pgtable_stage2_unmap(pgt, addr, size);
}
void pkvm_pgtable_stage2_destroy_pgd(struct kvm_pgtable *pgt)
{
- WARN_ON_ONCE(1);
+ /* Expected to be called after all pKVM mappings have been released. */
+ WARN_ON_ONCE(!RB_EMPTY_ROOT(&pgt->pkvm_mappings.rb_root));
}
--
2.39.5
^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-08-08 19:02 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-24 23:51 [PATCH 0/2] KVM: arm64: Destroy the stage-2 page-table periodically Raghavendra Rao Ananta
2025-07-24 23:51 ` [PATCH 2/2] " Raghavendra Rao Ananta
2025-07-25 14:59 ` Sean Christopherson
2025-07-25 15:04 ` ChaosEsque Team
2025-07-25 16:22 ` Raghavendra Rao Ananta
2025-07-29 16:01 ` Oliver Upton
2025-08-07 18:58 ` Raghavendra Rao Ananta
2025-08-08 18:56 ` Oliver Upton
[not found] ` <20250724235144.2428795-2-rananta@google.com>
2025-07-29 15:57 ` [PATCH 1/2] KVM: arm64: Split kvm_pgtable_stage2_destroy() Oliver Upton
2025-08-08 18:57 ` Oliver Upton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).